<img src="images/astradrel_datascience_logo.png" style="height: 100px;" align=left>
<img src="images/python_logo.png" style="height: 100px;" align=right>

# Project: Web Scrapping Job Postings from Indeed Malaysia

Almost all of fresh graduates in Malaysia are using online job posting websites like **Indeed, LinkedIn, JobStreet** and likewise to look for jobs. So many opportunities exists on the internet yet only fraction of total fresh graduates will get the jobs which proves the competition existed in the online jobseeking. Many attributes of the candidates itself can affect their chances of landing their dream jobs such as skills, experiences, salary expectation and many more. 

In the case of Data Science, with the recent influx of working professional and fresh graduate alike from different background looking for a career transition this maybe become a trouble when it comes to meeting the criterias of the recruiter. With so many people competing for the same jobs posts, it become apparent that the recruiter needed to filter down the candidates and only takes the top of the cream among all of them. In this project we will gather several data science jobs posting from various online job posting website and determining what the desired attributes that recruiter looking for in a candidates.

In this project, we are using web scrapping tools to scrap the information of a jobs along with its jobs description, job employment type, salary, location and several more for **Data Scientist** position from the **Indeed Malaysia**.

# Module: Dataset Cleaning and NLP Implementation

## Importing libraries

In [45]:
import pandas as pd
import os
import time
import datetime

# Natural Language Processing
import nltk
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer


## Importing scrapped Job Post dataset from Malaysia Indeed.com

In [2]:
df_indeed = pd.read_excel('data/indeed_jobs.xlsx')

In [3]:
df_indeed.head(5)

Unnamed: 0,Job Title,Company,Location,Salary,Post Date,Job Link,Type,Description
0,Meteorologist/ Meteorological Data Scientist,AkiraKan (Marine Technology) Sdn Bhd,Kuala Lumpur,"RM 3,500 a month",13 days ago,https://malaysia.indeed.com/company/AkiraKan-(...,Full time,AkiraKan [ AKN Technologies ] is hiring Meteor...
1,Data Scientist,Doo Technology MY Sdn. Bhd.,Kuala Lumpur,"RM 6,000 - RM 7,999 a month",6 days ago,https://malaysia.indeed.com/rc/clk?jk=e9d15ef5...,No Detail,Responsibilities: - Exploratory data analysis ...
2,Data Scientist,Datalabs Asia (M) Sdn Bhd,Kuala Lumpur,"RM 5,000 - RM 5,999 a month",13 days ago,https://malaysia.indeed.com/rc/clk?jk=4b53359d...,Contract,Data scientists find and interpret rich data s...
3,Data Scientist,BASF Asia Pacific,Kuala Lumpur,No Detail,15 days ago,https://malaysia.indeed.com/rc/clk?jk=3d272182...,No Detail,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY..."
4,Geomatics Surveyor (Geographic Information Sys...,AkiraKan (Marine Technology) Sdn Bhd,Kuala Lumpur,"RM 3,500 a month",13 days ago,https://malaysia.indeed.com/company/AkiraKan-(...,Full time,AkiraKan [ AKN Technologies ] is hiring survey...


## Data Cleaning and Transformation

In [4]:
# Transform the job post column and create job date column with proper date format

Date_Posted = []

for data in df_indeed['Post Date']:
    if re.findall(r'[0-9]', data):
        period = int(''.join(re.findall(r'[0-9]', data)))
        period_date = (datetime.datetime.today() - datetime.timedelta(period)).strftime('%d/%m/%y')
        Date_Posted.append(period_date)
    else:
        Date_Posted.append(datetime.datetime.today().strftime('%d/%m/%y'))

df_indeed['Job Date'] = Date_Posted
df_indeed = df_indeed.drop(['Post Date'], axis=1)

In [5]:
df_indeed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Job Title    150 non-null    object
 1   Company      150 non-null    object
 2   Location     150 non-null    object
 3   Salary       150 non-null    object
 4   Job Link     150 non-null    object
 5   Type         150 non-null    object
 6   Description  150 non-null    object
 7   Job Date     150 non-null    object
dtypes: object(8)
memory usage: 9.5+ KB


In [6]:
# Clean the type column and standardizes the type

#df_indeed['Type'].unique()
Type = []

for x in df_indeed['Type']:
    
    if x == 'Full time':
        Type.append('Fulltime')
        pass
    
    elif x == 'No Detail':
        Type.append('Unspecified')
    
    elif ' ' in x:
        #Type.append(re.sub(r"\s+", '/', x))
        Type.append(x.replace(' ','/'))
    
    else:
        Type.append(x)

df_indeed['Job Type'] = Type
df_indeed = df_indeed.drop(['Type'], axis=1)

In [7]:
df_indeed.Location.unique()

array(['Kuala Lumpur', 'Petaling Jaya', 'Malaysia', 'Seremban',
       'Kuala Lumpur+1 location', 'i-City', 'Selangor',
       'Bangsar South•Remote', 'Petaling Jaya•Remote',
       'Kuala Lumpur+2 locations', 'Subang Jaya', 'Kuala Lumpur•Remote',
       'Simpang Ampat', 'Penang', 'Puchong', 'Port Klang', 'Perai',
       'Cyberjaya', 'Malaysia+1 location', 'Kota Damansara', 'Batu Caves',
       'Bukit Gelugor', 'Melaka', 'Kulai', 'Brickfields', 'Johor'],
      dtype=object)

In [8]:
# Clean the location column and standardizes the location

Location = []

for x in df_indeed["Location"]:
    
    x = re.sub(r'\b[a-z]', lambda m: m.group().upper(), x)
    
    if "•" in x:

        Location.append(x.split('•')[0])
        
    elif "+" in x:
        
        Location.append(x.split('+')[0])        
              
    else:
        Location.append(x)
        
df_indeed["Job Location"] = Location
df_indeed = df_indeed.drop(['Location'], axis=1)

In [9]:
df_indeed.head(5)

Unnamed: 0,Job Title,Company,Salary,Job Link,Description,Job Date,Job Type,Job Location
0,Meteorologist/ Meteorological Data Scientist,AkiraKan (Marine Technology) Sdn Bhd,"RM 3,500 a month",https://malaysia.indeed.com/company/AkiraKan-(...,AkiraKan [ AKN Technologies ] is hiring Meteor...,28/10/21,Fulltime,Kuala Lumpur
1,Data Scientist,Doo Technology MY Sdn. Bhd.,"RM 6,000 - RM 7,999 a month",https://malaysia.indeed.com/rc/clk?jk=e9d15ef5...,Responsibilities: - Exploratory data analysis ...,04/11/21,Unspecified,Kuala Lumpur
2,Data Scientist,Datalabs Asia (M) Sdn Bhd,"RM 5,000 - RM 5,999 a month",https://malaysia.indeed.com/rc/clk?jk=4b53359d...,Data scientists find and interpret rich data s...,28/10/21,Contract,Kuala Lumpur
3,Data Scientist,BASF Asia Pacific,No Detail,https://malaysia.indeed.com/rc/clk?jk=3d272182...,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",26/10/21,Unspecified,Kuala Lumpur
4,Geomatics Surveyor (Geographic Information Sys...,AkiraKan (Marine Technology) Sdn Bhd,"RM 3,500 a month",https://malaysia.indeed.com/company/AkiraKan-(...,AkiraKan [ AKN Technologies ] is hiring survey...,28/10/21,Fulltime,Kuala Lumpur


In [10]:
# Split the salary to min and max salary while converting it to type integer

Salary = []

for x in df_indeed['Salary']:
    
    splits = x.split(' ')
    digits = []
    for split in splits:
        split = split.replace(',','')
        if split.isdigit():
            digits.append(int(split))
            
    if len(digits) == 0:
        digits = [0,0]
    elif len(digits) > 0 and len(digits) < 2:
        digits.append(digits[0])
    
    Salary.append(digits)
    
Min_Salary = []
Max_Salary = []

for item in Salary:
    
    Min_Salary.append(item[0])
    Max_Salary.append(item[1])

df_indeed['Min Salary'] = Min_Salary
df_indeed['Max Salary'] = Max_Salary

df_indeed = df_indeed.drop(columns=['Salary'],axis=1)
    


In [11]:
df_indeed.head()

Unnamed: 0,Job Title,Company,Job Link,Description,Job Date,Job Type,Job Location,Min Salary,Max Salary
0,Meteorologist/ Meteorological Data Scientist,AkiraKan (Marine Technology) Sdn Bhd,https://malaysia.indeed.com/company/AkiraKan-(...,AkiraKan [ AKN Technologies ] is hiring Meteor...,28/10/21,Fulltime,Kuala Lumpur,3500,3500
1,Data Scientist,Doo Technology MY Sdn. Bhd.,https://malaysia.indeed.com/rc/clk?jk=e9d15ef5...,Responsibilities: - Exploratory data analysis ...,04/11/21,Unspecified,Kuala Lumpur,6000,7999
2,Data Scientist,Datalabs Asia (M) Sdn Bhd,https://malaysia.indeed.com/rc/clk?jk=4b53359d...,Data scientists find and interpret rich data s...,28/10/21,Contract,Kuala Lumpur,5000,5999
3,Data Scientist,BASF Asia Pacific,https://malaysia.indeed.com/rc/clk?jk=3d272182...,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",26/10/21,Unspecified,Kuala Lumpur,0,0
4,Geomatics Surveyor (Geographic Information Sys...,AkiraKan (Marine Technology) Sdn Bhd,https://malaysia.indeed.com/company/AkiraKan-(...,AkiraKan [ AKN Technologies ] is hiring survey...,28/10/21,Fulltime,Kuala Lumpur,3500,3500


## Natural Language Processing

In [12]:
df_indeed['Description']

0      AkiraKan [ AKN Technologies ] is hiring Meteor...
1      Responsibilities: - Exploratory data analysis ...
2      Data scientists find and interpret rich data s...
3      LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...
4      AkiraKan [ AKN Technologies ] is hiring survey...
                             ...                        
145    If you are looking to excel and make a differe...
146    About the role A Product Manager’s mission at ...
147                                           No Details
148                                           No Details
149    General information\nAgency: Kinesso\nJob Func...
Name: Description, Length: 150, dtype: object

In [15]:
# Create new dataframe for job description

word_counts = []
for x in df_indeed['Description']:
    word_counts.append(len(x))

df_indeed_text = pd.concat([df_indeed['Description'], pd.DataFrame(word_counts,columns=['Word Counts'])],axis=1)

In [16]:
df_indeed_text

Unnamed: 0,Description,Word Counts
0,AkiraKan [ AKN Technologies ] is hiring Meteor...,1307
1,Responsibilities: - Exploratory data analysis ...,2396
2,Data scientists find and interpret rich data s...,2554
3,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",5252
4,AkiraKan [ AKN Technologies ] is hiring survey...,1015
...,...,...
145,If you are looking to excel and make a differe...,3529
146,About the role A Product Manager’s mission at ...,3782
147,No Details,10
148,No Details,10


In [17]:
# Identify if there is url links in the description

word_urls = []

for x in df_indeed_text['Description']:
    re_pattern=r'\b(?:http).+\b'
    if re.findall(re_pattern,x):
        word_urls.append(re.findall(re_pattern,x))
    else:
        word_urls.append([])
         
word_urls

[[],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['https://www.accenture.com/my-en/about/inclusion-diversity/gender-equality'],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 [],
 ['https://careers.airasia.com/how-we-hire", "value": "HIRING_PROCESS"}, {"text": "Our employee benefits", "value": "EMPLOYEE_BENEFITS"}, {"text": "Life at Air Asia", "linkOutText": "check out how life is at Air Asia.", "linkOut": true, "link": "https://careers.airasia.com/meet-allstars", "value": "ALLSTARS"}]}}, "EMPLOYEE_BENEFITS": {"prompts": [{"delay": 200, "content": "Great Question."}, {"delay": 1000, "content": "At Air Asia we believe that It has always been about the people. We offer comprehensive benefits to our employees."}, {"delay": 1200, "content": "Here\'s a quick overv

In [18]:
word_wo_urls = []

for x in df_indeed_text['Description']:
    re_pattern=r'\b(?:http).+\b'
    if re.findall(re_pattern,x):
        word_wo_urls.append(re.sub(re_pattern,' ',x))
    else:
        word_wo_urls.append(x)
        
df_indeed_text['Description w/o URL'] = word_wo_urls

In [19]:
df_indeed_text

Unnamed: 0,Description,Word Counts,Description w/o URL
0,AkiraKan [ AKN Technologies ] is hiring Meteor...,1307,AkiraKan [ AKN Technologies ] is hiring Meteor...
1,Responsibilities: - Exploratory data analysis ...,2396,Responsibilities: - Exploratory data analysis ...
2,Data scientists find and interpret rich data s...,2554,Data scientists find and interpret rich data s...
3,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",5252,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY..."
4,AkiraKan [ AKN Technologies ] is hiring survey...,1015,AkiraKan [ AKN Technologies ] is hiring survey...
...,...,...,...
145,If you are looking to excel and make a differe...,3529,If you are looking to excel and make a differe...
146,About the role A Product Manager’s mission at ...,3782,About the role A Product Manager’s mission at ...
147,No Details,10,No Details
148,No Details,10,No Details


In [20]:
word_counts = []
for x in df_indeed_text['Description w/o URL']:
    word_counts.append(len(x))

df_indeed_text['Word Counts w/o URL'] = word_counts

In [21]:
df_indeed_text['Word Counts Difference'] = df_indeed_text['Word Counts'] - df_indeed_text['Word Counts w/o URL']

In [22]:
# There is difference in certain job description in terms of word length which the everything related to url had been removed

df_indeed_text['Word Counts Difference'].unique()

array([    0,    72, 11528,    41,   171], dtype=int64)

In [68]:
# Removal of punctuation, digits and stopwords

stop_words = stopwords.words('english')
word_cleaned = []

for x in df_indeed_text['Description w/o URL']:
    # remove punctuations and digits
    clean = word_tokenize(''.join(n for n in x.lower() if n not in string.punctuation if not n.isdigit()))
    
    # remove single word and stopwords
    clean = [x for x in clean if (len(x) > 1 and x not in stop_words)]
    word_cleaned.append(clean)
    
df_indeed_text['Cleaned Description'] = word_cleaned

In [81]:
Words_lemmatize_list = []

for words in df_indeed_text['Cleaned Description']:
    
    Lemmatize_list = []
    for word in words:
        word = WordNetLemmatizer().lemmatize(word)
        Lemmatize_list.append(word)
    
    Words_lemmatize_list.append(Lemmatize_list)
    
df_indeed_text['Lemmatize Description'] = Words_lemmatize_list      

In [82]:
# This is the final result of the lemmatization of the job description

df_indeed_text

Unnamed: 0,Description,Word Counts,Description w/o URL,Word Counts w/o URL,Word Counts Difference,Cleaned Description,Lemmatize Description
0,AkiraKan [ AKN Technologies ] is hiring Meteor...,1307,AkiraKan [ AKN Technologies ] is hiring Meteor...,1307,0,"[akirakan, akn, technologies, hiring, meteorol...","[akirakan, akn, technology, hiring, meteorolog..."
1,Responsibilities: - Exploratory data analysis ...,2396,Responsibilities: - Exploratory data analysis ...,2396,0,"[responsibilities, exploratory, data, analysis...","[responsibility, exploratory, data, analysis, ..."
2,Data scientists find and interpret rich data s...,2554,Data scientists find and interpret rich data s...,2554,0,"[data, scientists, find, interpret, rich, data...","[data, scientist, find, interpret, rich, data,..."
3,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",5252,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY...",5252,0,"[location, kuala, lumpur, company, basf, asia,...","[location, kuala, lumpur, company, basf, asia,..."
4,AkiraKan [ AKN Technologies ] is hiring survey...,1015,AkiraKan [ AKN Technologies ] is hiring survey...,1015,0,"[akirakan, akn, technologies, hiring, surveyor...","[akirakan, akn, technology, hiring, surveyorgi..."
...,...,...,...,...,...,...,...
145,If you are looking to excel and make a differe...,3529,If you are looking to excel and make a differe...,3529,0,"[looking, excel, make, difference, take, close...","[looking, excel, make, difference, take, close..."
146,About the role A Product Manager’s mission at ...,3782,About the role A Product Manager’s mission at ...,3782,0,"[role, product, manager, mission, supahands, s...","[role, product, manager, mission, supahands, s..."
147,No Details,10,No Details,10,0,[details],[detail]
148,No Details,10,No Details,10,0,[details],[detail]


## Skills keyword Extraction

In [89]:
skill_word = ['Python','SQL','AWS', 'Machine learning','Deep learning','Text mining',
'NLP','SAS','Tableau','Sagemaker','Tensorflow','Spark', 'numpy', 'MongDB','PSQL',
"Postgres", 'Pandas', 'RESTFUL','NLP','Statistics','Algorithms','Visualization',
'GCP','Google Cloud','Naive Bayes','Random Forest','Bachelors degree','Masters degree'
'Java','Pyspark','Postgres','MySQL','Github','Docker','Machine Learning','C+',
'C++','Pytorch','Jupyter Notebook','R Studio','R-Studio','Forecasting','Hive',
'PhD','GCP','Numpy','NoSQL','Neo4j','Neural Network','Clustering','Linear Algebra',
'Google Colab','Data Mining','Regression','Time Series','ETL','Data Wrangling',
'Web Scraping','Feature Extraction','Featuring Engineering','Scipy','ML','DL']

In [90]:
skill_word = [x.lower() for x in skill_word]
skill_word

['python',
 'sql',
 'aws',
 'machine learning',
 'deep learning',
 'text mining',
 'nlp',
 'sas',
 'tableau',
 'sagemaker',
 'tensorflow',
 'spark',
 'numpy',
 'mongdb',
 'psql',
 'postgres',
 'pandas',
 'restful',
 'nlp',
 'statistics',
 'algorithms',
 'visualization',
 'gcp',
 'google cloud',
 'naive bayes',
 'random forest',
 'bachelors degree',
 'masters degreejava',
 'pyspark',
 'postgres',
 'mysql',
 'github',
 'docker',
 'machine learning',
 'c+',
 'c++',
 'pytorch',
 'jupyter notebook',
 'r studio',
 'r-studio',
 'forecasting',
 'hive',
 'phd',
 'gcp',
 'numpy',
 'nosql',
 'neo4j',
 'neural network',
 'clustering',
 'linear algebra',
 'google colab',
 'data mining',
 'regression',
 'time series',
 'etl',
 'data wrangling',
 'web scraping',
 'feature extraction',
 'featuring engineering',
 'scipy',
 'ml',
 'dl']