<img src="images/astradrel_datascience_logo.png" style="height: 100px;" align=left>
<img src="images/python_logo.png" style="height: 100px;" align=right>

# Project: Web Scrapping Job Postings from Indeed Malaysia

Almost all of fresh graduates in Malaysia are using online job posting websites like **Indeed, LinkedIn, JobStreet** and likewise to look for jobs. So many opportunities exists on the internet yet only fraction of total fresh graduates will get the jobs which proves the competition existed in the online jobseeking. Many attributes of the candidates itself can affect their chances of landing their dream jobs such as skills, experiences, salary expectation and many more. 

In the case of Data Science, with the recent influx of working professional and fresh graduate alike from different background looking for a career transition this maybe become a trouble when it comes to meeting the criterias of the recruiter. With so many people competing for the same jobs posts, it become apparent that the recruiter needed to filter down the candidates and only takes the top of the cream among all of them. In this project we will gather several data science jobs posting from various online job posting website and determining what the desired attributes that recruiter looking for in a candidates.

In this project, we are using web scrapping tools to scrap the information of a jobs along with its jobs description, job employment type, salary, location and several more for **Data Scientist** position from the **Indeed Malaysia**.

# Module: Dataset Cleaning and NLP Implementation

## Importing libraries

In [1]:
import pandas as pd
import os
import time
import datetime

# Natural Language Processing
import nltk
import string
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords


## Importing scrapped Job Post dataset from Malaysia Indeed.com

In [2]:
main_path = "C:\\Users\\astra\\Desktop\\CADS Datastar\\Junior Data Scientist\\Kaggle\\datasciencejobswebscrapping"


In [3]:
df_indeed = pd.read_excel(main_path + '/data/indeed_jobs.xlsx')

In [4]:
df_indeed.head(5)

Unnamed: 0,Job Title,Company,Location,Salary,Post Date,Job Link,Type,Description
0,Meteorologist/ Meteorological Data Scientist,AkiraKan (Marine Technology) Sdn Bhd,Kuala Lumpur,"RM 3,500 a month",13 days ago,https://malaysia.indeed.com/company/AkiraKan-(...,Full time,AkiraKan [ AKN Technologies ] is hiring Meteor...
1,Data Scientist,Doo Technology MY Sdn. Bhd.,Kuala Lumpur,"RM 6,000 - RM 7,999 a month",6 days ago,https://malaysia.indeed.com/rc/clk?jk=e9d15ef5...,No Detail,Responsibilities: - Exploratory data analysis ...
2,Data Scientist,Datalabs Asia (M) Sdn Bhd,Kuala Lumpur,"RM 5,000 - RM 5,999 a month",13 days ago,https://malaysia.indeed.com/rc/clk?jk=4b53359d...,Contract,Data scientists find and interpret rich data s...
3,Data Scientist,BASF Asia Pacific,Kuala Lumpur,No Detail,15 days ago,https://malaysia.indeed.com/rc/clk?jk=3d272182...,No Detail,"LOCATION\n\nKuala Lumpur, MY, 50000\n\nCOMPANY..."
4,Geomatics Surveyor (Geographic Information Sys...,AkiraKan (Marine Technology) Sdn Bhd,Kuala Lumpur,"RM 3,500 a month",13 days ago,https://malaysia.indeed.com/company/AkiraKan-(...,Full time,AkiraKan [ AKN Technologies ] is hiring survey...


## Data Cleaning and Transformation

In [5]:
# Transform the job post column and create job date column with proper date format

Date_Posted = []

for data in df_indeed['Post Date']:
    if re.findall(r'[0-9]', data):
        period = int(''.join(re.findall(r'[0-9]', data)))
        period_date = (datetime.datetime.today() - datetime.timedelta(period)).strftime('%d/%m/%y')
        Date_Posted.append(period_date)
    else:
        Date_Posted.append(datetime.datetime.today().strftime('%d/%m/%y'))

df_indeed['Job Date'] = Date_Posted
df_indeed = df_indeed.drop(['Post Date'], axis=1)

In [6]:
df_indeed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Job Title    150 non-null    object
 1   Company      150 non-null    object
 2   Location     150 non-null    object
 3   Salary       150 non-null    object
 4   Job Link     150 non-null    object
 5   Type         150 non-null    object
 6   Description  150 non-null    object
 7   Job Date     150 non-null    object
dtypes: object(8)
memory usage: 9.5+ KB


In [7]:
# Clean the type column and standardizes the type

#df_indeed['Type'].unique()
Type = []

for x in df_indeed['Type']:
    
    if x == 'Full time':
        Type.append('Fulltime')
        pass
    
    elif x == 'No Detail':
        Type.append('Unspecified')
    
    elif ' ' in x:
        #Type.append(re.sub(r"\s+", '/', x))
        Type.append(x.replace(' ','/'))
    
    else:
        Type.append(x)

df_indeed['Job Type'] = Type
df_indeed = df_indeed.drop(['Type'], axis=1)

In [8]:
df_indeed.Location.unique()

array(['Kuala Lumpur', 'Petaling Jaya', 'Malaysia', 'Seremban',
       'Kuala Lumpur+1 location', 'i-City', 'Selangor',
       'Bangsar South•Remote', 'Petaling Jaya•Remote',
       'Kuala Lumpur+2 locations', 'Subang Jaya', 'Kuala Lumpur•Remote',
       'Simpang Ampat', 'Penang', 'Puchong', 'Port Klang', 'Perai',
       'Cyberjaya', 'Malaysia+1 location', 'Kota Damansara', 'Batu Caves',
       'Bukit Gelugor', 'Melaka', 'Kulai', 'Brickfields', 'Johor'],
      dtype=object)

In [9]:
# Clean the location column and standardizes the location

Location = []

for x in df_indeed["Location"]:
    
    x = re.sub(r'\b[a-z]', lambda m: m.group().upper(), x)
    
    if "•" in x:

        Location.append(x.split('•')[0])
        
    elif "+" in x:
        
        Location.append(x.split('+')[0])        
              
    else:
        Location.append(x)
        
df_indeed["Job Location"] = Location