# ITBW51 Text & Social Analytic Project

## **Project Title**: *Suitable Job Classifier*

### WEB SCRAPING indeed.com

___

**Module Group**: *ITBW51-01*

**Group Name**: *Anyhow*

**Tutor**: *Ms Jane Zhang*

**By**: *Zhang Xiang*
___


### **Project Title**: *Suitable Job Classifier*


### Project Summary:

**Dataset**: Our team scraped data from 4 different recruitment websites (JobStreet, Indeed, Glassdoor, and LinkedIn) to facilitate job candidates in finding an ideal job based on their qualifications.

**Target Audience**: IT Fresh Grad looking for job opportunities.

**Objective**: To ***recommend the most ideal job related to a candidate's skills and qualifications.*** The project will only be attempted if a single label classification approach is successfully completed and if there is enough time available.

**Goal**: Build multi-class classification to predict job roles such as Data Analyst, Data Scientist, Software Engineer, and Business Analyst by analyzing job description texts. 


Justifications on amendment are stated in modelling and evaluation.

### Additional Info:
All the code below is run again with smaller scraping sample for the purpose of testing it before submission

## Download & Import necessary libraries

~~~Python
!pip install -U selenium
!pip install beautifulsoup4
~~~

In [1]:
import time
from bs4 import BeautifulSoup

# !pip install selenium
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options

import pandas as pd
import math

### Dynamic URL Function

In [2]:
# set the URL we want to scrape
def get_url(position, location, page):
    url_template = """https://sg.indeed.com/jobs?q={}&l={}&start={}"""
    url = url_template.format(position, location, str(page * 10))
    
    return url

### Selenium Chrome Driver

In [3]:
chrome_options = Options()
# proxy = "154.26.134.217:80"
# chrome_options.add_argument("--proxy-server={}".format(proxy))

# Control the size of browser
chrome_options.add_argument("--window-size=1024,768")

# Can be used to prevent ip block/ the proxy above as well
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36")

# Made sure it is encode with utf-8
chrome_options.add_argument("--accept-encoding=utf-8")

# create a Chrome browser
driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='chromedriver.exe')

  driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='chromedriver.exe')
  driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='chromedriver.exe')


### Begin Scraping

In [4]:
# List of job roles to scrape
job_roles = ['Software Engineer', 'Data Scientist', 'Business Analyst', 'Data Analyst']

# Each page contain about 15 job postings.
# Target is to scrape about 700+ job postings in every job roles for each person.
job_posts=[]

# Loop through all the 4 jobs
for role in job_roles:
    
    # having the browser visit the first page URL
    driver.get(get_url(role, 'Singapore', 0))
    
    soup = BeautifulSoup(driver.page_source, 'html.parser')
    
    num_job_posted_str = soup.find('div',class_='jobsearch-JobCountAndSortPane-jobCount').text
    num_job_posted = int(num_job_posted_str.replace(",", "").split(" jobs")[0])
    print(role)
    print("No. of Job Posted:{}".format(num_job_posted))
    
    no_of_pages = math.ceil(num_job_posted/15)
    if no_of_pages > 100:
        no_of_pages = 80
        print(f"Since there are more than 100 of pages, I will be scraping the first {no_of_pages} pages only.")
    else:
        print(f"Scraping {no_of_pages}")
        
    print('')
    
    # Testing / If not testing it comment the code.
#     no_of_pages = 2
    
    # Loop through every page
    for page in range(no_of_pages):
#         chrome_options = Options()
#         chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.75 Safari/537.36")
#         chrome_options.add_argument("--accept-encoding=utf-8")
#         # create a new Chrome browser
#         driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='path\to\chromedriver.exe')

        # have the browser visit the URL
        driver.get(get_url(role, 'Singapore', page))

        soup = BeautifulSoup(driver.page_source, 'html.parser')
        
        # Remove Pop-Up
        try:
            outer_most_point = soup.find('div', attrs={'id':'mosaic-provider-jobcards'})
        except:
            close_popup = driver.find_element(By.CLASS_NAME,"icl-CloseButton icl-Modal-close")
            close_popup.click()

        for job in outer_most_point.find('ul'):
            
            # Outer layer of "title, company, link"
            job_title = job.find('h2',{'class':'jobTitle'})
            job_company = job.find('span', class_='companyName')
            job_link = job_title
            
            # Scrape title
            if job_title != None:
                title = job_title.find('a').text.strip()
            # Scrape company
            if job_company != None:
                company = job_company.text.strip()
            # Scrape posting link for futher scraping on description
            if job_link != None:
                # The link is created using a attribute 'data-jk', instead of 'href'
                link = 'https://sg.indeed.com/viewjob?jk=' + job_link.a['data-jk']
                link = link.strip()
                
            # Robot scraping prevention
            time.sleep(1)
            
            # Append data scraped into a List called job_posts
            job_posts.append([role, title, company, link])

        # Not necessary
#         driver.quit()

Software Engineer
No. of Job Posted:9615
Since there are more than 100 of pages, I will be scraping the first 80 pages only.

Data Scientist
No. of Job Posted:890
Scraping 60

Business Analyst
No. of Job Posted:2432
Since there are more than 100 of pages, I will be scraping the first 80 pages only.

Data Analyst
No. of Job Posted:2663
Since there are more than 100 of pages, I will be scraping the first 80 pages only.



In [5]:
# View First 2 list scraped
job_posts[0:2]

[['Software Engineer',
  'Software Engineer',
  'Globotron (S) Pte Ltd',
  'https://sg.indeed.com/viewjob?jk=29fc16d129b696fc'],
 ['Software Engineer',
  'Software Engineer (Nucleus Graduate Programme)',
  'NCS',
  'https://sg.indeed.com/viewjob?jk=714d3347ae7782b0']]

In [6]:
# Create a Dataframe
df = pd.DataFrame(job_posts,columns=['Role', 'Title', 'Company', 'Link'])
# Temporary to store original data before removal of duplicate
df1 =df
df

Unnamed: 0,Role,Title,Company,Link
0,Software Engineer,Software Engineer,Globotron (S) Pte Ltd,https://sg.indeed.com/viewjob?jk=29fc16d129b696fc
1,Software Engineer,Software Engineer (Nucleus Graduate Programme),NCS,https://sg.indeed.com/viewjob?jk=714d3347ae7782b0
2,Software Engineer,Software Engineer (TypeScript) Adrestia (Remote),Cord,https://sg.indeed.com/viewjob?jk=02da0f57bbf32cfb
3,Software Engineer,Software Engineer - Haskell (Remote),Cord,https://sg.indeed.com/viewjob?jk=26da81ef7d13c127
4,Software Engineer,Software Engineer (Dev) Scala - Sidechains (Re...,Cord,https://sg.indeed.com/viewjob?jk=f6eaec70aa994958
...,...,...,...,...
139,Data Analyst,[GOVT] Data Analyst (Entry Level) - JL,BGC GROUP PTE. LTD.,https://sg.indeed.com/viewjob?jk=740ab533f5952718
140,Data Analyst,Data Analyst,Publicis Groupe,https://sg.indeed.com/viewjob?jk=b2dcf92887c34c30
141,Data Analyst,Data Analyst Intern #GeneralInternship,Singtel,https://sg.indeed.com/viewjob?jk=edf81911b0c87c66
142,Data Analyst,Data Analyst,Y3 TECHNOLOGIES PTE LTD,https://sg.indeed.com/viewjob?jk=05700fa9cf222f9b


### Check for duplicated row

Prevent Scraping same page again in the description scraping.

In [7]:
# Remove Duplicate by Link
print(df.duplicated(keep='last'))

0      False
1      False
2      False
3      False
4       True
       ...  
139    False
140    False
141    False
142     True
143    False
Length: 144, dtype: bool


### Duplicate Removal

In [8]:
# Remove duplicate records if entire row is same but keep the last
df.drop_duplicates( keep='last', inplace= True)

# Reset index after duplicate is removed
df.reset_index(drop=True, inplace = True)

df

Unnamed: 0,Role,Title,Company,Link
0,Software Engineer,Software Engineer,Globotron (S) Pte Ltd,https://sg.indeed.com/viewjob?jk=29fc16d129b696fc
1,Software Engineer,Software Engineer (Nucleus Graduate Programme),NCS,https://sg.indeed.com/viewjob?jk=714d3347ae7782b0
2,Software Engineer,Software Engineer (TypeScript) Adrestia (Remote),Cord,https://sg.indeed.com/viewjob?jk=02da0f57bbf32cfb
3,Software Engineer,Software Engineer - Haskell (Remote),Cord,https://sg.indeed.com/viewjob?jk=26da81ef7d13c127
4,Software Engineer,Software Engineer (Dev) Scala - Sidechains (Re...,Cord,https://sg.indeed.com/viewjob?jk=f6eaec70aa994958
...,...,...,...,...
100,Data Analyst,Data Analysis Support Officer,RMA CONTRACTS PTE. LTD.,https://sg.indeed.com/viewjob?jk=6f7fdca176e451c4
101,Data Analyst,[GOVT] Data Analyst (Entry Level) - JL,BGC GROUP PTE. LTD.,https://sg.indeed.com/viewjob?jk=740ab533f5952718
102,Data Analyst,Data Analyst,Publicis Groupe,https://sg.indeed.com/viewjob?jk=b2dcf92887c34c30
103,Data Analyst,Data Analyst Intern #GeneralInternship,Singtel,https://sg.indeed.com/viewjob?jk=edf81911b0c87c66


In [9]:
# For purpose of checking and rescrape the above data
# df1.to_csv("IT_JOB_210896X.csv")

In [10]:
# View the link
df['Link'][0]

'https://sg.indeed.com/viewjob?jk=29fc16d129b696fc'

### Scrape Description

In [11]:
# chrome_options = Options()
# chrome_options.add_argument("--accept-encoding=utf-8")
# driver = webdriver.Chrome(chrome_options=chrome_options, executable_path='path\to\chromedriver.exe')

df['Description'] = ''

# Loop through each link scraped previously
for idx in df.index:
    url = df['Link'][idx]
    
    # Open the browser using the url
    driver.get(url)

    soup_ = BeautifulSoup(driver.page_source, 'html.parser')
    
    # Find description
    descriptions = soup_.find('div',{'class':'jobsearch-jobDescriptionText'})
    job_descr_txt=[]
    for desc in descriptions:
        try:
            # If no error append the text into job_descr_txt list
            if desc != None:
                job_descr_txt.append(''.join(desc.text.strip()))
        except AttributeError:
            # If error append a empty string
            job_descr_txt.append('')
            
    time.sleep(2)

    # Join all the text into one.
    description = ' '.join(job_descr_txt).strip()
    
    # Assign it into the corresponding DataFrame row
    df['Description'][idx] = description

# Close Browser
driver.quit()

# Show DataFrame
df

Unnamed: 0,Role,Title,Company,Link,Description
0,Software Engineer,Software Engineer,Globotron (S) Pte Ltd,https://sg.indeed.com/viewjob?jk=29fc16d129b696fc,Responsibilities Prepare IT Status report\nMa...
1,Software Engineer,Software Engineer (Nucleus Graduate Programme),NCS,https://sg.indeed.com/viewjob?jk=714d3347ae7782b0,Will you be part of the extraordinary? We're ...
2,Software Engineer,Software Engineer (TypeScript) Adrestia (Remote),Cord,https://sg.indeed.com/viewjob?jk=02da0f57bbf32cfb,"IO Global, creator of the Cardano blockchain p..."
3,Software Engineer,Software Engineer - Haskell (Remote),Cord,https://sg.indeed.com/viewjob?jk=26da81ef7d13c127,IO Global is searching for a Senior Software E...
4,Software Engineer,Software Engineer (Dev) Scala - Sidechains (Re...,Cord,https://sg.indeed.com/viewjob?jk=f6eaec70aa994958,"IO Global, developer of the Cardano blockchain..."
...,...,...,...,...,...
100,Data Analyst,Data Analysis Support Officer,RMA CONTRACTS PTE. LTD.,https://sg.indeed.com/viewjob?jk=6f7fdca176e451c4,RESPONSIBILITIES \n\n\nData Processing \n\nCol...
101,Data Analyst,[GOVT] Data Analyst (Entry Level) - JL,BGC GROUP PTE. LTD.,https://sg.indeed.com/viewjob?jk=740ab533f5952718,"Job Type : Contract, Entry Level \nContract Du..."
102,Data Analyst,Data Analyst,Publicis Groupe,https://sg.indeed.com/viewjob?jk=b2dcf92887c34c30,Company Description\n Publicis Media is one o...
103,Data Analyst,Data Analyst Intern #GeneralInternship,Singtel,https://sg.indeed.com/viewjob?jk=edf81911b0c87c66,Data Analyst Intern #GeneralInternship\n\n\n\n...


In [12]:
# Check if distribution of roles scraped meet the target.
df['Role'].value_counts()

Software Engineer    28
Data Analyst         27
Data Scientist       25
Business Analyst     25
Name: Role, dtype: int64

In [13]:
# Additional info: parameter to to_csv => index=False can be used to avoid the index column
df.to_csv("Web_Scraping_210896X.csv")