<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 4: Web Scraping Job Postings

Done by: Darren Siow

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

Determine the industry factors that are most important in predicting the salary amounts for these data.
Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?
To limit the scope, your principal has suggested that you focus on data-related job postings, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by limiting your search to a single region.

In [1]:
import time
import re
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from bs4 import BeautifulSoup
from scrapy.selector import Selector
import pandas as pd
import numpy as np

In [2]:
# Setting the webdriver path
webdriver_path = "C://Users/dsiow/Desktop/chromedriver_win32/chromedriver.exe"

# Set page load timeout
timeout = 6

# Creating a new Chrome session
driver = webdriver.Chrome(executable_path=webdriver_path)

# Defining the url for data scientist jobs
url = "https://www.mycareersfuture.sg/search?search=machine%20learning&sortBy=new_posting_date&page=14"

# Navigating to the webpage
driver.get(url)
time.sleep(10)

# Close the pop-up if it appears
try:
    button = driver.find_element_by_xpath('//*[@class="tr pointer OverlayNavigation__icon-cross___1wfSE"]')
    button.click()
    time.sleep(timeout)
except:
    pass

main_webpage = "https://www.mycareersfuture.sg"

# Initializing a pandas dataframe that will hold the scraped job information
df = pd.DataFrame(columns = ['title','company','employment_type','job_level','category',
                             'salary_lower','salary_upper','salary_time_interval','roles','requirements'])

# Initializing lists that will hold the scraped job information
title = []
company = []
employment_type = []
job_level = []
category = []
salary_lower = []
salary_upper = []
salary_time_interval = []
roles = []
requirements = []

# Initializing a job_card_count variable to keep track of the number of job cards scraped
job_card_count = 0

# Initializing a script_run variable that is True by default, and we be changed to False when the scraping is completed
script_run = True

while(script_run == True):
    # Selenium hands the page source to BeautifulSoup - so that BeautifulSoup can look for the JS link for each job title
    search_result_soup = BeautifulSoup(driver.page_source, 'lxml')
    result_links = search_result_soup.findAll(name='div', attrs={'class':'card relative'})
    
    # result_link contains the html content for each job card
    for result_link in result_links:

        # Finding the 'a' tag will lead us to the JavaScript element (the link that leads to more comprehensive job info)
        job_title_link = result_link.find('a')

        # Getting the link path that leads us to more comprehensive job info
        job_url = job_title_link.get('href')

        # Opening the comprehensive job info link
        sub_driver = webdriver.Chrome(executable_path=webdriver_path)
        sub_driver.get(main_webpage + job_url)

        # Sleep for a while to give the page some time to load
        time.sleep(timeout)

        # Selenium hands over the html content of the comprehensive job info page to BeautifulSoup
        job_details_soup = BeautifulSoup(sub_driver.page_source, 'lxml')

        # job_info contains information on job title, salary, company name, etc.
        job_info = job_details_soup.find('div', attrs={'class':'jobInfo'})
        
        # job_body contains information on roles & responsibilities and requirements
        job_body = job_details_soup.find('div', attrs={'class':'jobDescription w-100 v-top relative'})

        # Getting the job title, company name, employment type
        title.append(job_info.find('h1', attrs={'id':'job_title'}).text)
        company.append(job_info.find('p', attrs={'name': 'company'}).text)

        # If employment type is unavailable..
        try:
            employment_type.append(job_info.find('p', attrs={'id':'employment_type'}).text)
        except:
            employment_type.append(np.nan)

        # If job_level is unavailable..
        try:
            job_level.append(job_info.find('p', attrs={'id':'seniority'}).text)
        except:
            job_level.append(np.nan)

        # If category is unavailable..
        try:
            category.append(job_info.find('p', attrs={'id':'job-categories'}).text)
        except:
            category.append(np.nan)

        # Getting the salaries, and the salary time intervals
        # If there are undisclosed salaries...
        try:
            job_salary_raw = job_info.findAll('div', attrs={'class':'lh-solid'})[0]
        except:
            salary_lower.append(np.nan)
            salary_upper.append(np.nan)
            salary_time_interval.append(np.nan)
        else:
            # Getting the salaries
            job_salary_raw = job_info.findAll('div', attrs={'class':'lh-solid'})[0]

            # Getting the salary(lower end)
            job_salary_raw_lower = job_salary_raw.text.split('to')[0]
            job_salary_lower = int(''.join(e for e in job_salary_raw_lower if e.isalnum()))
            salary_lower.append(job_salary_lower)

            # Getting the salary(upper end)
            job_salary_raw_upper = job_salary_raw.text.split('to')[1]
            job_salary_upper = int(''.join(e for e in job_salary_raw_upper if e.isalnum()))
            salary_upper.append(job_salary_upper)

            # Getting the salary time interval
            job_salary_time_interval = job_info.find('span', attrs={'class':"salary_type dib f5 fw4 black-60 pr1 i pb"}).text
            salary_time_interval.append(job_salary_time_interval)

        # Getting the roles and responsibilities
        try:
            roles_resp = job_details_soup.find('div', {'class':'f5 fw4 black-70 lh-copy break-word'}).text
            roles.append(roles_resp)
        except:
            roles.append(np.nan)

        # Getting the requirements
        try:
            req = job_details_soup.find('div', {'id':'requirements'}).text
            requirements.append(roles_resp)
        except:
            requirements.append(np.nan)
        
        job_card_count += 1
        print('There are {} jobs scraped so far'.format(job_card_count))
        print('title {}, company {}, employment_type {}, job_level {}, category {}, salary_lower {}, salary_upper {}, \
        salary_time_interval {}, roles {}, requirements {}'.format(len(title), len(company), len(employment_type),\
                                                                  len(job_level), len(category), len(salary_lower), \
                                                                  len(salary_upper), len(salary_time_interval), len(roles),\
                                                                  len(requirements)))
        
        # Closing sub_driver
        sub_driver.close()

    if(search_result_soup.findAll("div", {"class":"tc pv3"})[0].text.find('❯') != -1):
        button = driver.find_element_by_xpath('//*[@type="action"][last()]')
        button.click()
        time.sleep(timeout)
    else:
        print('Reached end of the search result...')
        script_run = False



There are 1 jobs scraped so far
title 1, company 1, employment_type 1, job_level 1, category 1, salary_lower 1, salary_upper 1,         salary_time_interval 1, roles 1, requirements 1
There are 2 jobs scraped so far
title 2, company 2, employment_type 2, job_level 2, category 2, salary_lower 2, salary_upper 2,         salary_time_interval 2, roles 2, requirements 2
There are 3 jobs scraped so far
title 3, company 3, employment_type 3, job_level 3, category 3, salary_lower 3, salary_upper 3,         salary_time_interval 3, roles 3, requirements 3
There are 4 jobs scraped so far
title 4, company 4, employment_type 4, job_level 4, category 4, salary_lower 4, salary_upper 4,         salary_time_interval 4, roles 4, requirements 4
There are 5 jobs scraped so far
title 5, company 5, employment_type 5, job_level 5, category 5, salary_lower 5, salary_upper 5,         salary_time_interval 5, roles 5, requirements 5
There are 6 jobs scraped so far
title 6, company 6, employment_type 6, job_level

In [3]:
print(len(title))
print(len(company))
print(len(employment_type))
print(len(job_level))
print(len(category))
print(len(salary_lower))
print(len(salary_upper))
print(len(salary_time_interval))
print(len(roles))
print('-'*100)
print(len(requirements))

28
28
28
28
28
28
28
28
28
----------------------------------------------------------------------------------------------------
28


In [4]:
df['title'] = title
df['company'] = company
df['employment_type'] = employment_type
df['job_level'] = job_level
df['category'] = category
df['salary_lower'] = salary_lower
df['salary_upper'] = salary_upper
df['salary_time_interval'] = salary_time_interval
df['roles'] = roles
df['requirements'] = requirements

df

Unnamed: 0,title,company,employment_type,job_level,category,salary_lower,salary_upper,salary_time_interval,roles,requirements
0,Research Fellow,NANYANG TECHNOLOGICAL UNIVERSITY,"Contract, Full Time",Professional,Sciences / Laboratory / R&D,4000.0,5500.0,Monthly,There is a postdoctoral position in condensed ...,There is a postdoctoral position in condensed ...
1,AA RPA Consultant,TOSS-EX PTE. LTD.,Contract,Executive,Information Technology,6000.0,7500.0,Monthly,At least 1-2 years of experience on developin...,At least 1-2 years of experience on developin...
2,Senior Principal Investigator / Principal In...,A*STAR RESEARCH ENTITIES,"Contract, Full Time",Non-executive,Sciences / Laboratory / R&D,7100.0,14200.0,Monthly,About Singapore Institute for Clinical Science...,About Singapore Institute for Clinical Science...
3,Senior Data Scientist,RESMED ASIA PTE. LTD.,"Permanent, Full Time",Senior Executive,"Engineering, Sciences / Laboratory / R&D",10000.0,14000.0,Monthly,Senior Data Scientist The Senior Data Scientis...,Senior Data Scientist The Senior Data Scientis...
4,Data Sciences & Analytics Engineer,SINGAPORE AIRLINES LIMITED,Permanent,"Professional, Executive",Information Technology,4000.0,8000.0,Monthly,SIA has multiple positions for junior and seni...,SIA has multiple positions for junior and seni...
5,Data Scientist,ALPHATECH BUSINESS SOLUTIONS PTE. LTD.,"Permanent, Full Time",Professional,Information Technology,5500.0,8000.0,Monthly,· Use machine learning and analytical te...,· Use machine learning and analytical te...
6,"Senior Specialist / Manager, CS Process & Project",APL CO. PTE LTD,Permanent,Senior Executive,Others,6000.0,8000.0,Monthly,BRIEF DESCRIPTION This position reports to the...,BRIEF DESCRIPTION This position reports to the...
7,"Senior Specialist / Manager, CS Process & Project",APL CO. PTE LTD,Permanent,Senior Executive,Others,4500.0,6000.0,Monthly,BRIEF DESCRIPTION This position reports to the...,BRIEF DESCRIPTION This position reports to the...
8,"Program Manager, Business Integrity",FACEBOOK SINGAPORE PTE. LTD.,Full Time,Professional,Advertising / Media,6000.0,12000.0,Monthly,Business Integrity's mission is to ensure safe...,Business Integrity's mission is to ensure safe...
9,Data Scientist (JD#4567),SCIENTE INTERNATIONAL PTE. LTD.,"Contract, Full Time",Professional,Banking and Finance,,,,An excellent opportunity to gain experience in...,An excellent opportunity to gain experience in...


In [5]:
df.to_csv('./careersfuture_ML_part2.csv', index = False)