# Project 4: Web Scraping Job Postings

### Factors that impact salary
    To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary). 
    
    Predictors: Location (Use NLP to extract Country), Title, Job Summary
    Target: Salary Amount (Regression) / Salary Category (Classification)

### Factors that distinguish job category
    Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:

    What components of a job posting distinguish data scientists from other data jobs?
    
    Predictors: To be determined
    Target: Data Scientist Or Not 
    
    What features are important for distinguishing junior vs. senior positions?
    
    Predictors: To be determined
    Target: Junior / Senior Position [Position Level]
    
    Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?
    
    Predictors: Title Features
    Target: Job Category (id=job-categories)
    

In [1]:
# Define Search Terms: data scientist, data analyst, research scientist, business intelligence

### Let's select a website to crawl

| Website | Location (NWES)   | Salary   | HTML class Structure   |
|------|------|------|------|
|   JobCentral  | Absent| Partial   | Messy  |
|   MyCareersFuture  | Present| Partial   | OK   | 
|   CareersGov  | Absent| Absent   | Messy   | 



In [2]:
# import requests

# url_front = 'https://www.mycareersfuture.sg/search?search='
# url_back = '&sortBy=new_posting_date&page='

# search_terms = ['data scientist', 'data analyst']
# for search_term in search_terms:
#     response = requests.get(url_front+search_term.replace(' ', '%20')+url_back+str(0)).text
#     soup = BeautifulSoup(response)
#     print(soup)
#     print(soup.find_all('a'))
# # https://www.mycareersfuture.sg/search?search=data%20analyst&sortBy=new_posting_date&page=0
# need to use selenium as it is a dynamic page

In [3]:
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup

url_front = 'https://www.mycareersfuture.sg/search?search='
url_back = '&sortBy=new_posting_date&page='

search_terms = ['data scientist', 'data analyst', 'research scientist', 'business intelligence', 'software developer', 'project manager', 'accountant']

job_links = []

for search_term in search_terms:
    for page_number in range(0, 100): # assume max 100 pages
        n_links = 0
        browser = webdriver.Chrome('./chromedriver')    
        browser.get(url_front+search_term.replace(' ', '%20')+url_back+str(page_number))
        sleep(5)
        soup = BeautifulSoup(browser.page_source)
        browser.close()
        all_listing = soup.find('div', attrs={'class':'card-list'})
        if all_listing == None:
            break;
        for link in all_listing.find_all('a', href=True):
            if '/job/' in link['href']: # filter away non-job related links
                print(link['href'])
                job_links.append([search_term, link['href']])
                n_links += 1
        if n_links == 0: # page doesnt exist, stop going to next page
            break;

/job/data-scientist-jabil-circuit-51ea2ce5b0eedff231dc658097a05543
/job/data-scientist-hewlett-packard-enterprise-singapore-4cd401024939234386449a21c78de05b
/job/data-scientist-smartsoft-c605a552bf2a0ae3caad295b5a238394
/job/data-scientist-schellden-global-services-2cce3a0fd90690d4b4505d77e0eff1e7
/job/data-scientist-schellden-global-7b446f16f1967685b15c4d50565d63bd
/job/data-scientist-siemens-f76d3f9f081a63cff0d2e4319ad98d58
/job/data-scientist-graebel-apac-center-295662cfb61e5e92694f9dbd081bc1c0
/job/data-scientist-mannhummel-ventures-9ffef96084b9ec6e133c6bee88026928
/job/data-scientist-sensorflow-1e90b327703a5b42c6961ef23e2994f1
/job/data-scientist-cisco-systems-ab78825f2f9666e3ea8ac0c296a1476d
/job/data-scientist-singapore-telecommunications-521d6cc01d708764e5a131f3c18b3a47
/job/data-scientist-sopra-steria-asia-02a1196b078ee709327436f183222379
/job/research-associat-nanyang-technological-university-ad08206598803824ebf0cf7780ecf07c
/job/data-scientist-government-technology-agency-f4

/job/senior-manager-irisnation-singapore-eedf3e4ed9b4a1109dede573c065e4c7
/job/senior-manager-irisnation-singapore-6e14524c2a61703ef7a0d12144f5474e
/job/risk-data-analyst-cac40e8808a6a0c4112864a0633e3778
/job/data-analyst-sbs-transit-80bfa22b3640dd3083526ddc05ad28d3
/job/quality-data-analyst-adecco-personnel-34f4175aa229b0e405a44ef9f88ff16c
/job/data-analyst-adecco-personnel-53aa81b4703a7168cfd3288ad3625e7b
/job/market-analyst-grabtaxi-holdings-962b6b110b66841082656a7f6fc8c7eb
/job/digital-data-analyst-kerry-consulting-7b9dca9acb10200b5152f48c6e23d25d
/job/data-analyst-transformation-project-chandler-macleod-group-24926b459347ba55308cd4fa6b57d7b5
/job/commercial-data-analyst-gmp-recruitment-services-791c9b319e8fc26ee88a691f7f4867c9
/job/senior-data-analyst-99-a17b2cb30afd62ddc9d73df3948816a8
/job/data-analyst-optimum-solutions-d8fc06c411925aea9449ce92e0e77cc1
/job/data-analyst-database-analyst-groupm-asia-pacific-holdings-631ebc36a74a85ee710dd711480e199d
/job/data-analyst-mindtree-f03f

/job/research-associate-scientist-prestige-bioresearch-6ac699dfb3abfb14f25bd125181f03fb
/job/protein-scientist-immunoscape-e5fd15ad910d2ac72ceb120c584d082b
/job/research-scientist-ihpc-astar-research-entities-553d5867006df80835242c3884d44e4d
/job/research-scientist-chugai-pharmabody-research-c0e4fc6d961bd2e3caf5376dc9f06410
/job/research-engineer-scientist-tcoms-technology-centre-offshore-marine-singapore-4a35245e8d555a73c361e54c5d10ee14
/job/scientist-simtech-astar-research-entities-f3f021f59687bf97df4b51af5bc66955
/job/research-engineer-astar-research-entities-3a3c70eb175dd9ae8a2f7e8e44fa9c60
/job/postdoctoral-associate-research-scientist-freight-behavioural-models-singapore-mit-alliance-research-technology-centre-be2e080dc9a500d02d1a1b848e5e6da9
/job/sics-research-fellow-astar-research-entities-dcf3f29b9509239f9bc097b53cf4b7bb
/job/ux-developer-%E2%80%93-european-mnc-globesoft-services-afb72171e824598909768b9fae224075
/job/research-associate-scientist-prestige-bioresearch-129f82c0d2

/job/business-intelligence-intropls-bdf09d4508beac710e9323c1adf31ce3
/job/fpa-manager-international-baccalaureate-organization-751ae87b2f17d1f2d39cb5a36a164b33
/job/asean-business-operations-analyst-amazon-web-services-singapore-ce98f60c448c9c55c2ce00f175e96b22
/job/lead-senior-systems-analyst-bi-reporting-analytics-m1-d02462ead0e335428f20f3933f490938
/job/application-support-analyst-titansoft-3cac59d0ac0db7a4820c1087ab6bf096
/job/cross-team-editor-%E2%80%93-singapore-blackpeak-f8b0192b476d7aede924b09c7f89356b
/job/business-development-manager-aspire-global-network-e85ab53b009a888aa10beb0cade5ad70
/job/regional-market-manager-gmp-recruitment-services-6ce004f1501048e80beb27add8b1bd09
/job/regional-senior-business-analyst-sephora-asia-678b2307342186a13bf79ebe7519ef43
/job/senior-business-intelligence-analyst-alphatech-business-solutions-e8d59e776a022c474d8eca460576a5e3
/job/business-intelligence-consultant-oneaston-ad41988150f2ab35538db4631f5acaa0
/job/adobe-google-analytics-developer-al

/job/construction-manager-tiong-seng-civil-engineering-9b9cdc42627fe6a16ff5a9541cdd2d89
/job/project-engineer-suez-water-technologies-solutions-singapore-7a33ad147762b43fb5a1effcb914f2d5
/job/project-engineer-active-fire-protection-systems-726d84b059c016e730258f7ebb1c49fc
/job/project-engineer-teambuild-engineering-construction-3ca78e88872e6945ed9f147225194129
/job/senior-lighting-designer-lightbox-04b315efd70f4b4a51e386ca900b42de
/job/wsho-cum-eco-h-p-construction-engineering-18c824d9d90d297eb38c64475e59cefb
/job/executive-assistant-regional-leader-senior-director-operations-world-vision-international-a93bd025c5add3afd50efab472a6d3b3
/job/application-developer-3yrs-exp-api-web-services-direct-asia-management-services-ce0ca9ea6e531974ac14198caa78479c
/job/avp-kyc-fcc-consultant-robert-walters-d4eb02d10a5d222f5296c824cf338b79
/job/business-analyst-murex-specialist-global-treasury-oversea-chinese-banking-corporation-2447fbe7e054ff393521697b1d1b6315
/job/site-engineer-vigcon-construction-

/job/admin-cum-account-5-days-2800-3000-clarke-quay-senior-level-admin-supreme-hr-advisory-7bd17397dcce6db5ee9a954942262a2d
/job/accounts-officer-machspeed-human-resources-fc203968a206e99cad3b248e6abd642b
/job/accounts-executive-tiong-woon-crane-transport-ad5e75b5914e7cdf2459dd3be3406f95
/job/accounts-executive-hai-leck-engineering-4597b55853380f1b862543835459a496
/job/accounting-officer-haulio-1e2e37dcde45c3597cbdbef1330f2527
/job/accounts-luminous-dental-clinic-tampines-plaza-a8c5faf8ae203b54549937db940067df
/job/regional-finance-controller-iss-asia-pacific-bdd9f12a1e596a2dc3872497c83298b4
/job/consolidation-accountant-go-jek-singapore-25dc6a233d4f833ca24a8d1b3e2c0512
/job/senior-accountant-fujifilm-asia-pacific-328ababb532f40290470f9d84ec474b1
/job/supervising-accountant-assistant-accounting-manager-accounting-manager-pkf-cap-corporate-services-2f409c9d35b3581bf0e9db421416149c
/job/fund-accountant-bluechip-platforms-asia-6c4ca8241f5d410bd6683bcc1acc51cd
/job/accountant-alpha-manpowe

/job/senior-specialist-accountant-hays-specialist-recruitment-dac3cbbf54521142a6ca41dd5a9b8010
/job/temporary-accountant-erni-asia-holding-3fd7a0213d3d643615942bfa1a704543
/job/global-accountant-jtron-a0d35e0762370627f30ed17d4a887979
/job/accountant-dyna-mac-engineering-services-dae3cbc7fcc6c926e15e023415b99345
/job/senior-executive-finance-singapore-sports-hub-global-spectrum-pico-01b057f914aefc67eedceef881c5da2d
/job/accounts-executive-mothers-work-23a0bd8331740f315e71ba4f22a93610
/job/accounts-executive-jpn-industrial-trading-691c2aa8096ba4f3b3d7fdb9e84fa85c
/job/accounts-assistant-kgm-brothers-contractors-8da2f18ac063dfcad65c0fa88ad9a8a5
/job/accounts-executive-thye-hua-kwan-moral-charities-d24bc2f758b447655035123470231a5b
/job/accounts-executive-laguna-national-golf-country-club-726d0a7d66eafd214cfc1ad993e8c5b5
/job/senior-accountant-chartered-765d3a62863453d84eb6bf061f138ac5
/job/accounts-audit-semi-senior-657fd373daff80ae98a3bf4d11926cea
/job/technical-accountant-beathchapman-af

In [4]:
print(len(job_links))

757


In [5]:
job_links[:4]

[['data scientist',
  '/job/data-scientist-jabil-circuit-51ea2ce5b0eedff231dc658097a05543'],
 ['data scientist',
  '/job/data-scientist-hewlett-packard-enterprise-singapore-4cd401024939234386449a21c78de05b'],
 ['data scientist',
  '/job/data-scientist-smartsoft-c605a552bf2a0ae3caad295b5a238394'],
 ['data scientist',
  '/job/data-scientist-schellden-global-services-2cce3a0fd90690d4b4505d77e0eff1e7']]

In [6]:
job_links[0][1]

'/job/data-scientist-jabil-circuit-51ea2ce5b0eedff231dc658097a05543'

In [7]:
job_url_front = 'https://www.mycareersfuture.sg'
data_store = []
for job_link in job_links:
    browser = webdriver.Chrome('./chromedriver')    
    browser.get(job_url_front+job_link[1])
    sleep(3)
    soup = BeautifulSoup(browser.page_source)
    job_title = soup.find('h1', attrs={'id': 'job_title'})
    try:
        job_title = job_title.text
    except:
        pass
    job_categories = soup.find('p', attrs={'id':'job-categories'})
    try:
        job_categories = job_categories.text
    except:
        pass
    job_location = soup.find('a', attrs={'href':'#location_map'})
    try:
        job_location = job_location.text
    except:
        pass
    job_employment_type = soup.find('p', attrs={'id':'employment_type'})
    try:
        job_employment_type = job_employment_type.text
    except:
        pass
    job_seniority = soup.find('p', attrs={'id':'seniority'})
    try:
        job_seniority = job_seniority.text
    except:
        pass
    job_last_posted_date = soup.find('span', attrs={'id':'last_posted_date'})
    try:
        job_last_posted_date = job_last_posted_date.text
    except:
        pass
    job_expiry_date = soup.find('span', attrs={'id':'expiry_date'})
    try:
        job_expiry_date = job_expiry_date.text
    except:
        pass
    job_description = soup.find('div', attrs={'id':'description-content'})
    try:
        job_description = job_description.text
    except:
        pass
    job_company_name = soup.find('p', attrs={'name':'company'})
    try:
        job_company_name = job_company_name.text
    except:
        pass
    job_company_info = soup.find('div', attrs={'data-cy':'companyinfo-writeup'})
    try:
        job_company_info = job_company_info.text
    except:
        pass
    job_requirement = soup.find('div', attrs={'id':'requirements-content'})
    try:
        job_requirement = job_requirement.text
    except:
        pass

    # get salary
    salary_range = soup.find('span', attrs={'class':'salary_range dib f2-5 fw6 black-80'})
    
    min_salary = None
    max_salary = None
    try:
        salary_range = salary_range.text
        min_salary, max_salary = salary_range.replace('$','').replace(',','').split('to') # need to convert to int
        min_salary = int(min_salary)
        max_salary = int(max_salary)

        salary_type = False
        salary_type = soup.find('div', attrs={'class':'salary tr-l'}).text

        print('salary_type:',len(salary_type.split('Annual'))>1)
        if len(salary_type.split('Annual'))>1:
            min_salary = min_salary/12
            max_salary = max_salary/12
    except:
        pass

    # search_term, job_url, job_title, job_categories, job_location, job_employment_type, job_seniority,0
    # job_last_posted_date, job_expiry_date, job_description, job_company_name, job_company_info, job_requirement, min_salary, max_salary
    data_per_job = [
        job_link[0], 
        job_link[1], 
        job_title, 
        job_categories, 
        job_location, 
        job_employment_type, 
        job_seniority, 
        job_last_posted_date, 
        job_expiry_date,
        job_description,
        job_company_name, 
        job_company_info,
        job_requirement,
        min_salary, 
        max_salary
    ]
#     break
    data_store.append(data_per_job)
    browser.close()

salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: True
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: True
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: True
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: Fa

salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: True
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: False
salary_type: 

In [12]:
len(data_store)

757

In [13]:
import pandas as pd
column_names = ['search_term', 'job_url', 'job_title', 'job_categories', 'job_location', 'job_employment_type', 'job_seniority','job_last_posted_date', 'job_expiry_date', 'job_description', 'job_company_name', 'job_company_info', 'job_requirement', 'min_salary', 'max_salary']
df = pd.DataFrame(data_store, columns=column_names)
df.head()

Unnamed: 0,search_term,job_url,job_title,job_categories,job_location,job_employment_type,job_seniority,job_last_posted_date,job_expiry_date,job_description,job_company_name,job_company_info,job_requirement,min_salary,max_salary
0,data scientist,/job/data-scientist-jabil-circuit-51ea2ce5b0ee...,Data Scientist,"Information Technology, Manufacturing",16 TAMPINES INDUSTRIAL CRESCENT 528604,"Permanent, Full Time",Executive,Posted 25 Jan 2019,Closing on 24 Feb 2019,JOB SUMMARY The Data Scientist I is a statisti...,JABIL CIRCUIT (SINGAPORE) PTE. LTD.,Jabil is one of the world’s largest electronic...,"● Advanced Statistics, operations rese...",3000.0,6000.0
1,data scientist,/job/data-scientist-hewlett-packard-enterprise...,Data Scientist,Consulting,1 DEPOT CLOSE 109841,Permanent,Professional,Posted 25 Jan 2019,Closing on 24 Feb 2019,Hewlett Packard Enterprise is an industry lead...,HEWLETT PACKARD ENTERPRISE SINGAPORE PTE. LTD.,\n\tHEWLETT PACKARD ENTERPRISE SINGAPORE PTE. ...,Responsibilities Collaborate with business pa...,5100.0,10000.0
2,data scientist,/job/data-scientist-smartsoft-c605a552bf2a0ae3...,Data Scientist,Information Technology,"INTERNATIONAL PLAZA, 10 ANSON ROAD 079903",Full Time,Senior Executive,Posted 25 Jan 2019,Closing on 24 Feb 2019,We are looking for highly motivated Individual...,SMARTSOFT PTE. LTD.,SMARTSOFT PTE. LTD. Smartsoft is a leading pr...,• Masters/Bachelor degree in Statistics/...,6000.0,11000.0
3,data scientist,/job/data-scientist-schellden-global-services-...,Data Scientist,Information Technology,"INTERNATIONAL PLAZA, 10 ANSON ROAD 079903",Full Time,Senior Executive,Posted 25 Jan 2019,Closing on 24 Feb 2019,We are looking for highly motivated Individual...,SCHELLDEN GLOBAL SERVICES,Schellden is a fast growing company specialise...,• Masters/Bachelor degree in Statistics/...,6000.0,11000.0
4,data scientist,/job/data-scientist-schellden-global-7b446f16f...,Data Scientist,Information Technology,"INTERNATIONAL PLAZA, 10 ANSON ROAD 079903",Full Time,Senior Executive,Posted 25 Jan 2019,Closing on 24 Feb 2019,We are looking for highly motivated Individual...,SCHELLDEN GLOBAL PTE. LTD.,Schellden Global Pte Ltd is Provider of IT Sol...,• Masters/Bachelor degree in Statistics/...,6000.0,11000.0


In [14]:
df.to_csv('data_req.csv')

In [15]:
df[df['min_salary'] > 100000]

Unnamed: 0,search_term,job_url,job_title,job_categories,job_location,job_employment_type,job_seniority,job_last_posted_date,job_expiry_date,job_description,job_company_name,job_company_info,job_requirement,min_salary,max_salary
