## Project 4

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.


In [1]:
#question 1
#Scrape and prepare your own data.

In [2]:
from bs4 import BeautifulSoup
import urllib
import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

In [3]:
# Visit our relevant page.
driver = webdriver.Chrome(executable_path="./chromedriver/chromedriver")
driver.get("https://www.mycareersfuture.sg/")
# Wait 3 second.
sleep(3)

In [4]:
# Find the search job.
elem = driver.find_element_by_name("search-text")
# Clear it.
elem.clear()
# Type in "data"
elem.send_keys("data")
# Send the keys
elem.send_keys(Keys.RETURN)
# Wait 5 second.
sleep(5)
# Close it.
# driver.close()

In [5]:
#grab current url
url_page = driver.current_url
print(url_page)

https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page=0


In [6]:
#Each url got 20job listings. Have to grab all the urls when navigating through the pages
url_list=[]
url = url_page.replace("0","{}")
#url = "https://www.mycareersfuture.sg/search?search=data&sortBy=new_posting_date&page={}"
for page in range(0,200):
    print('Page {}'.format(page))
    driver.get(url.format(page))
    sleep(5)
    soup = driver.page_source
    soup = BeautifulSoup(soup, 'lxml')
    job_links = [a['href'] for a in soup.find_all('a', href=True) if '/job/' in a['href']]
    print(len(job_links))
    for i in job_links:
        url_list.append('https://www.mycareersfuture.sg' + i)
print ('Total URL', len(url_list))
print ('==Done==')

Page 0
20
Page 1
20
Page 2
20
Page 3
20
Page 4
20
Page 5
20
Page 6
20
Page 7
20
Page 8
20
Page 9
20
Page 10
20
Page 11
20
Page 12
20
Page 13
20
Page 14
20
Page 15
20
Page 16
20
Page 17
20
Page 18
20
Page 19
20
Page 20
20
Page 21
20
Page 22
20
Page 23
20
Page 24
20
Page 25
20
Page 26
20
Page 27
20
Page 28
20
Page 29
20
Page 30
20
Page 31
20
Page 32
20
Page 33
20
Page 34
20
Page 35
20
Page 36
20
Page 37
20
Page 38
20
Page 39
20
Page 40
20
Page 41
20
Page 42
20
Page 43
20
Page 44
20
Page 45
20
Page 46
20
Page 47
20
Page 48
20
Page 49
20
Page 50
20
Page 51
20
Page 52
20
Page 53
20
Page 54
20
Page 55
20
Page 56
20
Page 57
20
Page 58
20
Page 59
20
Page 60
20
Page 61
20
Page 62
20
Page 63
20
Page 64
20
Page 65
20
Page 66
20
Page 67
20
Page 68
20
Page 69
20
Page 70
20
Page 71
20
Page 72
20
Page 73
20
Page 74
20
Page 75
20
Page 76
20
Page 77
20
Page 78
20
Page 79
20
Page 80
20
Page 81
20
Page 82
20
Page 83
20
Page 84
20
Page 85
20
Page 86
20
Page 87
20
Page 88
20
Page 89
20
Page 90
20
Page 91
2

In [8]:
print(url_list[0])

https://www.mycareersfuture.sg/job/data-engineer-moneysmart-singapore-d56b1a5839b83d1bf6486c4beb22ed91


In [11]:
#create a new data frame for jobs
import pandas as pd
jobs = pd.DataFrame(columns=["company","job_title","address","employment_type","seniority","category", "salary", "salary_period","requirements"])

In [12]:
#job_links = ['/job/data-science-lead-large-customer-sales-singapore-google-asia-pacific-4a23f0baa5c6bbdf7c07755a43ad57ff']
for i, row in enumerate(url_list):
    driver.get(row)
    # Wait 5 second.
    sleep(5)
    html = driver.page_source
    html = BeautifulSoup(html, 'lxml')
    if (i % 10) == 0:
        print(i, "records done")
    for entry in html.find_all('div', {'class':'w-70-l w-60-ms w-100 pr2-l pr2-ms relative'}):
    #for entry in html.find_all('div', {'class':'jobInfo w-100 dib v-top relative'}):
        # Grab the job listings
        try:
            company = entry.find('p', {'name': 'company'}).text
        except:
            company = 0
        try:
            job_title = entry.find('h1', {'id': 'job_title'}).text
        except:
            job_title = 0
        try:
            address = entry.find('p', {'id': 'address'}).text
        except:
            address = 0
        try:
            employment_type = entry.find('p', {'id': 'employment_type'}).text
        except:
            employment_type = 0
        try: 
            seniority = entry.find('p', {'id': 'seniority'}).text
        except:
            seniority = 0
        try:
            category = entry.find('p', {'id': 'job-categories'}).text
        except:
            category = 0
        try:
            salary = entry.find('div', {'class': 'lh-solid'}).text
        except:
            salary = 0
        try:
            salary_period = entry.find('span', {'class':'salary_type dib f5 fw4 black-60 pr1 i pb'}).text
        except:
            salary_period = 0
        try:
            requirements = entry.find('div', {'id':'requirements'}).text
        except:
            requirements = 0
        # Add to the DataFrame.
        jobs.loc[len(jobs)]=[company, job_title, address, employment_type, seniority, category, salary, salary_period, requirements]


0 records done
10 records done
20 records done
30 records done
40 records done
50 records done
60 records done
70 records done
80 records done
90 records done
100 records done
110 records done
120 records done
130 records done
140 records done
150 records done
160 records done
170 records done
180 records done
190 records done
200 records done
210 records done
220 records done
230 records done
240 records done
250 records done
260 records done
270 records done
280 records done
290 records done
300 records done
310 records done
320 records done
330 records done
340 records done
350 records done
360 records done
370 records done
380 records done
390 records done
400 records done
410 records done
420 records done
430 records done
440 records done
450 records done
460 records done
470 records done
480 records done
490 records done
500 records done
510 records done
520 records done
530 records done
540 records done
550 records done
560 records done
570 records done
580 records done
590 reco

In [13]:
jobs.shape

(3975, 9)

In [14]:
jobs.head()

Unnamed: 0,company,job_title,address,employment_type,seniority,category,salary,salary_period,requirements
0,MONEYSMART SINGAPORE PTE. LTD.,Data Engineer,"GRANDE BUILDING, 8 COMMONWEALTH LANE 149555",Full Time,Executive,Information Technology,"$5,000to$7,000",Monthly,RequirementsCompetencies Degree in Computer S...
1,PORTCAST PTE. LTD.,Data Scientist,32 CARPENTER STREET 059911,Full Time,Middle Management,Engineering,"$2,200to$6,000",Monthly,Requirements● Comfortable working with large ...
2,SINGAPORE PRESS HOLDINGS LIMITED,Data Visualisation Designer,"NEWS CENTRE, 1000 TOA PAYOH NORTH 318994",Permanent,Junior Executive,Design,"$3,500to$4,500",Monthly,Requirements Prior experience in a data visual...
3,GRABTAXI HOLDINGS PTE. LTD.,Data Analyst,"OUE DOWNTOWN, 6 SHENTON WAY 068809",Full Time,Executive,Information Technology,0,,RequirementsThe must haves: A Bachelor's/Mast...
4,AMAZON ASIA-PACIFIC RESOURCES PRIVATE LIMITED,Data Center Engineering Project Engineer APAC,"AIA TOWER, 1 ROBINSON ROAD 048542",Full Time,Professional,"Design, Engineering","$9,000to$12,000",Monthly,RequirementsBasic Qualifications - Minimum 5 ...


In [15]:
#save to csv
jobs.to_csv('jobs_data.csv')

In [None]:
#Analysis to be continued in another python file

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.



In [None]:
#question 2

### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

In [None]:
# 5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

# 6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.