# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.


---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

In [1]:
import requests
from bs4 import BeautifulSoup
import os
from selenium import webdriver
from time import sleep

In [2]:
import numpy as np
import pandas as pd

In [3]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException, WebDriverException

In [4]:
chromedriver = "/Users/Han/Downloads/Data/Git/GA/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

In [439]:
#url = 'https://www.indeed.com/jobs?q=Data+Scientist&l=United+States&sort=date'
url_130k = 'https://www.indeed.com/q-Data-Scientist-$130,000-l-United-States-jobs.html'
driver.get(url_130k)
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

In [406]:
scrape = pd.DataFrame(columns=['title','company','location','desc','date','salary','estimate','desired'])

In [381]:
def getlinks_indeed():
    for row in soup.findAll("a", {"class" : "turnstileLink"}):
        if "/cmp/" not in row['href']:
            links.append(row['href'])

In [440]:
def getinfo_indeed():
    #convert to css format
    css_selector = "a[href*="+'"'+href+'"'
    driver.find_element_by_css_selector(css_selector).click()
    sleep(1.5)

    soup1 = BeautifulSoup(driver.page_source, 'lxml')
    title = soup1.find("div", {"id" : "vjs-jobinfo"}).find("div", {"id" : "vjs-jobtitle"}).text
    company = soup1.find("span", {"id" : "vjs-cn"}).text
    location = soup1.find("div", {"id" : "vjs-jobinfo"}).find("span", {"id":"vjs-loc"}).text
#     try:
#         rating = soup1.find("div", {"id" : "vjs-jobinfo"}).find("span", {"class":"rating"})['style']
#         reviews = soup1.find("div", {"id" : "vjs-jobinfo"}).find("span", {"class":"slNoUnderline"}).text
#     except TypeError:
#         rating = 0
#         reviews = 0
    desc = soup1.find("div", {"id" : "vjs-desc"}).text
    date = soup1.find("span", {"class" : "date"}).text
    s_test = (soup1.find("div", {"id" : "vjs-jobinfo"}).find_all('span'))
    if True in ['$' in x.text for x in s_test]:
        salary = s_test[['$' in x.text for x in s_test].index(True)].text
    else:
        salary = -1
    
    estimate = '$13000+' #additional column for jobs using salary estimates
    desired = []
    for skills in soup1.find('div', {'id':'vjs-container'}).findAll('span', {'class':'experienceListItem'}):
        desired.append(skills.text)

    scrape.loc[len(scrape)] = [title, company, location, desc, date, salary, estimate, desired]

In [327]:
#try again function
def try_again():
    last_page = scrape[scrape['title'] == 'Next page'].index.max()
    scrape.drop(index=scrape.index[last_page+1:],inplace=True)

In [452]:
for pages in range(50):
    sleep(50)
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    links = []
    getlinks_indeed()
    for href in links:
        try:
            getinfo_indeed()
        
        except WebDriverException:
            # handles popups
            driver.find_element_by_class_name('popover-x').click()
            getinfo_indeed()
            
        except:
            # handles incomplete loading
            sleep(5)
            soup = BeautifulSoup(html, 'lxml')
            try_again()
            links = []
            getlinks_indeed()
            for href in links:
                getinfo_indeed()
    
    #go to next page
    driver.find_elements_by_class_name('pn')[-1].click()
    scrape.loc[len(scrape)] = ['Next page']*8

In [444]:
scrape.tail(30)

Unnamed: 0,title,company,location,desc,date,salary,estimate,desired
1132,Data Scientist - Machine Learning Engineer,Aera Technology,"- Mountain View, CA 94041",Do you want to shape the future of enterprise ...,10 days ago,-1,$13000+,"[TensorFlow, Linux, Spark, Machine Learning, K..."
1133,"Program Manager, Machine Learning Data Ecosyst...",Google,"- Mountain View, CA","Google's projects, like our users, span the gl...",10 days ago,-1,$13000+,"[Machine Learning, Project Management]"
1134,"Scientist, High Impact Data- Protein Analytics...",Bristol-Myers Squibb,"- Redwood City, CA",High Impact Data Scientist for Protein Analyti...,10 days ago,-1,$13000+,[Spectroscopy]
1135,Machine Learning Engineer,Dropbox,"- San Francisco, CA",----------------------\nRole Description\n----...,10 days ago,-1,$13000+,"[TensorFlow, AI, Machine Learning, Image Proce..."
1136,Data Scientist - Machine Learning,Affirm,"- San Francisco, CA 94126 (Financial District...",What You'll Do\nBuild production fraud and cre...,10 days ago,-1,$13000+,"[Machine Learning, Data Analysis, Python]"
1137,Analytics Software – Machine Learning Scientist,FICO,"- San Jose, CA 95110 (Downtown area)",The Machine Learning Scientist will be a contr...,10 days ago,-1,$13000+,"[Big Data, JavaScript, Java, Spark, Machine Le..."
1138,Data Scientist,Convoy,"- Seattle, WA","Convoy, one of the fastest growing startups in...",10 days ago,-1,$13000+,"[Big Data, Spark, Machine Learning, Hadoop, R,..."
1139,Data Scientist,LeapYear,"- Berkeley, CA","As a data scientist at LeapYear, you will be r...",10 days ago,-1,$13000+,"[TensorFlow, Hive, Spark, Machine Learning, Da..."
1140,Senior Data Scientist,Picarro,"- Santa Clara, CA 95054",The Opportunity:\nPicarro is seeking an accomp...,10 days ago,-1,$13000+,"[ArcGIS, Statisical Analysis, Python, GIS, SQL]"
1141,Data Scientist,"Paradigm Infotech, Inc","- San Jose, CA",Role: Data ScientistLocation: San Jose CADurat...,10 days ago,-1,$13000+,"[Spark, Machine Learning, Hadoop, R, Python, SQL]"


In [None]:
# try_again()

In [445]:
scrape.shape

(1162, 8)

In [446]:
len(scrape[scrape[['desc','company']].duplicated()].sort_values(by='desc'))

504

In [450]:
1162-504

658

In [448]:
scrape_unique = scrape.drop(scrape[scrape[['desc','company']].duplicated()].index)
scrape_indeed = scrape_unique[~(scrape_unique['title'] == 'Next page')]

scrape_indeed[scrape_indeed['salary'] != -1].reset_index(drop=True).shape

(8, 8)

In [451]:
#60K: stopped at page 52 (505 rows)
#95K: stopped at page 52 (473 rows)
#130K: stopped at page ? (658 rows)
# scrape_indeed.to_csv('/Users/Han/Downloads/scrape_indeed_60.csv')
# scrape_indeed.to_csv('/Users/Han/Downloads/scrape_indeed_95.csv')
scrape_indeed.to_csv('/Users/Han/Downloads/scrape_indeed_130.csv')

## efinancialcareers.com
</br>

In [160]:
#driver.close()

In [4]:
chromedriver = "/Users/Han/Downloads/Data/Git/GA/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

In [12]:
url = 'https://www.efinancialcareers.com/search/?q=data%20scientist&countryCode=US&currencyCode=USD&language=en&facets=*&page=1&pageSize=10&filters.salaryBand=FIRST_TIER%7CSECOND_TIER%7CTHIRD_TIER%7CFOURTH_TIER%7CFIFTH_TIER%7CSIXTH_TIER&ds=sr'
driver.get(url)
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

In [20]:
scrape = pd.DataFrame(columns=['title','company','location','desc','date','salary'])

In [25]:
def getinfo_efc():
    soup1 = BeautifulSoup(driver.page_source, 'lxml')

    title = soup1.find('h1').text
    company = soup1.find('li', {'class':'company'}).find('span').text
    location = soup1.find('li', {'class':'location'}).find('span').text
    location = location.splitlines()[-1].lstrip()
    desc = soup1.find('section', {'class':'description'}).text + \
            soup1.find('li', {'class':'position'}).find('span').text
    date = soup1.find('li', {'class':'updated'}).find('span').text
    salary = soup1.find('li', {'class':'salary'}).find('span').text

    scrape.loc[len(scrape)] = [title, company, location, desc, date, salary]

In [230]:
# various types of formatting for special jobs/companies
def getinfo_efc_alt():
    soup1 = BeautifulSoup(driver.page_source, 'lxml')

    title = soup1.find(['h1','h2']).text
    company = soup1.findAll('h2')[1].text
    company = company.splitlines()[-1].lstrip()
    try:
        location = soup1.find('div', {'class':'detailsvisible'}).findAll('li')[0].text
        location = location.split(':')[-1].lstrip()
    except:
        location = 0
    
    desc = soup1.find(['div','section'], {'class':['description','job-description']}).text 
       
    try:
        salary = soup1.find('div', {'class':'detailsvisible'}).findAll('li')[1].text
        salary = salary.split(':')[-1].lstrip()
    except:
        salary = 0
    
    date = 0
    
    scrape.loc[len(scrape)] = [title, company, location, desc, date, salary]

In [112]:
# resume scraping from last left off page
#url = 'https://www.efinancialcareers.com/search/?q=data%20scientist&countryCode=US&currencyCode=USD&language=en&facets=*&page=6&pageSize=10&filters.salaryBand=FIRST_TIER%7CSECOND_TIER%7CTHIRD_TIER%7CFOURTH_TIER%7CFIFTH_TIER%7CSIXTH_TIER&ds=sr'
#driver.get(url)

In [391]:
for pages in range(50):
    sleep(5)
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    for link in driver.find_elements_by_class_name('card-title-link'):
        try:
            link.click()
        
        #to handle (ignore) pop-ups
        except WebDriverException:
            driver.switch_to.active_element.click()

        driver.switch_to.window(driver.window_handles[1])
        sleep(5)
        
        try: #scrape
            getinfo_efc()
            driver.close()
            
        except AttributeError: 
            soup1 = BeautifulSoup(driver.page_source, 'lxml')
            if soup1.find('h2').text != None:
                getinfo_efc_alt() # to handle alternative formatting for special jobs/companies
                driver.close()
                
            else:
                sleep(5)
                getinfo_efc()
                driver.close()
            
        #switch back to main page
        driver.switch_to.window(driver.window_handles[0])
        sleep(0.5)

    #go to next page
    driver.find_element_by_id('searchPaginationNext-li').click()
    scrape.loc[len(scrape)] = ['Next page']*7

In [75]:

# soup1 = BeautifulSoup(driver.page_source, 'lxml')

# title = soup1.find('h2').text
# company = soup1.findAll('h2')[1].text
# company = company.splitlines()[-1].lstrip()
# location = soup1.find('div', {'class':'detailsvisible'}).findAll('li')[0].text
# location = location.split(':')[-1].lstrip()


# rating = 0
# reviews = 0
# desc = soup1.find('section', {'class':'description'}).text + \
#         soup1.find('div', {'class':'detailsvisible'}).findAll('li')[2].text
# date = 0
# salary = soup1.find('div', {'class':'detailsvisible'}).findAll('li')[1].text
# salary = salary.split(':')[-1].lstrip()

# test = pd.DataFrame(columns=['title','company','location','rating','reviews','desc','date','salary'])
# test.loc[len(test)] = [title, company, location, rating, reviews, desc, date, salary]
# test

In [231]:
# driver.close()

# #switch back to main page
# driver.switch_to.window(driver.window_handles[0])
# sleep(0.5)

# try_again()
# scrape.shape

In [250]:
scrape_unique = scrape.drop(scrape[scrape[['desc','company']].duplicated()].index)
scrape_efc = scrape_unique[~(scrape_unique['title'] == 'Next page')]

In [277]:
scrape_efc['salary'].tail(10)

150     £excellent + bonus + good package
151                                 $High
152    up to $240,000 plus bonus and bens
153         GBP63000 - GBP73000 per annum
154      £50,000 - £90,000 base + package
155                               £55,000
156         GBP55707 - GBP75707 per annum
157              Up to GBP60000 per annum
158                           Competitive
159                                     0
Name: salary, dtype: object

In [265]:
(x,y) = (scrape_efc.shape)
x-sum(scrape['salary'].value_counts().head().values)

49

In [284]:
scrape_efc.to_csv('/Users/Han/Downloads/scrape_efc.csv')

## Last website...
</br>

In [160]:
#driver.close()

In [292]:
chromedriver = "/Users/Han/Downloads/Data/Git/GA/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

In [293]:
url = 'https://www.simplyhired.com/search?q=data+scientist&l=united+states&mip=%2460%2C000&pp=&'
driver.get(url)
sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, 'lxml')

In [294]:
scrape = pd.DataFrame(columns=['title','company','location','desc','date','salary','estimate'])

In [295]:
def getlinks_simplyhired():
    for row in soup.findAll("a", {"class" : "card-link"}):
         links.append("https://www.simplyhired.com"+row['href'])

In [296]:
def getinfo_simplyhired(i):
    soup1 = BeautifulSoup(driver1.page_source, 'lxml')

    title = soup1.find('h1', {'itemprop' : 'title'}).text
    company = soup1.find('span', {'class':'company'}).text
    location = soup1.find('span', {'class':'location'}).text
    desc = soup1.find('div', {'class':'viewjob-description'}).text
    date = soup.findAll('span', {'class':'jobposting-timestamp'})[i].find('time')['datetime']
    salary = soup.findAll('span', {'class':'jobposting-salary'})[i]['data-salary']
    estimate = soup.findAll('span', {'class':'jobposting-salary'})[i]['data-est']

    scrape.loc[len(scrape)] = [title, company, location, desc, date, salary, estimate]

In [297]:
#try again function
def try_again():
        last_page = scrape[scrape['title'] == 'Next page'].index.max()
        scrape.drop(index=scrape.index[last_page+1:],inplace=True)

In [390]:
for pages in range(100):
    html = driver.page_source
    soup = BeautifulSoup(html, 'lxml')
    
    links = []
    getlinks_simplyhired()
    for i, job in enumerate(links):
        driver1 = webdriver.Chrome(executable_path=chromedriver)
        driver1.get(job)
        sleep(3)
    
        getinfo_simplyhired(i)
        driver1.close()
  
    #go to next page
    driver.find_element_by_class_name('next-pagination').click()
    scrape.loc[len(scrape)] = ['Next page']*7
    sleep(3)

In [311]:
scrape.shape

(1042, 7)

In [312]:
len(scrape[scrape[['desc','company']].duplicated()].sort_values(by='desc'))

77

In [320]:
scrape_unique = scrape.drop(scrape[scrape[['desc','company']].duplicated()].index)
scrape_simplyhired = scrape_unique[~(scrape_unique['title'] == 'Next page')]

scrape_simplyhired.shape

(964, 7)

In [321]:
scrape_simplyhired[scrape_simplyhired['estimate'] == 'false'].shape

(18, 7)

In [322]:
#stopped at page 50 (964 rows)
scrape_simplyhired.to_csv('/Users/Han/Downloads/scrape_simplyhired.csv')

In [255]:
# try_again()

In [235]:
# for i, job in enumerate(links[:2]):
#     chromedriver = "/Users/Han/Downloads/Data/Git/GA/chromedriver/chromedriver"
#     os.environ["webdriver.chrome.driver"] = chromedriver
#     driver1 = webdriver.Chrome(executable_path=chromedriver)
#     driver1.get(job)
#     sleep(3)
    
#     soup1 = BeautifulSoup(driver1.page_source, 'lxml')
#     print(soup1.find('h1', {'itemprop' : 'title'}).text)
#     print(soup1.find('span', {'class':'company'}).text)
#     print(soup1.find('span', {'class':'location'}).text)
#     print(soup1.find('div', {'class':'viewjob-description'}).text)
    
#     print(soup.findAll('span', {'class':'jobposting-salary'})[i]['data-salary'])
#     print(soup.findAll('span', {'class':'jobposting-salary'})[i]['data-est'])
#     print(soup.findAll('span', {'class':'jobposting-timestamp'})[i].find('time')['datetime'])
#     driver1.close()
#     print('---')

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.



### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.




### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.


