# Project 4: Web Scraping Job Postings

## Business Case Overview

You're working as a data scientist for a contracting firm that's rapidly expanding. Now that they have their most valuable employee (you!), they need to leverage data to win more contracts. Your firm offers technology and scientific solutions and wants to be competitive in the hiring market. Your principal has two main objectives:

   1. Determine the industry factors that are most important in predicting the salary amounts for these data.
   2. Determine the factors that distinguish job categories and titles from each other. For example, can required skills accurately predict job title?

To limit the scope, your principal has suggested that you *focus on data-related job postings*, e.g. data scientist, data analyst, research scientist, business intelligence, and any others you might think of. You may also want to decrease the scope by *limiting your search to a single region.*

Hint: Aggregators like [Indeed.com](https://www.indeed.com) regularly pool job postings from a variety of markets and industries. 

**Goal:** Scrape your own data from a job aggregation tool like Indeed.com in order to collect the data to best answer these two questions.

---

## Directions

In this project you will be leveraging a variety of skills. The first will be to use the web-scraping and/or API techniques you've learned to collect data on data jobs from Indeed.com or another aggregator. Once you have collected and cleaned the data, you will use it to answer the two questions described above.

### QUESTION 1: Factors that impact salary

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

You have learned a variety of new skills and models that may be useful for this problem:
- NLP
- Unsupervised learning and dimensionality reduction techniques (PCA, clustering)
- Ensemble methods and decision tree models
- SVM models

Whatever you decide to use, the most important thing is to justify your choices and interpret your results. *Communication of your process is key.* Note that most listings **DO NOT** come with salary information. You'll need to able to extrapolate or predict the expected salaries for these listings.

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
- Do the requirements for titles vary significantly with industry (e.g. healthcare vs. government)?

You may end up making multiple classification models to tackle different questions. Be sure to clearly explain your hypotheses and framing, any feature engineering, and what your target variables are. The type of classification model you choose is up to you. Be sure to interpret your results and evaluate your models' performance.


### BONUS PROBLEM

Your boss would rather tell a client incorrectly that they would get a lower salary job than tell a client incorrectly that they would get a high salary job. Adjust one of your models to ease his mind, and explain what it is doing and any tradeoffs. Plot the ROC curve.

---

## Requirements

1. Scrape and prepare your own data.

2. **Create and compare at least two models for each section**. One of the two models should be a decision tree or ensemble model. The other can be a classifier or regression of your choosing (e.g. Ridge, logistic regression, KNN, SVM, etc).
   - Section 1: Job Salary Trends
   - Section 2: Job Category Factors

3. Prepare a polished Jupyter Notebook with your analysis for a peer audience of data scientists. 
   - Make sure to clearly describe and label each section.
   - Comment on your code so that others could, in theory, replicate your work.

4. A brief writeup in an executive summary, written for a non-technical audience.
   - Writeups should be at least 500-1000 words, defining any technical terms, explaining your approach, as well as any risks and limitations.

#### BONUS

5. Answer the salary discussion by using your model to explain the tradeoffs between detecting high vs low salary positions.

6. Convert your executive summary into a public blog post of at least 500 words, in which you document your approach in a tutorial for other aspiring data scientists. Link to this in your notebook.

---

## Suggestions for Getting Started

1. Collect data from [Indeed.com](www.indeed.com) (or another aggregator) on data-related jobs to use in predicting salary trends for your analysis.
  - Select and parse data from *at least 1000 postings* for jobs, potentially from multiple location searches.
2. Find out what factors most directly impact salaries (e.g. title, location, department, etc).
  - Test, validate, and describe your models. What factors predict salary category? How do your models perform?
3. Discover which features have the greatest importance when determining a low vs. high paying job.
  - Your Boss is interested in what overall features hold the greatest significance.
  - HR is interested in which SKILLS and KEY WORDS hold the greatest significance.   
4. Author an executive summary that details the highlights of your analysis for a non-technical audience.
5. If tackling the bonus question, try framing the salary problem as a classification problem detecting low vs. high salary positions.

---

## Useful Resources

- Scraping is one of the most fun, useful and interesting skills out there. Don’t lose out by copying someone else's code!
- [Here is some advice on how to write for a non-technical audience](http://programmers.stackexchange.com/questions/11523/explaining-technical-things-to-non-technical-people)
- [Documentation for BeautifulSoup can be found here](http://www.crummy.com/software/BeautifulSoup/).

---

In [1]:
# Data modules
import numpy as np
import scipy.stats as stats
import pandas as pd

# Plotting modules
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_style('whitegrid')

# Make sure charts appear in the notebook
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

# Stats/regressions packages
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix, roc_curve, auc, classification_report
from sklearn.linear_model import Ridge, Lasso, ElasticNet, LinearRegression, RidgeCV, LassoCV, ElasticNetCV, LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
from sklearn.neural_network import MLPRegressor
import statsmodels.api as sm
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

# Webscraping packages
import requests
from bs4 import BeautifulSoup

# Hide warnings
import warnings
warnings.filterwarnings('ignore')

# Enable viewing of all columns for DataFrames
pd.set_option('display.max_columns', None)

<div class="alert alert-info">

### Setting up functions to extract information from the web scrape from Indeed.com.sg

</div>

In [72]:
def title(soup):
    '''
    Function to extract the job titles from BeautifulSoup
    Input soup = BeautifulSoup
    Returns a list of job titles
    '''
    title = []
    for job in soup:
        title.append(job.a.text)
        
    return title

In [73]:
def location(soup):
    '''
    Function to extract the location from BeautifulSoup
    Input soup = BeautifulSoup
    Returns a list of locations
    '''
    locations = []
    for job in soup:
        try:
            locations.append(job.find_all("div",{"class":"location"})[0].text)
        except:
            locations.append(job.find_all("span",{"class":"location"})[0].text)
            
    return locations

In [74]:
def salary(soup):
    '''
    Function to extract the salary from BeautifulSoup
    If salary is not found, 'No Salary Information' would be appended
    Input soup = BeautifulSoup
    Returns a list of salaries
    '''
    salary = []
    for job in soup:
        try:
            salary.append(job.find_all("div",{"class":"salarySnippet"})[0].text.strip())
        except:
            salary.append('No Salary Information')
            
    return salary

In [75]:
def summary(soup):
    '''
    Function to extract the summary from BeautifulSoup
    Input soup = BeautifulSoup
    Returns a list of summaries
    '''
    summary = []
    for job in soup:
        summary.append(job.find_all("div",{"class":"paddedSummary"})[0].text.strip())
            
    return summary

In [76]:
def postdate(soup):
    '''
    Function to extract the salary from BeautifulSoup
    If salary is not found, 'No Post Date' would be appended
    Input soup = BeautifulSoup
    Returns a list of post dates
    '''
    date = []
    for job in soup:
        try:
            date.append(job.find_all("span",{"class":"date"})[0].text)
        except:
            date.append('No Post Date')
            
    return date

In [112]:
# Creating an empty DataFrame to store the webscraped information
results = pd.DataFrame(columns=['Title', 'Location', 'Salary', 'Summary', 'Post_Date'])
roles = ['data+science', 'data+scientist', 'data+analyst', 'business+intelligence',
         'machine+learning', 'data+engineer']

for role in roles:
    for n in range(10,110,10):
        # Target web page
        url = "https://www.indeed.com.sg/jobs?q={}&l=singapore&start={}".format(role, n)

        # Establishing the connection to the web page:
        response = requests.get(url)

        # Ensuring that the html response code is OK
        if response.status_code != 200:
            print('Error Status Code:', response.status_code)
            pass
        else:
            # Setting up the html into BeautifulSoup
            html = response.text
            soup = BeautifulSoup(html, 'lxml')
            element = soup.find_all("div",{"class":"jobsearch-SerpJobCard row result"})

            # Extracting the Title, Location, Salary, Summary & Post Date information info a temp DataFrame
            df = pd.DataFrame(columns=['Title', 'Location', 'Salary', 'Summary', 'Post_Date'])
            df['Title'] = title(element)
            df['Location'] = location(element)
            df['Salary'] = salary(element)
            df['Summary'] = summary(element)
            df['Post_Date'] = postdate(element)

            # Appending the temp DataFrame into the results
            results = pd.concat([results, df], axis=0, ignore_index=True)

In [113]:
results.shape

(771, 5)

In [114]:
results.drop_duplicates(keep='first').shape

(525, 5)

In [115]:
results.drop_duplicates(keep='first', inplace=True)

In [116]:
results[results['Salary'] != 'No Salary Information'].shape

(37, 5)

<div class="alert alert-info">
    
### Data from Indeed.com.sg is not clean and we are unable to scrape enough data for the project due to lack of salary information

Moving on to webscrape from mycareersfuture.sg using selenium

</div>

In [11]:
import os
from selenium import webdriver
from time import sleep

In [12]:
# Using Selenium to extract the href of the various jobs in www.mycareersfuture.sg

# Setting up Selenium driver
chromedriver = "/Users/darren/desktop/General_Assembly/classes/week-06/labs/python-webscraping_opentable-lab-master/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

# List of job searches and pages
searches = ['business%20analyst','big%20data','data%20scientist','data%20analyst',
            'deep%20learning','data%20engineer','data%20architect','artificial%20intelligence',
            'machine%20learning','business%20intelligence','business%20data']
pages = range(100)

links = []

# Looping through the pages
for s in searches:
    for p in pages:
        driver.get("https://www.mycareersfuture.sg/search?search={}&sortBy=new_posting_date&page={}".format(s,p))
        sleep(5)
        
        # Grab the page source.
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')
        elements = soup.find_all("a", {"class":"bg-white mb3 w-100 dib v-top pa3 no-underline flex-ns flex-wrap JobCard__card___22xP3"})
        
        for element in elements:
            try:
                links.append(element.get('href'))
            except:
                continue

# Close browser
driver.close()

In [44]:
# Using Selenium to extract the relevant data from the various links that were previously extracted

# Creating a DataFrame to store the information scraped
data = pd.DataFrame(columns=['Title', 'Company', 'Location', 'Salary from', 'Salary to', 'Salary Type', 'Employment Type', 'Seniority', 'Summary'])

# Setting up Selenium driver
chromedriver = "/Users/darren/desktop/General_Assembly/classes/week-06/labs/python-webscraping_opentable-lab-master/chromedriver/chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(executable_path=chromedriver)

# Extracting the relevant data and appending to DataFrame
for link in links:
    try:
        driver.get("https://www.mycareersfuture.sg{}".format(link))
        sleep(5)                  
        html = driver.page_source
        soup = BeautifulSoup(html, 'lxml')

        try:
            title = soup.find_all("div", {"class":"jobInfo w-100 dib v-top relative"})[0].h1.text
        except:
            title = None
        try:
            company = soup.find_all("div", {"class":"jobInfo w-100 dib v-top relative"})[0].p.text
        except:
            company = None
        try:
            location = soup.find_all("div", {"class":"jobInfo w-100 dib v-top relative"})[0].find(id="job_info").p.text
        except:
            location = None
        try:
            salary_from = soup.find_all("div", {"class":"lh-solid"})[0].find_all("span",{"class":"dib"})[0].text
        except:
            salary_from = None
        try:
            salary_to = soup.find_all("div", {"class":"lh-solid"})[0].find_all("span",{"class":"dib"})[1].text[3:]
        except:
            salary_to = None
        try:
            salary_type = soup.find_all("div", {"class":"salary tr-l"})[0].find_all("span",{"class":"salary_type"})[0].text
        except:
            salary_type = None
        try:
            employment = soup.find_all("div", {"class":"jobInfo w-100 dib v-top relative"})[0].find(id="employment_type").text
        except:
            employment = None
        try:
            seniority = soup.find_all("div", {"class":"jobInfo w-100 dib v-top relative"})[0].find(id="seniority").text
        except:
            seniority = None
        try:
            summary = soup.find(id="description-content").text
        except:
            summary = None

        data = data.append({
            'Title':title,
            'Company':company,
            'Location':location,
            'Salary from':salary_from,
            'Salary to':salary_to,
            'Salary Type': salary_type,
            'Employment Type':employment,
            'Seniority':seniority,
            'Summary':summary
        }, ignore_index=True) 
        
    except:
        continue
    
driver.close()                           
                             

In [45]:
data.head()

Unnamed: 0,Title,Company,Location,Salary from,Salary to,Employment Type,Seniority,Summary
0,"Solution Architect, Warehouse Management System",DFS VENTURE SINGAPORE (PTE) LIMITED,"SINGAPORE LAND TOWER, 50 RAFFLES PLACE 048623","$11,000","to$12,000","Permanent, Full Time",Manager,Job Description: We are looking for a strong c...
1,Business Analyst Consultant,INFOSYS CONSULTING PTE. LTD.,"SUNTEC TOWER TWO, 9 TEMASEK BOULEVARD 038989","$5,500","to$10,000",Full Time,"Professional, Non-executive",Responsibility Support Analyst for Order & Exe...
2,Senior Business Analyst,THE BOSTON SOFTWARE SOLUTIONS INTERNATIONAL PT...,"HUDSON TECHNOCENTRE, 16 NEW INDUSTRIAL ROAD 53...","$7,000","to$8,200",Contract,Senior Executive,We are looking for Senior Business Analyst o ...
3,Technical Business Analyst (Digital / Internet...,OPTIMUM SOLUTIONS (SINGAPORE) PTE LTD,"PLAZA 8 @ CBP, 1 CHANGI BUSINESS PARK CRESCENT...","$5,500","to$8,000",Contract,"Senior Management, Manager",Optimum Solutions (Company Registration Number...
4,IT Business Analyst,THATZ INTERNATIONAL PTE LTD,"THE ADELPHI, 1 COLEMAN STREET 179803","$3,500","to$4,500",Full Time,Executive,Works with Subject Matter Experts & Applicati...


In [46]:
import pickle

with open('links', 'wb') as f:
    pickle.dump(links, f)

In [None]:
with open('links', 'rb') as f:
    links = pickle.load(f)

In [43]:
len(links)

1363

In [67]:
data.shape

(1363, 8)

In [71]:
data[data['Salary from'] != 'No Salary Info'].shape

(1255, 8)

In [72]:
data[data['Company'] != 'No Company Info'].shape

(1355, 8)

In [73]:
data[data['Title'] != 'No Title Info'].shape

(1363, 8)

In [94]:
data.to_csv('webscrapedata`.csv')

In [99]:
data.drop_duplicates(keep='first').shape

(927, 8)