# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary classifier.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use a random forest regressor, as well as another classifier of your choice; either logistic regression, SVM, or KNN. 

- **Question**: Why would we want this to be a classification problem?
- **Answer**: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

'''For some reason, I didn't look at the starter code and spent the better part of th last week trying to come up,
from scratch and without this guidance, the scraper code below. I gave up and went with Import.IO, which pulled
everything for me.'''

In [None]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [None]:
import requests
import bs4
import urllib
from bs4 import BeautifulSoup

In [None]:
r = requests.get(URL)

In [None]:
html = urllib.urlopen(URL).read()
soup = BeautifulSoup(html,"html.parser")

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it
    - Remember to use `try/except` if you anticipate errors
- **Test** the functions on the results above and simple examples

In [None]:
## YOUR CODE HERE

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

#### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [None]:
YOUR_CITY = ''

In [None]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 100 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []

for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY]):
    for start in range(0, max_results_per_city, 10):
        # Grab the results from the request (as above)
        # Append to the full set of results
        pass

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

# Begin Here

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
%matplotlib inline

In [2]:
df = pd.read_csv('/Users/charlesrice/desktop/Jobhuntr 7-Sept-2016.csv')
df.head()

Unnamed: 0,snippet,location,title/_source,title,title/_title,title/_text,pageUrl,salary,company/_text,company/_source,company
0,Experience working with imperfect data. At Civ...,"Chicago, IL",/rc/clk?jk=cb86f3205f73582a&fccid=85806092f576...,http://www.indeed.com/rc/clk?jk=cb86f3205f7358...,New Graduate - Data Scientist,New Graduate - Data Scientist,http://www.indeed.com/jobs?q=%22data+scientist...,,,,
1,Chief Data Scientist – Oil & Gas Analytics Gra...,"San Jose, CA",/cmp/Digitek-Corp-Solution-Inc.-(DCSI)/jobs/Ch...,http://www.indeed.com/cmp/Digitek-Corp-Solutio...,Chief Data Scientist – Oil & Gas Analytics,Chief Data Scientist – Oil & Gas Analytics,http://www.indeed.com/jobs?q=%22data+scientist...,"$170,000 a year",,,
2,The Data Scientist Analytics role has work acr...,"Mountain View, CA",/rc/clk?jk=342373f6d9bee6ab&fccid=1639254ea847...,http://www.indeed.com/rc/clk?jk=342373f6d9bee6...,"Data Scientist, Analytics (WhatsApp)","Data Scientist, Analytics (WhatsApp)",http://www.indeed.com/jobs?q=%22data+scientist...,,Facebook,/cmp/Facebook?from=SERP&campaignid=serp-linkco...,http://www.indeed.com/cmp/Facebook?from=SERP&c...
3,US Citizenship is required in order to obtain ...,"Reston, VA 20191",/rc/clk?jk=3b436f2ebaf33c60&fccid=6ccb0eba9956...,http://www.indeed.com/rc/clk?jk=3b436f2ebaf33c...,Data Scientist,Data Scientist,http://www.indeed.com/jobs?q=%22data+scientist...,,Engility Corporation,/cmp/Engility-Corp?from=SERP&campaignid=serp-l...,http://www.indeed.com/cmp/Engility-Corp?from=S...
4,"As a Data Scientist, you have the ability to l...","San Francisco, CA 94103 (South Of Market area)",/rc/clk?jk=c2142d8baff9d10b&fccid=34a475954feb...,http://www.indeed.com/rc/clk?jk=c2142d8baff9d1...,Data Scientist,Data Scientist,http://www.indeed.com/jobs?q=%22data+scientist...,,DoubleDutch,/cmp/Doubledutch?from=SERP&campaignid=serp-lin...,http://www.indeed.com/cmp/Doubledutch?from=SER...


Ew. What an uglybunch of data that is.

In [3]:
# Let's clean it up a bit
df.drop(['snippet', 'title/_source', 'title', 'title/_title', 'pageUrl', 'company/_source', 'company'], axis=1, inplace=True)

In [None]:
df.head()
# Much better

In [4]:
# Any nulls to speak of?
print df.isnull().sum()
print
print df.isnull().sum()/len(df)

location            0
title/_text         0
salary           2879
company/_text    1351
dtype: int64

location         0.000000
title/_text      0.000000
salary           0.928710
company/_text    0.435806
dtype: float64


In [None]:
# Good grief that's a lot of nulls. Almost all of our data is lacking the data we want. Great.

Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [5]:
print df.duplicated().sum()/len(df)
df[df.duplicated()]
df.dropna(subset=['salary'], inplace=True)

0


In [None]:
# Given the search terms used in the extractor, removing duplicate entries would remove about 50% of the dataset, which
# is already small enough as it is. Since the feature set is so small, there's no way to know if something is a legit 
# duplicate or not

In [None]:
# hour = 'an hour'
# for elem in df.salary:
#     if hour in elem:
#         elem == 42 # dead end road for what I'm trying to do

In [6]:
# Get rid of the hourly and monthly salaries
df = df[df.salary.str.contains(' an hour') == False]

df = df[df.salary.str.contains(' a month')== False]

In [None]:
df.shape

In [7]:
# There's still a lot of unnecssary strings around that lovely salary data.

df['sal_1'] = df.salary.str.rstrip('a year')

df['sal_2'] = df.sal_1.str.replace('$','')

df['sal_2'] = df.sal_2.str.replace(',', '')

df['sal_2'] = df.sal_2.str.split(' - ')

In [None]:
df.head()

In [None]:
# d = [[1.1, 2.1], [4.4, 3.3]]

# def interize(list):
#     for x in d:
#         for y in x:
#             y = int(y)
#     return y

# # interize(d)

In [8]:
# Need to make the salaries numbers and not strings.

df['sal_2'] = df['sal_2'].apply(pd.to_numeric)

In [None]:
for elem in df.sal_2:
    for v in elem:
        print type(v)

In [9]:
# Now, get the salaries or the average salaries in the event of a range
sal_3 = []
for elem in df.sal_2:
    if len(elem) == 2:
        s = (elem[0] + elem[1])/2
        sal_3.append(s)
    else:
        s = elem[0]
        sal_3.append(s)

In [10]:
print type(sal_3)

<type 'list'>


In [11]:
# The index was all mucked up from earlier slices, so it required a reset before we could move forward

df.reset_index(drop=True, inplace=True)

In [None]:
df.head()

In [12]:
# Append the new average salary information to the existing dataframe. Although in retrospect, I probably could have
# mapped some function onto the salary column

df2 = pd.DataFrame(data=sal_3, columns=['avg_sal'])

In [None]:
df2.head()

In [13]:
df3 = df.join(df2)

In [14]:
df3.drop(['salary', 'sal_1', 'sal_2'], axis=1, inplace=True)

In [15]:
pnames = ['location', 'title', 'company', 'avg_sal']
df3.columns = pnames

In [None]:
# That's cleaned up, but it's still messy. We've got unrecognized overlap due to discrepancies in names.
df3.location.value_counts()

In [16]:
# Let's clean them up the most efficient way we know how: regex
import re

In [17]:
df3['location'] = df3['location'].str.replace(r"\(.*\)","")

In [18]:
df3['location'] = df3['location'].str.replace("^\d+\s|\s\d+\s|\s\d+$","")

In [19]:
df3['location'] = df3['location'].str.replace(',', '')

#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [None]:
# Turn all that into a function. Yay!

### Save your results as a CSV

In [20]:
df3.to_csv('clean_job.csv')

## Predicting salaries using Random Forests + Another Classifier

#### Load in the the data of scraped salaries

In [21]:
df = pd.read_csv('clean_job.csv')

#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [22]:
df.drop(['Unnamed: 0'], axis=1, inplace=True)

In [23]:
# Take a look at our lovely, clean dataframe before we muck it all up again
df.head()

Unnamed: 0,location,title,company,avg_sal
0,San Jose CA,Chief Data Scientist – Oil & Gas Analytics,,170000
1,Arlington VA,Data Scientist,Jobspring Partners,75000
2,St. Louis MO,SENIOR DATA SCIENTIST,Analytic Recruiting,90000
3,New York NY,Data Scientist,Workbridge Associates,140000
4,Norwalk CT,Data Scientist,,112500


In [24]:
med = df['avg_sal'].mean()

In [25]:
med # mean salary for the dataset is 126,000

126356.0052631579

In [26]:
# Check salaries against the median
high = []
for row in df.avg_sal:
    high.append(row > med)
    

In [27]:
df2 = pd.DataFrame(data=high, columns=['high'])

In [28]:
df = df.join(df2)

In [None]:
df.head()

In [29]:
df['high'] = df.high.map({True:1, False:0})

In [None]:
df.head()

In [30]:
loc_enc = LabelEncoder()

In [31]:
# Encode the location names with numbers. Binary would probably be better, All those dummies are hard to interpret
df['locus'] = loc_enc.fit_transform(df.location)

In [None]:
df.head()

In [None]:
loc_enc.classes_

#### Thought experiment: What is the baseline accuracy for this model?

In [32]:
df.high.value_counts(normalize=True)

1    0.563158
0    0.436842
Name: high, dtype: float64

In [None]:
# Random chance would suggest a 56% chance of correctly predicting a salary above the mean

#### Create a Random Forest model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. 

In [33]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score, StratifiedKFold, KFold, train_test_split
from sklearn.metrics import confusion_matrix, classification_report

In [34]:
X = df.locus.values
y = df.high.values

In [35]:
X
X = X[:,None]

In [36]:
rf = RandomForestClassifier()

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=.3, random_state = 42)

In [38]:
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [39]:
y_pred = rf.predict(X_test)

In [40]:
confusion_matrix(y_test, y_pred)

array([[20,  7],
       [ 5, 25]])

In [41]:
print classification_report(y_test, y_pred)

             precision    recall  f1-score   support

          0       0.80      0.74      0.77        27
          1       0.78      0.83      0.81        30

avg / total       0.79      0.79      0.79        57



In [57]:
cross_val_score(rf, X,y)

array([ 0.5625  ,  0.796875,  1.      ])

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Random Forest with these features. Do they add any value? 


In [42]:
pat1 = 'senior'
pat2 = 'sr'

In [43]:
df['title_in1'] = df.title.str.contains(pat1, case=False)
df['title_in2'] = df.title.str.contains(pat2, case=False)

In [44]:
df['title_in1'] = df.title_in1.map({False:0, True:1})
df['title_in2'] = df.title_in2.map({False:0, True:1})

In [45]:
df.title_in1.value_counts()

0    147
1     43
Name: title_in1, dtype: int64

In [46]:
df.title_in2.value_counts()

0    167
1     23
Name: title_in2, dtype: int64

In [47]:
df['title_in'] = df.title_in1 + df.title_in2

In [48]:
df.title_in.value_counts()

0    124
1     66
Name: title_in, dtype: int64

In [49]:
df.title.value_counts()

Data Scientist                                                     58
Lead Data Scientist                                                24
Sr. Data Scientist                                                 23
Senior Data Scientist - Security Experience is Huge                21
Senior Data Scientist                                               6
Data Scientist - Experienced                                        2
Principal Statistical Analyst / Data Scientist ( USC or GC O...     1
Senior Data Scientist for Multi-Billion Dollar Hedge Fund           1
Senior Marketing Data Scientist                                     1
Principal Data Scientist - Supply Chain                             1
Director of Data Science/Chief Product Officer                      1
California Client seeks Ph.D. or M.S. Property and Casualty...      1
Data Scientist/Analytics Manager with SQL (requires Secret C...     1
Predictive Analytics Professionals-- ROAD WARRIORS                  1
Data Engineer (Elast

In [54]:
df.corr()

Unnamed: 0,avg_sal,high,locus,title_in1,title_in2,title_in
avg_sal,1.0,0.869775,-0.042897,0.311877,0.238406,0.437416
high,0.869775,1.0,-0.154623,0.222758,0.294322,0.397402
locus,-0.042897,-0.154623,1.0,0.080561,0.457014,0.38389
title_in1,0.311877,0.222758,0.080561,1.0,-0.200716,0.741335
title_in2,0.238406,0.294322,0.457014,-0.200716,1.0,0.50868
title_in,0.437416,0.397402,0.38389,0.741335,0.50868,1.0


In [55]:
df.drop(['title_in1', 'title_in2'], axis=1, inplace=True)

In [56]:
df.corr()

Unnamed: 0,avg_sal,high,locus,title_in
avg_sal,1.0,0.869775,-0.042897,0.437416
high,0.869775,1.0,-0.154623,0.397402
locus,-0.042897,-0.154623,1.0,0.38389
title_in,0.437416,0.397402,0.38389,1.0


In [58]:
X_2 = df[['locus', 'title_in']]

In [59]:
X_2_train, X_2_test, y_2_train, y_2_test = train_test_split(X_2, y, test_size=.3, random_state=42)

In [60]:
rf_2 = RandomForestClassifier()

In [61]:
rf_2.fit(X_2_train, y_2_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [62]:
y_pred_2 = rf_2.predict(X_2_test)

In [63]:
confusion_matrix(y_2_test, y_pred_2)

array([[15, 12],
       [ 2, 28]])

In [64]:
print classification_report(y_2_test, y_pred_2)

             precision    recall  f1-score   support

          0       0.88      0.56      0.68        27
          1       0.70      0.93      0.80        30

avg / total       0.79      0.75      0.74        57



In [67]:
scores_2 = cross_val_score(rf_2, X_2, y, cv=3)

In [66]:
scores_2

array([ 0.53846154,  0.43589744,  0.84210526,  1.        ,  1.        ])

#### Logistic Regression

Let's try treating this as a regression problem. 

- Train a random forest regressor on the regression problem and predict your dependent.
- Evaluate the score with a 5-fold cross-validation
- Do a scatter plot of the predicted vs actual scores for each of the 5 folds, do they match?

In [68]:
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import train_test_split

In [69]:
logr = LogisticRegression()

In [70]:
X_train, X_test, y_train, y_test = train_test_split(X,y, random_state=42)

In [71]:
logr.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [72]:
y_pred = logr.predict(X_test)

In [73]:
from sklearn.metrics import r2_score, confusion_matrix, classification_report

In [74]:
r2_score(y_test, y_pred)

-1.685314685314685

In [75]:
confusion_matrix(y_test, y_pred)

array([[ 0, 26],
       [ 6, 16]])

In [76]:
print classification_report(y_test, y_pred)

             precision    recall  f1-score   support

          0       0.00      0.00      0.00        26
          1       0.38      0.73      0.50        22

avg / total       0.17      0.33      0.23        48



# Playing with CountVectorizer

In [None]:
# This didn't go anywhere, which doesn't bode well for my capstone.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
titles = df.title.values

In [None]:
stops = ['data', 'scientist', 'science', 'the', 'a', 'an', 'is', 'senior', 'sr']

In [None]:
cv = CountVectorizer(stop_words=stops, max_features=6)

In [None]:
cv.fit_transform(titles)

In [None]:
cv.vocabulary_.get(u'security')

In [None]:
cv.get_feature_names()

In [None]:
cv

In [None]:
## YOUR CODE HERE

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate your models using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE