# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

In [1]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [265]:
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import bs4
from bs4 import BeautifulSoup
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.cross_validation import train_test_split, cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn import metrics
import statsmodels.api as sm
%matplotlib inline

### Learning to use Beutifulsoup to extract information out of indeed.com

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

#### Obtaining results for low salary range



In [23]:

base_url = "http://www.indeed.com/jobs?q=data+scientist+%2430%2C000+-+%2480%2C000&l="

start_from = '&start='    # start page number


df = pd.DataFrame()   # create a new data frame

for city in set(['New+York%2C+NY', 'Chicago%2C+IL', 'San+Francisco%2C+CA', 
                                        'Austin%2C+TX', 'Boston%2C+MA']):
    cityString = city
    for page in range(1,21): # Page from 1 to 20 
        page = (page-1) * 10  #Page Equation
        url = "%s%s%s%d" % (base_url, cityString , start_from, page) # get full url 
        r = requests.get(url) #
        soup = BeautifulSoup(r.content,'lxml') 

        results = soup.find_all("div", {'class' : ' row result'}) # we're interested in each row
    
        # trying to get each specific job information (such as company name, job title, ...)
        for i in range(len(results)): 
            comp_name = results[i].find('span', {'class':'company'}).get_text().strip()
            job_title = results[i].find('a', {'class':'turnstileLink'}).get_text().strip()
            location = city
            summary = results[i].find('span',{'class':'summary'}).get_text().strip()
            

    
            # Add job info to our data frame
            df = df.append({'Company Name': comp_name, 'Job Title': job_title, 
                        'Location': location, 'Summary': summary, 
                            'Salary Index': int(0)}, ignore_index=True)

df

Unnamed: 0,Company Name,Job Title,Location,Salary Index,Summary
0,"Alambic Investment Management, LP",Data Scientist / Engineer,San+Francisco%2C+CA,0.0,"We are a small, entrepreneurial San Francisco-..."
1,Twitch,Data Scientist,San+Francisco%2C+CA,0.0,"We think of Emmett, the CEO, as Twitch’s origi..."
2,Dolby,[Summer Session] Social Media and Data Analysi...,San+Francisco%2C+CA,0.0,Analyze online trends and data. Strong data an...
3,University of California San Francisco,Research Data Analyst,San+Francisco%2C+CA,0.0,The Research Data Analyst will provide statist...
4,University of California San Francisco,Research Analyst,San+Francisco%2C+CA,0.0,Assisting off-site research assistants with qu...
5,The Nielsen Company,Research Analyst/ Sr. Research Analyst,San+Francisco%2C+CA,0.0,Data Processing Management. Oversee all data p...
6,Natera,Data Scientist,San+Francisco%2C+CA,0.0,Natera is seeking a highly motivated Data Scie...
7,Ancestry,Senior Data Scientist,San+Francisco%2C+CA,0.0,Data Mining Product team is looking for an exp...
8,Ancestry,Scientific Data Wrangler,San+Francisco%2C+CA,0.0,"Working with a nimble team of physicians, gene..."
9,DiscoveRx Corporation,Scientist/Senior Scientist,San+Francisco%2C+CA,0.0,Scientist/Senior Scientist position currently ...


#### Obtaining results for high salary range

In [26]:
base_url = "http://www.indeed.com/jobs?q=data+scientist+%2480%2C000+-+%2530%2C000&l="

start_from = '&start='    # start page number


df2 = pd.DataFrame()   # create a new data frame

for city in set(['New+York%2C+NY', 'Chicago%2C+IL', 'San+Francisco%2C+CA', 
                                        'Austin%2C+TX', 'Boston%2C+MA']):
    cityString = city
    for page in range(1,21): # Page from 1 to 20 
        page = (page-1) * 10  #Page Equation
        url = "%s%s%s%d" % (base_url, cityString , start_from, page) # get full url 
        r = requests.get(url) #
        soup = BeautifulSoup(r.content,'lxml') 

        results = soup.find_all("div", {'class' : ' row result'}) # we're interested in each row
    
        # trying to get each specific job information (such as company name, job title, ...)
        for i in range(len(results)): 
            comp_name = results[i].find('span', {'class':'company'}).get_text().strip()
            job_title = results[i].find('a', {'class':'turnstileLink'}).get_text().strip()
            location = city
            summary = results[i].find('span',{'class':'summary'}).get_text().strip()
            

    
            # Add job info to our data frame
            df2 = df2.append({'Company Name': comp_name, 'Job Title': job_title, 
                        'Location': location, 'Summary': summary, 
                            'Salary Index': int(1)}, ignore_index=True)

df2

Unnamed: 0,Company Name,Job Title,Location,Salary Index,Summary
0,MarkMonitor,Data Scientist,San+Francisco%2C+CA,1.0,"Data skills (SQL, Hive, Pig). Applying machine..."
1,Docker,"Senior Manager/Director, Alliances-Strategic D...",San+Francisco%2C+CA,1.0,Have a pulse on partner industry to identify t...
2,Glassdoor,Principal Data Scientist,San+Francisco%2C+CA,1.0,Mentor a team of data scientists and ML engine...
3,Thomson Reuters,Data Scientist,San+Francisco%2C+CA,1.0,Explores existing data for insights and recomm...
4,O'Reilly Media,Junior Research Scientist,San+Francisco%2C+CA,1.0,Junior Research Scientist. This position is re...
5,Jawbone,Data Scientist,San+Francisco%2C+CA,1.0,The Team You can’t wait to join a team of data...
6,Lawrence Berkeley National Laboratory,Atomic and Molecular Dynamics Postdoctoral Fel...,San+Francisco%2C+CA,1.0,Demonstrated experience analyzing multidimensi...
7,Glassdoor,"Lead Data Scientist, Machine Learning",San+Francisco%2C+CA,1.0,"As a Lead Data Scientist, you will head up a 2..."
8,Smith Hanley Associates,Data Scientist,San+Francisco%2C+CA,1.0,"This person will recruit, build and lead a tea..."
9,HSF Consulting,VP of Data Services,San+Francisco%2C+CA,1.0,Teams included Data Services(including data en...


#### Learning about the data sets

In [27]:
categorical = df.dtypes[df.dtypes == "object"].index
df[categorical].describe()

Unnamed: 0,Company Name,Job Title,Location,Summary
count,820,820,820,820
unique,502,621,5,720
top,Natera,Research Analyst,Boston%2C+MA,"The Lab Technician I reports to a Supervisor, ..."
freq,24,40,180,14


In [28]:
categorical = df2.dtypes[df.dtypes == "object"].index
df2[categorical].describe()

Unnamed: 0,Company Name,Job Title,Location,Summary
count,514,514,514,514
unique,77,167,5,142
top,Biogen,Research Scientist,Boston%2C+MA,Physician Affiliate Group of New York (PAGNY) ...
freq,59,20,164,41


#### Merging the two dataframes together

In [29]:
mergedDF = [df, df2]
df4 = pd.concat(mergedDF)
df4 = df4.reset_index()
del df4['index']

### Save your results as a CSV

In [67]:
df4.to_csv("datascience.csv", encoding='utf-8')

## Predicting salaries using Logistic Regression

#### Load in the the data of scraped salaries, create dummy variables for the cities and perform a train test split on the data.

In [237]:
df4 = pd.read_csv("../Assets/Project4/datascience.csv")
df5=df4[['Location','Salary Index']]
dummies = pd.get_dummies( df5["Location"], prefix = "Location" )
dummies['Intercept']=1
dummies.columns=['Austin','Boston', 'Chicago','New York','San Francisco','Intercept']
y=df5['Salary Index']
X=dummies[['Boston','Chicago','New York','San Francisco','Intercept']]
# split data randomly into datasets, 70% train, 30% test using test train split
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=30)

#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [238]:
logit = sm.Logit(y_train, X_train)
result = logit.fit()
result.summary()

Optimization terminated successfully.
         Current function value: 0.644395
         Iterations 5


0,1,2,3
Dep. Variable:,Salary Index,No. Observations:,933.0
Model:,Logit,Df Residuals:,928.0
Method:,MLE,Df Model:,4.0
Date:,"Thu, 30 Jun 2016",Pseudo R-squ.:,0.03073
Time:,18:03:02,Log-Likelihood:,-601.22
converged:,True,LL-Null:,-620.28
,,LLR p-value:,1.057e-07

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Boston,0.0009,0.217,0.004,0.997,-0.424 0.426
Chicago,-0.6417,0.236,-2.716,0.007,-1.105 -0.179
New York,-0.1917,0.221,-0.868,0.385,-0.624 0.241
San Francisco,-1.2042,0.262,-4.603,0.000,-1.717 -0.691
Intercept,-0.1355,0.174,-0.780,0.436,-0.476 0.205


In [239]:
np.exp(result.params) #convert your parameters to odds ratio

Boston           1.000889
Chicago          0.526405
New York         0.825581
San Francisco    0.299923
Intercept        0.873239
dtype: float64

In [240]:
conf=result.conf_int()
conf['OR']=result.params
conf.columns=['2.5%','97.5%','OR']
np.exp(conf)#convert to odds ratio

Unnamed: 0,2.5%,97.5%,OR
Boston,0.654124,1.531481,1.000889
Chicago,0.331291,0.836431,0.526405
New York,0.53565,1.272444,0.825581
San Francisco,0.1796,0.500858,0.299923
Intercept,0.621122,1.227692,0.873239


In [241]:
dfTest=X_test
dfTest['predictedSalary'] = result.predict( dfTest )
dfTest["actualSalary"] = y_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


#### This loop helped me determine the best cutoff point for my model

In [242]:
for i in range(45,56,1):
    cutoff=i/100.0
    xyz = dfTest["predictedSalary"]
    dfTest["Predicted"] = [0 if i < cutoff else 1 for i in xyz]
    A= confusion_matrix(dfTest['actualSalary'],dfTest['Predicted'])
    Accuracy = (A[0,0]+ A[1,1])/float(len(dfTest["Predicted"]))
    print "My cutoff is: %r and ACC: %r"%(cutoff,Accuracy)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


My cutoff is: 0.45 and ACC: 0.55361596009975067
My cutoff is: 0.46 and ACC: 0.55361596009975067
My cutoff is: 0.47 and ACC: 0.6059850374064838
My cutoff is: 0.48 and ACC: 0.6059850374064838
My cutoff is: 0.49 and ACC: 0.6059850374064838
My cutoff is: 0.5 and ACC: 0.6059850374064838
My cutoff is: 0.51 and ACC: 0.6059850374064838
My cutoff is: 0.52 and ACC: 0.6059850374064838
My cutoff is: 0.53 and ACC: 0.6059850374064838
My cutoff is: 0.54 and ACC: 0.6059850374064838
My cutoff is: 0.55 and ACC: 0.6059850374064838


#### Using the information from the previous step, a cutoff of 0.46 is used. 

In [243]:
cutoff=0.46
xyz = dfTest["predictedSalary"]
dfTest["Predicted"] = [0 if i < cutoff else 1 for i in xyz]
print pd.crosstab(dfTest['actualSalary'], dfTest['Predicted'], rownames=['Actual'])

Predicted    0   1
Actual            
0.0        151  92
1.0         87  71


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [245]:
Accuracy = (222)/float(len(dfTest["Predicted"]))
Accuracy

0.5536159600997507

The accuracy of the model is around 55% which is a very low number to work with. I believe that there is a lot of Bias in the data collected. I belive that a more reliable source of data should be found. I will continue to use this data set to see if it is possible to increase the accuracy of the model using other tools available.

#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [257]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=30)
model2 = LogisticRegression()
model2.fit(X_train, y_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [258]:
dfTest2=X_test
dfTest2['predictedSalary'] = model2.predict(X_test)
dfTest2["actualSalary"] = y_test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  app.launch_new_instance()


In [261]:
print metrics.accuracy_score(y_test, dfTest2['predictedSalary'])
print pd.crosstab(y_test, dfTest2['predictedSalary'], rownames=['Actual'])

0.605985037406
predictedSalary  0.0
Actual              
0.0              243
1.0              158


Clearly using sklearn logistic regression does improve my accuracy but the model is wrong as it is just predicting everything to be low salary.

#### Use cross-validation in scikit-learn to evaluate the model above. 

In [264]:
scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
print scores
print scores.mean()

[ 0.6119403   0.6119403   0.6119403   0.17910448  0.61654135  0.61654135
  0.61654135  0.38345865  0.37593985  0.61654135]
0.524048928291


#### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [267]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=30)
L1Model= LogisticRegression(penalty='l1')
L1Model.fit(X_train, y_train)
dfTest3=X_test
dfTest3['predictedSalary'] = L1Model.predict(X_test)
dfTest3["actualSalary"] = y_test
print metrics.accuracy_score(y_test, dfTest3['predictedSalary'])
print pd.crosstab(y_test, dfTest3['predictedSalary'], rownames=['Actual'])

0.605985037406
predictedSalary  0.0
Actual              
0.0              243
1.0              158


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [268]:
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.30, random_state=30)
L2Model= LogisticRegression(penalty='l2')
L2Model.fit(X_train, y_train)
dfTest4=X_test
dfTest4['predictedSalary'] = L2Model.predict(X_test)
dfTest4["actualSalary"] = y_test
print metrics.accuracy_score(y_test, dfTest4['predictedSalary'])
print pd.crosstab(y_test, dfTest4['predictedSalary'], rownames=['Actual'])

0.605985037406
predictedSalary  0.0
Actual              
0.0              243
1.0              158


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [None]:
## YOUR CODE HERE

In [None]:
## YOUR CODE HERE

#### Re-test L1 and L2 regularization. You can use LogisticRegressionCV to find the optimal reguarlization parameters. 
- Re-test what text features are most valuable.  
- How do L1 and L2 change the coefficients?

In [None]:
## YOUR CODE HERE