# Web Scraping for Indeed.com & Predicting Salaries

In this project, we will practice two major skills: collecting data by scraping a website and then building a binary predictor with Logistic Regression.

We are going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job we will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings DO NOT come with salary information (as you will see in this exercise), being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression.

- Question: Why would we want this to be a classification problem?
- Answer: While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries.

First, look at the source of an Indeed.com page: (http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10")

Notice, each job listing is underneath a `div` tag with a class name of `result`. We can use BeautifulSoup to extract those. 

#### Setup a request (using `requests`) to the URL below. Use BeautifulSoup to parse the page and extract all results (HINT: Look for div tags with class name result)

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

In [300]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [301]:
import urllib2
import requests
import bs4
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen(URL).read())

In [302]:
## YOUR CODE HERE
divs = soup.find_all('div', class_=' row result')

for idx, i in enumerate(divs):
    print '+++++++++++', idx, '\n', i

+++++++++++ 0 
<div class=" row result" data-jk="03da19f3a2a1f3d7" data-tn-component="organicJob" id="p_03da19f3a2a1f3d7" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_03da19f3a2a1f3d7">
<a class="turnstileLink" data-tn-element="jobTitle" href="/rc/clk?jk=03da19f3a2a1f3d7&amp;fccid=f7029f63fe5c906e" itemprop="title" onclick="setRefineByCookie(['salest']); return rclk(this,jobmap[0],true,0);" onmousedown="return rclk(this,jobmap[0],0);" rel="nofollow" target="_blank" title="Data Scientist"><b>Data</b> <b>Scientist</b></a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Verizon" onmousedown="this.href = appendParamsOnce(this.href, 'from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=03da19f3a2a1f3d7&amp;jcid=f7029f63fe5c906e')" target="_blank">
        Verizon</a></span>
</span>

 - <a class="turnstileLink slNoUnderline " data-tn-element="reviewStars" data-tn

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

While this has some more verbose elements removed, we can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

### Write 4 functions to extract each item: location, company, job, and salary.

example: 
```python
def extract_location_from_result(result):
    return result.find ...
```


- **Make sure these functions are robust and can handle cases where the data/field may not be available.**
    - Remember to check if a field is empty or `None` for attempting to call methods on it
    - Remember to use `try/except` if you anticipate errors
- **Test** the functions on the results above and simple examples

In [303]:
import re
## YOUR CODE HERE
def extract_loc(result):
    dicty = {}
    for idx, div in enumerate(result):
        for idx2, j in enumerate(div.find_all('span', class_='location')):
            dicty[idx2] = j.text
    return dicty

def extract_comp(result):
    dicty = {}
    for idx, div in enumerate(result):
        for idx2, j in enumerate(div.find_all('span', class_='company')):
            dicty[idx2] = j.text
    return dicty
        
def extract_job(result):
    dicty = {}
    for idx, div in enumerate(result):
        for idx2, h in enumerate(div.find_all('h2')):
            dicty[idx2] = h.text
    return dicty

def extract_sal(result):
    dicty = {}
    for idx, div in enumerate(result):
        for idx2, i in enumerate(div.find_all('td', class_='snip')):
            for idx3, j in enumerate(i.find_all('nobr')):
                dicty[idx2] = j.text
    return dicty
        
            

Now, to scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

#### Complete the following code to collect results from multiple cities and starting points. 
- Enter your city below to add it to the search
- Remember to convert your salary to U.S. Dollars to match the other cities if the currency is different

In [304]:
YOUR_CITY = 'Seattle'

In [305]:
url_template = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l={}&start={}"
max_results_per_city = 500 # Set this to a high-value (5000) to generate more results. 
# Crawling more results, will also take much longer. First test your code on a small number of results and then expand.

results = []

for city in ['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami']:
    for start in range(0, max_results_per_city, 10):
        url_ = 'http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l='+city+'&start='+str(start)
        soupy = (BeautifulSoup(urllib2.urlopen(url_).read()))
        results.append(soupy.find_all('div', class_=' row result'))

#### Use the functions you wrote above to parse out the 4 fields - location, title, company and salary. Create a dataframe from the results with those 4 columns.

In [540]:
import pandas as pd



In [829]:
listy = []
for i in results:
    listy.append(extract_loc(i))
df = pd.DataFrame(listy)


In [830]:
listy = []
for i in results:
    listy.append(extract_job(i))
df['job'] = pd.DataFrame(listy)

In [831]:
listy = []
for i in results:
    listy.append(extract_sal(i))
df['sal'] = pd.DataFrame(listy)


In [832]:
listy = []
for i in results:
    listy.append(extract_comp(i))
df['comp'] = pd.DataFrame(listy)
df.head()

Unnamed: 0,0,job,sal,comp
0,"New York, NY",\nChief Data Scientist\n,,"\n\n Knotch, Inc.\n"
1,"New York, NY","\nJunior Data Scientist, Data & Statistics - D...",,\n\n\n Federal Reserve Bank of New York\n
2,"New York, NY 10029 (Yorkville area)",\nData Scientist - Scientific Computing\n,,\n\n\n Mount Sinai Health System\n
3,"New York, NY",\nData Scientist\n,$90 an hour,\n\n Countr\n
4,"New York, NY",\nLoyalty - Research Analyst (Entry-Level)\n,,\n\n\n Ipsos North America\n


In [833]:
df.columns = ['city', 'job', 'salary', 'company']

In [834]:
no_sal = df[df['salary'].isnull()]
no_sal.head()

Unnamed: 0,city,job,salary,company
0,"New York, NY",\nChief Data Scientist\n,,"\n\n Knotch, Inc.\n"
1,"New York, NY","\nJunior Data Scientist, Data & Statistics - D...",,\n\n\n Federal Reserve Bank of New York\n
2,"New York, NY 10029 (Yorkville area)",\nData Scientist - Scientific Computing\n,,\n\n\n Mount Sinai Health System\n
4,"New York, NY",\nLoyalty - Research Analyst (Entry-Level)\n,,\n\n\n Ipsos North America\n
5,"Jersey City, NJ",\nData Scientist\n,,\n\n SITO Mobile\n


Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

#### Find the entries with annual salary entries, by filtering the entries without salaries or salaries that are not yearly (filter those that refer to hour or week). Also, remove duplicate entries

In [835]:
## YOUR CODE HERE
df.dropna(how='any', axis=0, inplace=True)
df.drop_duplicates(inplace=True)
print df.shape
df.head()

(169, 4)


Unnamed: 0,city,job,salary,company
3,"New York, NY",\nData Scientist\n,$90 an hour,\n\n Countr\n
6,"New York, NY 10029 (Yorkville area)",\nASSOCIATE RESEARCHER I\n,"$120,000 a year",\n\n\n Mount Sinai Health System\n
7,"New York, NY","\nCross Asset Quantitative Strategist, Analyst...","$120,000 - $150,000 a year",\n\n\n Goldman Sachs\n
8,"New York, NY",\nData Scientist / Lead Quantitative Analyst\n,"$150,000 - $200,000 a year",\n\n\n Guidepoint Global\n
10,"New York, NY",\nQuantitative Researcher in Machine Learning\n,"$110,000 - $275,000 a year","\n\n Two Sigma Investments, LLC.\n"


In [836]:
listy = []
for i in df['job']:
    listy.append(str(i.encode('ascii', 'ignore')).split())
dicty = {}
for idx, i in enumerate(listy):
    for j in i:
        if '\\n' in j:
            continue
        elif idx not in dicty:
            dicty[idx] = str(j.encode('ascii', 'ignore'))
        else:
            dicty[idx] += ' ' + str(j)
listyy = []
for i in range(len(dicty)):
    listyy.append(dicty[i])
df['job'] = listyy
df.head()



Unnamed: 0,city,job,salary,company
3,"New York, NY",Data Scientist,$90 an hour,\n\n Countr\n
6,"New York, NY 10029 (Yorkville area)",ASSOCIATE RESEARCHER I,"$120,000 a year",\n\n\n Mount Sinai Health System\n
7,"New York, NY","Cross Asset Quantitative Strategist, Analyst/A...","$120,000 - $150,000 a year",\n\n\n Goldman Sachs\n
8,"New York, NY",Data Scientist / Lead Quantitative Analyst,"$150,000 - $200,000 a year",\n\n\n Guidepoint Global\n
10,"New York, NY",Quantitative Researcher in Machine Learning,"$110,000 - $275,000 a year","\n\n Two Sigma Investments, LLC.\n"


In [837]:
listy = []
for i in df['company']:
    listy.append(str(i).split())
dicty = {}
for idx, i in enumerate(listy):
    for j in i:
        if '\\n' in j:
            continue
        elif idx not in dicty:
            dicty[idx] = str(j)
        else:
            dicty[idx] += ' ' + str(j)
listyy = []
for i in range(len(dicty)):
    listyy.append(dicty[i])
df['company'] = listyy
df.head()

Unnamed: 0,city,job,salary,company
3,"New York, NY",Data Scientist,$90 an hour,Countr
6,"New York, NY 10029 (Yorkville area)",ASSOCIATE RESEARCHER I,"$120,000 a year",Mount Sinai Health System
7,"New York, NY","Cross Asset Quantitative Strategist, Analyst/A...","$120,000 - $150,000 a year",Goldman Sachs
8,"New York, NY",Data Scientist / Lead Quantitative Analyst,"$150,000 - $200,000 a year",Guidepoint Global
10,"New York, NY",Quantitative Researcher in Machine Learning,"$110,000 - $275,000 a year","Two Sigma Investments, LLC."


In [838]:
listy = []
for i in df['city']:
    listy.append(str(i).split(',')[0])
df['city'] = listy
df.head()

Unnamed: 0,city,job,salary,company
3,New York,Data Scientist,$90 an hour,Countr
6,New York,ASSOCIATE RESEARCHER I,"$120,000 a year",Mount Sinai Health System
7,New York,"Cross Asset Quantitative Strategist, Analyst/A...","$120,000 - $150,000 a year",Goldman Sachs
8,New York,Data Scientist / Lead Quantitative Analyst,"$150,000 - $200,000 a year",Guidepoint Global
10,New York,Quantitative Researcher in Machine Learning,"$110,000 - $275,000 a year","Two Sigma Investments, LLC."


#### Write a function that takes a salary string and converts it to a number, averaging a salary range if necessary

In [839]:
## YOUR CODE HERE
listy = []
for i in df['salary']:
    listy.append(str(i).replace('$', '').replace(',', '').replace('-', '').replace(' a year', '')
                .replace('a week', '').replace('a month', '').replace('an hour', ''))

listy = [i.split() for i in listy]
num_list = []
for idx, i in enumerate(listy):
    for j in i:
        num_list += [(idx, float(j))]
num_list
        
dicty = {}
for j, k in num_list:
    if j not in dicty:
        dicty[j] = k
    else:
        dicty[j] += k
        dicty[j] /= 2
dicty
for i in dicty:
    if dicty[i] < 100:
        dicty[i] *= 2080
    elif dicty[i] < 3000:
        dicty[i] *= 52
    elif dicty[i] < 10000:
        dicty[i] *= 12
sals = []
for i in range(len(dicty)):
    sals.append(dicty[i])


In [840]:
df['salary'] = sals
df.head()

Unnamed: 0,city,job,salary,company
3,New York,Data Scientist,187200.0,Countr
6,New York,ASSOCIATE RESEARCHER I,120000.0,Mount Sinai Health System
7,New York,"Cross Asset Quantitative Strategist, Analyst/A...",135000.0,Goldman Sachs
8,New York,Data Scientist / Lead Quantitative Analyst,175000.0,Guidepoint Global
10,New York,Quantitative Researcher in Machine Learning,192500.0,"Two Sigma Investments, LLC."


### Save your results as a CSV

In [841]:
## YOUR CODE HERE
df.to_csv('Salary_Data2')

## Predicting salaries using Logistic Regression

#### Load in the the data of scraped salaries

In [842]:
## YOUR CODE HERE
## But they're already here :(


#### We want to predict a binary variable - whether the salary was low or high. Compute the median salary and create a new binary variable that is true when the salary is high (above the median)

We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a _binary_ classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the `median` as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries, 

In [843]:
df.head()

Unnamed: 0,city,job,salary,company
3,New York,Data Scientist,187200.0,Countr
6,New York,ASSOCIATE RESEARCHER I,120000.0,Mount Sinai Health System
7,New York,"Cross Asset Quantitative Strategist, Analyst/A...",135000.0,Goldman Sachs
8,New York,Data Scientist / Lead Quantitative Analyst,175000.0,Guidepoint Global
10,New York,Quantitative Researcher in Machine Learning,192500.0,"Two Sigma Investments, LLC."


In [844]:
## YOUR CODE HERE
import numpy as np
frst_q = df['salary'].quantile(.25)
med = np.median(df['salary'])
thrd_q = df['salary'].quantile(.75)
highness = []
for i in df['salary']:
    if i < frst_q:
        highness.append('very_low')
    elif i < med:
        highness.append('low')
    elif i < thrd_q:
        highness.append('high')
    else:
        highness.append('very_high')
df['highness'] = highness
df.head()

listy = []
for i in df['salary']:
    if i < med:
        listy.append(1)
    else:
        listy.append(0)
df['is_high'] = listy

#### Thought experiment: What is the baseline accuracy for this model?

In [851]:
## YOUR CODE HERE
# The baseline for this model would be the most frequent observation for all outcomes:
print df['highness'].value_counts(), '\n'
print 'The baseline accuracy is', 44.0 / 169

low          44
very_high    43
high         42
very_low     40
Name: highness, dtype: int64 

The baseline accuracy is 0.260355029586


#### Create a Logistic Regression model to predict High/Low salary using statsmodel. Start by ONLY using the location as a feature. Display the coefficients and write a short summary of what they mean.

In [852]:
import statsmodels.formula.api as sm
model = sm.logit(
    "is_high ~ city",
    data = df
).fit()

model.summary()


         Current function value: 1.444789
         Iterations: 35




0,1,2,3
Dep. Variable:,is_high,No. Observations:,169.0
Model:,Logit,Df Residuals:,110.0
Method:,MLE,Df Model:,58.0
Date:,"Sun, 30 Oct 2016",Pseudo R-squ.:,-1.084
Time:,16:38:40,Log-Likelihood:,-244.17
converged:,False,LL-Null:,-117.14
,,LLR p-value:,1.0

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-4.7035,10.599,-0.444,0.657,-25.477 16.070
city[T.Alpharetta],5.7746,10.722,0.539,0.590,-15.241 26.790
city[T.Atlanta],4.2980,10.638,0.404,0.686,-16.553 25.149
city[T.Aurora],4.7035,10.693,0.440,0.660,-16.254 25.661
city[T.Austin],4.8858,10.616,0.460,0.645,-15.922 25.693
city[T.Aventura],2.3771,11.166,0.213,0.831,-19.508 24.262
city[T.Bellevue],-41.4587,1.06e+10,-3.92e-09,1.000,-2.07e+10 2.07e+10
city[T.Berkeley],-41.4588,1.06e+10,-3.92e-09,1.000,-2.07e+10 2.07e+10
city[T.Boulder],4.7035,10.693,0.440,0.660,-16.254 25.661


The coefficients are essentially the weight that the model is giving to each variable.  Technically, the coefficients are the log of the odds that, when combined with the intercept, can give the formula for the probability of the salary being high or not for the particular city.  The odds ratio derived from the coefficient can also tell you the multiplicative increase or decrease in odds of the salary being high or low in each city.

#### Create a few new variables in your dataframe to represent interesting features of a job title.
- For example, create a feature that represents whether 'Senior' is in the title 
- or whether 'Manager' is in the title. 
- Then build a new Logistic Regression model with these features. Do they add any value? 


In [853]:
## YOUR CODE HERE
def new_var(word):    
    sen = []
    for i in df['job']:
        if word in i.lower():
            sen.append(1)
        else:
            sen.append(0)
    df[word] = sen

words = ['manager', 'senior', 'research', 'machine',
         'python', 'developer', 'analyst', 'scientist']
for i in words:
    new_var(i)


In [854]:
model = sm.logit(
    "is_high ~ city + senior + manager + research + machine + python + developer + analyst + scientist",
    data = df
).fit()

model.summary()


         Current function value: 0.406682
         Iterations: 35




0,1,2,3
Dep. Variable:,is_high,No. Observations:,169.0
Model:,Logit,Df Residuals:,103.0
Method:,MLE,Df Model:,65.0
Date:,"Sun, 30 Oct 2016",Pseudo R-squ.:,0.4133
Time:,16:38:48,Log-Likelihood:,-68.729
converged:,False,LL-Null:,-117.14
,,LLR p-value:,0.006397

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,18.6341,1.11e+04,0.002,0.999,-2.18e+04 2.18e+04
city[T.Alpharetta],3.9680,3.84e+04,0.000,1.000,-7.53e+04 7.53e+04
city[T.Atlanta],-18.9615,1.11e+04,-0.002,0.999,-2.18e+04 2.18e+04
city[T.Aurora],-18.7223,1.11e+04,-0.002,0.999,-2.18e+04 2.18e+04
city[T.Austin],-17.7919,1.11e+04,-0.002,0.999,-2.18e+04 2.18e+04
city[T.Aventura],4.0278,4.85e+04,8.3e-05,1.000,-9.51e+04 9.51e+04
city[T.Bellevue],-39.3276,3.31e+04,-0.001,0.999,-6.49e+04 6.48e+04
city[T.Berkeley],-39.3276,3.31e+04,-0.001,0.999,-6.49e+04 6.48e+04
city[T.Boulder],-18.6341,1.11e+04,-0.002,0.999,-2.18e+04 2.18e+04


The new variables did appear to help to a certain extent, which can be seen in the higher pseudo R^2 score.  Also, the lower p-values for some of the new variables, together with their coefficients, indicate that they may be better predictor for salary.  This is particularly true for senior, developer, and scientist.

#### Rebuild this model with scikit-learn.
- You can either create the dummy features manually or use the `dmatrix` function from `patsy`
- Remember to scale the feature variables as well!


In [855]:
## YOUR CODE HERE
import patsy
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.grid_search import GridSearchCV


X = patsy.dmatrix('~ C(city) + C(manager) + C(senior) + C(research) + C(machine) + C(python) + C(developer) + C(analyst) + C(scientist)', df)
y = df['highness']

In [856]:
logreg = LogisticRegression()
C_vals = [0.0001, 0.001, 0.01, 0.1, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]
penalties = ['l1','l2']

gs = GridSearchCV(logreg, {'penalty':penalties, 'C':C_vals}, verbose=True, cv=5, scoring='f1_macro')
gs.fit(X, y)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


[Parallel(n_jobs=1)]: Done 120 out of 120 | elapsed:    1.0s finished


GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.0001, 0.001, 0.01, 0.1, 0.5, 0.75, 1.0, 2.5, 5.0, 10.0, 100.0, 1000.0]},
       pre_dispatch='2*n_jobs', refit=True, scoring='f1_macro',
       verbose=True)

In [857]:
gs.best_params_

{'C': 1.0, 'penalty': 'l2'}

In [858]:
logreg = LogisticRegression(C=1.0, penalty='l2', solver='liblinear')
logreg.fit(X, y)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

#### Use cross-validation in scikit-learn to evaluate the model above. 
- Evaluate the accuracy, AUC, precision and recall of the model. 
- Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.

In [859]:
## YOUR CODE HERE
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y)

array([ 0.24137931,  0.21428571,  0.2       ])

In [860]:
y_pred = logreg.predict(X)
print(classification_report(y, y_pred))
print pd.DataFrame(confusion_matrix(y, y_pred), index=['is_very_low','is_low', 'is_high', 'is_very_high'],
                   columns=['pred_very_low','pred_low', 'pred_high', 'pred_very_high'])


             precision    recall  f1-score   support

       high       0.69      0.64      0.67        42
        low       0.65      0.59      0.62        44
  very_high       0.74      0.72      0.73        43
   very_low       0.58      0.70      0.64        40

avg / total       0.67      0.66      0.66       169

              pred_very_low  pred_low  pred_high  pred_very_high
is_very_low              27         5          3               7
is_low                    4        26          6               8
is_high                   2         5         31               5
is_very_high              6         4          2              28


In [861]:
print 'coefficients:', logreg.coef_

coefficients: [[ -5.25684482e-01  -3.83787316e-01   4.27261873e-01   3.91867510e-01
    1.81174527e-01  -1.00797502e-01  -2.19157848e-01   6.08848742e-01
    3.40972552e-01  -1.37485589e-01  -1.00614814e-01   6.09558934e-01
    1.18529770e+00  -2.19157848e-01  -1.72421771e-01  -3.36010181e-01
   -2.19157848e-01   8.89927730e-01  -1.43522358e-01  -2.27436709e-01
   -3.84404718e-01  -2.18622322e-01   6.93712707e-01   1.40369745e-01
   -2.18622322e-01   3.81295096e-01  -2.19157848e-01   4.07465017e-01
   -1.58446145e-01   9.66498604e-01   6.09558934e-01  -9.33845107e-02
   -3.84404718e-01  -1.21571146e-01  -3.43278602e-01  -2.18622322e-01
   -2.19157848e-01   6.08848742e-01  -3.36010181e-01  -2.19157848e-01
    2.81390265e-01  -5.27244857e-01  -5.16874398e-01  -2.19157848e-01
   -3.36010181e-01  -2.18622322e-01  -3.83787316e-01  -2.18622322e-01
   -1.58446145e-01  -2.65820380e-01  -3.18992199e-01  -2.19157848e-01
   -2.18622322e-01  -1.71964711e-01  -3.36010181e-01  -2.19157848e-01
    6.

# Evaluate the accuracy, AUC, precision and recall of the model.
# Discuss the differences and explain when you want a high-recall or a high-precision model in this scenario.



Firstly, I could not figure out how to create an AUC-ROC curve and get the score of a multiclass regression.  Everything I found in the documentation stated that it is meant for binarized response variables and got overly complicated when it came to multiclass response variables.  Given more time, I would like to figure this out.  I'm turning this in as is, but would like to revisit it when my wife is not visiting and I can have more time to play with it.

Overall, I would say that the accuracy, precision, and recall of my model are pretty good.  Recall is the number of positive predictions divided by the number of true positive values in the class.  A high recall rate in this scenario would be useful in a situation where you you are not worried about making an incorrect prediction but want to make sure that you catch all positives.  This could matter when you are mostly worried about getting everyone a high salary that deserves it at the expense of giving people who don't deserve it high salaries as well.  Precision is the number of predicted positives divided by the number of true and false positives.  A high precision model in the this scenario would be useful for when you want to keep the number of false positives at a minimum.  This could mean that you wouldn't make a bad prediction and give someone too high of a salary.

#### Compare L1 and L2 regularization for this logistic regression model. What effect does this have on the coefficients learned?

In [865]:
## YOUR CODE HERE
logreg = LogisticRegression(C=1.0, penalty='l1', solver='liblinear')
logreg.fit(X, y)
y_pred = logreg.predict(X)
print 'coefficients:', logreg.coef_
print(classification_report(y, y_pred))
print pd.DataFrame(confusion_matrix(y, y_pred), index=['is_very_low','is_low', 'is_high', 'is_very_high'],
                   columns=['pred_very_low','pred_low', 'pred_high', 'pred_very_high'])


coefficients: [[-0.98937529  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          1.20684745
   0.          0.          0.          0.          0.73615163  0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.93244866  0.          0.          0.          0.
  -0.12419087  0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.          0.          0.          0.
   0.          0.          0.          0.         -0.42475248  0.
  -0.21292336 -1.1077494   0.          0.         -0.39754342  0.        ]
 [-0.79929861  1.04096929  0.          0.          0.44028763  0.          0.
   0.          0.          0.          0.          0.         -0.6666716
   0.          0.          0.          0.          0.          0.         

In [866]:
## YOUR CODE HERE
logreg = LogisticRegression(C=1.0, penalty='l2', solver='liblinear')
logreg.fit(X, y)
y_pred = logreg.predict(X)
print 'coeficients:', logreg.coef_
print(classification_report(y, y_pred))
print pd.DataFrame(confusion_matrix(y, y_pred), index=['is_very_low','is_low', 'is_high', 'is_very_high'],
                   columns=['pred_very_low','pred_low', 'pred_high', 'pred_very_high'])


coeficients: [[ -5.25684482e-01  -3.83787316e-01   4.27261873e-01   3.91867510e-01
    1.81174527e-01  -1.00797502e-01  -2.19157848e-01   6.08848742e-01
    3.40972552e-01  -1.37485589e-01  -1.00614814e-01   6.09558934e-01
    1.18529770e+00  -2.19157848e-01  -1.72421771e-01  -3.36010181e-01
   -2.19157848e-01   8.89927730e-01  -1.43522358e-01  -2.27436709e-01
   -3.84404718e-01  -2.18622322e-01   6.93712707e-01   1.40369745e-01
   -2.18622322e-01   3.81295096e-01  -2.19157848e-01   4.07465017e-01
   -1.58446145e-01   9.66498604e-01   6.09558934e-01  -9.33845107e-02
   -3.84404718e-01  -1.21571146e-01  -3.43278602e-01  -2.18622322e-01
   -2.19157848e-01   6.08848742e-01  -3.36010181e-01  -2.19157848e-01
    2.81390265e-01  -5.27244857e-01  -5.16874398e-01  -2.19157848e-01
   -3.36010181e-01  -2.18622322e-01  -3.83787316e-01  -2.18622322e-01
   -1.58446145e-01  -2.65820380e-01  -3.18992199e-01  -2.19157848e-01
   -2.18622322e-01  -1.71964711e-01  -3.36010181e-01  -2.19157848e-01
    6.0

L2 regularization outperformed L1 regularization.  This can be seen in the precision, recall, and f1-score results for each model.  L1 regularization brought many of the coefficients down to 0, effectively saying that those variables have no value as a predictor variable.  L2 regularization just brought the coefficients down to very small numbers, reducing their effect on the model, but not taking them out completely.

#### Continue to incorporate other text features from the title or summary that you believe will predict the salary and examine their coefficients

#### Take ~100 scraped entries with salaries. Convert them to use with your model and predict the salary - which entries have the highest predicted salaries?

In [867]:
df['predicted'] = y_pred
very_high_jobs = df.ix[df['predicted'] == 'very_high'][['job','salary']]
print 'Top 10 highest paying jobs', '\n', very_high_jobs.sort_values('salary', ascending=False)[:10]
df.to_csv('Sals_w_Preds')

Top 10 highest paying jobs 
                                                   job    salary
325  Scientist or Sr. Scientist Formulation Filling...  250000.0
44   Data Architect for Big Data Systems (VP) - Int...  212500.0
16   Senior Systematic Quantitative Analyst - Multi...  200000.0
10         Quantitative Researcher in Machine Learning  192500.0
3                                       Data Scientist  187200.0
61                        Senior Research Data Analyst  180000.0
73                 Senior Credit REIT Research Analyst  180000.0
8           Data Scientist / Lead Quantitative Analyst  175000.0
100                                     Data Scientist  160000.0
114                                     Data Scientist  160000.0


This is my report!  
https://docs.google.com/document/d/1FmT08kqMBDGWDAc_lqF29ZIODKy4_1EAPEOZ8i8lf80/pub


This is my presentation!  
https://docs.google.com/presentation/d/1aZyuqP-XDjQhHDPF4h4if2Qxbo1b2Jw3stRulctEFgo/pub?start=false&loop=false&delayms=60000

### BONUS 

#### Bonus: Use Count Vectorizer from scikit-learn to create features from the text summaries. 
- Examine using count or binary features in the model
- Re-evaluate the logistic regression model using these. Does this improve the model performance? 
- What text features are the most valuable? 

In [23]:
## YOUR CODE HERE

In [24]:
## YOUR CODE HERE

#### Re-test L1 and L2 regularization. You can use LogisticRegressionCV to find the optimal reguarlization parameters. 
- Re-test what text features are most valuable.  
- How do L1 and L2 change the coefficients?

In [25]:
## YOUR CODE HERE