# Web Scraping for Indeed.com & Predicting Salaries

In this project, I scrape salaries from Indeed.com, and use the features from Indeed.com to be able to predict future salaries. 

I am going to collect salary information on data science jobs in a variety of markets. Then using the location, title and summary of the job I will attempt to predict the salary of the job. For job posting sites, this would be extraordinarily useful. While most listings do not come with salary information, being to able extrapolate or predict the expected salaries from other listings can help guide negotiations.

Normally, we could use regression for this task; however, we will convert this problem into classification and use Logistic Regression. While more precision may be better, there is a fair amount of natural variance in job salaries - predicting a range be may be useful.

Therefore, the first part of the assignment will be focused on scraping Indeed.com. In the second, we'll focus on using listings with salary information to build a model and predict additional salaries.

### Scraping job listings from Indeed.com

We will be scraping job listings from Indeed.com using BeautifulSoup. Luckily, Indeed.com is a simple text page where we can easily find relevant entries. 

The URL here has many query parameters

- `q` for the job search
- This is followed by "+20,000" to return results with salaries (or expected salaries >$20,000)
- `l` for a location 
- `start` for what result number to start on

## Problem statement

The aim of this project will be to determine what effects salary for Data Science jobs in the United States. The salary data that is found can be combined with data on location and job titles. It would be interesting to see how location and certain words in a job title (such as senior or manager) affects the probability that one gets a salary above the median for data scientists.

In [1]:
import numpy as np
import pandas as pd
import urllib2

In [2]:
URL = "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

In [3]:
import requests
import bs4
from bs4 import BeautifulSoup

In [4]:
response = requests.get(URL)

In [6]:
page = response.text

In [7]:
soup = BeautifulSoup(page, "lxml")

Let's look at one result more closely. A single `result` looks like

```
<div class=" row result" data-jk="2480d203f7e97210" data-tn-component="organicJob" id="p_2480d203f7e97210" itemscope="" itemtype="http://schema.org/JobPosting">
<h2 class="jobtitle" id="jl_2480d203f7e97210">
<a class="turnstileLink" data-tn-element="jobTitle" onmousedown="return rclk(this,jobmap[0],1);" rel="nofollow" target="_blank" title="AVP/Quantitative Analyst">AVP/Quantitative Analyst</a>
</h2>
<span class="company" itemprop="hiringOrganization" itemtype="http://schema.org/Organization">
<span itemprop="name">
<a href="/cmp/Alliancebernstein?from=SERP&amp;campaignid=serp-linkcompanyname&amp;fromjk=2480d203f7e97210&amp;jcid=b374f2a780e04789" target="_blank">
    AllianceBernstein</a></span>
</span>
<tr>
<td class="snip">
<nobr>$117,500 - $127,500 a year</nobr>
<div>
<span class="summary" itemprop="description">
C onduct quantitative and statistical research as well as portfolio management for various investment portfolios. Collaborate with Quantitative Analysts and</span>
</div>
</div>
</td>
</tr>
</table>
</div>
```

We can see that there is some structure to the above:
- The salary is available in a `nobr` element inside of a `td` element with `class='snip`.
- The title of a job is in a link with class set to `jobtitle` and a `data-tn-element="jobTitle`.  
- The location is set in a `span` with `class='location'`. 
- The company is set in a `span` with `class='company'`. 

In [9]:
body = soup.find('body')
print type(body)

results = body.find_all('div', 'row','result' )

<class 'bs4.element.Tag'>


In [10]:
job =[]
company=[]
location=[]
salary=[]

for row in results:
    try:
        job.append(row.a.getText())
    except:
        job.append(np.nan)
    try:
        company.append(row.span.getText().strip())
    except:
        company.append(np.nan)
    try:
        location += [r.getText().strip() for r in row('span', class_='location')]
    except:
        location += [np.nan]
    try:
        salary.append(row.nobr.getText())
    except:
        salary.append(np.nan)

In [11]:
company

[u'Galvanize',
 u'Cumulus Media',
 u'United Nations',
 u'Yodle',
 u'7Park Data',
 u'Morgan Stanley',
 u'JPMorgan Chase',
 u'EXPERIAN',
 u'MassMutual Financial Group',
 u'Citi']

The code seems to be working. To scale up our scraping, we need to accumulate more results. We can do this by examining the URL above.

- "http://www.indeed.com/jobs?q=data+scientist+%2420%2C000&l=New+York&start=10"

There are two query parameters here we can alter to collect more results, the `l=New+York` and the `start=10`. The first controls the location of the results (so we can try a different city). The second controls where in the results to start and gives 10 results (thus, we can keep incrementing by 10 to go further in the list).

In [4]:
df_dic = {}
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Boston']):
    print city

Houston
Boston
Phoenix
Chicago
San+Francisco
New+York
Dallas
Philadelphia
Denver
Los+Angeles
Pittsburgh
Miami
Atlanta
Seattle
Austin
Portland


In [None]:
### The following code was used to scrape the data, but is not being run in this notebook.

df_dic = {}
for city in set(['New+York', 'Chicago', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Boston']):
    print city
    city_dic = {}
    scrape_range = range(0,1000,10)
    while scrape_range != []:
        for page in scrape_range: # page from 1 to 1010 (last page we can scrape is 1010)
            try:
                print scrape_range
                job=[]
                company=[]
                salary=[]
                location=[]
                url = "http://www.indeed.com/jobs?q=data+scientist+$20,000&l=%s&start=%s" % (city,page)
                pval = urllib2.urlopen(url).read()
                soup = BeautifulSoup(pval, 'lxml')
                resultcol = soup.findAll('div', "row")
                for row in resultcol:
                    try:
                        job.append(row.a.getText())
                    except:
                        job.append('0')
                    try:
                        company.append(row.span.getText().strip())
                    except:
                        company.append('0')
                    try:
                        location += [r.getText().strip() for r in row('span', class_='location')]
                    except:
                        location += ['0']
                    try:
                        salary.append(row.nobr.getText())
                    except:
                        salary.append(np.nan)
                if page == 0:
                    ref_job = job
                    ref_company = company
                    ref_location = location
                    df = pd.DataFrame({'Job':job,'Company':company,'Location':location,'Salary':salary})
                    ref_df = df
                    city_dic[city+'_'+str(page)] = df
                    scrape_range.remove(page)
                elif (job!=ref_job)&(company!=ref_company)&(location!=ref_location):
                    df = pd.DataFrame({'Job':job,'Company':company,'Location':location,'Salary':salary})
                    city_dic[city+'_'+str(page)] = df
                    scrape_range.remove(page)
            except:
                scrape_range.remove(page)
    data = pd.concat(city_dic.values())
    data = data.reset_index(drop=True)
    df_dic[city] = data
    print 'Finished scraping %s' % (city)

df_dic

In [120]:
df_dic['Chicago'].head()

Unnamed: 0,Company,Job,Location,Salary
0,Trunk Club,Director of Data Science,"Chicago, IL 60654 (Loop area)",
1,Ipsos North America,USPA - Statistical Programmer,"Chicago, IL",
2,University of Chicago,Research Data Analyst,"Chicago, IL",
3,Ipsos North America,Connect - Senior Research Analyst,"Chicago, IL",
4,Amyx,SEC Financial Computer Scientist (Python),"Chicago, IL",


In [13]:
final_df = df_dic['Chicago']

In [14]:
cities = (['New+York', 'San+Francisco', 'Austin', 'Seattle', 
    'Los+Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Phoenix', 'Denver', 'Houston', 'Miami', YOUR_CITY])

We merge all the dataframes from different cities so that we get our final dataframe.

In [15]:
for i in cities:
    final_df= final_df.append(df_dic[i])

In [16]:
final_df.describe()

Unnamed: 0,Company,Job,Location,Salary
count,12334,12334,12334,670
unique,2354,5195,729,306
top,"Magic Leap, Inc.",Data Scientist,"Houston, TX","$50,000 a year"
freq,577,369,698,94


In [17]:
df_sal = final_df.dropna()

In [18]:
df_sal.head()

Unnamed: 0,Company,Job,Location,Salary
29,Modis,Statistical Data Analyst- Long-term Contract i...,"Northbrook, IL 60062",$25 - $35 an hour
30,S.C. International,Statistical Analyst -- Health-7444 - 7444,"Chicago, IL","$80,000 a year"
88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)","Chicago, IL","$110,000 - $155,000 a year"
95,Workbridge Associates,Data Scientist (Healthcare Data),"Chicago, IL","$110,000 a year"
103,Jobspring Partners,Quality Engineer (Big Data),"Chicago, IL","$70,000 - $105,000 a year"


Lastly, we need to clean up salary data. 

1. Only a small number of the scraped results have salary information - only these will be used for modeling.
1. Some of the salaries are not yearly but hourly or weekly, these will not be useful to us for now
1. Some of the entries may be duplicated
1. The salaries are given as text and usually with ranges.

In [19]:
df_sal = df_sal[df_sal.Salary.str.contains("month") == False]
df_sal = df_sal[df_sal.Salary.str.contains("hour") == False]
df_sal = df_sal[df_sal.Salary.str.contains("day") == False]
df_sal = df_sal[df_sal.Salary.str.contains('week') == False]
df_sal = df_sal.drop_duplicates()

In [20]:
df_sal['Location']= df_sal['Location'].astype(str)

In [23]:
df_sal['Salary'] = df_sal['Salary'].map(lambda x: x.lstrip('$').rstrip('a year'))

In [24]:
df_sal[['Salary', 'Salary_2']]= df_sal['Salary'].str.split('$', expand=True)
df_sal = df_sal.fillna(value='0')

In [25]:
df_sal['Salary'] = df_sal['Salary'].map(lambda x: x.rstrip(' -'))

In [28]:
df_sal['Salary_2'] = df_sal['Salary_2'].map(lambda x: int(x.replace(',', '')))

In [29]:
df_sal['Sal_avg']=0
for i in range(len(df_sal['Salary'])):
    if df_sal.iloc[i, 4]==0:
        df_sal.iloc[i, 5]= df_sal.iloc[i,3]
    else:
        df_sal.iloc[i, 5]= (df_sal.iloc[i,3]+df_sal.iloc[i,4])/2

The code above has cleaned up salary, so that we have a new column which displays the yearly salary per title. If a range is given, this column takes the average of the minimum and the maximum value of that range. This dataset is saved to a csv for later use.

In [190]:
df_sal.to_csv('Salaries.csv', sep=';', encoding='utf-8')

## Predicting salaries using Logistic Regression

In this part of the project, the scraped csv is loaded in so that I can do some initial predictions using logistic regression.

In [5]:
import pandas as pd
import numpy as np

In [7]:
df = pd.read_csv('Dataset/Salaries.csv', sep=';')

In [8]:
df = df.drop(['Salary'], axis=1)
df = df.drop(['Salary_2'], axis=1)

In [9]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Job,Location,Sal_avg
0,30,S.C. International,Statistical Analyst -- Health-7444 - 7444,Chicago,80000
1,88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)",Chicago,132500
2,95,Workbridge Associates,Data Scientist (Healthcare Data),Chicago,110000
3,103,Jobspring Partners,Quality Engineer (Big Data),"Chicago, IL",87500
4,114,Analytic Recruiting,Manager / Director of Predictive Modeling,Chicago,115000


In [10]:
df['Location'].value_counts().head(20)

New York                             62
Chicago                              20
San Francisco, CA                    19
Boston, MA                           18
Atlanta, GA                          17
Los Angeles, CA                      17
New York, NY                         13
Coral Gables, FL                     11
Manhattan, NY                        10
Cambridge, MA                        10
Philadelphia, PA                      9
Seattle, WA                           7
Chicago, IL                           7
Phoenix, AZ                           7
Austin, TX                            7
Houston, TX                           7
Denver, CO                            5
Pittsburgh, PA                        5
Boston, MA 02116 (South End area)     5
Phoenix, AZ 85012 (Alhambra area)     4
Name: Location, dtype: int64

Since some less frequent locations in these search results are not very clean, I use a function to find inputs in the dataframe that exactly match (part of) a location name.

In [11]:
city_no_spaces = (['York', 'Francisco', 'Austin', 'Seattle', 
    'Angeles', 'Philadelphia', 'Atlanta', 'Dallas', 'Pittsburgh', 
    'Portland', 'Chicago', 'Phoenix', 'Denver', 'Houston', 'Miami', 'Boston'])

In [12]:
for city in city_no_spaces:
    for i in df.index:
        if city in df.loc[i, 'Location']:
            df.loc[i, 'Location']= city

In [13]:
df = df.loc[df['Location'].isin(city_no_spaces)]


In [14]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Job,Location,Sal_avg
0,30,S.C. International,Statistical Analyst -- Health-7444 - 7444,Chicago,80000
1,88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)",Chicago,132500
2,95,Workbridge Associates,Data Scientist (Healthcare Data),Chicago,110000
3,103,Jobspring Partners,Quality Engineer (Big Data),Chicago,87500
4,114,Analytic Recruiting,Manager / Director of Predictive Modeling,Chicago,115000


I want to predict a binary variable, whether the salary is high or low. We could also perform Linear Regression (or any regression) to predict the salary value here. Instead, we are going to convert this into a binary classification problem, by predicting two classes, HIGH vs LOW salary.

While performing regression may be better, performing classification may help remove some of the noise of the extreme salaries. We don't have to choice the median as the splitting point - we could also split on the 75th percentile or any other reasonable breaking point.

In fact, the ideal scenario may be to predict many levels of salaries.

In [15]:
med_salary= np.median(df['Sal_avg'])
print med_salary

105000.0


In [16]:
for i in df.index:
    if df.loc[i, 'Sal_avg']>med_salary:
        df.loc[i,'high_salary']=int(1) 
    else: 
        df.loc[i,'high_salary']=int(0)

In [17]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Job,Location,Sal_avg,high_salary
0,30,S.C. International,Statistical Analyst -- Health-7444 - 7444,Chicago,80000,0.0
1,88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)",Chicago,132500,1.0
2,95,Workbridge Associates,Data Scientist (Healthcare Data),Chicago,110000,1.0
3,103,Jobspring Partners,Quality Engineer (Big Data),Chicago,87500,0.0
4,114,Analytic Recruiting,Manager / Director of Predictive Modeling,Chicago,115000,1.0


I run a model using statsmodels to get a baseline accuracy for the classification problem.

In [18]:
import statsmodels.formula.api as smf

model1 = smf.logit(formula="high_salary ~ 1",data = df).fit() 
model1.summary() 

Optimization terminated successfully.
         Current function value: 0.692964
         Iterations 3


0,1,2,3
Dep. Variable:,high_salary,No. Observations:,261.0
Model:,Logit,Df Residuals:,260.0
Method:,MLE,Df Model:,0.0
Date:,"Sun, 22 Jan 2017",Pseudo R-squ.:,0.0
Time:,21:04:22,Log-Likelihood:,-180.86
converged:,True,LL-Null:,-180.86
,,LLR p-value:,

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-0.0383,0.124,-0.309,0.757,-0.281 0.204


In [19]:
outcomes = df["high_salary"].value_counts()

In [20]:
outcomes

0.0    133
1.0    128
Name: high_salary, dtype: int64

In [21]:
from __future__ import division

In [22]:
prob_high_salary=outcomes[1]/(outcomes[0]+outcomes[1])
print(prob_high_salary)

0.490421455939


Since the criteria for getting a high salary is to be above the median, there is a slightly bigger chance to have a low salary (since salaries with exactly the median value get put into the low salary category).

In [23]:
model1 = smf.logit(formula="high_salary ~ Location", data = df).fit() 
model1.summary() 

         Current function value: 0.550270
         Iterations: 35




0,1,2,3
Dep. Variable:,high_salary,No. Observations:,261.0
Model:,Logit,Df Residuals:,245.0
Method:,MLE,Df Model:,15.0
Date:,"Sun, 22 Jan 2017",Pseudo R-squ.:,0.2059
Time:,21:05:34,Log-Likelihood:,-143.62
converged:,False,LL-Null:,-180.86
,,LLR p-value:,7.01e-10

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.2528,0.567,-2.210,0.027,-2.364 -0.142
Location[T.Atlanta],0.1542,0.767,0.201,0.841,-1.349 1.657
Location[T.Austin],0.7419,0.925,0.802,0.422,-1.070 2.554
Location[T.Boston],1.4069,0.690,2.039,0.041,0.054 2.759
Location[T.Chicago],1.4759,0.687,2.150,0.032,0.130 2.822
Location[T.Dallas],-0.1335,1.254,-0.107,0.915,-2.590 2.323
Location[T.Denver],-0.6931,1.210,-0.573,0.567,-3.065 1.679
Location[T.Francisco],24.2105,2.16e+04,0.001,0.999,-4.23e+04 4.24e+04
Location[T.Houston],0.1542,0.994,0.155,0.877,-1.794 2.102


As can be seen from the logit regression above, the probability of getting a salary above or below the median is dependent on the location of the job title. When looking at the statistically significant p-values at the 5%-level, it is interesting to see that all significant effects are positive (the most significant seem to be New York and Seattle). The fact that most of the locations have a positive effect on the probability of a high salary can be explained by the fact that the intercept is negative. 

A next step is to find if certain words in a job title have a predictive effect on salary.

In [25]:
df['Job'] = df['Job'].map(lambda x: x.decode('utf-8'))

In [26]:
for i in df.index:
    if 'Senior' in df.loc[i,'Job']:
        df.loc[i,'Senior']=1
    else:
        df.loc[i, 'Senior']=0

In [27]:
for i in df.index:
    if 'Manager' in df.loc[i,'Job']:
        df.loc[i,'Manager']=1
    else:
        df.loc[i, 'Manager']=0

In [28]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Job,Location,Sal_avg,high_salary,Senior,Manager
0,30,S.C. International,Statistical Analyst -- Health-7444 - 7444,Chicago,80000,0.0,0.0,0.0
1,88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)",Chicago,132500,1.0,1.0,0.0
2,95,Workbridge Associates,Data Scientist (Healthcare Data),Chicago,110000,1.0,0.0,0.0
3,103,Jobspring Partners,Quality Engineer (Big Data),Chicago,87500,0.0,0.0,0.0
4,114,Analytic Recruiting,Manager / Director of Predictive Modeling,Chicago,115000,1.0,0.0,1.0


In [29]:
model2 = smf.logit(formula="high_salary ~ Location+ Manager + Senior", data = df).fit() 
model2.summary() 

         Current function value: 0.543869
         Iterations: 35




0,1,2,3
Dep. Variable:,high_salary,No. Observations:,261.0
Model:,Logit,Df Residuals:,243.0
Method:,MLE,Df Model:,17.0
Date:,"Sun, 22 Jan 2017",Pseudo R-squ.:,0.2152
Time:,21:12:32,Log-Likelihood:,-141.95
converged:,False,LL-Null:,-180.86
,,LLR p-value:,9.3e-10

0,1,2,3,4,5
,coef,std err,z,P>|z|,[95.0% Conf. Int.]
Intercept,-1.3559,0.575,-2.359,0.018,-2.482 -0.229
Location[T.Atlanta],0.2116,0.772,0.274,0.784,-1.301 1.725
Location[T.Austin],0.6379,0.937,0.681,0.496,-1.198 2.474
Location[T.Boston],1.3249,0.697,1.900,0.057,-0.042 2.692
Location[T.Chicago],1.5141,0.693,2.186,0.029,0.157 2.871
Location[T.Dallas],-0.2165,1.267,-0.171,0.864,-2.699 2.266
Location[T.Denver],-0.6733,1.236,-0.545,0.586,-3.095 1.749
Location[T.Francisco],27.5351,1.18e+05,0.000,1.000,-2.32e+05 2.32e+05
Location[T.Houston],0.0338,1.008,0.034,0.973,-1.941 2.009


As can be seen from the regression results above, only the word 'Senior' in the job title seems to have a statistically significant positive effect at the 10%-level, whereas the word 'Manager' in the job title does not. Using the word 'Senior' thus does provide some additional explanatory power to our model.

In [30]:
df = pd.concat([df, pd.get_dummies(df['Location'])], axis=1)

In [31]:
df.head()

Unnamed: 0.1,Unnamed: 0,Company,Job,Location,Sal_avg,high_salary,Senior,Manager,Angeles,Atlanta,...,Denver,Francisco,Houston,Miami,Philadelphia,Phoenix,Pittsburgh,Portland,Seattle,York
0,30,S.C. International,Statistical Analyst -- Health-7444 - 7444,Chicago,80000,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,88,Workbridge Associates,"Senior Data Scientist (H20, Python, and R)",Chicago,132500,1.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,95,Workbridge Associates,Data Scientist (Healthcare Data),Chicago,110000,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,103,Jobspring Partners,Quality Engineer (Big Data),Chicago,87500,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,114,Analytic Recruiting,Manager / Director of Predictive Modeling,Chicago,115000,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now, I try to set up the same model using sci-kit learn, using a simple logistic regression.

In [32]:
X = pd.DataFrame(df.loc[:, 'Senior':])
y = df['high_salary']
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_score

model = LogisticRegression() 

model.fit(X,y)
print model.score(X,y)

print cross_val_score(model, X,y).mean()

0.689655172414
0.386973180077




The model seems to overfit quite dramatically, which leads to poor performance when having a cross-validated score. Perhaps doing a Gridsearch while adding a penalty function will improve the performance.

In [33]:
from sklearn.model_selection import GridSearchCV
params = {'C': [0.01, 0.1, 1, 10, 100], 'penalty':['l1', 'l2']}
logreg_cv = GridSearchCV(model, param_grid=params, cv=5)

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.25, random_state=42)

logreg_cv.fit(X_train,y_train)
print 'Score: ' + str(logreg_cv.best_score_)
print 'Best parameters: '+ str(logreg_cv.best_params_)

Score: 0.651282051282
Best parameters: {'penalty': 'l2', 'C': 0.01}


When doing a Grid search, the cross-validated performance of the model becomes much better. The penalty function included avoids the problem of overfitting. I use these parameters to train the model and to make predictions using the test set, which is unseen data for the model. First I examine the coefficients for the model, however.

In [34]:
logreg = LogisticRegression(penalty= 'l2', C= 0.01).fit(X_train, y_train)

In [35]:
def examine_coefficients(model, df):
    df = pd.DataFrame(
        { 'Coefficient' : model.coef_[0] , 'Feature' : df.columns}
    ).sort_values(by='Coefficient')
    return df[df.Coefficient !=0 ]

In [36]:
examine_coefficients(logreg, X)

Unnamed: 0,Coefficient,Feature
8,-0.033705,Denver
2,-0.027829,Angeles
14,-0.024344,Pittsburgh
3,-0.023144,Atlanta
13,-0.018964,Phoenix
15,-0.014608,Portland
11,-0.014608,Miami
7,-0.009528,Dallas
1,-0.009316,Manager
10,-0.004591,Houston


The most important features in determining a data scientist's salary are the location of the job (where obviously working in San Francisco or New York has a positive impact on your salary). Next to that, being Senior is also an important factor in determining your future salary. Being a 'manager', however, seems to have a detrimental effect on your salary as a data scientist.

In [37]:
from sklearn.metrics import classification_report

y_pred = logreg.predict(X_test)
print(classification_report(y_test, y_pred))

             precision    recall  f1-score   support

        0.0       0.59      0.71      0.65        31
        1.0       0.69      0.57      0.62        35

avg / total       0.65      0.64      0.64        66



The model has quite a high precision in determining high salaries. This indicates that 69% of the overall positives that the model yields, are true positives. This could be a useful model for filtering out low salaries, so that 70% of the results the model generates are actually true positives (salaries over $105,000 per year).

### Conclusion

The values given by this model indicate that our model is an improvement of the baseline accuracy. The model achieves an accuracy of 65.1%, which is an improvement on the earlier accuracy of approximately 50%. However, it was very important to do cross-validation, and to add a penalty function to our logistic regression.

The most important features in determining a data scientist's salary are the location of the job (where obviously working in San Francisco or New York has a positive impact on your salary). Next to that, being Senior is also an important factor in determining your future salary. Being a 'manager', however, seems to have a detrimental effect on your salary as a data scientist.