### Creating a pipeline to tune  tf-idf + ridge regularization parameters and select the best model.


In [46]:
import pandas as pd
import numpy as np

from sklearn import linear_model
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.grid_search import GridSearchCV


from matplotlib import pyplot as plt
% matplotlib inline

We are going to try to build a model that predicts the salary offer for a job based on the description of the job listing.

In [2]:
train = pd.read_csv("https://raw.githubusercontent.com/ajschumacher/gadsdata/master/salary/train.csv")

In [41]:
y = train.SalaryNormalized

In [4]:
train.head(3)

Unnamed: 0,Id,Title,FullDescription,LocationRaw,LocationNormalized,ContractType,ContractTime,Company,Category,SalaryRaw,SalaryNormalized,SourceName
0,12612628,Engineering Systems Analyst,Engineering Systems Analyst Dorking Surrey Sal...,"Dorking, Surrey, Surrey",Dorking,,permanent,Gregory Martin International,Engineering Jobs,20000 - 30000/annum 20-30K,25000,cv-library.co.uk
1,12612830,Stress Engineer Glasgow,Stress Engineer Glasgow Salary **** to **** We...,"Glasgow, Scotland, Scotland",Glasgow,,permanent,Gregory Martin International,Engineering Jobs,25000 - 35000/annum 25-35K,30000,cv-library.co.uk
2,12612844,Modelling and simulation analyst,Mathematical Modeller / Simulation Analyst / O...,"Hampshire, South East, South East",Hampshire,,permanent,Gregory Martin International,Engineering Jobs,20000 - 40000/annum 20-40K,30000,cv-library.co.uk


We will just use the description and build a pipeline - this is insanely easy in sklearn!

In [43]:
estimators = [("tf_idf", TfidfVectorizer()), 
              ("ridge", linear_model.Ridge())]
model = Pipeline(estimators)

#### So we just plug in the raw descriptions and the tf_idf transforms it into a matrix that is then fitted by the ridge model. Genius. 

 ### $$Description\longrightarrow X , y \longrightarrow model$$

In [44]:
model.fit(train.FullDescription, y) 

Pipeline(steps=[('tf_idf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
...it_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001))])

In [130]:
params = {"ridge__alpha":[0.1, 0.3, 1, 3, 10], #regularization param
          "tf_idf__min_df": [1, 3, 10], #min count of words allowed
          "tf_idf__ngram_range": [(1,1), (1,2)], #1-grams or 2-grams
          "tf_idf__stop_words": [None, "english"]} #use stopwords or don't

How many different model must we run? Well since we're doing a grid search we can just multiply the possibilities for each parameter to get `5*3*2*2` for a total of 60 models - a decent number. And keep in mind that for each model we have to build the tf_idf vectorizer all over again.

In [131]:
grid = GridSearchCV(estimator=model, param_grid = params, scoring = "mean_squared_error")

grid.fit(train.FullDescription, y)

In [157]:
grid.best_params_

{'ridge__alpha': 0.3,
 'tf_idf__min_df': 1,
 'tf_idf__ngram_range': (1, 2),
 'tf_idf__stop_words': 'english'}

In [161]:
np.sqrt(-grid.best_score_)

10532.473521325306

We can also look at all the params:

In [143]:
params = pd.DataFrame([i[0] for i in grid.grid_scores_])
results = pd.DataFrame(grid.grid_scores_)
results = pd.concat([params, results], 1)
results["rmse"] = np.sqrt(-results.mean_validation_score)

In [209]:
results.head(3)

Unnamed: 0,ridge__alpha,tf_idf__min_df,tf_idf__ngram_range,tf_idf__stop_words,parameters,mean_validation_score,cv_validation_scores,rmse
0,0.1,1,"(1, 1)",,"{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...",-138398600.0,"[-103831685.851, -141229157.862, -170145315.841]",11764.29327
1,0.1,1,"(1, 1)",english,"{'ridge__alpha': 0.1, 'tf_idf__stop_words': 'e...",-140887000.0,"[-105929048.004, -144749023.148, -171993435.294]",11869.583228
2,0.1,1,"(1, 2)",,"{'ridge__alpha': 0.1, 'tf_idf__stop_words': No...",-111302600.0,"[-77620035.3972, -108499379.09, -147798481.11]",10550.004578


### Examining the Best Model:

In [153]:
model = grid.best_estimator_

Every time we predict the model will run the tf-idf part first, already fitted on the train set and then use the ridge regression model. 


In [167]:
model.predict(train.FullDescription)

array([ 25975.84531928,  32824.5058169 ,  32127.26976225, ...,
        50386.2916183 ,  50138.40072399,  27588.69246637])

One issue with using the pipeline is that we don't see the little details that go into fitting the models.

What if we want to examing more closely what goes on in each model? Say for example I want to look at the coefficients of my linear regression. That's also pretty straighforward using the `named_steps` method.

In [207]:
grid.best_estimator_.named_steps["ridge"].coef_

array([ -465.8824938 ,  1697.39286267,  1304.56896049, ...,  1416.89223231,
        -596.29992468,  -596.29992468])

In [183]:
ridge_model = model.named_steps["ridge"]
tf_idf_model = model.named_steps["tf_idf"]

In [188]:
coefficients = pd.DataFrame({"names":tf_idf_model.get_feature_names(),
                             "coef":ridge_model.coef_})

Let's look at the tokens with the largest coefficients:

In [211]:
coefficients.sort_values("coef", ascending=False).head(10)

Unnamed: 0,coef,names
88432,51890.453609,consultant grade
235331,48488.999766,locum
399929,45052.453063,subsea
174963,43441.208259,global
235338,40651.232641,locum consultant
90843,40016.08387,contract
235682,38554.092136,london
211090,36076.259999,investment
121657,34280.854922,director
244094,33309.500667,manager


We see some of the usual suspects - such as london, consultant, director and manager. However given how many features we have (over 400k) it's hard to interpret these coefficients very accurately. Perhaps doing a Lasso model with a strong $l_1$ regularization might help with that, since that wil reduce the number of non-zero coefficients.