<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Model tuning

---

We have covered various models for classification and regression which allow us to fit a response variable to a set of feature variables. Usually, all models have a variety of parameters which can be tuned to find the best among all possible models. Best in that context means not only fitting the training data well, but generalising also to previously unseen test data. 

In the lessons you have seen how to use regularisation and grid search in combination with cross validation to find a good model according to model scores like MSE and R2 in the case of regression, and accuracy, recall or precision in the case of classification.

As a reference below we will illustrate this procedure for a binary classification problem with logistic regression.

The same steps can be used for other models if you replace with the appropriate tuning parameters.
 

We assume you already have a data set with the features and the target you want to model (this is the output you get from doing the EDA).

Then the basic workflow generalises in the following way:
-----

1. Create an Instance of the model

1. Check for the parameters of the model (each model will come with its own particular set of parameters)

1. Set up a search grid for the parameters you want to tune in form of a dictionary with the tuning parameters as keys and lists of possible values

1. Call GridSearchCV with the model, the parameter grid, the number of cross validation steps and the scoring of your choice

1. Fit the model

1. Extract the best score

1. Extract the best parameters

1. Choose the model with the best parameters and predict target variable values (here chosen for some of the values in the dataset)

## Example

First we load the libraries that we will need. 

In [1]:
import numpy as np
import pandas as pd
from sklearn import datasets, metrics
from sklearn.linear_model import LogisticRegression
from sklearn import model_selection

#### Now we load the breast cancer dataset that's included in sklearn.

In [2]:
bc = datasets.load_breast_cancer()

#### Load the feature matrix with data about the instances of breast cancer

In [3]:
features_df = pd.DataFrame(bc.data, columns=bc.feature_names)

features_df.head(3)

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758


#### Load the target variable (1 for malign, 0 for benign)

In [4]:
target_df = pd.Series(bc.target)
target_df.head(3)

0    0
1    0
2    0
dtype: int64

### Logistic regression

Now that we have the data (features and target), we will go through the steps in the workflow, so we can fit a linear regression on _**all**_ of our features.

#### 1. Create an instance of the model (here you could also use other classifiers)

In [5]:
model = LogisticRegression()

#### 2. Check for the parameters of the model (each model will come with its own particular set of parameters)

In [6]:
model.get_params().keys()

['warm_start',
 'C',
 'n_jobs',
 'verbose',
 'intercept_scaling',
 'fit_intercept',
 'max_iter',
 'penalty',
 'multi_class',
 'random_state',
 'dual',
 'tol',
 'solver',
 'class_weight']

**Note:** Not all of the model parameters are tuning parameters. 'C', 'penalty', 'fit_intercept' for example are tuning parameters, whereas 'n_jobs', 'verbose' or 'random_state' are not. 

#### 3. Set up a search grid for the parameters you want to tune in form of a dictionary with the tuning parameters as keys and lists of possible values

In [7]:
params = {'C':[0.1,1,10],
          'penalty':['l1','l2'],
          'fit_intercept':[True,False]}

#### 4. Call GridSearchCV with the model, the parameter grid, the number of cross validation steps and the scoring of your choice

In [8]:
gs = model_selection.GridSearchCV(estimator=model,
                                  param_grid=params,
                                  cv=5,
                                  scoring='accuracy')

#### 5. Fit the model

In [9]:
gs.fit(features_df,target_df)

GridSearchCV(cv=5, error_score='raise',
       estimator=LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params={}, iid=True, n_jobs=1,
       param_grid={'penalty': ['l1', 'l2'], 'C': [0.1, 1, 10], 'fit_intercept': [True, False]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring='accuracy', verbose=0)

#### 6. Extract the best score

In [10]:
print gs.best_score_

0.957820738137


#### 7. Extract the best parameters

In [11]:
print gs.best_params_

{'penalty': 'l1', 'C': 10, 'fit_intercept': True}


#### 8. Choose the model with the best parameters and predict target variable values (here chosen for some of the values in the dataset)

In [12]:
gs.best_estimator_.predict(features_df.iloc[100:110])

array([0, 1, 1, 1, 1, 0, 1, 1, 0, 1])

### Further analysis

#### Get all results from the grid search (they are returned as a dictionary that you can conveniently read into a data frame)

In [13]:
results = pd.DataFrame(gs.cv_results_)
results

Unnamed: 0,mean_fit_time,mean_score_time,mean_test_score,mean_train_score,param_C,param_fit_intercept,param_penalty,params,rank_test_score,split0_test_score,...,split2_test_score,split2_train_score,split3_test_score,split3_train_score,split4_test_score,split4_train_score,std_fit_time,std_score_time,std_test_score,std_train_score
0,0.100128,0.007826,0.926186,0.938488,0.1,True,l1,"{u'penalty': u'l1', u'C': 0.1, u'fit_intercept...",11,0.904348,...,0.929204,0.938596,0.920354,0.942982,0.938053,0.934211,0.019164,0.014913,0.012914,0.005018
1,0.004863,0.000277,0.945518,0.946395,0.1,True,l2,"{u'penalty': u'l2', u'C': 0.1, u'fit_intercept...",9,0.930435,...,0.982301,0.940789,0.929204,0.953947,0.946903,0.947368,0.000142,5e-06,0.019395,0.004328
2,0.077948,0.00036,0.926186,0.93981,0.1,False,l1,"{u'penalty': u'l1', u'C': 0.1, u'fit_intercept...",11,0.904348,...,0.929204,0.938596,0.920354,0.942982,0.938053,0.934211,0.023562,1.9e-05,0.012914,0.003778
3,0.005059,0.000304,0.945518,0.945956,0.1,False,l2,"{u'penalty': u'l2', u'C': 0.1, u'fit_intercept...",9,0.930435,...,0.982301,0.938596,0.929204,0.953947,0.946903,0.947368,0.000317,3.9e-05,0.019395,0.004942
4,0.196167,0.000373,0.950791,0.961778,1.0,True,l1,"{u'penalty': u'l1', u'C': 1, u'fit_intercept':...",6,0.93913,...,0.973451,0.953947,0.946903,0.969298,0.964602,0.958333,0.052644,8e-06,0.01594,0.005277
5,0.005769,0.000283,0.950791,0.958704,1.0,True,l2,"{u'penalty': u'l2', u'C': 1, u'fit_intercept':...",6,0.930435,...,0.973451,0.949561,0.946903,0.967105,0.964602,0.953947,0.000296,1.6e-05,0.01594,0.006234
6,0.23235,0.000368,0.952548,0.961778,1.0,False,l1,"{u'penalty': u'l1', u'C': 1, u'fit_intercept':...",3,0.947826,...,0.973451,0.953947,0.946903,0.969298,0.964602,0.958333,0.07441,6e-06,0.01501,0.005277
7,0.006059,0.000275,0.952548,0.958266,1.0,False,l2,"{u'penalty': u'l2', u'C': 1, u'fit_intercept':...",3,0.93913,...,0.973451,0.949561,0.946903,0.967105,0.964602,0.951754,0.000239,8e-06,0.013955,0.006619
8,0.273643,0.000412,0.957821,0.977158,10.0,True,l1,"{u'penalty': u'l1', u'C': 10, u'fit_intercept'...",1,0.956522,...,0.955752,0.969298,0.955752,0.97807,0.964602,0.97807,0.210128,6.2e-05,0.003393,0.004275
9,0.006672,0.000273,0.952548,0.96617,10.0,True,l2,"{u'penalty': u'l2', u'C': 10, u'fit_intercept'...",3,0.947826,...,0.973451,0.964912,0.946903,0.973684,0.964602,0.958333,0.000436,5e-06,0.01501,0.004913


#### The most interesting result is the mean test score for each parameter combination

In [14]:
results[['mean_test_score'] + [col for col in results.columns if 'param_' in col]]

Unnamed: 0,mean_test_score,param_C,param_fit_intercept,param_penalty
0,0.926186,0.1,True,l1
1,0.945518,0.1,True,l2
2,0.926186,0.1,False,l1
3,0.945518,0.1,False,l2
4,0.950791,1.0,True,l1
5,0.950791,1.0,True,l2
6,0.952548,1.0,False,l1
7,0.952548,1.0,False,l2
8,0.957821,10.0,True,l1
9,0.952548,10.0,True,l2
