# Introduction to parameter tuning

**Hyper-parameters**

A machine learning model is a mathematical formula with a number of parameters that are learnt from the data. That is the crux of machine learning: fitting a model to the data.

However, there is another kind of parameters that cannot be directly learned from the regular training process. These parameters express “higher-level” properties of the model such as its complexity or how fast it should learn. They are called hyperparameters. Hyperparameters are usually fixed before the actual training process begins.

So, how are hyperparameters decided?

Broadly speaking, this is done by setting different values for those hyperparameters, training different models, and deciding which ones work best by testing them

So, to summarize. Hyperparameters:

- Define higher level concepts about the model such as complexity, or capacity to learn.
- Cannot be learned directly from the data in the standard model training process and need to be predefined.
- Can be decided by setting different values, training different models, and choosing the values that test better

Some examples of hyperparameters:

- Number of leaves or depth of a tree
- Number of latent factors in a matrix factorization
- Learning rate (in many models)
- Number of hidden layers in a deep neural network
- Number of clusters in a k-means clustering

source: [Quora](https://www.quora.com/What-are-hyperparameters-in-machine-learning)


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('fivethirtyeight')

In [None]:
#Read the data 

**Exercise**

In [2]:
#load the train data


In [18]:
#load the test data


**Basic checks**

Check if the columns are the same in train and test.

What else will you check?  [**Discuss**]

In [19]:
train.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')

In [20]:
test.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome', 'deposit'],
      dtype='object')

In [21]:
print(train.shape, test.shape)

(35211, 17) (10000, 17)


In [24]:
print("train")
print(train.dtypes)
print()
print("test")
print(test.dtypes)

train
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object

test
age           int64
job          object
marital      object
education    object
default      object
balance       int64
housing      object
loan         object
contact      object
day           int64
month        object
duration      int64
campaign      int64
pdays         int64
previous      int64
poutcome     object
deposit      object
dtype: object


The categorical data should be encoded.

We saw LabelEncoder earlier. Now, we will use one-hot encoding

### One-hot encoding

![](img/onehot.jpg)

In [6]:
train_updated = pd.get_dummies(train)

In [7]:
train.shape

(35211, 17)

In [8]:
train_updated.shape

(35211, 53)

In [28]:
#print the first record
train_updated.iloc[0]

age                      58
balance                2143
day                       5
duration                261
campaign                  1
pdays                    -1
previous                  0
job_admin.                0
job_blue-collar           0
job_entrepreneur          0
job_housemaid             0
job_management            1
job_retired               0
job_self-employed         0
job_services              0
job_student               0
job_technician            0
job_unemployed            0
job_unknown               0
marital_divorced          0
marital_married           1
marital_single            0
education_primary         0
education_secondary       0
education_tertiary        1
education_unknown         0
default_no                1
default_yes               0
housing_no                0
housing_yes               1
loan_no                   1
loan_yes                  0
contact_cellular          0
contact_telephone         0
contact_unknown           1
month_apr           

**Exercise**
Apply one-hot encoding to test dataset and store in test_updated

In [None]:
#Code here

In [26]:
print(test.shape, test_updated.shape)

(10000, 17) (10000, 53)


In [27]:
#print the first record
test_updated.iloc[1]

age                      58
balance                5445
day                      14
duration                391
campaign                  1
pdays                    -1
previous                  0
job_admin.                0
job_blue-collar           1
job_entrepreneur          0
job_housemaid             0
job_management            0
job_retired               0
job_self-employed         0
job_services              0
job_student               0
job_technician            0
job_unemployed            0
job_unknown               0
marital_divorced          0
marital_married           1
marital_single            0
education_primary         1
education_secondary       0
education_tertiary        0
education_unknown         0
default_no                1
default_yes               0
housing_no                0
housing_yes               1
loan_no                   1
loan_yes                  0
contact_cellular          1
contact_telephone         0
contact_unknown           0
month_apr           

In [10]:
#Split this into train and label

train_1 = train_updated.iloc[:,:51]

In [29]:
train_1.iloc[0]

age                      58
balance                2143
day                       5
duration                261
campaign                  1
pdays                    -1
previous                  0
job_admin.                0
job_blue-collar           0
job_entrepreneur          0
job_housemaid             0
job_management            1
job_retired               0
job_self-employed         0
job_services              0
job_student               0
job_technician            0
job_unemployed            0
job_unknown               0
marital_divorced          0
marital_married           1
marital_single            0
education_primary         0
education_secondary       0
education_tertiary        1
education_unknown         0
default_no                1
default_yes               0
housing_no                0
housing_yes               1
loan_no                   1
loan_yes                  0
contact_cellular          0
contact_telephone         0
contact_unknown           1
month_apr           

In [12]:
#Store label as a separate variable
label = train_updated.iloc[:,51]

In [13]:
print(train_1.shape, label.shape)

(35211, 51) (35211,)


In [14]:
#Let's build random forest model

In [15]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
model_rf = RandomForestClassifier(n_estimators=100,
                                 criterion="gini",
                                 max_depth=4,
                                 min_samples_split=1,
                                 min_samples_leaf= 1,
                                 oob_score=True,
                                 n_jobs=-1
                                 )

In [None]:
model_rf.fit(train_1, label)

In [None]:
model_rf.oob_score_

Let's do cross validation and see what the generalization error is

### Cross-validation

In [32]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_curve, auc

In [None]:
model_rf = RandomForestClassifier(n_estimators=100,
                                 criterion="gini",
                                 max_depth=3,
                                 min_samples_split=1,
                                 min_samples_leaf= 1,
                                 oob_score=True,
                                 n_jobs=-1
                                 )

In [None]:
%%time

#Or use %%timeit -n1 -r1 to time the cell

cross_val_score_rf = cross_val_score(model_rf, 
                                     train_1, 
                                     label, scoring="roc_auc",
                                     cv=5,
                                     n_jobs=-1
                                    )

In [None]:
cross_val_score_rf

**Exercise**

In [None]:
#What is the average cross validation score?


#### grid-search

The above was for some arbitrary chosen parameter value.

How do we run the model on various choices of hyper-parameters?


In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

In [34]:
%%timeit -n1 -r1

# Set the parameters by cross-validation
tuned_parameters = [{'n_estimators': [50,100], 
                     'max_depth': [3, 4]
                    }]

scores = ['roc_auc']

for score in scores:
    print("# Tuning hyper-parameters for %s" % score)
    print()

    clf = GridSearchCV(RandomForestClassifier(n_jobs=-1), 
                       tuned_parameters, cv=5,
                       scoring='%s' % score)
    clf.fit(train_1, label)

    print("Best parameters set found on development set:")
    print()
    print(clf.best_params_)
    print()
    print("Grid scores on development set:")
    print()
    means = clf.cv_results_['mean_test_score']
    stds = clf.cv_results_['std_test_score']
    for mean, std, params in zip(means, stds, clf.cv_results_['params']):
        print("%0.3f (+/-%0.03f) for %r"
              % (mean, std * 2, params))
    print()

    print("Detailed classification report:")
    print()
    print("The model is trained on the full development set.")
    print("The scores are computed on the full evaluation set.")
    print()
    y_true, y_pred = test_updated.iloc[:,51],  clf.predict(test_updated.iloc[:,:51])
    
    false_positive_rate, true_positive_rate, thresholds = roc_curve(y_true, y_pred)
    roc_auc = auc(false_positive_rate, true_positive_rate)
    print("AUC:", roc_auc)
    
    print(classification_report(y_true, y_pred))
    print()


# Tuning hyper-parameters for roc_auc

Best parameters set found on development set:

{'max_depth': 3, 'n_estimators': 50}

Grid scores on development set:

0.644 (+/-0.424) for {'max_depth': 3, 'n_estimators': 50}
0.631 (+/-0.411) for {'max_depth': 3, 'n_estimators': 100}
0.622 (+/-0.398) for {'max_depth': 4, 'n_estimators': 50}
0.641 (+/-0.363) for {'max_depth': 4, 'n_estimators': 100}

Detailed classification report:

The model is trained on the full development set.
The scores are computed on the full evaluation set.

AUC: 0.504958329704
             precision    recall  f1-score   support

          0       0.80      0.01      0.02      1170
          1       0.88      1.00      0.94      8830

avg / total       0.87      0.88      0.83     10000


1 loop, best of 1: 10.4 s per loop


**Exercise**

- For `max_depth` include - 6, 10
- Add `min_samples_split`, `min_samples_leaf` to the grid search
- In addition to `roc_auc`, add `precision` and `recall` 