# Hyperparameters

We know that 
* the choice of **model** and 
* the choice of **hyperparameters**

are the most important part of using supervised techniques effectively.

# Hyperparameter Tuning 

<img align="left" style="padding-right:30px;" src="figures/tuning.jpeg" width="40%"> 
Gathering **more data** and **feature engineering** usually has the greatest payoff in terms of time invested versus improved performance <br>
but when we have exhausted all data sources, it’s time to move on to **model hyperparameter tuning**.

## A Brief Explanation of Hyperparameter Tuning

* While **model parameters** are learned during training (such as the slope and intercept in a linear regression)
* **hyperparameters** must be set by the data scientist before training. 

**Example of a random forest**: hyperparameters include 
* the **number of decision trees** in the forest 
* the **number of features** considered by each tree when splitting a node. 


## Scikit-Learn and Hyperparameter Tuning

`Scikit-Learn` implements a set of sensible **default hyperparameters** for all models, but these are not guaranteed to be optimal for a problem. 

The **best hyperparameters** are usually **impossible** to determine ahead of time, and tuning a model is where machine learning turns from a science into **trial-and-error based engineering**.



### The Hyperparameter tuning approach
Hyperparameter tuning 
* relies more on **experimental results** than theory, and thus 
* the best method to determine the optimal settings is 
  * to **try many different combinations**
  * **evaluate** the performance of each model. 
  
However, **evaluating** each model only **on the training set** can lead to one of the most fundamental problems in machine learning: `overfitting`.

### Cross Validation for Hyperparameter tuning

<img align="left" style="padding-right:30px;" src="figures/cv_hyperparameters.png" width="40%"> 
* perform **many iterations** of the **entire K-Fold CV process**, 
* each time using **different model settings**. 
* **compare** all of the models, 
* **select** the best one, 
* **train it** on the full training set, and then 
* **evaluate** on the testing set. 

Model tuning with K-Fold CV is implemented in Scikit-Learn: `GridSearchCV`

### GridSearchCV

<img align="left" style="padding-right:30px;" src="figures/grid-search.png" width="30%"> 
* It **brute force** all combinations. 
* **Exhaustive search** over specified parameter values for an estimator.
  


#### Example

Grid search example to tune two set of hyperparameters for a MLP:
  * Learning Rate
  * Number of Layers

In [1]:
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
from sklearn.neural_network import MLPClassifier
iris = datasets.load_iris()
mlp = MLPClassifier(solver='lbfgs', alpha=1e-5, 
                    random_state=1, learning_rate='constant')
parameters = {'hidden_layer_sizes':[(5,), (10,), (20,), (5,2), (10,2), (20,2)], 
              'learning_rate_init':[0.0001, 0.001, 0.01, 0.1]}
clf = GridSearchCV(mlp, parameters, cv=5)
clf.fit(iris.data, iris.target)

print("Best parameters set found on development set:")
print(clf.best_params_)
print("Grid scores on development set:")

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

Best parameters set found on development set:
{'hidden_layer_sizes': (10,), 'learning_rate_init': 0.0001}
Grid scores on development set:
0.467 (+/-0.533) for {'hidden_layer_sizes': (5,), 'learning_rate_init': 0.0001}
0.467 (+/-0.533) for {'hidden_layer_sizes': (5,), 'learning_rate_init': 0.001}
0.467 (+/-0.533) for {'hidden_layer_sizes': (5,), 'learning_rate_init': 0.01}
0.467 (+/-0.533) for {'hidden_layer_sizes': (5,), 'learning_rate_init': 0.1}
0.980 (+/-0.053) for {'hidden_layer_sizes': (10,), 'learning_rate_init': 0.0001}
0.980 (+/-0.053) for {'hidden_layer_sizes': (10,), 'learning_rate_init': 0.001}
0.980 (+/-0.053) for {'hidden_layer_sizes': (10,), 'learning_rate_init': 0.01}
0.980 (+/-0.053) for {'hidden_layer_sizes': (10,), 'learning_rate_init': 0.1}
0.980 (+/-0.080) for {'hidden_layer_sizes': (20,), 'learning_rate_init': 0.0001}
0.980 (+/-0.080) for {'hidden_layer_sizes': (20,), 'learning_rate_init': 0.001}
0.980 (+/-0.080) for {'hidden_layer_sizes': (20,), 'learning_rate_ini

## Random Search Cross Validation in Scikit-Learn

Usually, we only have a vague idea of the best hyperparameters and thus 
* the best approach to narrow our search is to evaluate a wide range of values for each hyperparameter. 

Using Scikit-Learn’s `RandomizedSearchCV` method, we can 
* define a **grid of hyperparameter** ranges, and 
* **randomly sample** from the grid, 
* performing K-Fold CV with each combination of values.

### RandomizedSearchCV

<img align="left" style="padding-right:30px;" src="figures/random-search.png" width="30%"> 

For example,  
Instead of trying to check all 100,000 samples,  
we can check 1000 random parameters.

In [None]:
from sklearn.model_selection import RandomizedSearchCV

clf = RandomizedSearchCV(mlp, parameters, cv=5, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True, n_iter=8)
clf.fit(iris.data, iris.target)

print("Best parameters set found on development set:")
print(clf.best_params_)
print("Grid scores on development set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

<div class="alert alert-success">
    
## Practice 
* Compare best parameetrs of a RandomForest using RandomizedSearchCV and GridSearchCV
* Compare also execution time
* Use the dataset available in `data/pima-indians-diabetes.data.csv`

</div>

In [9]:
# Pandas is used for data manipulation
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
import sklearn

X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3)
# Create the model with 100 trees
RSEED = 50
model = RandomForestClassifier(n_estimators=100, 
                               random_state=RSEED, 
                                max_features = 'sqrt',
                               n_jobs=-1, verbose = 1)
parameters = {'n_estimators':[5, 10, 20], 
              'max_depth':[1, 5, 10]}
clf = GridSearchCV(model, parameters, cv=5)
clf.fit(X_train, y_train)
                          
'''
predictions = model.predict(X_test)
sklearn.metrics.accuracy_score(y_test,predictions)
'''

print("Best parameters set found on development set:")
print(clf.best_params_)
print("Grid scores on development set:")

means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

rclf = RandomizedSearchCV(model, parameters, cv=5, verbose=2, random_state=42, n_jobs=-1,
                              return_train_score=True, n_iter=8)
rclf.fit(X_train, y_train)

print("Best parameters set found on development set:")
print(rclf.best_params_)
print("Grid scores on development set:")
means = rclf.cv_results_['mean_test_score']
stds = rclf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, rclf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))



[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done   5 out of   5 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   

[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.0s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    0.1s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  10 out of  10 | elapsed:    0.0s finished
[Parallel(n_jobs=4)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jo

Best parameters set found on development set:
{'max_depth': 5, 'n_estimators': 20}
Grid scores on development set:
0.905 (+/-0.148) for {'max_depth': 1, 'n_estimators': 5}
0.971 (+/-0.077) for {'max_depth': 1, 'n_estimators': 10}
0.971 (+/-0.077) for {'max_depth': 1, 'n_estimators': 20}
0.971 (+/-0.047) for {'max_depth': 5, 'n_estimators': 5}
0.971 (+/-0.047) for {'max_depth': 5, 'n_estimators': 10}
0.981 (+/-0.048) for {'max_depth': 5, 'n_estimators': 20}
0.971 (+/-0.047) for {'max_depth': 10, 'n_estimators': 5}
0.971 (+/-0.047) for {'max_depth': 10, 'n_estimators': 10}
0.971 (+/-0.047) for {'max_depth': 10, 'n_estimators': 20}
Fitting 5 folds for each of 8 candidates, totalling 40 fits


[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    3.3s


Best parameters set found on development set:
{'n_estimators': 20, 'max_depth': 5}
Grid scores on development set:
0.971 (+/-0.047) for {'n_estimators': 10, 'max_depth': 10}
0.971 (+/-0.077) for {'n_estimators': 10, 'max_depth': 1}
0.981 (+/-0.048) for {'n_estimators': 20, 'max_depth': 5}
0.905 (+/-0.148) for {'n_estimators': 5, 'max_depth': 1}
0.971 (+/-0.047) for {'n_estimators': 20, 'max_depth': 10}
0.971 (+/-0.077) for {'n_estimators': 20, 'max_depth': 1}
0.971 (+/-0.047) for {'n_estimators': 10, 'max_depth': 5}
0.971 (+/-0.047) for {'n_estimators': 5, 'max_depth': 5}


[Parallel(n_jobs=-1)]: Done  40 out of  40 | elapsed:    3.7s finished
[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    0.1s finished
