## What are Hyperparameter?

We briefly learned about the distinction between the terms parameters and hyperparameters in Week 9. To recap, the model parameters are the components of the model that are learned during the training of the model. The model parameters are required to make predictions from model. Examples of the model parameter include slope and coefficients of linear regression, and weights and biases in artificial neural network.

In contrast to model parameters, hyperparameters relates to the architecture of your machine learning model. Hyperparameters cannot be directly learned during the training process and  often defined before model training. The hyperparameters determine the final parameters of the model.

The process of searching for the best set of hyperparameter that leads to better model outcome is referred to as hyperparameter tuning/optimization.

Let's take an example of random forest and find hyperparameters for random forest.



In [158]:
##create a simple random forest estimator using sklearn and print it out
rf=RandomForestClassifier()
print(rf)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)


These are all configurations of a random forest model that we need to optimize to achive the best model performance. What do all these hyperparameter mean?The detailed description of each random forest hyperparameter is avaialble here: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html. 

Let's take an example:

n_estimator: It is the number of trees in the forest. The default value is set to 10. Typically, the higher the number of trees leads to better performance. But this will also increase computation time.




***Let's check what are the hyperparameters of KNN Classifier***

In [28]:
##create a simple knn estimator using sklearn and print it out
knn=KNeighborsClassifier()
print (knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')


There are less number of hyperparameters for KNN than RF model. Go to scikit learn documentation page to learn more about hyperparameters for KNN

## Hyperparameter tuning/optimization

As we saw in the random forest (RF) estimator, the RF model has many different hyperparameters we need to set. As the machine learning model becomes more complex, the number of hyperparameter we need to set also increases. The performance of a model highly depends on the choice of its hyperparameters. 

During the model training we don't know what the optimal set of hyperparamters should be for a given model, and thus we need to explore a range of possibilities. This process of exploring the set of optimal hyperparameters that improves model performance (reduced cost function) is referred to as hyperparameter tuning  or optimization. 
 
In general, a hyperparamter tuning/optimization includes the following steps:

1. Define the model and possible sets of hyperparameters(not all the hyperprameters are important) and range of values for the selected hyperparameters
2. Define the method for sampling of hyperparameter value
2. Define the metric to be minimized or maximized for the problem in hand
3. Split the data into training,validation and test sets
4. Repeat the optimization loop a number of times over your sampled hyperparamter space:

  (a) Select a new set of model hyperparameters   
  (b) Train the model on the training subset using the selected set of hyperparameters  
  (c) Apply the model to the test subset and generate the corresponding predictions  
  (d) Evaluate the test predictions using the appropriate scoring metric for the problem at hand, such as accuracy or mean absolute error. Store the metric value that corresponds to the selected set of hyperparameters
  
3. Compare all metric values and choose the hyperparameter set that yields the best metric value.


## Hyperparameter tuning methods

As we know by now the hyperparameter tuning method includes "searching" the hyperparameter space for the optimum values. In general the hyperparameter tuning strategies can be grouped as:
1. Manual search
2. Grid Search
3. Random Search
4. Bayesian Optimization

**Manual Search**

In manual search, we initially choose a set of hyperparamters based upon heuristis. The model is trained with the sets of hyperparameters, scored on validation set and accuracy is assessed on test set. The process is repeated until a satisfactory result is obtained.The problem with such manual search is that the process could be highly computationally expensive. With each trial of search, we have to train a model on training data and evaluate the performance on validation data. This process can quickly become intractable with complex models with large number of hyperparameters such as deep learning model. 

**Grid Search**

In grid search, we first set up a grid of hyperparameter values and the model is trained for each combination of hyperparamters and tested on the test set to find the optimal combination.  

Let's take an exmple for random forest and consider the n_estimators and max_depth hyperparameters as below, which will result in total of thirty models

n_estimators = [10, 50, 100, 200,500]  
max_depth = [3, 10, 20, 40, 60, 100]

Each model will be fit to the training data and evaluated on the validation data. Depending on the models, data size, grid search can often become computationally very expensive.

**Random Search**

In random search, we no longer provide a discrete set of values to explore the hyperparameter space. Instead, values for each hyperparameter is randomly sampled from a statistical distribution. 

One of the main theoretical backings to motivate the use of random search in place of grid search is the fact that for most cases, hyperparameters are not equally important.

*"A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that **for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets**. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets."* - ***Bergstra, 2012*** 

During grid search, each hyperparameter is searched for the best possible value while holding all other hyperparameters constant. For cases where the hyperparameter being studied has little effect on the resulting model score, this results in wasted effort. Conversely, the random search has much improved exploratory power and can focus on finding the optimal value for the important hyperparameter.

**Bayesian Optimization**

Both Grid Search and Random Search develops models from the set of parameters and record the performance of each model. However, each model is independent of the other and the information from previous models is not used to improve the next model. Bayesian optimization algorithm allows us to use the results of previous iterations when choosing hyperparameters set to evaluate in next iterations. By choosing the hyperparameters in an informed way, it limits the search of the parameter space and requires less iteration to get find the optimal set of hyperparameter values. Go [here](https://towardsdatascience.com/a-conceptual-explanation-of-bayesian-model-based-hyperparameter-optimization-for-machine-learning-b8172278050f) to learn more about Bayesian optimization for hyperparameter tuning.

In [7]:

import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd  
import seaborn as sns 

%matplotlib inline
import numpy as np
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
 
from sklearn.datasets import load_boston
from sklearn.model_selection import (cross_val_score, train_test_split, 
                                     GridSearchCV, RandomizedSearchCV)
from sklearn.metrics import r2_score

In [8]:

## load the boston dataset from sklearn
boston_dataset = load_boston()

## check what the boston dataset contains
print(boston_dataset.keys())





dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])


In [9]:
## Check the dataset characteristics for the Boston dataset.
#print(boston_dataset.DESCR)


### The prices of the house which is our target variable is in the attribute MEDV and remaining are our feature variable that we will use to predict the value of the house.

***Pandas Dataframe Conversion***


In [10]:
## convert the data into a pandas dataframe and check the first 5 rows

boston = pd.DataFrame(boston_dataset.data, columns=boston_dataset.feature_names)
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33


In [11]:
boston['MEDV'] = boston_dataset.target
boston.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
0,0.00632,18.0,2.31,0.0,0.538,6.575,65.2,4.09,1.0,296.0,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,242.0,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0.0,0.458,6.998,45.8,6.0622,3.0,222.0,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0.0,0.458,7.147,54.2,6.0622,3.0,222.0,18.7,396.9,5.33,36.2


In [12]:
## check if there are any missing values in the data frame
boston.isnull().sum()


CRIM       0
ZN         0
INDUS      0
CHAS       0
NOX        0
RM         0
AGE        0
DIS        0
RAD        0
TAX        0
PTRATIO    0
B          0
LSTAT      0
MEDV       0
dtype: int64

In [13]:
## Check the data type of all the variables
boston.dtypes

CRIM       float64
ZN         float64
INDUS      float64
CHAS       float64
NOX        float64
RM         float64
AGE        float64
DIS        float64
RAD        float64
TAX        float64
PTRATIO    float64
B          float64
LSTAT      float64
MEDV       float64
dtype: object

## Exploratory Analysis


In [14]:
## we use the describe method to understand general distribution of the data. T is used to swap rows and columns
boston.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
CRIM,506.0,3.613524,8.601545,0.00632,0.082045,0.25651,3.677083,88.9762
ZN,506.0,11.363636,23.322453,0.0,0.0,0.0,12.5,100.0
INDUS,506.0,11.136779,6.860353,0.46,5.19,9.69,18.1,27.74
CHAS,506.0,0.06917,0.253994,0.0,0.0,0.0,0.0,1.0
NOX,506.0,0.554695,0.115878,0.385,0.449,0.538,0.624,0.871
RM,506.0,6.284634,0.702617,3.561,5.8855,6.2085,6.6235,8.78
AGE,506.0,68.574901,28.148861,2.9,45.025,77.5,94.075,100.0
DIS,506.0,3.795043,2.10571,1.1296,2.100175,3.20745,5.188425,12.1265
RAD,506.0,9.549407,8.707259,1.0,4.0,5.0,24.0,24.0
TAX,506.0,408.237154,168.537116,187.0,279.0,330.0,666.0,711.0


In [15]:
#sns.set(rc={'figure.figsize':(11.7,8.27)})
#sns.distplot(boston['MEDV'], bins=30)


In [16]:
X = boston.drop("MEDV", axis=1)
Y = boston["MEDV"]
print(X.shape)
print(Y.shape)

(506, 13)
(506,)


Let's split the data into train and test test. We use 80% of the sample for training and 20% for testing.


In [17]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(404, 13)
(102, 13)
(404,)
(102,)


#### Let's fit the model with default hyperparameters from sklearn to get a baseline idea of model performance

In [18]:
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(x_train, y_train)
y_pred = rf.predict(x_test)



#### We will use mean absolute error and R squared to assess model performance

In [19]:
from sklearn.metrics import mean_absolute_error, r2_score
mae = mean_absolute_error(y_test, y_pred)
r2=r2_score(y_test, y_pred)
print("The model performance for baseline model is:")
print("---------------------------------------------")
print('mean absoulte error is {}'.format(mae))
print('R2 score is {}'.format(r2))

The model performance for baseline model is:
---------------------------------------------
mean absoulte error is 2.266078431372548
R2 score is 0.8574924093945536


#### With the default random forest regression hyperparameters the mean absoluate error for housing price prediction is 2.36 and R2 is 87%. Let's set up a dictionary with hyperparameters what we want to test. 

In [22]:
param_grid = {
    'bootstrap': [True],
    'max_depth': [10, 20, 30],
    'max_features': [2,3,5],
    'min_samples_leaf': [3, 4, 5],
    'n_estimators': [100, 200]
}

#### We now create the instance of GridSearchCV with the hyperparameter grid. The parameter CV corresponds to the number of fold for cross validation. Once the GridSearchCV class is initialized, we call the fit method of the class and pass it the training set:

In [23]:
from sklearn.model_selection import GridSearchCV
rf = RandomForestRegressor()
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = -1, verbose = 2)
grid_search.fit(x_train, y_train);

Fitting 3 folds for each of 54 candidates, totalling 162 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:    6.3s
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed:   20.4s
[Parallel(n_jobs=-1)]: Done 162 out of 162 | elapsed:   21.4s finished


#### Check the parameters that return the highest accuracy which we do using the *best_estimator_*  attributes of the GridSearchCV object  as shown below:

In [24]:
best_grid = grid_search.best_estimator_
best_grid


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10,
                      max_features=5, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=3, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=200,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

#### Use the best model returned from gridsearch to predict in the training set and calculate the model error

In [25]:
y_pred_gridsearch = best_grid.predict(x_test)

In [26]:
mae_gridsearch = mean_absolute_error(y_test, y_pred_gridsearch)
r2_gridsearch=r2_score(y_test, y_pred_gridsearch)
print("The model performance for testing set from grid search")
print("--------------------------------------")
print('mean absoulte error is {}'.format(mae_gridsearch))
print('R2 score is {}'.format(r2_gridsearch))
print('Improvement of {:0.2f}%.'.format( 100 * (r2_gridsearch- r2) / r2))

The model performance for testing set from grid search
--------------------------------------
mean absoulte error is 2.064521797240613
R2 score is 0.8890151457923472
Improvement of 3.68%.


#### Randomized Search for Hyperparameter tuning

##### Create  random grid for hyperparamter search

In [29]:
param_dist = {
               'max_depth': list(np.linspace(10, 1200, 10, dtype = int)),
               'n_estimators': list(np.linspace(100, 1200, 10, dtype = int))}

In [30]:
random_grid_search = RandomizedSearchCV(estimator = rf, param_distributions = param_dist, 
                          cv = 3, n_jobs = -1, verbose = 2,n_iter = 10)
random_grid_search.fit(x_train, y_train);

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:   22.4s finished


In [31]:
best_grid_randomsearch= random_grid_search.best_estimator_
best_grid_randomsearch


RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=142,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=222,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [32]:
y_pred_randomsearch = best_grid_randomsearch.predict(x_test)
mae_randomsearch = mean_absolute_error(y_test, y_pred_randomsearch)
r2_randomsearch=r2_score(y_test, y_pred_randomsearch)
print("The model performance for testing set from randomsearch search")
print("--------------------------------------")
print('mean absoulte error is {}'.format(mae_randomsearch))
print('R2 score is {}'.format(r2_randomsearch))
print('Improvement of {:0.2f}%.'.format( 100 * (r2_randomsearch- r2) / r2))


The model performance for testing set from randomsearch search
--------------------------------------
mean absoulte error is 2.2317832538420803
R2 score is 0.8558904773238419
Improvement of -0.19%.


#### Assigment: 
1. Change the hyperparameter values and check how the values in the best estimator changes and use it to predict for the test set.
2. Use the sklearn best score function to check the cross validated score of the best estimator.