# Hyperparameter search

For this exercise, you will use cross-validation with a grid-search to optimise the hyperparameters for probabilistic classification of mode choice using Random Forests and Gradient Boosting Decision Trees. 

You will first optimise the model parameters using only the `train_validate` data, and evaluate the model performance on the `test` data.

**Note**: remember, grid-searches are very intensive and can take a long time to run. Always check your code is working by obtaining results for a simple reduced grid first, before running your code on the full grid.

Tasks:

1. Investigate the documentation for `GridSearchCV` in *scikit-learn*


2. Use `GridSearchCV` to optimise the hyperparameters for the `RandomForestClassifier` on the London Passenger Mode Choice dataset
    * Use 5-fold cross-validation.
    * Evaluate the model performance using the negative log likelihood during the grid search.
    * Search for values of *maximum tree depth* over an appropriate range (e.g. between 2 and 15).
    * Search for ensemble sizes over an appropriate range (e.g. between 100 and 1000 trees). 
    * Note that the search may be very slow -  you will have to set up your search with an appropriate grid size.
    * Save the trained object as a pickle file to avoid having to rerun the gridsearch multiple times. 

    
3. Investigate the `cv_results_` attribute of the fitted `GridSearchCV` object
    * Comment on the returned values: 
        * What is used to determine the `best_parameters_`/`rank_test_score` in the search? 
        * What other returned values are important?
        * How could these results be visualised?
    * Select the parameters from the grid-search that you believe to be optimal, and *if necessary* refit the model to all of the `train_validate` data using these hyperparameters


4. Test the fitted model on the `test` data.
    * How does the log-likelihood score differ for the test data compared to the cross-validation score?
    * What are the possible reasons for any discrepancy? (We will discuss this in detail next week!)


### Extension exercise

For the extenstion exercise, you will optimise the `xgboost` classifer using the `early_stopping_rounds` variable.


* In order to perform cross-validation with early stopping for the `xgboost` classifier, you will need to use the `xgboost.cv` method from the xgboost learning api, documented [here](https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.training). 
    * Again, use 5-fold CV for the search. 
    * Set `metrics='mlogloss'` to evaluate the model on the log-likelihood loss. 
    * Note that `xgboost.cv` is **not** compatible with `GridSearchCV` in *scikit-learn*. As such, you will need to setup the search manually, as we did in class. 


* Set `num_boost_round=1000` and `early_stopping_rounds=10`, so that the number of boosting rounds is capped at 1000, but stops earlier if the model performance does not increase for 10 rounds. 


* Fix the learning rate (called `eta` in learning API) to be 0.1, and investigate appropriate values of `max_depth`.


* Test the optimised model on the `test` data, and comment on the results. 


* Remember to use Google/Stack Overflow for help! This has been attempted many times before. 

# GridSearchCV and RF

We first use the `GridSearchCV` to optimise the parameters for the `RandomForestClassifier`. 

We will first import necessary libraries...

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import log_loss
from sklearn.ensemble import RandomForestClassifier

import time
import xgboost as xgb

import matplotlib.pyplot as plt
import matplotlib

...and do the usual data preprocessing

In [2]:
df_train_validate = pd.read_csv('data/train_validate.csv', index_col='trip_id')

target = ['travel_mode']
id_context = ['trip_id', 
              'household_id', 
              'person_n', 
              'trip_n',
              'survey_year',
              'travel_year'
             ]
features = [c for c in df_train_validate.columns 
            if c not in (target + id_context)]

y_train_validate = df_train_validate[target].values.ravel()
X_train_validate = df_train_validate[features]

We can create an instance of the Random Forest Classifier, specifying all of the parameters that won't be modified during the search. 

In [3]:
rf = RandomForestClassifier(criterion='entropy', 
                            random_state=42,
                            n_jobs=-1)


Next, we can specify the search space for the Grid search.

In [4]:
parameters = {'n_estimators': [50, 100, 200, 500, 1000],
              'max_depth': list(range(2,16,2))}


The grid search is a *class*, which is used like any other model in *scikit-learn*. First we create an instance of the class.

In [5]:
clf = GridSearchCV(rf, parameters, scoring='neg_log_loss', cv=5, verbose=1)


Then we can fit the `GridSearchCV` instance we have created. This may take a considerable ammount of time. Remember, it is relatively straightforward to parallelise a grid search over multiple computers, and combine results after. 

In [6]:
clf.fit(X_train_validate, y_train_validate)


Fitting 5 folds for each of 35 candidates, totalling 175 fits


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 175 out of 175 | elapsed: 18.5min finished


GridSearchCV(cv=5,
             estimator=RandomForestClassifier(criterion='entropy', n_jobs=-1,
                                              random_state=42),
             param_grid={'max_depth': [2, 4, 6, 8, 10, 12, 14],
                         'n_estimators': [50, 100, 200, 500, 1000]},
             scoring='neg_log_loss', verbose=1)

The "best score" and associated parameters are stores as attributes of the `GridSearchCV` object. Note that the underscore at the end of the attribute indicates that this attribute is only available once the object has been fitted.

In [7]:
print(f'Best score: {clf.best_score_:.3f} with parameters {clf.best_params_}')


Best score: -0.662 with parameters {'max_depth': 14, 'n_estimators': 1000}


We can investigate the `cv_results_` attribute to see more detail of what is stored during the search. 

In [8]:
clf.cv_results_


{'mean_fit_time': array([ 0.77885876,  0.67633042,  1.15194407,  2.54367452,  4.98540425,
         0.57866468,  0.979353  ,  1.75176353,  4.20671277,  9.98450956,
         0.89696541,  1.44573307,  2.91394386,  6.40828648, 12.45935192,
         0.95720854,  1.84413424,  3.15651121,  7.73939381, 15.43058615,
         1.14947162,  2.01956511,  3.85093932,  9.34747758, 20.2559443 ,
         1.42939262,  2.52801294,  4.92684979, 11.78022909, 22.22877393,
         1.3956141 ,  2.61734524,  5.35444336, 13.01463552, 25.23190522]),
 'std_fit_time': array([0.7309393 , 0.0383258 , 0.03012608, 0.04514808, 0.06288824,
        0.02430845, 0.02386224, 0.04651695, 0.07347834, 1.24083135,
        0.06021578, 0.16355033, 0.28825183, 0.43865496, 0.59596438,
        0.04509335, 0.28436911, 0.12941174, 0.27088153, 0.46945876,
        0.08765582, 0.09668774, 0.13074995, 0.08050457, 0.73157468,
        0.10453383, 0.05740265, 0.09592768, 0.64558709, 0.18608215,
        0.02610772, 0.07200383, 0.30801553, 0.

The ranking is determined based purely on the mean test score (in this case the negative log likelihood), calculated over the 5 cross-validation folds. 

As well as the mean score, the *standard deviation* (*std*) of the scores and the *fit times* and *score times* are also important. 

The standard deviation of the scores could be used to investigate whether the performance differences between different parameters are statistically significant. Note that this is not commonly applied in practice. 

We can see there is a large variation of the score and fit times depending on the model parameters, with small ensembles having much quicker score and fit times than larger ensembles. Knowing the fit and score times can allow the modeller to make a more informed decision on approrpriate model parameters, penalising parameter combinations with very high times. 

Whilst the fit times are larger than the score times, it may be desirable to have prioritise low score times if the model is to be used for real time prediction on large data quantities.

For now, we will simply use the parameters with the best mean log likelihood. By default, the model is automatically retrained on all of the train/validate data with these parameters at the end of the Grid Search. This can be obtained using the appribute `best_estimator_`. 

In [9]:
rf = clf.best_estimator_


In order to test the final model, we must import and provess the holdout test data. 

In [10]:
df_test = pd.read_csv('data/test.csv', index_col='trip_id')

y_test = df_test[target].travel_mode.values
X_test = df_test[features]

Finally, we can output predicted probabilities for the finak model on the test data and compute the cross entropy loss.

In [11]:
pred_proba = rf.predict_proba(X_test)


In [12]:
log_loss(y_test, pred_proba)


0.6776073815996025

By comparing the test score with the cross-validation score, we can see that the cross-validation has clearly overestimated the performance of the classifier. We will discuss the reasons for this in the next lecture. 

This highlights the need to perform true external validation on a holdout test sample collected separately from the training data. 

# Cross validation and early stopping with XGBoost

We will now perform a similar grid search for the `xgboost` classifier. However, due to the sequential nature of boosting, we can use early stopping to determine the appropriate ensmble size automatically. We therefore do not need to include ensemble size in the grid itself. 

In order to make use of the `cv` from `xgboost`, we will have to transform to a `DMatrix` object, within the `xgboost` library.

In [13]:
xgtrain = xgb.DMatrix(X_train_validate, label=y_train_validate)


We can then set up the grid search manually, as we did in class, but making use of the `cv` method to perform cross-validation with the classifier. 

In [14]:
max_depths = list(range(2, 15, 2))

res = {'scores': [], 'n_estimators': [], 'params': [], 'full_results': []}

for d in max_depths:
    
    print(f'CV for max_depth = {d}')
    
    clf = xgb.XGBClassifier(objective='multi:softprob', 
                            max_depth=d,
                            eta=0.1,
                            n_jobs=-1)
    
    xgb_param = clf.get_xgb_params()
    
    xgb_param['num_class'] = 4

    cvresult = xgb.cv(xgb_param, xgtrain, num_boost_round=1000, nfold=5,
                      metrics=['mlogloss'], early_stopping_rounds=10, 
                      verbose_eval=False, seed=42)
    
    res['scores'].append(min(cvresult['test-mlogloss-mean']))
    res['n_estimators'].append(cvresult['test-mlogloss-mean'].idxmin())
    res['params'].append(xgb_param)
    res['full_results'].append(cvresult)


CV for max_depth = 2
CV for max_depth = 4
CV for max_depth = 6
CV for max_depth = 8
CV for max_depth = 10
CV for max_depth = 12
CV for max_depth = 14


In [15]:
print(f'Best log loss found: {np.min(res["scores"]):.3f}')


Best log loss found: 0.496


In [17]:
best_run = np.argmin(res["scores"])
best_params = res['params'][best_run]
best_n = res['n_estimators'][best_run]+1

print(f'Best parameters: {best_params}')
print(f'{best_n} rounds of boosting needed')


Best parameters: {'objective': 'multi:softprob', 'base_score': None, 'booster': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'gamma': None, 'gpu_id': None, 'interaction_constraints': None, 'learning_rate': None, 'max_delta_step': None, 'max_depth': 10, 'min_child_weight': None, 'monotone_constraints': None, 'n_jobs': -1, 'num_parallel_tree': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None, 'eta': 0.1, 'num_class': 4}
298 rounds of boosting needed


We can now fit the model on all of the data with the best `max_depth` and test it on the test data as we did with the previous model.

In [18]:
clf = xgb.XGBClassifier(**best_params, n_estimators=best_n,label_encoder=False)

clf.fit(X_train_validate, y_train_validate)




Parameters: { label_encoder } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, eta=0.1, gamma=0,
              gpu_id=-1, importance_type='gain', interaction_constraints='',
              label_encoder=False, learning_rate=0.100000001, max_delta_step=0,
              max_depth=10, min_child_weight=1, missing=nan,
              monotone_constraints='()', n_estimators=298, n_jobs=-1,
              num_class=4, num_parallel_tree=1, objective='multi:softprob',
              random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=None,
              subsample=1, tree_method='exact', validate_parameters=1,
              verbosity=None)

In [19]:
pred_proba = clf.predict_proba(X_test)


In [20]:
log_loss(y_test, pred_proba)


0.7496789120988846

As with the Random Forst classifier, there is a large discrpeancy between the estimated performance using cross-validation and the true out-of-sample performance, but here the difference is even greater.