# Hyperparameter selection

### Classifier selected: Gradient Boosting

Description: The script will select the best performing hyperparameters. The final result is the log_loss representing how close the prediction probability is to the corresponding actual/true value. Important: this is not the log loss of the final trained model because for hyperparameter selection, no external validation is done. The final result is expected to be slightly worse than the here obtained log_loss.

Author: Caroline Risoud

License:

Last update date: 23.10.2021


In [8]:
import pandas as pd

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss

import xgboost as xgb

from scipy.stats import randint

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import matplotlib.pyplot as plt
import matplotlib

In [2]:
%store -r X_train_validate_rev
%store -r y_train_validate_rev
%store -r X_test_rev
%store -r y_test_rev

### Randomized search to find the best performing hyperparameters

In [24]:
# definition of the gradient boosting classifier
# a random state of 42 is specified for reproducibility purpose
clf = xgb.XGBClassifier(n_jobs=-1, random_state=42, use_label_encoder =False,
                        objective='multi:softprob', eval_metric='mlogloss')


#specify parameters and distributions to sample from
#here we propose limited parameters to limit the running time of the hyperparameter search

param_dist_clf = {
                'max_depth': randint(5,6),
                'n_estimators': randint(300, 310),
                'gamma': randint(2,3),
                'min_child_weight': randint(2,3),
                'max_delta_step': randint(2,3)}

In [25]:
# number of iteration - we start with one iteration and increase it afterwards
n_iter_search_clf = 1

# defining given parameters for the randomized search
random_search_clf = RandomizedSearchCV(clf, 
                                       param_distributions = param_dist_clf,
                                       cv = 5,
                                       n_iter = n_iter_search_clf,
                                       scoring = 'neg_log_loss')

In [26]:
# ----------- THIS SECTION MAY TAKE A FEW MINUTES (2-4 min) TO RUN ---------------------


# performing the randomized search
random_search_clf.fit(X_train_validate_rev, y_train_validate_rev)


# printing the best performing parameters
clf = random_search_clf.best_estimator_

In [27]:
# Calculating the log loss

pred_probaclf = clf.predict_proba(X_test_rev)

log_loss(y_test_rev, pred_probaclf)

0.15345155836824453