# Hyperparameter selection

### Classifier selected: Gradient Boosting

Description: The script will select the best performing hyperparameters. The final result is the log_loss representing how close the prediction probability is to the corresponding actual/true value. Important: this is not the log loss of the final trained model because for hyperparameter selection, no external validation is done. The final result is expected to be slightly worse than the here obtained log_loss.

Author: Caroline Risoud

License:

Last update date: 23.10.2021


In [None]:
import pandas as pd

from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import log_loss

import xgboost as xgb

from time import time

from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import matplotlib.pyplot as plt
import matplotlib

In [None]:
%store -r X_train_validate_rev, y_train_validate_rev
%store -r X_test_rev, y_test_rev

### Randomized search to find the best performing hyperparameters

In [None]:
# definition of the gradient boosting classifier
clf = xgb.XGBClassifier(n_jobs=-1, random_state=42, use_label_encoder =False,
                        objective='multi:softprob', eval_metric='mlogloss')


#specify parameters and distributions to sample from
param_dist_clf = {
                'max_depth': randint(2,16),
                'n_estimators': randint(100, 1001),
                'gamma': randint(1,6),
                'min_child_weight': randint(1,6),
                'max_delta_step': randint(1,6)}

In [None]:
# number of iteration
n_iter_search_clf = 20

# defining given parameters for the randomized search
random_search_clf = RandomizedSearchCV(clf, 
                                       param_distributions = param_dist_clf,
                                       cv = 5,
                                       n_iter = n_iter_search_clf,
                                       scoring = 'neg_log_loss')

In [None]:
start = time()

# performing the randomized search
random_search_clf.fit(X_train_validate_rev, y_train_validate_rev)

print("RandomizedSearchCV took %.2f seconds for %d candidates"
      " parameter settings." % ((time() - start), n_iter_search_clf))

# printing the best performing parameters
clf = random_search_clf.best_estimator_

In [None]:
# Calculating the log loss

pred_probaclf = clf.predict_proba(X_test_rev)

log_loss(y_test_rev, pred_probaclf)