### Imports
NumPy - Import the data.

XGBoost - ML package used.

train_test_split - Split the data into a _training and a testing_ set.

RandomizedSearchCV / GridSearchCV - Figure out the best _hyperparameters_.

f1_score / recall_score / accuracy_score - Metrics used to give the various models scores.

In [10]:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import f1_score, recall_score, accuracy_score
import sklearn

### Load in data 
The data is loaded in—using NumPy—and the various sections are allocated (X and y). X = features, y = targets.

In [3]:
data = np.loadtxt("ml_models/weekly_delta_binary.csv", delimiter=",", skiprows=1)

X = data[:,:-1]
y = data[:,-1]

#### Split data
Split the data into a 80/20 train-test split. No shuffle.

In [4]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

### Base model
Just the most basic model, without any adjustments of the hyperparameters. However, not optimal, since the hyperparameters haven't been optimized, and is thereby most likely overfit.

In [5]:
# xgb_clf = xgb.XGBClassifier(objective="multi:softmax")
xgb_clf = xgb.XGBClassifier()


xgb_model = xgb_clf.fit(X_train, y_train)
y_train_preds = xgb_model.predict(X_train)
y_test_preds = xgb_model.predict(X_test)

print("Test Recall: ", recall_score(y_test, y_test_preds))
print("Test F1 Average: ", f1_score(y_test, y_test_preds))
print("Test Accuracy: ", accuracy_score(y_test, y_test_preds))

Test Recall:  0.77
Test F1 Average:  0.6968325791855204
Test Accuracy:  0.6104651162790697


### Most important feature
Features are given weights on how important they are for the predictions, this outputs the most important feature, along with it's weight.

In [6]:
for index, feature in enumerate(xgb_model.feature_importances_):
    if feature == xgb_model.feature_importances_.max():
        print(index, feature)

70 0.031139895


### Get the best parameters.
Use `GridSearchCV` with Google Cloud to get the actual best parameters, `RandomizedSearchCV` only checks randomly (so it doesn't use too much computational power, because `GridSearchCV` can take days).

In [13]:
%%time
%%capture

xgb_clf = xgb.XGBClassifier()

parameters = {"learning_rate": [0.1, 0.01, 0.001],
               "gamma" : [0.01, 0.1, 0.3, 0.5, 1, 1.5, 2],
               "max_depth": [2, 4, 7, 10],
               "colsample_bytree": [0.3, 0.6, 0.8, 1.0],
               "subsample": [0.2, 0.4, 0.5, 0.6, 0.7],
               "reg_alpha": [0, 0.5, 1],
               "reg_lambda": [1, 1.5, 2, 3, 4.5],
               "min_child_weight": [1, 3, 5, 7],
               "n_estimators": [100, 250, 500, 1000]}

xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions=parameters, scoring="accuracy",
                             cv=10, verbose=3)

model_xgboost = xgb_rscv.fit(X_train, y_train)

CPU times: user 4min 45s, sys: 2.46 s, total: 4min 47s
Wall time: 24.9 s


#### Best parameters
Insert the best parameters found by `RandomizedSearchCV` (or `GridSearchCV`), and test it. Check if the metric scores have improved.

In [14]:
params = model_xgboost.best_estimator_.get_params()

xgb_clf = xgb.XGBClassifier(**params)

xgb_model = xgb_clf.fit(X_train, y_train)
y_train_preds = xgb_model.predict(X_train)
y_test_preds = xgb_model.predict(X_test)

print("Test Recall: ", recall_score(y_test, y_test_preds))
print("Test F1 Average: ", f1_score(y_test, y_test_preds))
print("Test Accuracy: ", accuracy_score(y_test, y_test_preds))

Test Recall:  0.79
Test F1 Average:  0.6556016597510373
Test Accuracy:  0.5174418604651163
