### 1 Imports
Numpy - Import the data.

XGBoost - ML package used.

train_test_split - Split the data into a _training and a testing_ set.

RandomizedSearchCV / GridSearchCV - Figure out the best _hyperparameters_.

f1_score / recall_score / accuracy_score - Metrics used to give the various models scores.

In [1]:
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn.metrics import f1_score, recall_score, accuracy_score
import sklearn

### 2 Load in data 
The data is loaded in—using NumPy—and the various sections are allocated (X and y). X = features, y = targets.

In [2]:
data = np.loadtxt("data/processed/weekly-delta-binary-curated.csv", delimiter=",")

X = data[:,1:]
y = data[:,0]

#### 2.1 Split data
Split the data into a 80/20 train-test split. No shuffle.

In [3]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, shuffle=False)

### 3 Base model
Just the most basic model, without any adjustments of the hyperparameters. However, not optimal, since the hyperparameters haven't been optimized, and is thereby most likely overfit.

In [10]:
xgb_clf = xgb.XGBClassifier(learning_rate=0.001)


xgb_model = xgb_clf.fit(X_train, y_train)
y_train_preds = xgb_model.predict(X_train)
y_test_preds = xgb_model.predict(X_test)

print("Test Recall: ", recall_score(y_test, y_test_preds))
print("Test F1 Average: ", f1_score(y_test, y_test_preds))
print("Test Accuracy: ", accuracy_score(y_test, y_test_preds))
print("Training Accuracy: ", accuracy_score(y_train, y_train_preds))

Test Recall:  0.62
Test F1 Average:  0.5767441860465116
Test Accuracy:  0.4678362573099415
Training Accuracy:  0.7090643274853801


### 4 Annual return model
This returns the annual return of the model and for a buy-and-hold strategy. Used for comparing algorithms.

In [11]:
%%capture

import yfinance as yf

ticker_df = yf.download("^DJI", period="max", interval="1wk")
ticker_df = ticker_df.drop(["Open", "High", "Low", "Adj Close", "Volume"], axis=1)
ticker_df["Pct_Change"] = ticker_df["Close"].pct_change()
ticker_df.loc[:, "Pct_Change"] += 1
ticker_df = ticker_df[-len(y_test_preds)-4:-4]
ticker_df = ticker_df.reset_index()

ml = 100
bh = 100

i = 0
j = 0

while i < len(y_test_preds):
    if y_test_preds[i] == 1:
        ml *= ticker_df["Pct_Change"][i]
    else:
        ml /= ticker_df["Pct_Change"][i]
        
    if ticker_df["Pct_Change"][i] >= 1:
        j += 1
        
    bh *= ticker_df["Pct_Change"][i]
        
    i += 1
       
ml = (ml - 100) / 100
bh = (bh - 100) / 100

ml = (1+ml) ** (1 / (len(y_test_preds)/52)) - 1
bh = (1+bh) ** (1 / (len(y_test_preds)/52)) - 1

In [6]:
print("Algorithm Annual Return:", (ml * 100).round(1), "%")
print("Buy-and-hold Annual Return:", (bh * 100).round(1), "%")
print("Buy-and-hold Accuracy:", j / len(y_test_preds))

Algorithm Annual Return: 16.1 %
Buy-and-hold Annual Return: 7.1 %
Buy-and-hold Accuracy: 0.5847953216374269


### 5 Feature importance
Features are given weights on how important they are for the predictions, this outputs the most important feature, along with it's weight.

In [7]:
for index, feature in enumerate(xgb_model.feature_importances_):
    if feature == xgb_model.feature_importances_.max():
        print(index, feature)

1 0.092999764


### 6 Get the best parameters.
Use `GridSearchCV` with Google Cloud to get the actual best parameters, `RandomizedSearchCV` only checks randomly (so it doesn't use too much computational power, because `GridSearchCV` can take days).

In [8]:
%%capture

xgb_clf = xgb.XGBClassifier()

parameters = {"learning_rate": [0.1, 0.01, 0.001],
               "gamma" : [0.01, 0.1, 0.3, 0.5, 1, 1.5, 2],
               "max_depth": [2, 4, 7, 10],
               "colsample_bytree": [0.3, 0.6, 0.8, 1.0],
               "subsample": [0.2, 0.4, 0.5, 0.6, 0.7],
               "reg_alpha": [0, 0.5, 1],
               "reg_lambda": [1, 1.5, 2, 3, 4.5],
               "min_child_weight": [1, 3, 5, 7],
               "n_estimators": [100, 250, 500, 1000]}

xgb_rscv = RandomizedSearchCV(xgb_clf, param_distributions=parameters, scoring="accuracy",
                             cv=10, verbose=3)

model_xgboost = xgb_rscv.fit(X_train, y_train)

#### 6.1 Best parameters
Insert the best parameters found by `RandomizedSearchCV` (or `GridSearchCV`), and test it. Check if the metric scores have improved.

In [9]:
params = model_xgboost.best_estimator_.get_params()

xgb_clf = xgb.XGBClassifier(**params)

xgb_model = xgb_clf.fit(X_train, y_train)
y_train_preds = xgb_model.predict(X_train)
y_test_preds = xgb_model.predict(X_test)

print("Test Recall: ", recall_score(y_test, y_test_preds))
print("Test F1 Average: ", f1_score(y_test, y_test_preds))
print("Test Accuracy: ", accuracy_score(y_test, y_test_preds))

Test Recall:  1.0
Test F1 Average:  0.7380073800738007
Test Accuracy:  0.5847953216374269
