<a href="https://colab.research.google.com/github/cdtalley/Data-Science-Portfolio/blob/main/Jane_Street_Market_Prediction_XGBoost_and_Hyperparameter_Tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Jane Street Market Prediction: XGBoost and Hyperparameter Tuning

## Using Financial Market Data to Predict Successful Trades

Financial market data is a driving force behind the decisions of traders everyday. As data collection and processing capabilites grow, and the amount of electronic trading increases with the usage of free trading apps and mechanical/algorithmic based trading, market data and trading data captured from the markets grows in complexity each day. This allows for development of machine learning models that can predict with some degree of accuracy the decision to make on a trade, to pass or initiate the buy of an asset that will give us a positive return on our trade at a later date.

### About Our Data

This dataset contains an anonymized set of features, feature_{0...129}, representing real stock market data. Each row in the dataset represents a trading opportunity, for which I will be predicting an action value: 1 to make the trade and 0 to pass on it. Each trade has an associated weight and resp, which together represents a return on the trade. The date column is an integer which represents the day of the trade, while ts_id represents a time ordering. In addition to anonymized feature values, you are provided with metadata about the features in features.csv.

In the training set, train.csv, we are provided a resp value, as well as several other resp_{1,2,3,4} values that represent returns over different time horizons. These variables are not included in the test set. Trades with weight = 0 were intentionally included in the dataset for completeness.

* train.csv - the training set, contains historical data and returns
* features.csv - metadata pertaining to the anonymized features


Data is from the Jane Street Market Prediction Dataset, linked here; https://www.kaggle.com/c/jane-street-market-prediction

# Exploratory Data Analysis

In [1]:
# Importing necessary data visualization and machine learning packages and .csv file into a pandas DataFrame.
%matplotlib inline
import time
import pandas as pd
import numpy as np
import seaborn as sns
from matplotlib import pyplot as plt
from scipy import stats
from google.colab import drive
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
import plotly as py
import plotly.graph_objs as go
import plotly.tools as tls
from plotly.offline import iplot, init_notebook_mode
import cufflinks
import cufflinks as cf
import plotly.figure_factory as ff
import os
import warnings

# Suppress warnings.
warnings.filterwarnings("ignore")

In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=False)

Mounted at /content/drive


In [2]:
# Importing pandas DataFrame.
train = pd.read_csv("/content/drive/My Drive/Data/train.csv")

In [3]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime > "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

Mon Feb  8 05:38:08 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.39       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P0    26W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
# Viewing DataFrame column types to make sure data types are correct.
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2390491 entries, 0 to 2390490
Columns: 138 entries, date to ts_id
dtypes: float64(135), int64(3)
memory usage: 2.5 GB


In [None]:
# Calling pandas head function to take a glance at our data.
train.head()

Unnamed: 0,date,weight,resp_1,resp_2,resp_3,resp_4,resp,feature_0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,feature_11,feature_12,feature_13,feature_14,feature_15,feature_16,feature_17,feature_18,feature_19,feature_20,feature_21,feature_22,feature_23,feature_24,feature_25,feature_26,feature_27,feature_28,feature_29,feature_30,feature_31,feature_32,...,feature_91,feature_92,feature_93,feature_94,feature_95,feature_96,feature_97,feature_98,feature_99,feature_100,feature_101,feature_102,feature_103,feature_104,feature_105,feature_106,feature_107,feature_108,feature_109,feature_110,feature_111,feature_112,feature_113,feature_114,feature_115,feature_116,feature_117,feature_118,feature_119,feature_120,feature_121,feature_122,feature_123,feature_124,feature_125,feature_126,feature_127,feature_128,feature_129,ts_id
0,0,0.0,0.009916,0.014079,0.008773,0.00139,0.00627,1,-1.872746,-2.191242,-0.474163,-0.323046,0.014688,-0.002484,,,-0.989982,-1.05509,,,-2.667671,-2.001475,-1.703595,-2.196892,,,1.483295,1.307466,,,1.1752,0.967805,1.60841,1.319365,,,-0.515073,-0.448988,,,...,1.15877,,3.754522,7.137163,-1.863069,,0.434466,,-0.292035,0.317003,-2.60582,,2.896986,,1.485813,4.147254,-2.238831,,-0.892724,,-0.156332,0.622816,-3.921523,,2.561593,,3.457757,6.64958,-1.472686,,,1.168391,8.313583,1.782433,14.018213,2.653056,12.600292,2.301488,11.445807,0
1,0,16.673515,-0.002828,-0.003226,-0.007319,-0.011114,-0.009792,-1,-1.349537,-1.704709,0.068058,0.028432,0.193794,0.138212,,,-0.151877,-0.384952,,,1.225838,0.789076,1.11058,1.102281,,,-0.5906,-0.625682,,,-0.543425,-0.547486,-0.7066,-0.667806,,,0.910558,0.914465,,,...,1.157671,,1.297679,1.281956,-2.427595,,0.024913,,-0.413607,-0.073672,-2.434546,,0.949879,,0.724655,1.622137,-2.20902,,-1.332492,,-0.586619,-1.040491,-3.946097,,0.98344,,1.357907,1.612348,-1.664544,,,-1.17885,1.777472,-0.915458,2.831612,-1.41701,2.297459,-1.304614,1.898684,1
2,0,0.0,0.025134,0.027607,0.033406,0.03438,0.02397,-1,0.81278,-0.256156,0.806463,0.400221,-0.614188,-0.3548,,,5.448261,2.668029,,,3.836342,2.183258,3.902698,3.045431,,,-1.141082,-0.979962,,,-1.157585,-0.966803,-1.430973,-1.103432,,,5.131559,4.314714,,,...,2.420089,,0.800962,1.143663,-3.214578,,1.585939,,0.193996,0.953114,-2.674838,,2.200085,,0.537175,2.156228,-3.568648,,1.193823,,0.097345,0.796214,-4.090058,,2.548596,,0.882588,1.817895,-2.432424,,,6.115747,9.667908,5.542871,11.671595,7.281757,10.060014,6.638248,9.427299,2
3,0,0.0,-0.00473,-0.003273,-0.000461,-0.000476,-0.0032,-1,1.174378,0.34464,0.066872,0.009357,-1.006373,-0.676458,,,4.508206,2.48426,,,2.902176,1.799163,3.1927,2.848359,,,-1.401637,-1.428248,,,-1.421175,-1.487976,-1.756415,-1.647543,,,4.766182,4.528353,,,...,2.330484,,0.182066,1.088451,-3.527752,,-1.338859,,-1.257774,-1.194013,-1.719062,,-0.94019,,-1.510224,-1.781693,-3.373969,,2.513074,,0.424964,1.992887,-2.616856,,0.561528,,-0.994041,0.09956,-2.485993,,,2.838853,0.499251,3.033732,1.513488,4.397532,1.266037,3.856384,1.013469,3
4,0,0.138531,0.001252,0.002165,-0.001215,-0.006219,-0.002604,1,-3.172026,-3.093182,-0.161518,-0.128149,-0.195006,-0.14378,,,2.683018,1.450991,,,1.257761,0.632336,0.905204,0.575275,,,2.550883,2.484082,,,2.502828,2.60644,2.731251,2.566561,,,-1.477905,-1.722451,,,...,4.345282,,2.737738,2.602937,-1.785502,,-0.172561,,-0.299516,-0.420021,-2.354611,,0.762192,,1.59862,0.623132,-1.74254,,-0.934675,,-0.373013,-1.21354,-3.677787,,2.684119,,2.861848,2.134804,-1.279284,,,0.34485,4.101145,0.614252,6.623456,0.800129,5.233243,0.362636,3.926633,4


In [None]:
# Setting max rows to show all percent missing values.
pd.set_option("max_rows", None)

In [None]:
# Finding the percent of missing values by calling pandas .isna function which returns a mask of bool values for each element in 
# DataFrame that indicates whether an element is not an NA value, and rounding the mean.
percent_missing = train.isna().mean().round(4) * 100
percent_missing = pd.DataFrame(percent_missing)
percent_missing

Unnamed: 0,0
date,0.0
weight,0.0
resp_1,0.0
resp_2,0.0
resp_3,0.0
resp_4,0.0
resp,0.0
feature_0,0.0
feature_1,0.0
feature_2,0.0


Some of our features are missing a good amount of data, but this is still an acceptable amount. We can proceed further with machine learning analysis.

In [4]:
# Importing pandas DataFrame.
features =  pd.read_csv("/content/drive/My Drive/Data/features.csv")

In [None]:
# Viewing DataFrame column types to make sure data types are correct.
features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 130 entries, 0 to 129
Data columns (total 30 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   feature  130 non-null    object
 1   tag_0    130 non-null    bool  
 2   tag_1    130 non-null    bool  
 3   tag_2    130 non-null    bool  
 4   tag_3    130 non-null    bool  
 5   tag_4    130 non-null    bool  
 6   tag_5    130 non-null    bool  
 7   tag_6    130 non-null    bool  
 8   tag_7    130 non-null    bool  
 9   tag_8    130 non-null    bool  
 10  tag_9    130 non-null    bool  
 11  tag_10   130 non-null    bool  
 12  tag_11   130 non-null    bool  
 13  tag_12   130 non-null    bool  
 14  tag_13   130 non-null    bool  
 15  tag_14   130 non-null    bool  
 16  tag_15   130 non-null    bool  
 17  tag_16   130 non-null    bool  
 18  tag_17   130 non-null    bool  
 19  tag_18   130 non-null    bool  
 20  tag_19   130 non-null    bool  
 21  tag_20   130 non-null    bool  
 22  ta

In [None]:
# Calling pandas head function to take a glance at our data.
features.head()

Unnamed: 0,feature,tag_0,tag_1,tag_2,tag_3,tag_4,tag_5,tag_6,tag_7,tag_8,tag_9,tag_10,tag_11,tag_12,tag_13,tag_14,tag_15,tag_16,tag_17,tag_18,tag_19,tag_20,tag_21,tag_22,tag_23,tag_24,tag_25,tag_26,tag_27,tag_28
0,feature_0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,feature_1,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,feature_2,False,False,False,False,False,False,True,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,feature_3,False,False,False,False,False,False,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,feature_4,False,False,False,False,False,False,True,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


# Target Variable and Train-Test-Split

In [9]:
# Disregarding trades that do not contribute to scoring with a weight of 0 for training set.
train = train[train['weight'] != 0]
# Creating query to train data on certain date range of 85 days.
train = train.query('date > 85').reset_index(drop = True) 
# Limiting memory usage by converting float 64 to float 32 in applicable data types.
train = train.astype({c: np.float32 for c in train.select_dtypes(include='float64').columns}) #limit memory use

# Creating target variable by multiplying weight and resp values and selecting those with a positive return.
train['action'] = ((train['weight'].values * train['resp'].values) > 0).astype('int')

# Filling missing values with the mean.
train.fillna(train.mean(),inplace=True)

cols = [c for c in train.columns if 'feature' in c]

# Creating independent variables by extracting feature columns to create X variable.
X = train.loc[:, train.columns.str.contains('feature')]
# Creating our target variable using our newly created action variable.
Y = train.loc[:, 'action']

In [None]:
# Creating a simple plot to show class balance.
x = train['action'].value_counts().index
y = train['action'].value_counts().values

trace2 = go.Bar(
     x=x ,
     y=y,
     marker=dict(
         color=y,
         colorscale = 'Viridis',
         reversescale = True
     ),
     name="Imbalance",    
 )
layout = dict(
     title="Class Balance for Target Variable",
     #width = 900, height = 500,
     xaxis=go.layout.XAxis(
     automargin=True),
     yaxis=dict(
         showgrid=False,
         showline=False,
         showticklabels=True,
 #         domain=[0, 0.85],
     ), 
)
fig1 = go.Figure(data=[trace2], layout=layout)
iplot(fig1)

The target variable is balanced and does not need any further processing to fix class imbalance.

In [10]:
# Using sklearns train_test_split to split data into random train and test subsets to calculate accuracy.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.20, random_state=42)

# XGBoost Classifier

First we will set up a simple XGBoost classification model without hyperparameter tuning to see how performance is increased with RandomSearchCV.

In [11]:
import xgboost as xgb
print("XGBoost version:", xgb.__version__)

XGBoost version: 0.90


In [None]:
clf = xgb.XGBClassifier(
    random_state=2021,
    tree_method='gpu_hist'  # Needed for GPU usage.
)

In [None]:
%time clf.fit(X_train, y_train)

CPU times: user 3.49 s, sys: 1.53 s, total: 5.02 s
Wall time: 5.41 s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0,
              learning_rate=0.1, max_delta_step=0, max_depth=3,
              min_child_weight=1, missing=None, n_estimators=100, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=2021,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, tree_method='gpu_hist', verbosity=1)

In [None]:
# Print classification report for our XGBoost classifier.
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy: ", accuracy)
print("Classification report:")
print(report)
print("Confusion matrix:")
print(cm)

Accuracy:  0.5278141038490787
Classification report:
              precision    recall  f1-score   support

           0       0.53      0.48      0.50    156239
           1       0.53      0.57      0.55    158044

    accuracy                           0.53    314283
   macro avg       0.53      0.53      0.53    314283
weighted avg       0.53      0.53      0.53    314283

Confusion matrix:
[[75013 81226]
 [67174 90870]]


In [None]:
# Print cross validation score.
cross_val_score(clf, X_train, y_train, cv=5)

array([0.5290124 , 0.52668568, 0.52656448, 0.52629402, 0.52789688])

Model performance metrics leave much more to be desired. We will adjust hyperparameters with further models to demonstrate the importance of hyperparameter tuning.

# XGBoost: Hyperparameter Tuning Using RandomizedSearchCV

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Create parameter grid for RandomizedSearchCV.
parameters = {'n_estimators':[1000,2000],
              'max_depth':[12,22],
              'learning_rate':[0.03,0.06],
              'subsample':[0.5,1],
              'colsample_bytree':[1,0.5],
              'gamma':[0.5,1],
              'min_child_weight':[1,2],
              'tree_method':['gpu_hist']}

# Create decision tree classifier with RandomizedSearchCV.
CV_clf = RandomizedSearchCV(clf, parameters, random_state=0)

# Fitting decision tree classifier to training data.
CV_clf.fit(X_train, y_train)

RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.1, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=100,
                                           n_jobs=1, nthread=None,
                                           objective='binary:logistic',
                                           random_state=2021, reg_alpha=0,
                                           reg_lambd...
                                           verbosity=1),
                   iid='deprecated', n_iter=10, n_jobs=None,
                   param_distributions={'colsample_bytree': 

In [None]:
# Calling .best_params_ to pull best parameters from RandomizedSearchCV.
CV_clf.best_params_

{'colsample_bytree': 0.5,
 'gamma': 1,
 'learning_rate': 0.06,
 'max_depth': 12,
 'min_child_weight': 2,
 'n_estimators': 1000,
 'subsample': 1,
 'tree_method': 'gpu_hist'}

In [None]:
clf = xgb.XGBClassifier(
    random_state=0,
    tree_method='gpu_hist',  # Needed for GPU usage.
    colsample_bytree=0.5,
    gamma=1,
    learning_rate=0.06,
    max_depth=12,
    min_child_weight=2,
    n_estimators=1000,
    subsample=1,
)

In [None]:
%time clf.fit(X_train, y_train)

CPU times: user 1min 19s, sys: 13 s, total: 1min 32s
Wall time: 1min 32s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=1,
              learning_rate=0.06, max_delta_step=0, max_depth=12,
              min_child_weight=2, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, tree_method='gpu_hist', verbosity=1)

In [None]:
# Print classification report for our XGBoost classifier.
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy: ", accuracy)
print("Classification report:")
print(report)
print("Confusion matrix:")
print(cm)

Accuracy:  0.6407378063719641
Classification report:
              precision    recall  f1-score   support

           0       0.64      0.63      0.64    156239
           1       0.64      0.65      0.64    158044

    accuracy                           0.64    314283
   macro avg       0.64      0.64      0.64    314283
weighted avg       0.64      0.64      0.64    314283

Confusion matrix:
[[ 98847  57392]
 [ 55518 102526]]


In [None]:
# Print cross validation score.
cross_val_score(clf, X_train, y_train, cv=5)

array([0.62702892, 0.62617778, 0.6226325 , 0.6273138 , 0.62595356])

We were able to increase model performance by 10%, a major increase. This makes our model certainly more viable, but we need to adjust it further with another hyperparameter grid exploring larger numbers of estimators and max_depth.

In [None]:
# Creating another parameter grid for RandomizedSearchCV.
parameters = {'n_estimators':[1000,2000],
              'max_depth':[12,24],
              'learning_rate':[0.06,0.2],
              'subsample':[1],
              'colsample_bytree':[0.5],
              'gamma':[0.5,1],
              'min_child_weight':[2,4],
              'tree_method':['gpu_hist']}

# Create decision tree classifier with RandomizedSearchCV.
CV_clf = RandomizedSearchCV(clf, parameters, random_state=0)

# Fitting decision tree classifier to training data.
CV_clf.fit(X_train, y_train)

RandomizedSearchCV(cv=None, error_score=nan,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=0.5, gamma=1,
                                           learning_rate=0.06, max_delta_step=0,
                                           max_depth=12, min_child_weight=2,
                                           missing=None, n_estimators=1000,
                                           n_jobs=1, nthread=None,
                                           objective='binary:logistic',
                                           random_state=0, reg_alpha=0,
                                           reg_lam...
                                           verbosity=1),
                   iid='deprecated', n_iter=10, n_jobs=None,
                   param_distributions={'colsample_bytree': 

In [35]:
# Calling .best_params_ to pull best parameters from RandomizedSearchCV.
CV_clf.best_params_

{'colsample_bytree': 0.5,
 'gamma': 1,
 'learning_rate': 0.06,
 'max_depth': 24,
 'min_child_weight': 2,
 'n_estimators': 1000,
 'subsample': 1,
 'tree_method': 'gpu_hist'}

In [36]:
clf = xgb.XGBClassifier(
    random_state=0,
    tree_method='gpu_hist',  # Needed for GPU usage.
    colsample_bytree=0.5,
    gamma=1,
    learning_rate=0.06,
    max_depth=24,
    min_child_weight=2,
    n_estimators=1000,
    subsample=1,
)

In [37]:
%time clf.fit(X_train, y_train)

CPU times: user 13min 2s, sys: 1min, total: 14min 3s
Wall time: 14min 2s


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.5, gamma=1,
              learning_rate=0.06, max_delta_step=0, max_depth=24,
              min_child_weight=2, missing=None, n_estimators=1000, n_jobs=1,
              nthread=None, objective='binary:logistic', random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
              silent=None, subsample=1, tree_method='gpu_hist', verbosity=1)

In [38]:
# Print classification report for our XGBoost classifier.
y_pred = clf.predict(X_test)
accuracy = accuracy_score(y_test,y_pred)
report = classification_report(y_test, y_pred)
cm = confusion_matrix(y_test, y_pred)

print("Accuracy: ", accuracy)
print("Classification report:")
print(report)
print("Confusion matrix:")
print(cm)

Accuracy:  0.6748949195470324
Classification report:
              precision    recall  f1-score   support

           0       0.68      0.67      0.67    156239
           1       0.67      0.68      0.68    158044

    accuracy                           0.67    314283
   macro avg       0.67      0.67      0.67    314283
weighted avg       0.67      0.67      0.67    314283

Confusion matrix:
[[103972  52267]
 [ 49908 108136]]


We can see that increasing our XGBoost models max_depth increases performance slightly again with a model performance increase in accuracy scoring of about 5%. Further testing with increased max_depth up to 100 did not show any signs of model improvement, only longer run times.

# Conclusion

This notebook demonstrates the importance of hyperparameter tuning when it comes to model performance. Our initial model was not showing any useful predictive accuracy, with a scoring of 52%. After hyperparameter tuning we were able to achieve a predictive accuracy of almost 68%, a roughly 16% increase in model accuracy scoring. XGBoost continues to be a useful model for working with large amounts of complex numerical data like the market data here.