<h1> Predicting short-term price movement in crypto markets with the XGBoost algorithm.</h1>

This notebook is an investigation into predicting short-term movement of crypto markets using the XGBoost Algorithm. 

The hypothesis is that there is a some predictability in price movement based on the relationship between price and pivot-highs/pivot-lows. We will see if XGBoost algorithm can make accurate predictions based on this data.

I have already calculated the pivot-high and pivot-low points in the data in this repository, if you'd like to know more about pivot-high and pivot-low calcuations you can learn about it here: 

https://www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/pivot-points-high-low

In [1]:
# Import required modules

import pandas as pd
import numpy as np
import matplotlib
%matplotlib inline
import xgboost
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split
import os
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (12,8)
from sklearn import  metrics, model_selection
from xgboost.sklearn import XGBClassifier
import graphviz


In [2]:
# Import the data and check it has loaded

data = pd.read_pickle('XBTUSD_1h_with_pivs_5')  #= amazing!


data.tail()


Unnamed: 0,timestamp,open,high,low,close,volume,pivot,pivH1,pivL1,pivH2,pivH3,pivH4,pivH5,pivL2,pivL3,pivL4,pivL5
11408,2019-04-21 08:00:00,5323.5,5334.0,5315.5,5316.0,31191090,1,5334.0,5250.5,5342.5,5348.0,5333.0,5328.0,5295.0,5292.0,5250.0,5234.5
11409,2019-04-21 09:00:00,5316.0,5317.5,5279.0,5306.5,96627345,0,5334.0,5250.5,5342.5,5348.0,5333.0,5328.0,5295.0,5292.0,5250.0,5234.5
11410,2019-04-21 10:00:00,5306.5,5306.5,5222.0,5250.0,191676251,0,5334.0,5250.5,5342.5,5348.0,5333.0,5328.0,5295.0,5292.0,5250.0,5234.5
11411,2019-04-21 11:00:00,5250.0,5271.0,5232.5,5261.0,79802394,0,5334.0,5250.5,5342.5,5348.0,5333.0,5328.0,5295.0,5292.0,5250.0,5234.5
11412,2019-04-21 12:00:00,5261.0,5264.5,5247.0,5258.0,26239526,0,5334.0,5250.5,5342.5,5348.0,5333.0,5328.0,5295.0,5292.0,5250.0,5234.5


<h2>Adding features</h2>

The data above is hourly candles data from Bitmex exchange from 2018 - 2019. The columns "pivH4" etc are the positions of pivot points relative to the price. We will use these as the features that XGBoost will work with and see if we can think of some combinations that make sense.

In the case below I have chosen to look at the relationship between price and the latest pivot points as well as the relationship between the latest pivot points and each other.

In [4]:
# Add the features we want to use

data['close/pivH1'] = data['close'] / data['pivH1']
data['close/pivL1'] = data['close'] / data['pivL1']

data['close/pivH2'] = data['close'] / data['pivH2']

data['close/pivL2'] = data['close'] / data['pivL2']


data['close/pivH3'] = data['close'] / data['pivH3']
data['close/pivL3'] = data['close'] / data['pivL3']

data['pivL1/pivH1'] = data['pivL1'] / data['pivH1']
data['pivL2/pivH2'] = data['pivL2'] / data['pivH2']

data['high/pivH1'] = data['high'] / data['pivH1']
data['low/pivH1'] = data['low'] / data['pivH1']
data['high/pivL1'] = data['high'] / data['pivL1']


data['low/pivL1'] = data['low'] / data['pivL1']

data['close/prevClose'] = data['close'] / data['close'].shift(1)


# Below are the things we are interested in predicting:

data['next_candle_size'] = abs(data['close'].shift(-1) - data['close']) / data['close']

# Result is -1, 1, or 0 at the mo - its not binary!! So we just want to know if it is 1 or 0

data['next_candle_color'] = np.where(data['close'].shift(-1) > data['close'], 1, -1)

data = data[['timestamp', 'open', 'high', 'low', 'close', 'close/prevClose','low/pivL1', 'close/pivH3', 'close/pivH1', 'close/pivL2', 'close/pivL1', 'close/pivH2',
       'high/pivL1', 'close/pivL3', 'low/pivH1', 'pivL2/pivH2', 'high/pivH1', 'next_candle_color', 'next_candle_size']]




In [5]:
# Splt into training and test data

train_data = data[:8000]

validate_data = data[8001:]

df = train_data
df_2 = validate_data

split  = len(df) / (len(df_2) + len(df) )

print(len(data))
print(len(df))
print(len(df_2))

print(split)

11413
8000
3412
0.7010164738871364


In [6]:
df.dropna(axis=0, inplace=True)

X = df.iloc[:,5:17]
y = df.iloc[:,-2]

# split data randomly into 70% training and 30% test
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.3, random_state=123)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [7]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 2,
    'learning_rate': 1,
    'silent': 1,
    'n_estimators': 5
}

model = XGBClassifier(**params).fit(X_train, y_train)

In [8]:
# use the model to make predictions with the test data
y_pred = model.predict(X_test)
# how did our model perform?
count_misclassified = (y_test != y_pred).sum()
print('Misclassified samples: {}'.format(count_misclassified))
accuracy = metrics.accuracy_score(y_test, y_pred)
print('Accuracy: {:.2f}'.format(accuracy))

Misclassified samples: 913
Accuracy: 0.62


  if diff:


In [9]:
# Double check the results in more detail

df_results = X_test

df_results['next_candle_color'] = df['next_candle_color']

df_results['next_candle_size'] = df['next_candle_size']




df_results['prediction'] = y_pred

success = df_results.prediction == df_results.next_candle_color

failure = np.where(df_results.prediction != df_results.next_candle_color, 1, 0)

df_results['success'] = np.where(success, 1, 0)

win_percent = sum(df_results['success']) / len(df_results)



#correctly_predicted_candle_size = np.where(df_results['success'] == 1, candle_size )



print("Correct predictions: " + str(sum(df_results['success'])))

print("Incorrect predictions: " + str(sum(failure)))

print("Win percent: " + str(win_percent * 100) + "%")

print(" ")



Correct predictions: 1480
Incorrect predictions: 913
Win percent: 61.84705390722942%
 


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html

In [10]:
# The real validation starts here by using the out of sample data


df_2.dropna(axis=0, inplace=True)


X1 = df_2.iloc[:,5:17]
y1 = df_2.iloc[:,-2]


y_pred1 = model.predict(X1)

df_results_2 = X1



df_results_2['prediction'] = y_pred1

df_results_2['next_candle_color'] = df_2['next_candle_color']

success_2 = df_results_2.prediction == df_results_2.next_candle_color

failure_2 = np.where(df_results_2.prediction != df_results_2.next_candle_color, 1, 0)

df_results_2['success'] = np.where(success_2, 1, 0)

win_percent_2 = sum(df_results_2['success']) / len(df_results_2)



print("Accurate predictions: " + str(sum(df_results_2['success'])))

print("Incorrect predictions: " + str(sum(failure_2)))

print("Win percent: " + str(win_percent_2 * 100) + "%")

Accurate predictions: 2095
Incorrect predictions: 1316
Win percent: 61.41893872764585%


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
  if diff:


In [11]:
# checking the feature importance according to XGBoost:

feature_imp = pd.DataFrame(model.feature_importances_, index=X.columns)

feature_imp.sort_values(by=0,ascending=False)

Unnamed: 0,0
low/pivL1,0.45324
high/pivH1,0.267552
close/pivH1,0.142302
close/prevClose,0.136905
close/pivH3,0.0
close/pivL2,0.0
close/pivL1,0.0
close/pivH2,0.0
high/pivL1,0.0
close/pivL3,0.0


<h2>Preliminary results</h2>

We can see from this experiment that XGBoost is able to predict short term price movement with 60% accuracy. It seems that the most recent pivot-high and pivot-low and their relationship with the latest price high and price low are the most significant predictors. This is a massive edge over the market. However, this is only a binary prediction (whether the price will go Up or Down) and does NOT predict the distance the price will move. In order to be profitable witha 60% win rate we would need an average risk-reward ratio of 0.7 not including bitmex fees. To develop a trading strategy with this we would need to backtest using stop limits etc. However, this certainly could be part of a trading signal in a more complex strategy.