# Machine Learning for Cryptocurrency Trading
## An Introduction

This is a brief crash course on how to use machine learning to make cryptocurrency trading decisions.  **The code used to perform most of the cryptocurrency specific tasks is located in the** ```/ml4t``` **directory of this repo. This library should be a good starting place for building your own strategies.** Other popular Python libraries handle general data structure and machine learning algorithm implementation.

This isn't a comprehensive implementation of cutting-edge techniques, but rather a clean and logical place to start. Suggestions for extending this work, and references to good resources for doing so are noted throughout.

Trading is definitionally zero-sum: every transaction has two sides and if one wins the other must lose. As you might imagine then, building a money-making machine for a zero-sum game is quite competitive. We'll just scratch the surface here, but it's a really fun problem to spend time on because (1) the stakes are real and (2) you develop skills that generalize to a lot of interesting problems.

**The goal here is to predict price movements and build a profitable trading strategy based on those predictions.** 

Data collection and trade execution vary by exchange but are generally easy either manually or via API. (See [this repo](https://github.com/danpaquin/gdax-python) for a nice Python package for interacting with the [GDAX API](https://docs.gdax.com).) Sample data from GDAX is provided for this notebook.

<a id='toc'></a>
### Table of Contents:
This course is broken down into the following sections:
1. [Quick Orientation on Trading](#section1)
2. [Overview](#section2)
3. [Cleaning Raw Trading Data](#section3)
4. [Feature Engineering](#section4)
5. [Split Data](#section5)
6. [Predicting Price](#section6)
7. [Optimizing Trading Strategy](#section7)

The repo for this course includes a library that will perform much of the work. The code is in ```/ml_for_trading/ml4t```. This should be a useful jumping point from which you can build your own strategies.

<a id='section1'></a>

[Jump back to the TOC](#toc)
# 1. Quick Orientation on Trading

On an exchange like GDAX, trading involves matching buyers and sellers. Limit orders underlie the market and "give them liquidity." A market order is when someone with an asset offers to sell some or all of an amount of an asset (say 10 Bitcoin) at a specific per-unit price (say \$1,000 per Bitcoin). Or, someone could offer to buy some of all of an amount of an asset (say 1 Bitcoin) at a specific per-unit price (say \$995 per Bitcoin).

Limit orders sit open on the exchange, waiting for a counter party to take the other side of the trade. Traders who put in limit orders are known as "price makers" because they determine the price of their sell or buy (if it gets executed). Fees for limit orders are typically the cheapest because they attract more people to the exchange; the more liquidity the better. On GDAX, for example, limit orders are free. But, they are not dynamic and may not get executed if no one else wants to take the other side of the trade. There's always a risk that the market will move away from a limit order: for example, if you're trying to sell Bitcoin at one price but the price is falling, you're stuck with an asset that is losing value.

The other major type of order is a market order. Market orders are "price takers" and execute against the outstanding limit orders in the order book. They are structured as specifying the amount of money you're willing to spend buying an asset, or the volume of assets you want to sell. For example, you might execute a market order to buy \$10,000 worth of Bitcoin. In that case, you will get the cheapest \$10,000 worth of Bitcoin available in outstanding limit orders. Note that in doing this, you may be paying several different prices for various amounts of the asset. For example, there may be only 1 Bitcoin available at \$5,000 and the next costs \$5,050. In this case, you'd end up with about 1.99 Bitcoin in total for your \$10,000, even though the quoted price at the execution of your market order was \$5,000. Because market orders "remove liquidity from the market" and execute immediately (provided there are outstanding limit orders to trade against at _some_ price), they are more expensive. GDAX charges around 0.3% of the value of the trade for market orders.

<a id='section2'></a>

[Jump back to the TOC](#toc)
# 2. Overview
## Approach: Inference & Prediction
Broadly, there are two things you can do with machine learning: 
1. You can develop an understanding of how systems work: how is an output generated as a function of some input. This is **inference**.
2. You can attempt to guess, as accurately as possible, what output will result from some input. This is **prediction**. 

This notebook is focused almost exclusively on prediction. If you can systematically predict which way the price of a cryptocurrency is going to move and can act efficiently to trade based on that information, you will make above-market returns. 

Ideally, strong inference work underpins the development of prediction systems; a logical understanding of the system should guide prediction modeling decisions. Here, however, there may not be much sound inference to be drawn because cryptocurrencies aren't _really_ underlying any meaningful utility yet. Given this, and a desire for brevity, it makes sense to focus on developing a solid prediction architecture.

## Approach: Fundamental & Technical/ Quantitative Analysis
Another distinction broadly classifies the three types of equity trading strategies:
1. **Fundamental Analysis** attempts to evaluate an asset's intrinsic value. For example, based on the performance of a company, its assets and liabilities, the state of the market, and the overall economy, what is the true present value. Cryptocurrencies are currently being valued not for what they are, but what they could be. The evolution will likely be weirder than we think. High uncertainty about not just the probability of specific outcomes, but even the topography of possible outcomes isn't a tractable machine learning problem.
2. **Technical Analysis** attempts to predict future price movements based on the recent trading data.
3. **Quantitative Analysis** attempts to predict future price based on mathematical relationships to basically anything other than an asset's historical price and volume. For example, counting the number of cars in retail store parking lots could predict revenue. Or, fuel prices may be predictive of airline margins.

Both technical analysis and quantitative analysis are well suited for machine learning. Here, we'll focus on technical analysis, if only because of the simplicity of working with a single dataset. Much of the overhead should generalize to quantitative analysis strategies, however.

## Data Sources
This notebook will only use price and volume data for Bitcoin (though Ethereum and Litecoin data are also inculded in ```/data``` for your enjoyment). Each observation is called a "candlestick" because of what it looks like when graphed, and includes the open, close, high and low price, as well as volume traded, for a specific period of time (1 minute in this case). This is about the most parsimonious dataset with which you could imagine modeling price movements.

There are a lot of other data sources you should consider if you want to get serious with your algorithmic trading, including:
- Other cryptocurrencies:
 - Trading price and volume of other cryptocurrencies
 - Statistical relationships between other cryptocurrencies (_e.g.,_ relative prices movements have converged and decoupled at various points)
- Deeper exchange data:
 - By-order data: candle stick data, which we're going to use here, are an aggregation of one or more orders over a set period of time. Though noticeably more challenging to handle, by-order data could provide granularity that yields additional predictive power
 - Order book data: candle stick and by-order data record the trades that actually occurred. What about orders that went unfilled? Complete orderbook data would add a lot of additional market context. It would also add a lot more complexity: orders can be outstanding for an arbitrary amount of time and can end in three different terminal states: cancelled, partially filled, or fully filled
- Other exchanges: there are a lot of exchanges that have meaningful liquidity. Some may lead or lag in price movements. There may even be pure arbitrage opportunities (make sure consider transaction time and fees)
- Network information: the information stored on an asset's blockchain. Data included varies by asset, but some of the common things include
 - Nodes: number and power of miners on the network
 - Transaction fees
 - Amount and velocity of transactions
 - Number and concentration of wallets
- News sentiment: Reddit, Telegram, Twitter, various other chat communities, and media coverage
- Code: many open source projects save their code, and incremental updates, to public GitHub repos. Summarizing the evolution of these code bases could indicate some rough signal of quality or momentum
 - Volume and size of commits
 - Number and quality of contributors
- Major events: some way to capture big changes to the relevant currencies (_e.g.,_ announcements of adoption by large institutions), or environment (_e.g.,_ SEC regulation)
- Non-cryptocurrency assets and macro economy indicators

## Tools:
The code in this course is Python 3.6. It relies on a number of popular libraries:

In [1]:
import numpy as np
import pandas as pd
import time
import datetime as dt
import matplotlib.pyplot as plt
import ml4t  # ~/ml4t in this repo; e.g., ml4t.TradeData()

# Viewing options
%matplotlib notebook
pd.options.display.max_columns = 2000
pd.options.display.max_rows = 2000
pd.options.display.float_format = '{:,.4f}'.format  # 2 decimal places
# %config InlineBackend.figure_format = 'retina'   # if viewing on MBP retina and want high res plots


<a id='section3'></a>

[Jump back to the TOC](#toc)
# Section 3: Cleaning Raw Data
Raw trading data were pulled from GDAX's API and saved as CSV files. In this section, these raw data are cleaned and prepared for price prediction modeling.


In [2]:
REPO_PATH = %pwd
DATA_PATH = REPO_PATH + "/data/"
DATA_STARTTIME = '2017-01-01 00:00:00'
DATA_ENDTIME   = '2018-03-15 23:59:59'

In [3]:
# Create second-level index for df based on begining and end or data range
datetimes = pd.date_range(DATA_STARTTIME, DATA_ENDTIME, freq='min')
datetimes[0:10]

DatetimeIndex(['2017-01-01 00:00:00', '2017-01-01 00:01:00',
               '2017-01-01 00:02:00', '2017-01-01 00:03:00',
               '2017-01-01 00:04:00', '2017-01-01 00:05:00',
               '2017-01-01 00:06:00', '2017-01-01 00:07:00',
               '2017-01-01 00:08:00', '2017-01-01 00:09:00'],
              dtype='datetime64[ns]', freq='T')

In [4]:
# Create df with datetimes range as index
df_template = pd.DataFrame(index=datetimes)
df_template.index.name = 'dt'

In [5]:
# Import csv data

# Row names
def import_data(exchange, periods):

    COLS = ['dt'
                , exchange +'_low'
                , exchange + '_high'
                , exchange + '_open'
                , exchange + '_close'
                , exchange + '_coin_vol']

    # Periods of data
    PERIODS = periods  # ['2017', '2018_through_03-15']

    # Exchange df
    df_exchange = pd.DataFrame(columns = COLS)

    for q in PERIODS:
        df_tmp = pd.read_csv(DATA_PATH + str(q) + '_' + exchange + '_60sec_candles.csv', header=None)
        df_tmp.columns = COLS
        df_exchange = df_exchange.append(df_tmp)
    del df_tmp
    df_exchange['dt'] = pd.to_datetime(df_exchange['dt'], unit='s')
    df_exchange = df_exchange.set_index('dt')
    
    # Overlay data on template df (with _every_ minute in series, not just those traded)
    return df_template.join(df_exchange, how='outer')

df_btc = import_data('USD-BTC', ['2017', '2018_through_03-15'])

Let's take a look at what the raw data look like:

In [6]:
display(df_btc.tail())

Unnamed: 0_level_0,USD-BTC_low,USD-BTC_high,USD-BTC_open,USD-BTC_close,USD-BTC_coin_vol
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-03-15 23:55:00,8247.78,8254.99,8247.78,8254.99,22.1226
2018-03-15 23:56:00,8254.99,8258.11,8255.0,8255.0,8.3839
2018-03-15 23:57:00,8255.0,8260.0,8255.0,8260.0,4.3908
2018-03-15 23:58:00,8259.99,8260.0,8259.99,8259.99,1.8671
2018-03-15 23:59:00,8259.99,8260.0,8260.0,8259.99,2.7873


### Sanity Check Raw Data
See what proportion of minutes over the time series actually have trades. Within candles, how much does price tend to vary?

In [7]:
# Density of data
print('Total minutes over time series: {}'.format(df_btc.shape[0]))

# % of seconds with trade
print('% minutes in time series with {} trades: {}'.format('USD-BTC', df_btc['USD-BTC_low'].notna().sum()/df_btc.shape[0] ))


Total minutes over time series: 632160
% minutes in time series with USD-BTC trades: 0.9785418248544672


In [8]:
# % of minute candles with >0 volume in which high and low price are NOT the same
# Range of w/i second price variation: high / low
# Note: the more variation of price w/i a candle, the more likely the USD volume estimate will be incorrect
ex = 'USD-BTC'
high_low_magnitude = (df_btc.loc[:, ex + '_high'] / df_btc.loc[:, ex + '_low']) - 1

print('% of {} minute trade candles w/ multiple prices : {}'.format(ex, (high_low_magnitude[high_low_magnitude.notna()] != 0.0).sum() / high_low_magnitude.notna().sum()))
print('Max {} (high - low)/low : {}'.format(ex, high_low_magnitude.max()))
print('Mean {} (high - low)/low : {}'.format(ex, high_low_magnitude.mean()))
print('Median {} (high - low)/low : {}'.format(ex, high_low_magnitude.median()))

del ex, high_low_magnitude

% of USD-BTC minute trade candles w/ multiple prices : 0.9128557456817465
Max USD-BTC (high - low)/low : 0.3184444444444443
Mean USD-BTC (high - low)/low : 0.0010631585507540806
Median USD-BTC (high - low)/low : 0.00039111883691544946


97.9% of minutes in the time series have trades. Of those minutes with trades, 91.3% have multiple prices within the minute. The largest within minute difference between high and low price is 31.9%, and the mean is 0.001%. Large variation is possible when, for instance, a big market order is placed and, in order to fulfill the volume, the market moves deep into the outstanding limit order book. 

The technique used to calculate the USD volume of a minute is (open_price + close_price) / 2 * coin_volume. This will be inaccurate to the degree that within-minute prices are not distributed uniformly. On average, it should be pretty close.

### Clean and Aggregate Data
ml4t.TradeData() is used to clean and aggregate the raw data. ```TradeData.clean_df_s()``` method will perform the following:
 - Fill forward missing data
  - For periods without trades, the low, high, open, and close price are set equal to the most recent closing price. Volume is set to 0.
 - USD volume is calculated as ```(open_price + close_price) / 2 * coin_volume```
 - Period return percentage is calculated as ```(close_price - open_price) / open_price```
 - Aggregations of the data are created as the following attributes of the object:
  - Minute-level: ```TradeData.df_m```
  - Hour-level: ```TradeData.df_h```
  - Day-level: ```TradeData.df_d```

Note: there may be leading NaNs in these series which can cause issues down the line. A note on how many leading NaN periods there are will be printed.

In [9]:
trade_data_btc = ml4t.TradeData(df_btc)
trade_data_btc.clean_df_m()

Note: df has 0 leading NaN periods.
Minute-level data have been cleaned.
Minute, hour, and day-level data are available as .df_m, .df_h, and .df_d, respectively.


<a id='section4'></a>

[Jump back to the TOC](#toc)
# Section 4: Feature Engineering
Before we attempt to build a price prediction model, we'll create potential model features.

One of the main goals of this feature engineering is to build variables that are normalized. This means we're adjusting the raw values to a common scale that is comparable across the time series. For example, one day the price might move from \$500 to \$1000. Another day it might move from \$10,0000 to \$10,500. Although both of these days resulted in a \$500 increase in value, the normalized changes are dramatically different: the first increased 100%, while the second only increased 5%. Percentage increase is a normalized metric and is generally more easily comparable across observations than an absolute metric. Building models based on normalized features tends to make them better at predicting the future because the measurements account for context in which they were calculated.

As noted above, here we'll build models based strictly off of historical price and volume movements.

Price prediction based strictly on historical price and volume is called "technical analysis" in finance. There is _a lot_ of finance literature on different metrics for this type of prediction. Here, we'll build a few features that are closely related to common normalized statistics across domains, as well as a popular technical analysis indicator:
- Normalized price dispersion: rolling std of closing price / rolling mean of closing price
- Diversion from rolling mean price: (period closing price - rolling mean of closing price) / rolling std of closing price; (note: Bollinger Bands are simply +2/-2 thresholds for this measurement)
- Normalized volume dispersion: usd volume rolling std / usd volume rolling mean
- Diversion from rolling mean USD volume: (period usd volume - rolling mean of usd volume) / rolling std of usd volume
- Rolling mean return
- Boolean for if previous period return was positive
- \# of periods in which return has maintained positive return
- \# of periods in which return has maintained negative return (we won't be including interaction terms in our models, so we want both the negative and positive version of this metric)
- [Stochastic oscillator](https://www.investopedia.com/terms/s/stochasticoscillator.asp): A technical analysis indicator that ranges from 0-100 and aims to measure price momentum

### If you want to extend this work:
- Create additional features with these data (see papers on "technical indicators" for ideas)
- Explore the different sources of data mentioned in Section 1. All of these features are descriptions of what's happening to _that_ cryptocurrency. An obvious next step is to add _between_ cryptocurrency features. Then, explore relevant data outside of cryptocurrencies.
- The number of periods over which rolling statistics are calculated is something that should be sensitivity tested. The default in ```.calc_features()``` is 10, though this can be changed and multiple rolling trends can be calculated in the same feature set. In your own work, once you have found a good model, try re-specifying it with different rolling feature windows to see if you can improve performance.


In [10]:
Y_btc, X_btc = ml4t.calc_features(trade_data_btc.df_h, rolling_periods=[10], y_binary=False)

Here is what our features data look like:

In [11]:
X_btc.head(5)

Unnamed: 0_level_0,USD-BTC_return_lag1,USD-BTC_return_pos_momentum,USD-BTC_return_neg_momentum,USD-BTC_close_stdmag_10,USD-BTC_close_stds_10,USD-BTC_return_rm_10,USD-BTC_usd_vol_stdmag_10,USD-BTC_usd_vol_stds_10,USD-BTC_so_10
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2017-01-01 10:00:00,0.0013,3.0,0.0,0.0012,1.1151,-0.0003,0.4104,-1.2532,52.9347
2017-01-01 11:00:00,0.0048,4.0,0.0,0.0023,2.4564,0.0005,0.3543,0.0179,54.763
2017-01-01 12:00:00,0.0,5.0,0.0,0.0029,1.7826,0.0005,0.371,-1.0882,69.8106
2017-01-01 13:00:00,-0.0005,0.0,0.0,0.0032,1.2841,0.0007,0.4112,-0.9336,85.4991
2017-01-01 14:00:00,0.0037,0.0,0.0,0.0038,1.7251,0.0011,0.412,1.1632,91.5741


<a id='section5'></a>

[Jump back to the TOC](#toc)
# Section 5: Split Data

Below, we'll split the cleaned data into a training set, a cross validation set, and an out-of-sample test set. Then, we'll take a look at the training and cross validation sets.

In building and assessing price models, it's crucial to not let information about the future get brought back in time and artificially inflate the performance of a model in a way that isn't possible in the real world. For example, using the i to i+100 period mean price to predict the price at i+2 is using information about the future that obviously wouldn't be available in actual trading. In this spirit, data are split according the principle of roll-forward cross-validation, which says that all data used to assess the performance of a model should come after the training data. 

In this notebook, we'll use the first 90% to train and cross validate a model. We'll set aside the last 10% as an out-of-sample test set to assess "real world" performance. Within that training and cross-validation set, the first 80% of the data will be used as training data to specify models and the last 20% will be cross-validation data used to tune parameters.
- More explicitly: 
    - Training data: January 1, 2017 - November 11, 2017
    - Cross-validation data: November 12, 2017 - January 29, 2018
    - Real-world performance test data: January 30, 2018 - March 15, 2018

### Balancing Positive and Negative Return Periods
The ```.split_test_train()``` method below forces the training dataset to have the same number of positive and negative return periods. It does this by oversampling the minority class. The observations of the less-likely return type are randomly duplicated until there are as many positive as negative return periods. This helps avoid biasing the model towards one type of outcome. (This oversampling of the minority can be easily togged off in the method if you prefer.)

In [12]:
# Find create 90-10 train/test split
i_split = int(Y_btc.shape[0] * (1 - 0.1))

# Y
Y_btc_model = Y_btc.iloc[0:i_split,]
Y_btc_test = Y_btc.iloc[i_split:,]

# X
X_btc_model = X_btc.iloc[0:i_split,]
X_btc_test = X_btc.iloc[i_split:,]

In [13]:
Y_btc_train, X_btc_train, Y_btc_val, X_btc_val = ml4t.split_test_train(Y_btc_model
                                                                         , X_btc_model
                                                                         , test_pct = 0.2
                                                                         , upsample = True)

<a id='section6'></a>

[Jump back to the TOC](#toc)
# Section 6: Predicting Price
In this section, we make several attempts to build a model that can predict future returns based on historical data. The high-level modeling strategy is as follows:

### Continuous Dependent Variable
The thing we'll try to predict here is **return percentage for the next hour**. This is a continuous value. 

Alternatively, we could try to model this as a classification problem (_ie.,_ will the price an hour from now be higher or not). Classification problems have a really nice, intuitive set of performance criteria. In this situation, however, transaction fees are meaningful relative to the typical period change. In order to build a successful strategy with market orders we have to be able to say something about _how large_ of a return we expect in the future, not just the direction.

### Measure of Accuracy
There are arguably several measures of accuracy we could focus on here. A more extensive analysis should certainly explore various ways of assessing fit. Here, however, we'll focus on Mean Absolute Error because it relates directly to the percentage fees associated with market orders. After all, a trade can't be profitable unless it is both directionally correct and a large enough magnitude to more than cover fees.

You can also imagine a strategy where we're not concerned with the mean error, but rather identifying some subset of period predictions (ideally large ones) that we could model very well, even if the average prediction isn't very good. Then, we could just trade during periods that matched that subset. That gets complicated quickly, so we're not going to attempt that here.

### Considering Fees and Price Impact of Trading
Predicting the future return is just one component of building a complete trading strategy. In the following section we'll look at making trading decisions in the context of transaction costs and the fact that trading impacts the market price.

## Random Forest
We'll use random forests to try to predict future returns. This ensemble learner aggregates the prediction of many decision trees, each split at random criteria. Every tree is created from a bootstrap aggregated ("bagged") set of training observations the size of the training data. This means that every tree in the forest is created on its own training dataset. Each of those datasets is created by sampling, with replacement, from the original training set for a "bag" that's as big as the original training set.

When we use the random forest to predict an observation, each of the trees in the forest is used to calculate its own answer, then the answers are averaged. Random forests are pleasantly intuitive and a great first model to try for many machine learning problems. 

There are two main hyperparameters of the model we'll focus on tuning:
- Number of trees in the forest
- Minimum leaf size

There could certainly be interaction between these two criteria, but we'll evaluate them sequentially here.



In [14]:
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor

The calibration below takes a long time to run. For evaulation purposes, leave the calibration code commented out and just run the 24 tree, 5 leaf size version which turns out to be about the best in cross validation.

In practice, you should play around with the commented out code, however!

In [15]:
## Calibrate number of trees with lowest cross-validation error
## Note: this calibration takes a while to run!
#from sklearn.ensemble import AdaBoostRegressor
#
#tree_size = [i for i in range(20, 51) if not (i % 2)]  # only run factors of 2 to speed things up
#tree_size_train_mae = []
#tree_size_val_mae = []
#
#for t in tree_size:
#    rf = AdaBoostRegressor(RandomForestRegressor(min_samples_leaf = 5
#                               , criterion='mae'
#                               , random_state = 64)
#                           , n_estimators = t
#                           , random_state = 64)
#    rf.fit(X_btc_train, Y_btc_train)
#    tree_size_train_mae.append(mean_absolute_error(Y_btc_train, rf.predict(X_btc_train)))
#    tree_size_val_mae.append(mean_absolute_error(Y_btc_val, rf.predict(X_btc_val)))
#    print("finished: {}; validation mae: {}".format(t, tree_size_val_mae[-1]))

In [16]:
# Plot test and cross-validation error
#plt.plot(leaf_vals[0:40], test_rmse[0:40])
#plt.plot(leaf_vals[0:40], train_rmse[0:40])
#plt.legend(['Test RMSE', 'Train RMSE'])
#plt.xlabel('Size of Leaf')
#plt.ylabel('RMSE')
#plt.show()

In [17]:
## Calibrate size of leaf with lowest cross-validation error
## Note: this calibration takes a while to run!
#from sklearn.ensemble import AdaBoostRegressor
#
#leaf_size = [i for i in range(2, 11) if not (i % 2)]  # only run factors of 2 to speed things up
#leaf_size_train_mae = []
#leaf_size_val_mae = []
#
#for l in leaf_size:
#    rf = AdaBoostRegressor(RandomForestRegressor(min_samples_leaf = l
#                               , criterion='mae'
#                               , random_state = 64)
#                           , n_estimators = 24
#                           , random_state = 64)
#    rf.fit(X_btc_train, Y_btc_train)
#    leaf_size_train_mae.append(mean_absolute_error(Y_btc_train, rf.predict(X_btc_train)))
#    leaf_size_val_mae.append(mean_absolute_error(Y_btc_val, rf.predict(X_btc_val)))
#    print("finished: {}; validation mae: {}".format(t, leaf_size_val_mae[-1]))

In [18]:
from sklearn.ensemble import AdaBoostRegressor
# By period error using tree
rf = RandomForestRegressor(n_estimators = 24
                           , min_samples_leaf = 5
                           , criterion='mae'
                           , random_state = 64)
rf.fit(X_btc_train, Y_btc_train)
print("MAE of CV set with trained random forest is: {}".format(mean_absolute_error(Y_btc_val, rf.predict(X_btc_val))))

MAE of CV set with trained random forest is: 0.011035390504066293


In [19]:
# For perspective, guessing the training mean return in the val set results in the following MAE:
mean_absolute_error(Y_btc_val, [Y_btc_train.mean() for i in range(Y_btc_val.shape[0])])

0.010996089662615375

<a id='section7'></a>

[Jump back to the TOC](#toc)
# Section 7: Assess Strategy

It looks like by total Mean Average Error, the random forest model isn't any better at predicting the next hour’s return any more than the simple average of the training data.

Perhaps, at the extremes, however, it's good enough at identifying large impending increases or decreases that it would be a useful model to trade based on.

Below we'll identify where the random forest predicts movements of +1.0% or -1.0% returns over the next hour. In those instances, we'll buy or sell respectively.

Then, we'll see how this trading strategy would have performed with real pricing data.

For simplicity, we'll start our back-testing portfolio with \$10,000 in cash. Then, we'll take one of two positions: either holding 2 Bitcoin, or short-selling 2 Bitcoin (we'll assume we're allowed to hold these volumes "on margin" even if their total value happens to exceed our cash at any point). 

To more accurately reflect _real_ trading mechanics, we impose two types of costs on each transaction:
- A 0.29% transaction fee based on the dollar value of the trade
- A 0.1% price impact: purchase prices for our portfolio will be _higher_ than the historical market rate and sales prices will be _lower_ than the market rate by 0.1%; this is to reflect the fact that our trading would have actually impacted the market price if we'd been active at that time

In [20]:
# Create df from rf model predictions of test set data
dfprediction = pd.DataFrame(data={'y_hat': rf.predict(X_btc_test)}, index=X_btc_test.index)

In [21]:
# Where would trades be if only bought when predicted return > 0.01% and sold when < -0.01%
to_trade = np.abs(dfprediction['y_hat']) > 0.01
to_trade.sum() # will result in 29 trades
dfprediction[to_trade]

Unnamed: 0_level_0,y_hat
dt,Unnamed: 1_level_1
2018-02-01 17:00:00,0.011
2018-02-01 20:00:00,0.0127
2018-02-02 00:00:00,-0.0103
2018-02-02 16:00:00,-0.0103
2018-02-04 16:00:00,0.0146
2018-02-05 15:00:00,0.01
2018-02-05 19:00:00,0.0123
2018-02-05 20:00:00,0.0237
2018-02-06 06:00:00,-0.0163
2018-02-06 08:00:00,0.0231


Now, we build buy or sell orders based on these predicted price moves.

**Note: the** ```ml4t.build_orders()``` **method can be pretty easily changed to output order calls into whatever format your prefered exchange API requires.**

In [22]:
# Build orders based on this threshold
dforders = ml4t.build_orders(dfprediction, abs_threshold=0.01, startin=False, symbol='USD-BTC')
dforders

Unnamed: 0_level_0,Order,Shares,Symbol
dt,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-02-01 16:00:00,BUY,1.0,USD-BTC
2018-02-01 23:00:00,SELL,2.0,USD-BTC
2018-02-04 15:00:00,BUY,2.0,USD-BTC
2018-02-06 05:00:00,SELL,2.0,USD-BTC
2018-02-06 07:00:00,BUY,2.0,USD-BTC
2018-02-06 09:00:00,SELL,2.0,USD-BTC
2018-02-07 02:00:00,BUY,2.0,USD-BTC
2018-02-28 14:00:00,SELL,2.0,USD-BTC
2018-03-09 04:00:00,BUY,2.0,USD-BTC
2018-03-12 18:00:00,SELL,2.0,USD-BTC


Finally, we use these trades to simulate actual trading on our hold-out testing data.

In [23]:
portvals = ml4t.compute_portvals(dforders=dforders, dfprices=trade_data_btc.df_h, trend=X_btc_test.index
                                , start_val=10000, commission=0.0029, impact=0.001)
portvals

Unnamed: 0_level_0,0
dt,Unnamed: 1_level_1
2018-01-31 03:00:00,10000.0
2018-01-31 04:00:00,10000.0
2018-01-31 05:00:00,10000.0
2018-01-31 06:00:00,10000.0
2018-01-31 07:00:00,10000.0
2018-01-31 08:00:00,10000.0
2018-01-31 09:00:00,10000.0
2018-01-31 10:00:00,10000.0
2018-01-31 11:00:00,10000.0
2018-01-31 12:00:00,10000.0


Over the test period, this results in a 36.89% return. The default buy-and-hold return over this period is as follows:

In [24]:
first_price = trade_data_btc.df_h.loc[Y_btc_test.index[0]]
last_price = trade_data_btc.df_h.loc[Y_btc_test.index[-1]]
(last_price - first_price) / first_price

USD-BTC_low        -0.1607
USD-BTC_high       -0.1678
USD-BTC_open       -0.1678
USD-BTC_close      -0.1652
USD-BTC_coin_vol   -0.3037
USD-BTC_usd_vol    -0.4175
USD-BTC_return     -0.2889
dtype: float64

Clearly the difference between 36.89% and -28.89% is enormous. In fact, it's so big that it seems suspicious. We shouldn't take this as evidence that our simple strategy is obviously a winner. We almost certainly just got lucky here because nobody is going to create a strategy that returns 50%+ more than the market, much less with such a simple model and feature set. Instead, we should treat this as a fun coincidence, and motivating example to get us to keep modeling!