In this notebook I demonstrate my pipeline for training a machine learning model for algo trading. 

This notebook walks you through:
* loading bars data
* extracting features based on technical indicators
* labelling target variables for modelling using the triple barrier method
* dynamically adjusting stop loss and take profit levels
* selecting useful features using recursive feature elimination with random forests OR based on your own reasoning
* training an XGBoost model to make buy decisions
* backtesting and evaluating model performance


In [1]:
import pandas as pd
from stonks import feature_extraction, feature_selection, backtest, labelling, models, visualisation
import warnings
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)
warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)
warnings.simplefilter(action='ignore', category=pd.errors.SettingWithCopyWarning)

Loading stock price data.
I get minute bars data from the Alpaca API. For your convenience, I am going to load data from a csv file. 
This gives you 1 minute bars for PLTR (Palantir) from 2022-05-01.

In [2]:
hist = pd.read_csv('data/bars_PLTR.csv', index_col='timestamp')
hist.index = pd.DatetimeIndex(hist.index)
hist.head()

Unnamed: 0_level_0,open,high,low,close,volume,trade_count,vwap
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2022-05-02 09:30:00+00:00,10.38,10.39,10.28,10.33,583607,1561,10.333511
2022-05-02 09:31:00+00:00,10.325,10.3263,10.26,10.2699,201313,731,10.280256
2022-05-02 09:32:00+00:00,10.26,10.295,10.2175,10.29,271984,1168,10.254898
2022-05-02 09:33:00+00:00,10.28,10.29,10.22,10.23,152645,685,10.259573
2022-05-02 09:34:00+00:00,10.23,10.285,10.1918,10.28,311727,1016,10.237203


In [3]:
# If you would like to get your own minute bars, you can create a Free Alpaca account and use this code:
# from alpaca_trade_api.rest import REST
# from alpaca.data.timeframe import TimeFrame, TimeFrameUnit

# API_KEY = "YOURKEY"
# SECRET_KEY = "YOURKEY"
# api = REST(key_id=API_KEY,secret_key=SECRET_KEY,base_url="https://paper-api.alpaca.markets")
# symbol = ['PLTR']
# hist=api.get_bars(symbol,timeframe=TimeFrame(amount=1, unit=TimeFrameUnit.Minute), start='2022-05-01').df
# hist = hist.drop(columns = ['symbol'])

Now that we have bars data we can extract features that we will use to train our machine learning model.

The features consist of various technical indicators such as moving averages, volatility indicators (rolling standard error, ATR), oversold/ overbought indicators etc.

I create features algorithmically by iterating over different time windows e.g. macd_12_26 compares exponential moving averages over the last 12 and 26 minutes, but it could be [26,50], [100,1000], [100,9000] minutes.

The features are added to the dataframe and given a descriptive feature name. For example:
* macd_12_26 - is the moving average convergence divergence for the 12 and 26 minute windows
* vol_std_pct_50 - is volatility measured as a rolling exponentially averaged standard deviation over 50 minutes

In [4]:
# Turn off warnings so pandas does not print a long list of them for each column added with .loc
# window limit ensures not too many features are generated
hist = feature_extraction.get_features(hist, window_limit=300)

In [5]:
hist.tail()

Unnamed: 0_level_0,open,high,low,close,volume,trade_count,vwap,returns,volume_pct_change,minutes_open,date,datetime,market_open,close_market_open,datetime_market_open,pct_change_daily,30_min_bin,first_30_min,last_30_min,ATR_1,RSI_1,ema_1,ema_volume_1,vol_std_1,vol_std_volume_1,vol_pct_1,vol_pct_volume_1,returns_vol_std_1,swing_low_1,swing_high_1,swing_low_pct_1,swing_high_pct_1,min_vol_std_1,max_vol_std_1,close_ema_vol_std_1,close_ema_pct_1,ema_vol_std_shift_1,vol_std_pct_1,vol_std_pct_ratio_1,returns_ema_1,exp_ema_pct_shift_1,exp_ema_pct_abs_shift_1,exp_ema_pct_dir_shift_1,exp_pct_volatility_1,exp_pct_volatility_old_1,ATR_2,RSI_2,ADX_2,ema_2,ema_volume_2,...,macd_26_100_signal,volume_macd_26_100,volume_macd_26_100_signal,ema_volume_vol_std_26_100,ema_volume_vol_std_new_26_100,volume_macd_pct_26_100,ema_vol_std_shift_26_100,ema_vol_std_shift_new_26_100,macd_pct_26_100,vol_std_pct_ratio_26_100,macd_50_100,macd_50_100_signal,volume_macd_50_100,volume_macd_50_100_signal,ema_volume_vol_std_50_100,ema_volume_vol_std_new_50_100,volume_macd_pct_50_100,ema_vol_std_shift_50_100,ema_vol_std_shift_new_50_100,macd_pct_50_100,vol_std_pct_ratio_50_100,macd_50_200,macd_50_200_signal,volume_macd_50_200,volume_macd_50_200_signal,ema_volume_vol_std_50_200,ema_volume_vol_std_new_50_200,volume_macd_pct_50_200,ema_vol_std_shift_50_200,ema_vol_std_shift_new_50_200,macd_pct_50_200,vol_std_pct_ratio_50_200,macd_60_150,macd_60_150_signal,volume_macd_60_150,volume_macd_60_150_signal,ema_volume_vol_std_60_150,ema_volume_vol_std_new_60_150,volume_macd_pct_60_150,ema_vol_std_shift_60_150,ema_vol_std_shift_new_60_150,macd_pct_60_150,vol_std_pct_ratio_60_150,returns_EMA_20,returns_EMA_50,returns_EMA_80,returns_EMA_200,returns_EMA_1200,candle,candle_pct
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1,Unnamed: 82_level_1,Unnamed: 83_level_1,Unnamed: 84_level_1,Unnamed: 85_level_1,Unnamed: 86_level_1,Unnamed: 87_level_1,Unnamed: 88_level_1,Unnamed: 89_level_1,Unnamed: 90_level_1,Unnamed: 91_level_1,Unnamed: 92_level_1,Unnamed: 93_level_1,Unnamed: 94_level_1,Unnamed: 95_level_1,Unnamed: 96_level_1,Unnamed: 97_level_1,Unnamed: 98_level_1,Unnamed: 99_level_1,Unnamed: 100_level_1,Unnamed: 101_level_1
2023-09-14 15:56:00+00:00,15.81,15.82,15.8,15.815,135683,645,15.810052,0.000943,-0.579341,386,2023-09-14,2023-09-14 15:56:00+00:00,2023-09-14 09:30:00+00:00,15.72,2023-09-14 09:30:00+00:00,0.604326,12,0,1,0.02,100.0,15.815,135683.0,0.010112,99852.71,0.06397,54.453957,0.000707,15.815,15.815,100.0,100.0,0.0,0.0,0.0,0.0,1.473484,0.000639,1.652933,0.000473,4e-05,0.000553,-4e-05,0.000621,0.000642,0.024013,84.758237,69.780324,15.809802,178290.7,...,1,39829.602996,1,0.427968,0.528489,32.749058,0.430885,0.834881,0.141242,0.515374,0.013499,1,19398.426918,1,0.208436,0.274531,19.170408,0.260799,0.38262,0.085537,0.681032,0.028604,1,29718.365237,1,0.208328,0.420582,29.369041,0.380522,0.810763,0.18125,0.468487,0.01922,1,21890.650486,1,0.177078,0.306144,22.847503,0.293635,0.499343,0.121814,0.587327,9.7e-05,6.004226e-05,4.6e-05,2.6e-05,1.4e-05,0.005,0.031626
2023-09-14 15:57:00+00:00,15.81,15.83,15.81,15.815,337571,1019,15.815863,0.0,1.487939,387,2023-09-14,2023-09-14 15:57:00+00:00,2023-09-14 09:30:00+00:00,15.72,2023-09-14 09:30:00+00:00,0.604326,12,0,1,0.02,100.0,15.815,337571.0,0.008487,117906.4,0.053674,45.266605,0.000578,15.815,15.815,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.000537,0.839247,0.000236,3.9e-05,0.000547,-3.9e-05,0.00062,0.000539,0.022006,84.758237,79.037335,15.813267,284477.6,...,1,50760.978045,1,0.527122,0.584948,36.885707,0.44724,0.858515,0.146903,0.520181,0.013881,1,23603.341977,1,0.245106,0.300549,21.368362,0.267608,0.390516,0.087952,0.684665,0.029299,1,36340.470182,1,0.253245,0.462735,32.899422,0.389648,0.824248,0.185636,0.471854,0.01968,1,26325.126668,1,0.21046,0.336275,25.376422,0.300452,0.508171,0.124718,0.590504,9.2e-05,5.886496e-05,4.5e-05,2.6e-05,1.4e-05,0.005,0.031626
2023-09-14 15:58:00+00:00,15.815,15.82,15.81,15.815,133852,586,15.814584,0.0,-0.603485,388,2023-09-14,2023-09-14 15:58:00+00:00,2023-09-14 09:30:00+00:00,15.72,2023-09-14 09:30:00+00:00,0.604326,12,0,1,0.01,100.0,15.815,133852.0,0.006421,113855.6,0.040608,57.747367,0.000433,15.815,15.815,100.0,100.0,0.0,0.0,0.0,0.0,0.0,0.000406,0.756656,0.000118,3.9e-05,0.000542,-3.9e-05,0.000618,0.000408,0.016003,84.758237,83.665841,15.814422,184060.5,...,1,49551.479452,1,0.51614,0.580937,36.079936,0.461353,0.879658,0.151823,0.523673,0.01423,1,23590.088159,1,0.24572,0.302286,21.18046,0.273796,0.397712,0.090157,0.687809,0.029948,1,36663.4712,1,0.256029,0.46981,32.918452,0.398177,0.836981,0.189735,0.474827,0.02011,1,26564.922521,1,0.212922,0.340911,25.366155,0.306827,0.51644,0.127434,0.593361,8.8e-05,5.771075e-05,4.5e-05,2.6e-05,1.4e-05,0.0,0.0
2023-09-14 15:59:00+00:00,15.815,15.84,15.81,15.825,634831,1728,15.82651,0.000632,3.742783,389,2023-09-14,2023-09-14 15:59:00+00:00,2023-09-14 09:30:00+00:00,15.72,2023-09-14 09:30:00+00:00,0.667939,12,0,1,0.03,100.0,15.825,634831.0,0.008568,279847.3,0.054166,67.2716,0.000439,15.825,15.825,100.0,100.0,0.0,0.0,0.0,0.0,1.167069,0.000541,1.333522,0.000375,4.5e-05,0.000543,-4.5e-05,0.000617,0.000544,0.023002,96.774374,90.850191,15.821474,484574.2,...,1,75570.246253,1,0.683712,0.578776,43.383958,0.482695,0.902452,0.159493,0.534017,0.014742,1,33285.146473,1,0.301143,0.306582,25.234319,0.282454,0.406599,0.093391,0.694027,0.030847,1,51617.78093,1,0.34837,0.475441,39.132757,0.409727,0.850752,0.195408,0.480664,0.020707,1,36572.305613,1,0.276173,0.349819,29.95119,0.315391,0.526146,0.131206,0.59865,0.000114,6.897742e-05,5.2e-05,2.9e-05,1.5e-05,0.01,0.063231
2023-09-14 16:00:00+00:00,15.83,15.83,15.77,15.77,4225103,48,15.829994,-0.003476,5.655477,390,2023-09-14,2023-09-14 16:00:00+00:00,2023-09-14 09:30:00+00:00,15.72,2023-09-14 09:30:00+00:00,0.318066,12,0,1,0.06,0.0,15.77,4225103.0,0.030652,2340970.0,0.194067,100.88,0.002378,15.77,15.77,100.0,100.0,0.0,0.0,0.0,0.0,-1.794335,0.001944,3.589777,-0.00155,1e-05,0.000572,-1e-05,0.00063,0.001946,0.041501,10.005558,72.548767,15.787158,2978260.0,...,1,293925.370142,1,0.6902,0.371371,61.975962,0.44796,0.845208,0.147362,0.529219,0.014145,1,112090.191571,1,0.263212,0.191607,38.331671,0.272251,0.394012,0.089612,0.690352,0.03008,1,170893.422629,1,0.522036,0.292126,58.440711,0.400342,0.837852,0.190557,0.476909,0.020193,1,116268.058032,1,0.321313,0.216451,45.305569,0.308433,0.517314,0.127953,0.595458,-5.7e-05,-5.224009e-07,8e-06,1.1e-05,1.2e-05,-0.06,-0.379027


Now, in order to train the model we need to label desired outcomes (targets).

So far the best method I have found is the triple barrier method (see Lopez De Prado, Advances in Financial Machine Learning).

The triple barrier considers future stock price change over a specified time window, e.g. 50 minutes. If the price hits the upper barrier first (increases by X amount) it gets a 1, if it hits the lower barrier first it gets a -1. If it doesnt cross either barrier it gets a 0.

The barriers could be set to a fixed number, such as 1% or 1 dollar above current price. However, I will set them dynamically by considering the recent trend and volatility.

In [6]:
# This multiplier gives us the expected volatility over a 50 minute time period expressed as pct. 
# I calculate the standard deviation over the last 50 min expressed as % of the 50 minute EMA.
# then I take the exponential moving average of the last 5000 rows of volatilities

hist['multiplier_volatility'] = hist['exp_pct_volatility_50']
print('Expected volatility over a 50 minute time period (% change as decimal points):')
print(hist['multiplier_volatility'].describe())
print('')


# In order to take a reasonable take profit level we can also add the trend to volatility - by how many pct, does the ema typically increase over a 50m window?
hist['multiplier'] = hist['exp_ema_pct_abs_shift_50']
print('Expected EMA shift over a 50 minute time period (% change as decimal points):')
print(hist['multiplier'].describe())

Expected volatility over a 50 minute time period (% change as decimal points):
count    134004.000000
mean          0.009144
std           0.002030
min           0.005583
25%           0.008001
50%           0.008723
75%           0.009877
max           0.020173
Name: multiplier_volatility, dtype: float64

Expected EMA shift over a 50 minute time period (% change as decimal points):
count    133954.000000
mean          0.007825
std           0.001627
min           0.003336
25%           0.006859
50%           0.007548
75%           0.008510
max           0.016046
Name: multiplier, dtype: float64


In [7]:
# then we obtain our barriers by adding the multipliers to the current price. I have found that the volatility multiplier is sufficient.
hist['higher_bound'] = hist.close * (1 + (hist.multiplier * 0 + hist.multiplier_volatility * 1))
hist['lower_bound'] = hist.close * (1 - (hist.multiplier * 0 + hist.multiplier_volatility * 1))

In [8]:
# lets use the triple barrier function to label every minute bar based on whether price will hit our desired target in the next 50 minutes
window = 26
hist = labelling.get_barrier(hist,hist.close.values,hist.higher_bound.values,hist.lower_bound.values,window=window,side=None)
hist['barrier'].value_counts()

 0.0    86013
-1.0    24022
 1.0    23993
Name: barrier, dtype: int64

In [9]:
# some algorithms like Keras LSTMs like this labelling more
hist['barrier'].loc[hist['barrier'] == 1] = 2
hist['barrier'].loc[hist['barrier'] == 0] = 1
hist['barrier'].loc[hist['barrier'] == -1] = 0

Now that we have created our features and target variables, we have to reduce the number of features we are considering to avoid overfitting and make the model interpretable.

We can use different methods for it like recursive feature elemination (RFE), genetic algorithms or mutual information. Below is an example using RFE.

Running the below cell takes 18m on my machine. If you want to skip this just use the model features I get as a result of my run (provided below).

In [10]:
feature_ranking = feature_selection.select_best_features(hist.drop(columns=['date','datetime','market_open','datetime_market_open']), 
                                       method='RFE', 
                                       estimator_type = 'random_forest', 
                                       #class_weight = {0:5,1:1,2:10},
                                       max_depth = 4, 
                                       n_features_to_select=10,
                                       step = 20)

f = pd.Series(feature_ranking,index=hist.drop(columns=['barrier','date','datetime','market_open','datetime_market_open']).columns)
model_features = list(f.loc[f==1].index)

Fitting estimator with 668 features.
Fitting estimator with 648 features.
Fitting estimator with 628 features.
Fitting estimator with 608 features.
Fitting estimator with 588 features.
Fitting estimator with 568 features.
Fitting estimator with 548 features.
Fitting estimator with 528 features.
Fitting estimator with 508 features.
Fitting estimator with 488 features.
Fitting estimator with 468 features.
Fitting estimator with 448 features.
Fitting estimator with 428 features.
Fitting estimator with 408 features.
Fitting estimator with 388 features.
Fitting estimator with 368 features.
Fitting estimator with 348 features.
Fitting estimator with 328 features.
Fitting estimator with 308 features.
Fitting estimator with 288 features.
Fitting estimator with 268 features.
Fitting estimator with 248 features.
Fitting estimator with 228 features.
Fitting estimator with 208 features.
Fitting estimator with 188 features.
Fitting estimator with 168 features.
Fitting estimator with 148 features.
F

Index(['minutes_open', '30_min_bin', 'last_30_min', 'returns_vol_std_26',
       'volume_macd_50_200', 'volume_macd_pct_50_200', 'volume_macd_60_150',
       'ema_volume_vol_std_60_150', 'ema_volume_vol_std_new_60_150',
       'volume_macd_pct_60_150'],
      dtype='object')

In [14]:
# here are the features you get from RFE
#print(model_features)

['minutes_open', '30_min_bin', 'last_30_min', 'returns_vol_std_26', 'volume_macd_50_200', 'volume_macd_pct_50_200', 'volume_macd_60_150', 'ema_volume_vol_std_60_150', 'ema_volume_vol_std_new_60_150', 'volume_macd_pct_60_150']


In [15]:
# alternatively here are the features I get as a result of RFE:
model_features = ['minutes_open', '30_min_bin', 'returns_vol_std_20',
       'volume_macd_50_100', 'ema_volume_vol_std_50_200',
       'volume_macd_pct_50_200', 'volume_macd_60_150',
       'ema_volume_vol_std_60_150', 'ema_volume_vol_std_new_60_150',
       'volume_macd_pct_60_150']

This example shows how RFE can be useful to identify features, but also should not be used on its own.

The algorithm identified a useful time window of comparing the 60m and 150m EMAs. 

However, it found redundant features that share a lot of information with each other as 'volume_macd_60_150', 'volume_macd_pct_60_150', 'ema_volume_vol_std_60_150' are the same variable (MACD of volume), but transformed to be expressed as pct or standard deviations.

In [16]:
# add barrier to the list of features as the next function will use it
model_features = model_features + ['barrier']

Now it is time to train our model. I am using gradient boosting to predict which barrier the stock price will hit first. 

The backtest days parameter controls what time period to set aside for backtesting. Lets set aside the last 30 days for testing. 

Be mindful that the classification metrics are hard to interpret. I usually get a low f-1 score for the target class (2) ~30, but the winrate may still come out to 50%. Or the winrate may be ~30% with a 2:1 win to loss ratio and decent returns.

In [17]:
ypred, model_index,model = models.predict_barrier(hist, 
                                                  features = model_features, 
                                                  max_depth=4, 
                                                  model_type='boosting', 
                                                  parallel=True, 
                                                  normalise=False,
                                                  tree_method='hist',
                                                  backtest_days=30,
                                                  prod_model=False)

              precision    recall  f1-score   support

         0.0       0.42      0.22      0.29      1266
         1.0       0.83      0.96      0.89      5901
         2.0       0.44      0.33      0.38      1410

    accuracy                           0.75      8577
   macro avg       0.56      0.50      0.52      8577
weighted avg       0.70      0.75      0.72      8577



In [18]:
# which features are most useful for our predictions?
pd.Series(model.feature_importances_,index=model.feature_names_in_).sort_values()

ema_volume_vol_std_60_150        0.036540
ema_volume_vol_std_new_60_150    0.047213
ema_volume_vol_std_50_200        0.066125
30_min_bin                       0.068681
volume_macd_pct_50_200           0.069833
volume_macd_pct_60_150           0.090011
returns_vol_std_20               0.109757
volume_macd_50_100               0.130391
minutes_open                     0.185698
volume_macd_60_150               0.195751
dtype: float32

In [19]:
# use the model output to create a dataframe for backtesting 
backtest_hist = hist.loc[model_index]
backtest_hist['ml_prediction'] = ypred
backtest_hist['strategy1_sell'] = 0
backtest_hist['strategy1_buy'] = 0

# whenever the model predicts that we will hit the upper barrier, we will take it as a buy signal
backtest_hist['strategy1_buy'].loc[(backtest_hist.ml_prediction == 2)] = 1

# optionally, we can use the same model to generate sell signals
#backtest_hist['strategy1_sell'].loc[(backtest_hist.ml_prediction == 0)] = 1

# how many buy signals (1) do we get in our sample?
backtest_hist.strategy1_buy.value_counts()

0    7516
1    1061
Name: strategy1_buy, dtype: int64

In [20]:
# before we backtest, we need to set the stop loss and take profit levels
# these can be different from the barriers we used during labelling
# lets use 26 minute volatility and a 3:1 ratio of take profit to stop loss

backtest_hist['multiplier'] = backtest_hist['exp_ema_pct_abs_shift_26']
backtest_hist['multiplier_volatility'] = backtest_hist['exp_pct_volatility_26']

backtest_hist['higher_bound'] = backtest_hist.close * (1 + ((0 * backtest_hist.multiplier + 3 * backtest_hist.multiplier_volatility) * 1))
backtest_hist['lower_bound'] = backtest_hist.close * (1 - ((0 * backtest_hist.multiplier + 1 * backtest_hist.multiplier_volatility) * 1))

The backtest iterates over our test dataframe and simulates trading decisions. A buy is made based on the ML model predictions.

A sell decision can be made based on the ML model and/ or crossing stop loss take profit (SLTP) levels. The backtest params below use the ML model to make buy decisions and stop-loss/ take-profit to sell. We set the holding period to 30 min, meaning that if no barrier is hit in 30 minutes we will sell. In addition, we will hold stock if there is a new buy signal, even if we are below/ above SLTP level.

In [21]:
trades = backtest.run_backtest(
    backtest_hist=backtest_hist.loc[backtest_hist.lower_bound.notna()], 
    buy_index=backtest_hist.loc[backtest_hist.strategy1_buy==1].index, 
    sell_index=[],
    take_profit_series=backtest_hist.higher_bound,
    stop_loss_series=backtest_hist.lower_bound,
    risk=1,
    cash = 10000,
    holding_period= 30,
    min_holding_period=0,
    hold_on_buy_signal=False,
    sltp=True,
    sltp_update=False)

Cash money:
9793.476002842042
112
112
Winrate:
0.2857142857142857
Avg win:
0.012199857873811865
Avg loss:
-0.006830136406268766


With the automated features the model is not performing well. The wins are 2x bigger than losses, but not enough to compensate for the low winrate. Lets try create a model with different features.

Lets use:
* Swing high/ swing low pct - how far was the recent swing high/ low from the EMA line expressed as pct?
* Volume and Closing price MACDs expressed as pct
* Volatility metric vol_std_pct_ based on an EMA of standard deviation expressed as pct
* Trend and volatility variables we used during target labelling:
    * 'exp_ema_pct_abs_shift_50' is the expected change in EMA over 50 minutes based on the last 5000 such shifts.
    * 'exp_pct_volatility_50' is the expected 50 minute volatility based on the last 5000 such volatilities.
* 30_min_bin and minutes_open capture how long the market has been open on that day

In [24]:
# hand picked features by logic
model_features = [
    'swing_high_pct_12', 'swing_low_pct_50','swing_low_pct_100', 'swing_high_pct_100','swing_high_pct_26',
    'volume_macd_pct_12_100','volume_macd_pct_26_100', 'macd_pct_26_100',
    'vol_std_pct_100', 'vol_std_pct_26', 
    'exp_ema_pct_abs_shift_50', 'exp_pct_volatility_50',
    '30_min_bin', 'minutes_open'
    ]
model_features = model_features + ['barrier']

In [25]:
ypred, model_index,model = models.predict_barrier(hist, 
                                                  features = model_features, 
                                                  max_depth=4, 
                                                  model_type='boosting', 
                                                  parallel=True, 
                                                  normalise=False,
                                                  tree_method='hist',
                                                  backtest_days=30,
                                                  prod_model=False)

              precision    recall  f1-score   support

         0.0       0.42      0.31      0.36      1266
         1.0       0.83      0.95      0.89      5901
         2.0       0.48      0.33      0.39      1410

    accuracy                           0.75      8577
   macro avg       0.58      0.53      0.54      8577
weighted avg       0.71      0.75      0.73      8577



In [26]:
# use the model output to create a dataframe for backtesting 
backtest_hist = hist.loc[model_index]
backtest_hist['ml_prediction'] = ypred
backtest_hist['strategy1_sell'] = 0
backtest_hist['strategy1_buy'] = 0

# whenever the model predicts that we will hit the upper barrier, we will take it as a buy signal
backtest_hist['strategy1_buy'].loc[(backtest_hist.ml_prediction == 2)] = 1

# optionally, we can use the same model to generate sell signals
#backtest_hist['strategy1_sell'].loc[(backtest_hist.ml_prediction == 0)] = 1

# how many buy signals (1) do we get in our sample?
backtest_hist.strategy1_buy.value_counts()

0    7620
1     957
Name: strategy1_buy, dtype: int64

In [27]:
backtest_hist['multiplier'] = backtest_hist['exp_ema_pct_abs_shift_26']
backtest_hist['multiplier_volatility'] = backtest_hist['exp_pct_volatility_26']

backtest_hist['higher_bound'] = backtest_hist.close * (1 + ((0 * backtest_hist.multiplier + 3 * backtest_hist.multiplier_volatility) * 1))
backtest_hist['lower_bound'] = backtest_hist.close * (1 - ((0 * backtest_hist.multiplier + 1 * backtest_hist.multiplier_volatility) * 1))

In [28]:
trades = backtest.run_backtest(
    backtest_hist=backtest_hist.loc[backtest_hist.lower_bound.notna()], 
    buy_index=backtest_hist.loc[backtest_hist.strategy1_buy==1].index, 
    sell_index=[],
    take_profit_series=backtest_hist.higher_bound,
    stop_loss_series=backtest_hist.lower_bound,
    risk=1,
    cash = 10000,
    holding_period= 30,
    min_holding_period=0,
    hold_on_buy_signal=False,
    sltp=True,
    sltp_update=False)

Cash money:
10700.544221768389
98
98
Winrate:
0.2755102040816326
Avg win:
0.015003647388568185
Avg loss:
-0.0066050339319929


The algorithm gives me 10700, so 7% return in a month, with a winrate of 28%, and the average win size of 1.5% and loss size of 0.66 (more than 2:1 ratio). You can review the simulated trades here:

In [36]:
trades[0].tail(10)

Unnamed: 0,buy_index,buy_price,position,stop_loss,take_profit,sell_index,price,$ gain,returns,holding_duration
88,2023-09-13 09:31:00+00:00,15.6588,673.040723,15.575146,15.909761,2023-09-13 09:33:00+00:00,15.57,-59.766016,-0.005671,2
89,2023-09-13 09:34:00+00:00,15.55,673.90637,15.466987,15.799038,2023-09-13 09:41:00+00:00,15.82,181.95472,0.017363,7
90,2023-09-13 15:37:00+00:00,15.555,685.387256,15.475966,15.792102,2023-09-13 16:00:00+00:00,15.59,23.988554,0.00225,23
91,2023-09-14 09:31:00+00:00,15.775,677.349434,15.695299,16.014103,2023-09-14 09:41:00+00:00,16.0314,173.672395,0.016254,10
92,2023-09-14 09:42:00+00:00,15.9588,680.430842,15.877983,16.201252,2023-09-14 09:48:00+00:00,15.855,-70.628721,-0.006504,6
93,2023-09-14 09:49:00+00:00,15.8233,681.794,15.742925,16.064424,2023-09-14 09:53:00+00:00,15.73,-63.61138,-0.005896,4
94,2023-09-14 09:54:00+00:00,15.755,680.712131,15.674822,15.995535,2023-09-14 09:56:00+00:00,15.64,-78.281895,-0.007299,2
95,2023-09-14 09:57:00+00:00,15.72,677.247947,15.63992,15.960239,2023-09-14 10:08:00+00:00,15.6395,-54.51846,-0.005121,11
96,2023-09-14 10:09:00+00:00,15.605,678.745227,15.525264,15.844207,2023-09-14 10:18:00+00:00,15.525,-54.299618,-0.005127,9
97,2023-09-14 10:19:00+00:00,15.5001,679.835591,15.420723,15.738231,2023-09-14 10:38:00+00:00,15.7399,163.024575,0.015471,19


The next stage in strategy development is to deploy the model in a real trading environment, which is what I am doing right now using the Alpaca Trading API.

Trading strategies developed in this way tend to rapidly decay over time with changing market conditions.

In order to ensure consistent returns, I am currently automating the process of iterating over hundreds of such strategies using different features and stocks, and then retraining and deploying the best models.

Contact me if you have some questions, feedback or insights to share: \
email: gluzman64@gmail.com \
linkedin: linkedin.com/in/egluzman/ \
whatsapp: +447463457579