<a href="https://colab.research.google.com/github/dgalassi99/quant-trading-self-study/blob/main/more/Meta_Labelling_%26_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Main Code

This exercise comes from studying "Machine Learning for Asset Managers". Due to the size of the exercise I created a separate notebook file.

The full book study and exercise is here: https://github.com/dgalassi99/quant-trading-self-study/blob/main/00_books/Machine_Learning_for_Asset_Managers.ipynb

 Using the labels generated with the triple-barrier method:
- Fit a random forest classifier on those labels. Use as features estimates of
mean return, volatility, skewness, kurtosis, and various differences in
moving averages.
- Backtest those predictions using as a trading rule the same rule used to
generate the labels.
- Apply meta-labeling on the backtest results.
- Refit the random forest on meta-labels, adding as a feature the label
predicted in (a).
- Size (a) bets according to predictions in (d), and recompute the backtest




## Labelling with Triple Barrier Method

In [1]:
from google.colab import drive
import pandas as pd, numpy as np
import matplotlib.pyplot as plt

drive.mount('/content/drive')
# Load the CSV data
file_path = '/content/drive/MyDrive/QUANT/DATA/btc_1h_data_2018_to_2025.csv'
df = pd.read_csv(file_path, date_format='%Y%m%d %H%M%S')
df['Open time'] = pd.to_datetime(df['Open time'])
df = df[['Open time', 'Close','Volume']]

#define a target volume --> we use the average
target_vol = df.Volume.mean()
df['cum_vol'] = df.Volume.cumsum()
#create the bar_num in the df
df['vol_bar_num'] = df.cum_vol.apply(lambda x: x/target_vol).astype(int)
#keep only when vol_bar_num changes
df_vol = df.groupby('vol_bar_num').first()

#compute returns and std
df_vol['returns'] = df_vol.Close.pct_change()
std_returns = df_vol.returns.std()
# horizontal barriers #
hrz_barrier = 2*std_returns
print(f"Standard deviation of returns: {std_returns:.6f}")
print(f"Horizontal barrier (2 * std): {hrz_barrier:.6f}")

''' the maximum holding period is defined as the
average number of bars per day

'''
#reset index to use timestamps again
df_vol = df_vol.reset_index()
#add date column (only the day)
df_vol['date'] = df_vol['Open time'].dt.date
# Count bars per day, then average
bars_per_day = df_vol.groupby('date').size()
max_holding_period = int(bars_per_day.mean())
print(f"Max holding period (avg bars/day): {max_holding_period}")

labels = []  # list to store labels

#loop through the rows, stopping max_holding_period early
for i in range(len(df_vol.Close) - max_holding_period):
    label = 0  # default label
    start_price = df_vol.Close.iloc[i]

    #look forward up to max_holding_period steps
    for j in range(1, max_holding_period + 1):
        future_price = df_vol.Close.iloc[i + j]
        returns = (future_price - start_price) / start_price
        #check labelling conditions
        if returns >= hrz_barrier:
            label = 1
            break
        elif returns <= -hrz_barrier:
            label = -1
            break

    labels.append(label)
#append NaNs to align with df_vol's length
labels += [np.nan] * max_holding_period
df_vol["label"] = labels
print(df_vol['label'].value_counts(dropna=True))


Mounted at /content/drive


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['cum_vol'] = df.Volume.cumsum()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['vol_bar_num'] = df.cum_vol.apply(lambda x: x/target_vol).astype(int)


Standard deviation of returns: 0.009658
Horizontal barrier (2 * std): 0.019315
Max holding period (avg bars/day): 14
label
 1.0    13132
 0.0    13017
-1.0    12436
Name: count, dtype: int64


In [2]:
# overall distribution
overall_dist = df_vol['label'].value_counts(normalize=True)
print("Overall label distribution (Triple Barrier):")
print(overall_dist)

df_vol['hour'] = df_vol['Open time'].dt.hour
hourly_dist = df_vol.groupby('hour')['label'].value_counts(normalize=True).unstack().fillna(0)
print("Hourly label distribution (Triple Barrier):")
print(hourly_dist)

Overall label distribution (Triple Barrier):
label
 1.0    0.340340
 0.0    0.337359
-1.0    0.322301
Name: proportion, dtype: float64
Hourly label distribution (Triple Barrier):
label      -1.0       0.0       1.0
hour                               
0      0.343131  0.328134  0.328734
1      0.327804  0.332449  0.339748
2      0.330812  0.323934  0.345254
3      0.325747  0.327881  0.346373
4      0.317073  0.335725  0.347202
5      0.313584  0.317919  0.368497
6      0.327610  0.340659  0.331731
7      0.325032  0.332695  0.342273
8      0.324040  0.332714  0.343247
9      0.338054  0.327586  0.334360
10     0.307359  0.333333  0.359307
11     0.319436  0.321888  0.358676
12     0.328054  0.328054  0.343891
13     0.329570  0.318817  0.351613
14     0.330474  0.346115  0.323411
15     0.315628  0.356997  0.327375
16     0.315817  0.356280  0.327903
17     0.305035  0.353630  0.341335
18     0.330055  0.344806  0.325138
19     0.315231  0.347066  0.337703
20     0.311746  0.333333  0.

In [3]:
df = df_vol.copy()
df.head()

Unnamed: 0,vol_bar_num,Open time,Close,Volume,cum_vol,returns,date,label,hour
0,0,2018-01-01 00:00:00,13529.01,443.356199,443.356199,,2018-01-01,-1.0,0
1,1,2018-01-01 07:00:00,13570.35,292.188777,2978.419643,0.003056,2018-01-01,-1.0,7
2,2,2018-01-01 15:00:00,13247.0,427.071957,6130.108727,-0.023828,2018-01-01,1.0,15
3,3,2018-01-02 00:00:00,13750.01,466.596114,9076.511958,0.037972,2018-01-02,-1.0,0
4,4,2018-01-02 04:00:00,13127.31,992.418927,11967.867697,-0.045287,2018-01-02,1.0,4


## Classification Using a Random Forest Clasifier

- We labeled data based on *actual future price behavior* using the TBM
- Now we build a model able to predict those labels in *real-time* using features known at the current time

It is important to pick the right features in roder to make the prediction using ML methods. So the first step consists in picking some features and check their correlation. Then we can proceed with a tree-based method and also check the feature importance a posteriori to eventually eliminate redundand/overfitting covariates!

In [8]:
df['mean_return_5'] = df['returns'].rolling(window=5).mean()
df['mean_return_10'] = df['returns'].rolling(window=10).mean()
df['volatility_5'] = df['returns'].rolling(window=5).std()
df['volatility_10'] = df['returns'].rolling(window=10).std()
df['skew_5'] = df['returns'].rolling(window=5).skew()
df['skew_10'] = df['returns'].rolling(window=10).skew()
df['kurt_5'] = df['returns'].rolling(window=5).kurt()
df['kurt_10'] = df['returns'].rolling(window=10).kurt()
df['ma_fast'] = df['Close'].rolling(window=5).mean()
df['ma_slow'] = df['Close'].rolling(window=20).mean()
df['ma_diff'] = df['ma_fast'] - df['ma_slow']
df.dropna(inplace=True)
df.head()

Unnamed: 0,vol_bar_num,Open time,Close,Volume,cum_vol,returns,date,label,hour,mean_return_5,mean_return_10,volatility_5,volatility_10,skew_5,skew_10,kurt_5,kurt_10,ma_fast,ma_slow,ma_diff
38,38,2018-01-07 03:00:00,16740.68,833.027889,110900.578404,-0.01928,2018-01-07,-1.0,3,0.003773,0.001892,0.018824,0.017107,-0.553142,-0.705871,-2.911667,-1.0202,16670.322,15926.319,744.003
39,39,2018-01-07 06:00:00,16417.05,721.569884,113096.296472,-0.019332,2018-01-07,-1.0,6,0.002645,-0.000917,0.020247,0.01813,-0.517318,-0.237222,-3.209968,-1.818041,16710.75,16027.223,683.527
40,40,2018-01-07 14:00:00,16520.02,271.768469,116038.520367,0.006272,2018-01-07,-1.0,14,-0.00031,-0.001537,0.017823,0.017723,-0.375124,-0.17333,-3.016762,-1.695892,16703.51,16135.5735,567.9365
41,41,2018-01-07 19:00:00,16100.9,295.322998,119141.35482,-0.02537,2018-01-07,-1.0,19,-0.007967,-0.005253,0.018912,0.018497,0.736678,0.263101,-2.020346,-1.715613,16569.688,16200.0685,369.6195
42,42,2018-01-08 02:00:00,16037.38,765.781916,122465.354713,-0.003945,2018-01-08,-1.0,2,-0.012331,-0.002845,0.013074,0.016681,0.770446,0.166691,-1.225086,-1.528775,16363.206,16262.0435,101.1625


In [11]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

#eventually drop rows with NaNs
df_model = df.dropna(subset=[
    'label', 'returns', 'mean_return_5', 'mean_return_10',
    'volatility_5', 'volatility_10', 'skew_5', 'skew_10',
    'kurt_5', 'kurt_10', 'ma_fast', 'ma_slow', 'ma_diff'
])
#features to regress on
feature_cols = [
    'returns', 'mean_return_5', 'mean_return_10',
    'volatility_5', 'volatility_10', 'skew_5', 'skew_10',
    'kurt_5', 'kurt_10', 'ma_fast', 'ma_slow', 'ma_diff'
]

X = df_model[feature_cols]
y = df_model['label']

#train/test split following the time series ordinality
train_size = int(0.8 * len(X))
X_train, X_test = X.iloc[:train_size], X.iloc[train_size:]
y_train, y_test = y.iloc[:train_size], y.iloc[train_size:]

#fit classifier
clf = RandomForestClassifier(n_estimators=100, max_depth=6, random_state=42)
clf.fit(X_train, y_train)

#predict
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

#store predictions
df_model.loc[X_test.index, 'predicted_label'] = y_pred


              precision    recall  f1-score   support

        -1.0       0.36      0.77      0.49      2482
         0.0       0.51      0.36      0.42      2430
         1.0       0.33      0.08      0.13      2798

    accuracy                           0.39      7710
   macro avg       0.40      0.40      0.35      7710
weighted avg       0.40      0.39      0.34      7710



Considering the low qialuty of the metrics we try performing a CV with different hyperparams

In [19]:
from sklearn.model_selection import RandomizedSearchCV
#defininf the grid of parameeters for cross validation
param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [5, 10, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2'],
    'class_weight': [None, 'balanced', 'balanced_subsample']
}

#our metric to maximize is F1
scoring = 'f1_macro'  # gives equal weight to each class

rf = RandomForestClassifier(random_state=42)
search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_grid,
    n_iter=50,
    scoring='f1_macro',
    cv=3,
    verbose=2,
    n_jobs=1,
    random_state=42
)

search.fit(X_train, y_train)
best_rf = search.best_estimator_
y_pred = best_rf.predict(X_test)

print(classification_report(y_test, y_pred))


Fitting 3 folds for each of 50 candidates, totalling 150 fits
[CV] END class_weight=balanced_subsample, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=500; total time= 1.3min
[CV] END class_weight=balanced_subsample, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=500; total time= 1.0min
[CV] END class_weight=balanced_subsample, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=500; total time= 1.0min
[CV] END class_weight=balanced, max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=  13.3s
[CV] END class_weight=balanced, max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=  13.3s
[CV] END class_weight=balanced, max_depth=5, max_features=sqrt, min_samples_leaf=1, min_samples_split=5, n_estimators=300; total time=  13.4s
[CV] END class_weight=balanced_subsample, max_d

Even with cross validation the model did not improve significantly - here you can read the best hyperparams!!!

In [20]:
print("Best parameters found:")
print(search.best_params_)

Best parameters found:
{'n_estimators': 100, 'min_samples_split': 5, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 10, 'class_weight': 'balanced_subsample'}


## Backtesting the Predictions

Now we will perform a back test on the test dataset.

Witht he following idea:
- if y_pred = 1(-1) --> go long (short)
- if y_pred = 0 --> do nothing

Once a postion is entered...
- TP when the 2sigma barrier is hit that is if P(t+1) = p(t) + 2sigma
- SL when the 2sigma barrier is hit that is if P(t+1) = p(t) - 2sigma

So we have a symmetric risk reward configuration in this case!

In [26]:
df_vol.loc[X_test.index, 'predicted_label'] = y_pred #appending the column of prediction to the data
df_vol.dropna(inplace=True) #remove all teh lines where there is no prediction (training set)


Unnamed: 0,vol_bar_num,Open time,Close,Volume,cum_vol,returns,date,label,hour,predicted_label
30875,55326,2023-04-19 15:00:00,29260.44,3660.78588,1.604081e+08,-0.004525,2023-04-19,0.0,15,1.0
30876,55327,2023-04-19 16:00:00,29315.24,1921.49063,1.604100e+08,0.001873,2023-04-19,-1.0,16,1.0
30877,55328,2023-04-19 17:00:00,29318.14,2569.77171,1.604126e+08,0.000099,2023-04-19,-1.0,17,1.0
30878,55329,2023-04-19 18:00:00,29277.45,2641.42752,1.604152e+08,-0.001388,2023-04-19,-1.0,18,1.0
30879,55330,2023-04-19 20:00:00,29237.71,2565.25344,1.604195e+08,-0.001357,2023-04-19,-1.0,20,0.0
...,...,...,...,...,...,...,...,...,...,...
38580,64228,2025-05-02 01:00:00,96970.40,566.90548,1.862164e+08,0.004632,2025-05-02,-1.0,1,-1.0
38581,64229,2025-05-02 06:00:00,96715.00,383.92171,1.862188e+08,-0.002634,2025-05-02,-1.0,6,-1.0
38582,64230,2025-05-02 12:00:00,97182.53,893.40406,1.862225e+08,0.004834,2025-05-02,-1.0,12,-1.0
38583,64231,2025-05-02 14:00:00,97748.40,2123.65060,1.862257e+08,0.005823,2025-05-02,-1.0,14,-1.0


In [27]:
hrz_barrier

0.01931505532734949