<a href="https://colab.research.google.com/github/dgalassi99/quant-trading-self-study/blob/main/03_ML_finance/W9_data_stationarity_%26_financial_features.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Stationarity and Financial Features

In this short notebook file we are going to deal with and understand why classical/traditional ML struggles when we serve it financial data as the "daily meal" and we will write some cells to prepare data for successive ML application/methods

## A Pinch of Theory

The first important question to understand is: *Why traditional ML struggles in finance?*

Financial data present a sereis of issues which we need to account for. Tradidional ML algos are built in orther to predict/classify a given features of an observations using a series of other features (covariates). But a lot of these methods also need a series of assumptions to be true...





### Low Signal-to-Noise Ratio (SNR)


A time series is characterized by a signal (risk premia, structural inefficienncies) which is the component we would like to predict/learn and noise (news, market sentiment, ...) which are not other than random fluctuations.

Now, ML methods do very well when data has a strong patters and the noise componet represents a small percentage of the total pattern itself. In other words, we require that features explain a consistent fraction of the variance of the feature we'd like to predict.

Unfortunately, this does not hold in financial data. Here the SNR ratio is quite unbalanced, hence ML will either underfit (that is flipping a coin perfroms as well as out model) or overfit (we force the ML model to learn 'from the noise' --> poor out-of-sample performances).

### Stationarity


Standard ML theory requires data stationarity. OK, what does it mean?

Most of the statistic theory is based on a key definition of the observations. That is, data are *independent and identically distributed (IID)*, well this is quite bullshit in financial data. But why?

Observations are: (1) autocorrelated (the correlation of a time series with a lagged copy of itself); (2) non-stationary as statistical propertis (mean ,variance...) change over time; (3) overlapped as labelled events overlap creating dependence.

Those issues invalid any form of random shuffling --> we need to respect time and dependence.

But why fin. data are non-stationary? Well, consider that:
- Prices (generally) trends up (equities) or oscillate around changins means (FX, commodities)
- Volatility is time-varying
- MArket microstrucutre, news, market regimes, politics, regulation and infinite more events introduce breaks, disturptions and total change in the main structure of the data itself

### Anything Else?

Well, unluckily, yes. Other than the abovementioned issues we also need to take into account:
- Weak gaussianity as financial data present fat tails due to rare but extremely importnat events
- Regime changes due to bull and bear seasons
- Data dependency reduces the sample size we can use for training

...
- many more :(

### What can we do?

There are some solutions to our problems:

- Transform raw prices into log-returns and eventually std-adjust them by normalization over rolling volatility
- Fractional differencing as simple differencing (get returns) removes too much memory - This is something we will see in AFML Chap 5.
- Alternative sampling by gouping data into volume/dollar bars which helps stabilize infomration per bar
- De-overlap features by purging or application of embargo periods
- Standardization/Normalization of features by Z-score, IQR, meadian, mean ...
- Event driven sampling by focusing on training models on informative evetns (primary filters such as CUSUM)

I hope we will see all (if not some) of these solutions to learn how to better deal with financial data!

## Practice

### Data importing and computation of financial features

In [48]:
import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.model_selection import train_test_split
from google.colab import drive

In [10]:
drive.mount('/content/drive')
csv_path = '/content/drive/MyDrive/QUANT/DATA/btc_1h_data_2018_to_2025.csv'

#donwload csv and dataframe it
data = pd.read_csv(csv_path)
df = pd.DataFrame(data)
#rename colums and drop non-used colums
df = df.rename(columns={'Close time': 'datetime', 'Open': 'open', 'High': 'high', 'Low': 'low', 'Close': 'close', 'Volume': 'volume'})
df = df.drop(columns=['Open time', 'Quote asset volume', 'Number of trades', 'Taker buy base asset volume', 'Taker buy quote asset volume', 'Ignore'],axis=1)
#transform the close time in datatime type
df['datetime'] = pd.to_datetime(df['datetime'])
#set the timestamp as index and round it to the next hour for easyness of comprehension
df = df.set_index('datetime')
df.index = df.index.round('H')
df.head()

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


  df.index = df.index.round('H')


Unnamed: 0_level_0,open,high,low,close,volume
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01 01:00:00,13715.65,13715.65,13400.01,13529.01,443.356199
2018-01-01 02:00:00,13528.99,13595.89,13155.38,13203.06,383.697006
2018-01-01 03:00:00,13203.0,13418.43,13200.0,13330.18,429.064572
2018-01-01 04:00:00,13330.26,13611.27,13290.0,13410.03,420.08703
2018-01-01 05:00:00,13434.98,13623.29,13322.15,13601.01,340.807329


In [11]:
# computing some financial features

rolling_window = 20
#moving average
df['ma'] = df.close.rolling(window=rolling_window).mean()
#log returns
df['log_ret'] = np.log(df.close/df.close.shift(1))
#rolling volatility
df['roll_vol'] = df.log_ret.rolling(window=rolling_window).std()
#momentum as % price change over a window
df['momentum'] = df.close/df.close.shift(rolling_window) - 1
#price to ma ratio
df['ma_ratio'] = df.close/df.ma
#volume z-score
df['ma_vol'] = df.volume.rolling(window=rolling_window).mean()
df['volatility_vol'] = df.volume.rolling(window=rolling_window).std()
df['vol_zscore'] = (df.volume-df.ma_vol)/df.volatility_vol

#dropping NaN as rolling creates NaN cells
df = df.dropna()

In [12]:
#check the muneric features main statistical values before normalization
df.describe()

Unnamed: 0,open,high,low,close,volume,ma,log_ret,roll_vol,momentum,ma_ratio,ma_vol,volatility_vol,vol_zscore
count,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0
mean,31056.542574,31195.225912,30911.297124,31057.800704,2900.085168,31045.792647,3.1e-05,0.006103,0.001133,1.000364,2900.044652,1575.064902,0.004965
std,25505.978003,25607.159576,25402.673086,25507.093787,4097.682716,25492.243358,0.007531,0.004439,0.032249,0.017766,3334.993271,1867.21913,1.0642
min,3172.62,3184.75,3156.26,3172.05,0.0,3204.7525,-0.201033,0.00045,-0.459832,0.682039,192.851938,43.908663,-2.996039
25%,9173.745,9209.075,9139.9925,9173.57,936.548487,9170.2465,-0.002485,0.003347,-0.012448,0.993756,1210.082947,595.697127,-0.704502
50%,23801.035,23903.32,23700.0,23800.915,1604.352842,23729.761,7.4e-05,0.005066,0.0006,1.000408,1834.658439,1013.593544,-0.309411
75%,46642.2775,46887.9225,46359.495,46643.395,3062.536628,46559.977875,0.002643,0.007441,0.014672,1.007292,3004.357562,1764.47134,0.413954
max,108320.0,109588.0,107780.51,108320.01,137207.1886,106923.293,0.16028,0.081097,0.346492,1.223169,35321.897255,33502.972918,4.232165


### Normalization


In [15]:
df.columns

Index(['open', 'high', 'low', 'close', 'volume', 'ma', 'log_ret', 'roll_vol',
       'momentum', 'ma_ratio', 'ma_vol', 'volatility_vol', 'vol_zscore'],
      dtype='object')

In [32]:
# define teh features to be normalized
features_to_normalize = ['log_ret', 'roll_vol', 'momentum', 'ma_ratio', 'ma_vol', 'volatility_vol', 'vol_zscore']
#define X as the features matrix
X = df[features_to_normalize]
#normalize X with sklearn StandardScaler
std_scaler = StandardScaler()
X_scaled = std_scaler.fit_transform(X)
#restransf X into a pandas df
X = pd.DataFrame(X_scaled, index=X.index, columns=features_to_normalize)
#chekc again main stats
X.describe()

Unnamed: 0,log_ret,roll_vol,momentum,ma_ratio,ma_vol,volatility_vol,vol_zscore
count,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0,64226.0
mean,7.13574e-18,-1.557693e-16,6.637898e-18,9.487769e-16,2.83217e-17,-1.416085e-17,7.522951e-18
std,1.000008,1.000008,1.000008,1.000008,1.000008,1.000008,1.000008
min,-26.69871,-1.273596,-14.29386,-17.91796,-0.8117599,-0.820026,-2.819986
25%,-0.3339875,-0.6208056,-0.4211172,-0.3719065,-0.5067401,-0.5245102,-0.6666728
50%,0.005750661,-0.2335569,-0.01652791,0.002511221,-0.3194593,-0.3007016,-0.2954132
75%,0.34683,0.3015248,0.4198411,0.3899791,0.03127854,0.1014385,0.3843191
max,21.27911,16.89491,10.70909,12.54133,9.72179,17.09931,3.972217


### Training a simple classifier

In [64]:
# create a binary target for classification

#remove dupicates if present
X = X[~X.index.duplicated()]
#very simple in this case jsut to see how to do it
y = (df.log_ret.shift(-1) > 0).astype(int)
y = y[~y.index.duplicated()]
#align y with X to eb sure
y = y.loc[X.index]

print(f'shape of y {y.shape} and of X {X.shape}')

shape of y (64220,) and of X (64220, 7)


In [65]:
#train and validation split --> MAKE SURE SHUFFLE IS FALSE --> we need to preserve time dependency
X_tr, X_val, y_tr, y_val = train_test_split(X, y, test_size=0.2, shuffle=False)

In [66]:
# train Random Forest
rf = RandomForestClassifier(
    n_estimators=100,
    max_depth=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_tr, y_tr)

#predictions
y_pred = rf.predict(X_val)

#mainn metrics
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_pred))
print("\nClassification Report:\n", classification_report(y_val, y_pred))


Confusion Matrix:
 [[3075 3229]
 [2764 3776]]

Classification Report:
               precision    recall  f1-score   support

           0       0.53      0.49      0.51      6304
           1       0.54      0.58      0.56      6540

    accuracy                           0.53     12844
   macro avg       0.53      0.53      0.53     12844
weighted avg       0.53      0.53      0.53     12844



In [67]:
#feature importance
feat_imp = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Feature Importances")
feat_imp

Feature Importances


Unnamed: 0,0
log_ret,0.471712
ma_ratio,0.199757
momentum,0.10064
vol_zscore,0.067126
volatility_vol,0.055465
roll_vol,0.053442
ma_vol,0.051859
