# Cross Validation - Shuffling, Purging and Embargoing

This notebook aims at replying to the exercises 7.1 to 7.5 of chapter 7 of AFML book.

## Exercise 7.1

Why is shuffling a dataset before conducting k-fold CV generally a bad idea in
finance? What is the purpose of shuffling? Why does shuffling defeat the purpose
of k-fold CV in financial datasets?

Shuffling a dataset before performing k-fold cross-validation is generally a bad idea in finance because financial data are time-dependent and not independent and identically distributed (IID).

Purpose of shuffling in standard CV:

In classical machine learning, shuffling ensures that each fold is representative of the full dataset. It helps prevent bias if the data have any temporal or sequential ordering. This is appropriate when the data are IID

Why shuffling is problematic in finance:

Financial prices and returns exhibit serial correlation, meaning the value at time t
t depends on previous periods t−1,t−2,…. Shuffling would destroy the temporal structure, allowing training on future information that would not have been available at prediction time. This leads to look-ahead bias and data leakage, giving artificially optimistic performance estimates.

Implications for cross-validation:

Instead of shuffling, we need time-aware CV methods, such as:

- Purged k-fold CV
- Embargoed CV
- Sequential bootstrap sampling

These methods respect the causal ordering of financial events and prevent overlapping information from contaminating the training set.


## Exercise 7.2

### Get Features and Labels

In [30]:
# import libraries

import sys
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import datetime
import seaborn as sns
import statsmodels
from statsmodels.tsa.stattools import acf
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.stattools import coint
from scipy.stats import jarque_bera
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

#import functions from scripts folder

sys.path.append('../scripts')
from fracdiff import *
from labelling import *
from samp_weights import *
from fin_data_management import * 
from fetch_yf_data import fetch_data
from AFML_book_scripts import *
from AFML_my_scripts import *
from sequential_CV import *

#### Data

In [2]:
# import data 
data = pd.read_csv("../data/SP_futures_tick_data.csv")
#manipulate data such that we can transfomr into a dollar bar series
datetime_str = data['date'] + ' ' + data['time']
data['datetime'] = pd.to_datetime(datetime_str, errors='coerce')
#drop date and time columns
data = data.drop(['date', 'time'], axis=1)
#get the dollar bar dataframe
dollars_bars_size = 1000000  
df = DollarBarsDfVectorized(data, dollar_per_bar=dollars_bars_size)
#check for duplicates
print(df.index[df.index.duplicated()])
# reindex the dataframe to datetime as we will need timeindexed series objects
df = df.drop('start_date', axis=1 )
df = df.rename(columns={'end_date': 'datetime'})
df = df.set_index('datetime')
#remove duplicate indices and check again
df = df[~df.index.duplicated(keep='first')]
print(df.index[df.index.duplicated()])

Index([], dtype='int64')
DatetimeIndex([], dtype='datetime64[ns]', name='datetime', freq=None)


#### CUSUM filter

In [18]:
std

np.float64(316.01634790805633)

In [19]:
df.close.mean()

np.float64(1340.346529209622)

#### Feature Matrix

In [None]:
#drop non useful columns 
df = df.drop(['open', 'high', 'low'],axis =1)
#create some extra features
window = 5  
df['rolling_mean'] = df['close'].rolling(window).mean()
df['rolling_std'] = df['close'].rolling(window).std()
df['returns'] = df['close'].pct_change()


TBM

In [20]:
ptSL = (1,1)
target = GetTargetforTBM(df.close,ema_periods=window)
min_ret = target.mean()*2
numDays = 1
close = df.close
tEvents = close.index

t1=df.close.index.searchsorted(tEvents+pd.Timedelta(days=numDays))
t1=t1[t1<df.close.shape[0]]
t1=pd.Series(df.close.index[t1],index=tEvents[:t1.shape[0]])

In [21]:
#get events
events = getEventsMeta(df.close,tEvents,ptSL,target,min_ret,t1)
#get labels
labels = getTBMLabels(events, df.close)

In [22]:
#drop nan rows of events and df cause the vertcial barries will not be realized
events = events.dropna()
df = df.loc[events.index]
#check if all the dataframes are aligned
print(f'The shapes of events, feature matrix and labels df are {events.shape}, {df.shape} and {labels.shape}')

The shapes of events, feature matrix and labels df are (1428, 4), (1428, 6) and (1428, 4)


In [23]:
y = labels.bin
X = df

print(f'shape of X {X.shape} and shape of y {y.shape}')

shape of X (1428, 6) and shape of y (1428,)


### Solve the Exercise

In [35]:
from sklearn.model_selection import KFold, cross_val_score

# NO SHUFFLING - KFOLD

clf = RandomForestClassifier(n_estimators=100, random_state=42)

kfold_no_shuffle = KFold(n_splits=10,shuffle=False)
scores_no_shuffle = cross_val_score(clf, X, y, cv=kfold_no_shuffle, scoring='accuracy')

# SHUFFLING - KFOLD
kf_shuffle = KFold(n_splits=10, shuffle=True, random_state=42)

scores_shuffle = cross_val_score(clf, X, y, cv=kf_shuffle, scoring='accuracy')


# PURGE AND EMBARGO (NO SHUFFLING)
t1_series = events["t1"]
scores_purged = cvScore(
    clf,
    X,
    y,
    sample_weight=pd.Series(1, index=X.index),  # equal weights
    scoring='accuracy',
    t1=t1_series,
    cv=10,
    pctEmbargo=0.01
)


In [36]:

print('--------- RESULTS WITH SHUFFLE = TRUE ---------')
print("Mean accuracy:", round(scores_shuffle.mean(),2))

print('--------- RESULTS WITH SHUFFLE = FALSE ---------')
print("Mean accuracy:", round(scores_no_shuffle.mean(),2))

print('--------- RESULTS WITH PURGE AND EMBARGO ---------')
print("Mean accuracy:", round(scores_purged.mean(),2))

--------- RESULTS WITH SHUFFLE = TRUE ---------
Mean accuracy: 0.56
--------- RESULTS WITH SHUFFLE = FALSE ---------
Mean accuracy: 0.49
--------- RESULTS WITH PURGE AND EMBARGO ---------
Mean accuracy: 0.51
