In [1]:
with open('requirements.txt', 'w') as f:
    f.write("""pandas>=1.5.0
numpy>=1.21.0
scikit-learn>=1.0.0
matplotlib>=3.5.0
seaborn>=0.11.0
xgboost>=1.7.6
yfinance>=0.2.0
sklearn>=1.6.1            
statsmodels>=0.13.0""")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import os
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score, calinski_harabasz_score, mean_squared_error
from sklearn.linear_model import ElasticNet
from sklearn.model_selection import TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from __future__ import annotations
from pathlib import Path
from typing import List, Optional, Tuple, Type

# Python 3.13.7


## NOTE; HOW TO RUN CODE/NOTEBOOK FOR ASSESSOR:

**Use of Artificial Intelligence Acknowledgement: Some Files within this assignment have been created with the assistance of Gen Artificial Intelligence - indicated by a top comment**

### Option 1: Estimated Completion Time: 122 minutes
If you wish to run the full notebook, you can do so via the "Run All" button on the very top. This includes all horizons (horizons = [1, 5, 10, 20, 40, 50, 70, 100]) for student_ext.py and baseline comparison with McGreevy et al (2024).

-----------------------------------------------------


### Option 2: Estimated Completion Time: 10-15 mins (for one horizon)
If you wish to run only the required code on a specified horizon; run the first two Python cells in this notebook and search (Ctrl+F) in this notebook for **%run DataPreparation/KMeans_Data_Prep_robustScaler.py** and run the cell, then search for **%run student_ext.py** - change the "horizons = [1, 5, 10, 20, 40, 50, 70, 100]" line (line 380)  to a horizon of your choosing, for example; "horizons = [10]" and run this cell. 

-----------------------------------------------------

To save on time, I have included my part 1 results, my original clustering model, experimentations of my clustering model as .csv files and I have imported their findings into this notebook. 

Please note: the plots presented in this notebook are saved externally in Appendix/Images... - so that they can be viewed without having to wait for the entire notebook to run


# Part 2: Multivariate Regime Detection and Predictive Modelling

Within this project, I will be performing the following:

#### 1. Data Preparation:
"Using the same set of tickers as in Part 1 (XLK, XLP, XLV, XLF, XLE, XLI) to construct a multivariate timeseries from daily or weekly prices or other derived features (such as returns and volatility)."

For this assignment, I have decided upon using weekly adj_close prices. I used weekly data for clustering as:
- weekly data has less noise compared to daily data, 
- rolling stats can be more stable on weekly data, compared to more volatile daily data,
- less computational cost, leading to faster clustering

Within this assignment, I use adj_close as it accounts for dividends, corporate actions and splits - which leads to better analysis for long-term investing.

#### 2. Regime Detection:

"Apply a clustering algorithm of your choice (e.g. k-means, Gaussian Mixture Models, hierarchical clustering) To identify latent regimes (ie clusters) from this multivariate time series."

For this assignment, I will be attempting the clustering algorithm; K-Means.

#### 3. Integration with Prediction
"Use the detected regime labels as additional features in your predictive modelling task from Part 1 (the Student class with fit/predict).
• Evaluate whether regime-informed prediction improves over your Part 1 Student class as the baseline"

For this assignment, I will attempt to integrate my chosen clustering algorithm with my original part 1 baseline student.py. I will also run experiments to see if certain fields/methods, e.t.c. changed within my *Regime Detection* section can help improve the overall Directional Accuracy and decrease the overall MAE and RMSE.

#### 4. Baseline Comparison

"In addition to your own Part 1 baseline, you must compare your results against at least one published baseline method" I will be using: "McGreevy, J., Muguruza, A., Issa, Z., Salvi, C., Chan, J., & Zuric, Z. (2024). Detecting Multivariate Market Regimes Via Clustering Algorithms. SSRN [http://dx.doi.org/10.2139/ssrn.4758243]"[1]

## 1. Data Preparation

The clustering algorithm I'm attempting is K-Means Clustering. K-means has been proven as a successful clustering algorithm in previous academic papers, such as "Horvath, Issa & Muguruza (2021); Clustering Market Regimes Using the Wasserstein Distance"[2]; found that using Wasserstein (a variant of k-means, more on this later) got high accuracy in detecting market regimes in univariate time series.




In [3]:
%run DataPreparation/KMeans_Data_Prep




-------------------------------
K = 2
Silhouette: 0.334 | CH: 140.4

[INTERPRETATION - TRAIN]
Cluster 0: ret=0.0072, vol=0.0113 -> BULL
Cluster 1: ret=-0.0193, vol=0.0140 -> BEAR
Saved: detected_regimes_k2.csv

-------------------------------
K = 3
Silhouette: 0.303 | CH: 141.0

[INTERPRETATION - TRAIN]
Cluster 0: ret=0.0204, vol=0.0137 -> BULL
Cluster 1: ret=0.0035, vol=0.0109 -> NEUTRAL
Cluster 2: ret=-0.0300, vol=0.0141 -> BEAR
Saved: detected_regimes_k3.csv

-------------------------------
K = 4
Silhouette: 0.149 | CH: 120.1

[INTERPRETATION - TRAIN]
Cluster 0: ret=-0.0065, vol=0.0111 -> NEUTRAL
Cluster 1: ret=0.0192, vol=0.0142 -> NEUTRAL
Cluster 2: ret=-0.0353, vol=0.0157 -> BEAR
Cluster 3: ret=0.0116, vol=0.0106 -> BULL
Saved: detected_regimes_k4.csv

-----------------------------------------------------


Within Appendix/DataPreparation/KMeans_Data_Prep.py; I loaded daily prices from prices.csv (specifically adj_close), normalised tickers and converted long format into a wide panel of the six sector ETFs. From this weekly panel, I created causal features including weekly log returns, 12-week rolling volatility and 6-week momentum for each ticker, along with market level aggregates (i.e. cross sectional volaitility) for capturing overall market conditions. I drop rows with incomplete features and then split the dataset into train and test sets based on a fixed date (I chose 4th Jan 2019 as it is a Friday, it provides 9 years of training data (2010-2018) and 5 years of testing data (2019-2023)) as to avoid look-ahead bias. 

In an Introduction to Statistical Learning (ISLP)[3] it suggests "... always running K-means clustering with a large value of n_init, such as 20 or 50, since otherwise an undesirable local optimum may be obtained", hence why my n_init=50. Later in this assignment, I will trying n_init values of 20 and 30.

As a practical consideration, for standardising inputs, I use StandardScaler to minimise the within-cluster Euclidean distances. As mentioned on page 532 of ISLP [3], standardisation is preferred on K-means, so later in this assignment, I will be trying a different scaler from the sklearn.preprocessing package to see which returns the best directional accuracy and lowest MAE and RMSE.

To understand the best K value for my k-means clustering on the prices.csv data, I've opted to use a mixture of silhouette score and Calinski-Harabasz scores. To combat data leakage, I've included the creation of the regime_lag1 so that predictiosn only use information available at prediction time. 

Interpreting the Data;
The terms; "Bull", "Neutral" and "Bear" are used to refer to stock market conditions - used to describe how the market is doing in general. According to Investopedia [4], "A bull market is a market that is on the rise and where the economy is sound. A bear market exists in an economy that is receding, where most stocks are declining in value." Neutral market indicates a market that is stable, which is neither rising, nor falling.

Within my data preparation file, I have a heuristic rule for the different K values.

For K=2, I use binary classification on returns, where if average return is posivtive it is a "Bull", whereas negative it is "Bear".

For K=3 and K=4, I use ternary classification on return and volatility, to create a 2x2 decision matrix:


| Return VS Median | Volatility VS Median | Label   |
|-------------------|-----------------------|---------|
| Above Median      | Below Median         | Bull    |
| Above Median      | Above Median         | Neutral |
| Below Median      | Below Median         | Neutral |
| Below Median      | Above Median         | Bear    |


I use siholuette and Calinski-Harabasz to understand what would be the best K value. 

Use of Silhouette Score:
The silhouette score ranges from -1 to 1; "with the score being bounded between -1 and 1. [A positive value] indicates that the point is appropriately clustered, whereas values near 0 denote ambiguity, and negative values hint at a possible misclassification." according to NumberAnalytics[5].

Use of Calinski-Harabasz (CH):
The Calinski-Harabasz index (also known as variance ratio criterion) measures "how similar an object is to it's own cluster (cohesion) compared to other clusters (seperation)" according to GeeksForGeeks[6]. A higher value for CH means that clusters are dense are well seperated. 

Results:
For K=2:
- Silhouette Score is 0.334, CH Index is 140.4
- Cluster 0 is bull
- Cluster 1 is bear

The silhouette score is the highest out of the different K values, indicating the clearest regime distinction. This is the most optimal k value as it's the simplest model for capturing the main market states.

For K=3:
-  Silhouette Score is 0.303, CH index is 141.0
- Cluster 0 is bull
- Cluster 1 is neutral
- Cluster 2 is bear

The silhouette score is lower than it is for k=2, but CH index is slightly higher. Silhouette score is more important for intrepability however, as Silhouette has a bounded range (-1 to 1) and so is more interpretable, compared to CH, which has no bounded range. This k value captures netural periods (slightly positive returns, low volatility).

For K=4
- Silhouette Score is 0.149, CH index is 120.1

The Silhouette score drops dramatically to 0.149 and CH index also has the lowest value out of all the k values; indicating the worst seperation.

For this assignment, I will be using k=2, but will investigate further with k=3 and k=4 in my experimental section.


### Integrating with Prediction:

Below is my baseline student results from Coursework Part 1, saved as "student.py" which uses ElasticNet (available to view in /Appendix as "student.py"):

In [15]:
import pandas as pd

filename = "Results/part1_results_student.csv"  
df = pd.read_csv(filename)
print(df)

   HORIZON  DIRACC    MAE   RMSE
0        1   0.525  0.009  0.013
1        5   0.560  0.019  0.027
2       10   0.578  0.027  0.038
3       20   0.598  0.038  0.053
4       40   0.636  0.051  0.070
5       50   0.656  0.056  0.076
6       70   0.682  0.064  0.086
7      100   0.706  0.074  0.099



To integrate my initial student.py from Coursework part 1 with Kmeans; I extended the class (please see Appendix/student_kmeans_before_experimentation.py) by creating the external file (via Appendix/KMeans_Data_Prep.py file, "detected_regimes_k{K}.csv", e.g. detected_regimes_k2.csv) and making it required in my prediction class: student_kmeans_before_experimentation.py

Within my prediction class, I've made it so that the columns; date, regime and regime_lag1 are required (for no data leakage) input is the X dataframe (same as the original code) and also my detected_regimes_k2.csv. DateTimeIndex is sorted because time series operations like .loc[date] and shift() assume chronological order.

For the _make_features() function in my prediction class,  it now has 13 features instead of 12 (original part 1 student features with my newly created regime column). Helper methods are same as my original part 1 student code. 

My prediction class has a new function; _get_regime_for_date(); this is a time-safe regime lookup function using the lagged regime data, again to ensure no data leakage (by ensuring regime at time t uses information up to t-1) and includes some handling of edge cases, such as NaN values.

For the fit() function, I've added one-hot encoding of regime labels (lines 200-203), creating regime-specific indicator features. Additionally, I've stored the training column strucutre to ensure consistent feature alignment during prediction.  

For predict() I have added column alignment logic to ensure prediction features match the training structure, including creating one-hot regime columns for new data and reordering columns to maintain consistency with training phase.

Compared to my original student, my prediction class contains error-handling, such as validating that  the regime file has required columns, handles regimes not seen during training, aligns prediction features to training structures and manages int64 nullable types and is overall more robust than my original student.py class. 

Within my kmeans student class, I've also embedded the core functions of mltester; specifically forward_log_return(), compute_metrics(), walk_forward_predict(), evaluate_ticker(). Within my walk_forward_predict() (compared to the original mltester.py) I made it so that each walk-forward block gets a fresh model with the same regime file. I've also simplified data loading, compared to the original mltester.py.

Looking at the results of the given DirAcc, MAE and RMSE from different K values, it successfully demonstrates that regime detection works due to differenet results, and as expected, K=2 performs best as it has better DirAcc and lower MAE/RMSE on average compared to K=3 and K=4. 


In [17]:
import pandas as pd

#these results were generated from the .csv outputted from Appendix/DataPreparation/KMeans_Data_Prep.py, 
# and the results were produced from "Appendix/student_kmeans_before_experimentation.py" - the results are saved in the below .csv file:

filename = "Results/student_kmeans_k2_standardscaler_init50_NOpca_results.csv"  
df = pd.read_csv(filename)
print(df)

   HORIZON  DIRACC     MAE    RMSE
0        1   0.529  0.0086  0.0127
1        5   0.569  0.0192  0.0270
2       10   0.589  0.0267  0.0376
3       20   0.619  0.0374  0.0525
4       40   0.657  0.0510  0.0696
5       50   0.677  0.0564  0.0764
6       70   0.713  0.0639  0.0858
7      100   0.737  0.0735  0.0983


### Experimentation of Different Methods within my KMeans clustering.

As of right now, my Kmeans clustering file (Appendix/DataPreparation/KMeans_Data_Prep.py) uses;
- StandardScaler(),
- No PCA used to generate clusters,
- n_init = 50,

In an attempt to improve the Directional Accuracy and decrease the MAE and RMSE values; I will try:
- changing the StandardScaler() to RobustScaler(),
- implement PCA within the cluster generation,
- changing the n_init value to 20 and 
- changing the n_init value to 30 


Below I've detailed my experimentation journey for each experiment attempted. 

#### RobustScaler (aka Robust):

Within my KMeans_Data_Prep, I changed the scaler used on line 42 from StandardScaler() to RobustScaler() (Appendix/Experimentation/KMeans_Data_Prep_robustScaler). And changed the scaler used in the prediction file (Appendix/Experimentation/student_kmeans_multvariate_robust.py) which runs the mltester functionality on my computed timeseries. I decided to try replacing StandardScaler() with RobustScaler() as StandardScaler() (according to GeeksForGeeks [7]), "subtracts the mean of the data and divides it by standard deviation. This centers the data around zero and standardizes the variablity". It is more sensitive to outliers. Whereas RobustScaler "subtracts the median of data and divides by interquartile range (IQR) which helps in reducing the effect of outliers why maintaining distribution of non-outlier values." I expect that RobustScaler will produce better results as it will make the prediction class less sensitve to extreme events.

#### PCA:
Within my KMeans_Data_Prep, I've added PCA (principle component analysis) for dimensionality reduction and to help "reduce the number of features in a dataset while keeping the most important information"  according to GeeksForGeeks[8]. This in itself can lead to less noise and redundant information, resulting in cleaner, more stable clusters. PCA also looks to mitigate the curse of dimensionality , as in high-dimensional spaces data can become sparse and can lead to overfitting. I expect the inclusion of PCA to produce better results due to noise reduction.

#### n_init20 (aka Init20) and n_init30 (aka Init30)
Within my KMeans_Data_Prep, I've changed the n_init value from 50 to 20 and 30 by changing the n_init parameter within the kmeans initialisation (line 55).
The n_init parameter is defined as the "Number of times the k-means algorithm is run with different centroid seeds" according to the scikit-learn docs[9]. For this parameter (for both n_init=20 and n_init=30), I am expecting similar results to my current prediction class results: ""Results/student_kmeans_standardscaler_init50_NOpca_results.csv" (printed above this cell).


#### Directional Accuracy Results

![directional_accuracy_zoomed.png](Appendix/Images/directional_accuracy_zoomed1.png)
(The above graph is zoomed in for easier distinction between bars)



From the results; 
- Both the Init20 and Init30 have the same averages as each other across the differing k values, indicating that changing the n_init value does not make a difference to the directional accuracy.
- For PCA addtion in regime detection; it has a high DirAcc in k=2 and k=3, then second highest in k=4, but stil higher than my original kmeans code. 
- For RobustScaler; it is the lowest scoring method for DirAcc in k=2, the same as other  for k=3 but has the highest score for k=4.
- For my original student kmeans, it has a high DirAcc in k=2 and k=3, but then has the lowest DirAcc overall in K=4, indicating that my baseline clustering has become unstable at higher K values.

Worst performing was Robust in K=2, then for k=4 it was init20, init 30 and my kmeans student.

Overall I have found that regimes can improve predictional accuracy. For K=2, the highest directional accuracy is given by init20, init30, PCA and my original kmeans student. DirAcc is the same across all methods at k=3 and then Robust has the highest DirAcc at K=4.

#### MAE Results

![mae_zoomed.png](Appendix/Images/mae_zoomed1.png)
(The above graph is zoomed in for easier distinction between bars)

From the results:
- Similar to DirAcc, both the Init20 and Init30 have the same averages for MAE as each other for the differing K values, indicating that changing the n_init value does not make a difference to MAE score.
- PCA has the third highest MAE score for k=2, then the highest MAE for k=3, then second lowest for k=4.
- Robust has the lowest MAE score overall at all k values.
- My original kmeans student has the highest MAE at k=2, second highest at k=3 and has the same MAE value as init20 and init30 at k=4.

Overall, for minimising the MAE score, RobustScaler() performed best in all k values. Worst performing (as in, highest scoring) was my student kmeans in k=2, PCA for k=3, then init20, init30 and my student kmeans for k=4.

#### RMSE Results

![rmse_zoomed.png](Appendix/Images/rmse_zoomed1.png)
(The above graph is zoomed in for easier distinction between bars)

Results are very similar to results in MAE in regards to output. Overall, for minimising the RMSE score, RobustScaler() performed best in all k values. Worst performing (as in, highest scoring) was my student kmeans in k=2, PCA for k=3, then init20, init30 and my student kmeans for k=4.

#### Decisions made based on results

Based upon the results of my experimentation, I must consider the trade-off of having higher directional accuracy with higher MAE and RMSE, or slightly lower directional accuracy with lower MAE and RMSE. As RobustScaler() has shown to be stable throughout DirAcc, MAE and RMSE results, (especially considering the difference between the highest scoring methods and Robust for k=2 was 0.0008), I have decided to implement it within my prediction model, now to become student_ext.py.

To run my student_ext.py, I will first prepare the data (now with using RobustScaler file)

In [19]:
%run DataPreparation/KMeans_Data_Prep_robustScaler


-------------------------------
K = 2
Silhouette: 0.334 | CH: 144.5

[INTERPRETATION - TRAIN]
Cluster 0: ret=0.0075, vol=0.0114 -> BULL
Cluster 1: ret=-0.0207, vol=0.0137 -> BEAR
Saved: detected_regimes_k2_robustscaler.csv

-------------------------------
K = 3
Silhouette: 0.296 | CH: 142.5

[INTERPRETATION - TRAIN]
Cluster 0: ret=0.0211, vol=0.0137 -> BULL
Cluster 1: ret=-0.0294, vol=0.0140 -> BEAR
Cluster 2: ret=0.0035, vol=0.0109 -> NEUTRAL
Saved: detected_regimes_k3_robustscaler.csv

-------------------------------
K = 4
Silhouette: 0.154 | CH: 122.5

[INTERPRETATION - TRAIN]
Cluster 0: ret=-0.0067, vol=0.0111 -> NEUTRAL
Cluster 1: ret=0.0195, vol=0.0141 -> NEUTRAL
Cluster 2: ret=0.0115, vol=0.0106 -> BULL
Cluster 3: ret=-0.0361, vol=0.0158 -> BEAR
Saved: detected_regimes_k4_robustscaler.csv

-----------------------------------------------------


In [None]:
%run student_ext.py #estimated run time: 63 mins


TESTING K=2 REGIMES

Evaluating horizon 1 days...
   XLE: DirAcc=0.5127  MAE=0.011796  RMSE=0.017381
   XLF: DirAcc=0.5148  MAE=0.009424  RMSE=0.013988
   XLI: DirAcc=0.5376  MAE=0.008376  RMSE=0.012288
   XLK: DirAcc=0.5490  MAE=0.009201  RMSE=0.013440
   XLP: DirAcc=0.5323  MAE=0.005938  RMSE=0.008649
   XLV: DirAcc=0.5281  MAE=0.007118  RMSE=0.010159
Horizon   1 days: DirAcc=0.529 | MAE=0.0086 | RMSE=0.0127

Evaluating horizon 5 days...
   XLE: DirAcc=0.5305  MAE=0.026839  RMSE=0.039030
   XLF: DirAcc=0.5568  MAE=0.020812  RMSE=0.029535
   XLI: DirAcc=0.5807  MAE=0.019007  RMSE=0.026991
   XLK: DirAcc=0.5958  MAE=0.020012  RMSE=0.027484
   XLP: DirAcc=0.5791  MAE=0.012665  RMSE=0.017620
   XLV: DirAcc=0.5714  MAE=0.015699  RMSE=0.021462
Horizon   5 days: DirAcc=0.569 | MAE=0.0192 | RMSE=0.0270

Evaluating horizon 10 days...
   XLE: DirAcc=0.5349  MAE=0.038098  RMSE=0.055896
   XLF: DirAcc=0.5857  MAE=0.028969  RMSE=0.041096
   XLI: DirAcc=0.5937  MAE=0.026170  RMSE=0.037358
   XLK:

The following visualisation compare 3 models; the original part 1 baseline (no regime detection), my KMeans model with StandardScaler() (before experimentation) and my new KMeans class (after experimentation- adding RobustScaler() instead of StandardScaler()).

![part1baseline_originalkmeans_robustkmeans.png](Appendix/Images/part1baseline_originalkmeans_robustkmeans.png)

I've also included a heatmap to view results:

![comparison_heatmaps.png](Appendix/Images/comparison_heatmaps.png)


Overall, I have found that both KMeans classes have outperformed the Part1 baseline across all horizons; with smaller improvements in shortterm horizons and larger improvements in longterm horizons in regards to directional accuracy and MAE and RMSE has been reduced (a small amount) in some places. This demonstrates the value of incorporating market regime information.

RobustScaler() vs StandardScaler() Trade Offs:

My original kmeans before experimentation, using StandardScaler(), has slightly better directional accuracy (for example, at horizon 70), but my kmeans after experimentation, using RobustScaler() has lower RMSE than the original kmeans. As said before, due to RobustScaler() having good stability ( and with differences in directional accuracy being negligible), this version has become my student_ext.py.

In [20]:
%run Appendix/PublishedComparison/ComputeRegimes_McGreevy.py

# this creates regimes_mcgreevy_weekly.csv

Saved: regimes_mcgreevy_weekly.csv


In [21]:
%run Appendix/PublishedComparison/student_ext_mcgreevy.py #takes 59 mins to run - RESULTS OF THIS ARE IN "Results/mcgreevy_results.csv"


TESTING K=2 REGIMES

Evaluating horizon 1 days...


KeyboardInterrupt: 

## 4. Baseline Comparison

For baseline comparison, I will be using the "Detecting Multivariate Market Regimes via Clustering Algorithms" method (2024) by James McGreevy (Aitor Muguruza-Gonzalez, Zacharia Issa Jonathan Chan and Cris Salvi). I prepared the data via ComputeRegimes_McGreevy.py (which outputs regimes_mcgreevy_weekly.csv) and ran it on student_ext_mcgreevy.py (only difference between this file and student_ext.py is that this file uses the the regimes_mcgreevy_weekly.csv outputted from ComputeRegimes_McGreevy.py).

Key takeaways of the McGreevy et al paper is as follows:

"• We develop an adapted k-means algorithm that uses the 2-Wasserstein distance metric or Maximum Mean Discrepancy, and d-dimensional data in order to identify changes in joint market regimes between assets, in particular correlation.
• We create a two-step process for finding the marginal and joint market regimes in synthetic and real data.
• Using the two-step process, we form approximations to the mean, variance and correlation which then subsequently inform profitable portfolios of pairs of stocks."

To begin creating my version of McGreevy et al for this assignment, I first created a script to compute regimes (ComputeRegimes_McGreevy.py), then also created the class containing the fit(), predict() and relevant mltester functions, that will use the computed regimes. For McGreevy et al (2024), the paper's main contribution is showing how it adapts clustering for the specific problem - using a probability metric (specifically; 2-Wasserstein distance between empirical measures) within it's clustering framework to compare the data segment distributions, rather than using raw data points. More specifically, the novelty of McGreevy et al's is replacing Euclidean distance with Wasserstein to compare data segment distributions.

### Creation of the Regimes (ComputeRegimes_McGreevy.py)

Following McGreevy et al's framework, I used log returns of the six sector ETFs. Weekly sampling (as I've mentioned before) reduces noise compared to daily data. By using the wide panel format (tickers as columns, dates as rows) it allows for direct application of their multivariate clustering approach. The paper states: "We work with the log-returns of elements of the
form S = (s0, . . . , sN ) ∈ S(R^d), which are price paths of d financial assets."

Although McGreevy et al doesn't mention standardisation, to maintain consistency with my baseline models and ensure fair distance comparisons, I standardised returns within a walk-foward framework. 

![h1_minus_h2_block_mcgreevy.png](Appendix/Images/h1_minus_h2_block_mcgreevy.png)

(Image is taken from another McGreevy et al paper (2023)).

Figure 5.2 is a horizontal timeline of returns divided into boxes (where each box = an observation, e.g. one week). h1 represents the segment length (e.g. from the image; 5 boxes = 5 weeks). h2 represents the overlap between consecutive segments (e.g. spanning all 5 boxes / weeks). The red block indicates the current segment being clustered. The segments (\mu_1, \mu_2, e.t.c.) represents an empirical distribution of returns over h1 weeks, with the lower arrows showing an overlapping sliding forward windows by h1 - h2 steps. 

I implemented the paper's segmentation method using 20-week windows (h1) and 16 week overlaps (h2), resulting in a 4-week step between consecutive segments. As mentioned above, each segment represents an empirical distribution that forms the basis for regime detection.

The core innovation in McGreevy et al's paper is replacing Euclidean distance with Wasserstein distance for comparing empirical distributions. Within my w2() function, I've implemented their 2-Wasserstein metric using optimal transport, specifically the Hungarian algorithm for segments (In "The d-dimensional WK algorithm": "Computation of the p-Wasserstein distance... Hungarian algorithm" (Hungarian being the "linear_sum_assignment(C)" in my code)).

Following McGreevy's two step approach, for the first step, it focuses on marginal distributions (volatility patterns).  I first applied univariate Wasserstein k-means to each asset seperately, identifying high and low volatility regimes (k=2). McGreevy et al. (2024) explains that this step removes the influence of marginal distributions (mean and variance) before analysing correlation. They write: "We begin by applying the uni-d 1-WK-means algorithm to the data in order to remove the effects of the marginal distribution on each asset and subsequently apply the 2-d WK-means or 2-d MMDK-means algorithm to this transformed data. This method is based on the theory of copulas."

To follow McGreevy et al's recommendation of seperating volatility effects from correlation structure, I implemented a copula transformation. In my code, this happens after univariate clustering in ecdf_from_atoms() function. For each ETF and its volatility cluster, I built an empirical CDF (ECDF) from pooled segment values, then I converted each segment's returns into Uniform(0,1) values using this ECDF. This ensures that all marginals are standardised, leaving only dependence structure.

When volatilty effects are removed via copula transfrmation, I performed multivariate Wasserstein k-means clustering on these copula segments to detect correlation regimes (k=2). The second stage identifies periods of high versus low interdependence among ETFs.

Finally, Regime labels were lagged by one period to prevent look-ahead bias, as mentioned previously in my "Data Preparation" section. The outputted .csv file is used by "student_ext_mcgreevy.py" which uses regime_lag1 when merging features to avoid leakage.

For the comparison itself I will be comparing my student_ext.py from this assignment with my McGreevy et al implemetation.

For Directional Accuracy:

![diracc_heatmap_large.png](Appendix/Images/diracc_heatmap_large.png)

Results for DirAcc are quite similar with each other, but my student_ext.py outperforms at horizons 20 and 100, whereas McGreevy et al outperforms at horizons 40 and 70.

For MAE:

![mae_heatmap_large.png](Appendix/Images/mae_heatmap_large.png)

Results for MAE are very similar, but student_ext.py outperforms at horizon 100, with a 0.01 smaller difference.


For RMSE:

![rmse_heatmap_large.png](Appendix/Images/rmse_heatmap_large.png)

Again, Results for RMSE are very similar, but student_ext.py outperforms at horizon 10 and 20, with a 0.01 smaller difference for each.

Overall, my McGreevy et al implementation was a strong competitor against my student_ext.py. My student_ext.py however still outperfomed (though with some very small differences), further validating the the effectiveness of my student_ext.py. 


References:
[1] Detecting Multivariate Market Regimes via Clustering Algorithms, James Mc Greevy, Aitor Muguruza Gonzalez, Zacharia Issa, Jonathan Chan, Cris Salvi (2024) Imperial College London


[2] Horvath, Issa & Muguruza (2021); Clustering Market Regimes Using the Wasserstein Distance (https://arxiv.org/abs/2110.11848 )


[3] An Introduction to Statisitcal Learning; Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, Johnathan Taylor (2023)


[4] Investopedia https://www.investopedia.com/insights/digging-deeper-bull-and-bear-markets/ by Leslie Kramer, June 2024



[5] NumberAnalytics https://www.numberanalytics.com/blog/silhouette-score-clustering-evaluation by Sarah Lee (AI GENERATED), March 2025

[6] GeeksForGeeks https://www.geeksforgeeks.org/machine-learning/calinski-harabasz-index-cluster-validity-indices-set-3/ Calinski-Harabasz Index – Cluster Validity indices, July 2025, Debomit Dey


[7] GeeksForGeeks https://www.geeksforgeeks.org/machine-learning/standardscaler-minmaxscaler-and-robustscaler-techniques-ml/  StandardScaler, MinMaxScaler and RobustScaler techniques, Oct 2025, Ashwin


[8] GeeksForGeeks https://www.geeksforgeeks.org/data-analysis/principal-component-analysis-pca/ Principal Component Analysis (PCA), Nov 2025, Aishwarya

[9] https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html, 2025

