# Market Regime Classification

Following the financial crisis, quantitative value strategies, exspcially in equities, have experienced a decade of underperformance, making it challenging for investors to maintain commitment to them. This notebook aims to identify corresponding regimes and implement tactical allocation changes by evaluating the performance of three well-known supervised methods for binary (Value or Momentum) regime classification. Due to the results of [Fernández-Delgado et al. (2014)](https://jmlr.org/papers/v15/delgado14a.html), we focus on the following methods:

- Logistic Regression (base line model)
- Random Forest
- Support Vector Machine
- Multi-layer Perceptron

Even though these are not sequence models, time series problems can be reformulated by lagging the feature time series, extracting reasonable feature to capture corresponding aspects and forming (X, y) tuples with the target variable.

### Libraries

These are the necessary libraries we rely on:

In [None]:
# General libraries for data manipulation and visualization
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import missingno as msno
from arch.bootstrap import CircularBlockBootstrap, optimal_block_length
from scipy.cluster import hierarchy
import time

# Sklearn libraries for model selection, metrics, and models
from sklearn.model_selection import train_test_split, TimeSeriesSplit, RandomizedSearchCV
from sklearn.metrics import (confusion_matrix, roc_curve, roc_auc_score, accuracy_score, precision_score,
                             recall_score, f1_score, classification_report, log_loss, make_scorer)
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.decomposition import PCA
from sklearn.tree import plot_tree
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance
from sklearn.feature_selection import RFECV, SequentialFeatureSelector
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.base import clone

# Skopt for hyperparameter tuning
from skopt import BayesSearchCV
from skopt.space import Real, Categorical, Integer

# SciPy libraries for statistical functions
from scipy.stats import randint, uniform, reciprocal
from scipy.stats import rv_continuous

# Numpy random state
from numpy.random import RandomState

## Data Engineering

The monthly return data used here are from the paper [How Do Factor Premia Vary Over Time? A Century of Evidence](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3400998) by Ilmanen et al. (2019). It explores the following four factor strategies across liquid asset classes:

1. Value
2. Momentum
3. Carry
4. Defensive

We focus on **Value** and **Momentum**, aggregated across all liquid asset classes, for two reasons. First, they can be considered antagonists due to their most pronounced negative pairwise correlation, which results in the highest potential added value through timing them. Second, aggregation across asset classes helps lower the impact of asset-specific factors.

### Price based Features

From the return data set, a few features can be engineered to capture several posibly informative aspects:

- **Value & Momentum Returns**: They caputure the momentum of these factors, embodying the persistence of absolute returns.

- **Value-Momentum Spread**: Designed to exploit potential mean-reversion tendencies, this feature captures divergences in returns that may trigger counter-movements, such as those driven by portfolio rebalancing activities.

- **Value & Momentum 12M returns**: These annual cumulative returns provide a window into traders' portfolios over longer periods of time. They also capture trend effects over this time horizon.

- **Value-Momentum 12M Spread**: This feature serves as another measure of mean reversion and extends the horizon to 12 months, allowing for the identification of longer-term cyclical behaviour.

- **Value & Momentum 36M Returns**: Cumulative returns over this three-year period are an important criterion for the evaluation of investments of institutional investors. This feature caputres that aspect.

- **Value-Momentum 36M Spread**: This feature extends the mean reversion capture window to 36 months, providing a longer-term perspective on the cyclicality of these strategies.

- **Rolling Value & Momentum Means**: These rolling averages over the past year provide insight into current trends and shifts in expected returns, helping to identify potential strategic pivots.

- **Rolling Mean Difference**: The difference between the rolling means of Value & Momentum provides a comparative perspective on the current expectations of each strategy.

- **Value & Momentum Volatilities**: The annualized standard deviations of returns captures the inherent uncertainty associated with each strategy.

- **Volatility Difference**: The difference in volatility between Value & Momentum captures the comparative uncertainty associated with each strategy.

- **Value & Momentum Skewness**: These 12-month rolling statistics of return asymmetry captures asymmetries in risk and return profiles.

- **Skewness Difference**: The difference between the skewness of Value & Momentum returns indicates potential profit opportunities associated with the asymmetry of each strategy's return distribution.

- **Value & Momentum Excess Kurtosis**: Measures tail risk, highlighting the potential for extreme outcomes and capturing the degree of outlier risk in each strategy.

- **Kurtosis Difference**: This measure of the divergence between the tail risks of value and momentum strategies helps to identify which may be more prone to extreme returns.

- **Correlation**: This feature captures the interdependence between value and momentum trading returns, helping to capture potential portfolio overlap and diversification implications for investors.

- **Value & Momentum Drawdowns**: These measures encapsulate the most significant losses investors would have suffered prior to a rebound, providing a tangible picture of the losses investors suffer.

- **Drawdown Difference**: This feature compares the drawdowns of value and momentum strategies to capture which strategy may be causing investors more pain at this time.

In [None]:
# Load the dataset from a csv file, specify the separator as ";"
data = pd.read_csv('return based data.csv', sep = ";")

# Convert the 'Date' column to datetime format (day first)
data['Date'] = pd.to_datetime(data['Date'], dayfirst=True)

# Set 'Date' as the index of the DataFrame
data.set_index('Date', inplace=True)

# Remove the last two columns of the DataFrame
data = data.iloc[:, :-2]

# Remove the last row of the DataFrame
data = data.iloc[:-1]

# Convert the 'Regime' column to integer type
data['Regime'] = data['Regime'].astype(int)

# Display the DataFrame
display(data)

First, we visualize the cumulative performance of both strategies to get a first impression.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data
plt.plot((1 + data['Value Returns']).cumprod(), label='Value', color = 'red')
plt.plot((1 + data['Momentum Returns']).cumprod(), label='Momentum', color = 'black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

# Determine the color for the background
background_color = ['lightgrey' if value > momentum else 'lightcoral' for value, momentum in zip(value_returns, momentum_returns)]

# Plot the data with colored background
plt.plot(value_returns, label='Value', color='red')
plt.plot(momentum_returns, label='Momentum', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("Time")

# Set y-axis label
plt.ylabel("Return")
plt.yscale('log')

# Set plot title
plt.title("Value vs. Momentum")

# Set the background color
plt.gca().set_facecolor(background_color)

# Add legend
plt.legend()

# Show the plot
plt.show()

As an example for a feature variable, the 3-year cumulative return difference is plotted here.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data
plt.plot(data["Value-Momentum 36M Spread"], color='black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("return spread")

# Set plot title
plt.title("Value-Momentum 36M Spread")

# Show the plot
plt.show()

### Economic Features

As a next step, we supplement the most important non-price based financial macroecomoic features, such as the **[Federal Funds Effective Rate](https://fred.stlouisfed.org/series/DFF)**.

In [None]:
# Read the 'Federal Funds Effective Rate.csv' file using pandas read_csv function
DFF = pd.read_csv('Federal Funds Effective Rate.csv', sep = ",")

# Convert the 'DATE' column to datetime format using the pandas to_datetime function
DFF['DATE'] = pd.to_datetime(DFF['DATE'])

# Set the 'DATE' column as the index of the DataFrame using the set_index function
DFF = DFF.set_index('DATE')

# Rename the index to 'Date'
DFF.index.name = 'Date'

# Rename the 'DFF' column to 'Effective Federal Funds Rate' using the rename function
DFF = DFF.rename(columns={'DFF': 'Effective Federal Funds Rate'})

# Merge the 'data' DataFrame with the 'DFF' DataFrame using the merge function. 
# The merge is performed on the indices of the two DataFrames (left_index=True and right_index=True)
# and any rows that do not have a match in the other DataFrame are kept (how='outer')
combined_df = data.merge(DFF, left_index=True, right_index=True, how='outer')

# Normalize the index to remove any time component from the datetime objects
combined_df.index = pd.to_datetime(combined_df.index).normalize()

# Display the combined DataFrame using the display function
display(combined_df)

As a next step, we add the **[Federal Reserve Assets](https://fred.stlouisfed.org/series/WALCL)** to our set of features:

In [None]:
# Read the 'Federal Reserve Assets.csv' file into a DataFrame
WALCL = pd.read_csv('Federal Reserve Assets.csv', sep = ",")

# Convert the 'DATE' column to datetime format
WALCL['DATE'] = pd.to_datetime(WALCL['DATE'])  

# Set the 'DATE' column as the index of the DataFrame
WALCL = WALCL.set_index('DATE')  

# Rename the index to 'Date'
WALCL.index.name = 'Date'  

# Rename the 'WALCL' column to 'Federal Reserve Assets'
WALCL = WALCL.rename(columns={'WALCL': 'Federal Reserve Assets'})

# Merge the current DataFrame with the 'WALCL' DataFrame. The merge is performed on the indices
# of the two DataFrames (left_index=True and right_index=True), and rows that do not have a match 
# in the other DataFrame are kept (how='outer')
combined_df = combined_df.merge(WALCL, left_index=True, right_index=True, how='outer')

# Display the combined DataFrame
display(combined_df)

As a next step, we add the **[Mortage Rates](https://fred.stlouisfed.org/graph/?g=QGRW)** to our set of features.

In [None]:
# Load the CSV file 'Mortage Rates.csv' into a pandas DataFrame
MR = pd.read_csv('Mortage Rates.csv', sep = ",")

# Convert the 'DATE' column to datetime format
MR['DATE'] = pd.to_datetime(MR['DATE'])

# Set 'DATE' column as the index of the DataFrame
MR = MR.set_index('DATE')

# Rename the index to 'Date'
MR.index.name = 'Date'

# Rename the column 'MORTGAGE30US' to '30Y Mortage Rates'
MR = MR.rename(columns={'MORTGAGE30US': '30Y Mortage Rates'})

# Rename the column 'MORTGAGE15US' to '15Y Mortage Rates'
MR = MR.rename(columns={'MORTGAGE15US': '15Y Mortage Rates'})

# Rename the column 'MORTGAGE5US' to '5/1Y Mortage Rates'
MR = MR.rename(columns={'MORTGAGE5US': '5/1Y Mortage Rates'})

# Merge 'combined_df' DataFrame with 'MR' DataFrame using their indices
# Rows with indices that are not common in both DataFrames are filled with NaNs ('outer' join)
combined_df = combined_df.merge(MR, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

As a next step, we add the **[Treasury Yields](https://fred.stlouisfed.org/graph/?g=I2cQ)** to our set of features:

In [None]:
# Load the CSV file 'Treasury Yields.csv' into a pandas DataFrame
TY = pd.read_csv('Treasury Yields.csv', sep = ",")

# Convert the 'DATE' column to datetime format
TY['DATE'] = pd.to_datetime(TY['DATE'])

# Set 'DATE' column as the index of the DataFrame
TY = TY.set_index('DATE')

# Rename the index to 'Date'
TY.index.name = 'Date'

# Rename the column 'DGS3MO' to '3M Treasury Yields'
TY = TY.rename(columns={'DGS3MO': '3M Treasury Yields'})

# Rename the column 'DGS2' to '2Y Treasury Yields'
TY = TY.rename(columns={'DGS2': '2Y Treasury Yields'})

# Rename the column 'DGS5' to '5Y Treasury Yields'
TY = TY.rename(columns={'DGS5': '5Y Treasury Yields'})

# Rename the column 'DGS10' to '10Y Treasury Yields'
TY = TY.rename(columns={'DGS10': '10Y Treasury Yields'})

# Merge 'combined_df' DataFrame with 'TY' DataFrame using their indices
# Rows with indices that are not common in both DataFrames are filled with NaNs ('outer' join)
combined_df = combined_df.merge(TY, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

As a next step, we add the **[Yield Spreads](https://fred.stlouisfed.org/graph/?g=I2c3)** to our set of features:

In [None]:
# Load the 'Yield Spreads.csv' file into a DataFrame
YS = pd.read_csv('Yield Spreads.csv', sep = ",")

# Convert the 'DATE' column to datetime
YS['DATE'] = pd.to_datetime(YS['DATE'])

# Set the 'DATE' column as the index of the DataFrame
YS = YS.set_index('DATE')

# Rename the index to 'Date'
YS.index.name = 'Date'

# Rename the 'T10Y2Y' column to '10-2Y Yield Spreads'
YS = YS.rename(columns={'T10Y2Y': '10-2Y Yield Spreads'})

# Rename the 'T10Y3M' column to '10-3M Yield Spreads'
YS = YS.rename(columns={'T10Y3M': '10-3M Yield Spreads'})

# Merge the 'YS' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(YS, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

As a next step, we add the **[Financial Stress](https://fred.stlouisfed.org/graph/?g=12QTs)** indicators to our set of features:

In [None]:
# Load the 'Financial Stress.csv' file into a DataFrame
FS = pd.read_csv('Financial Stress.csv', sep = ",")

# Convert the 'DATE' column to datetime format
FS['DATE'] = pd.to_datetime(FS['DATE'])

# Set the 'DATE' column as the index of the DataFrame
FS = FS.set_index('DATE')

# Rename the index to 'Date'
FS.index.name = 'Date'

# Rename the 'NFCI' column to 'Chicago Fed National Financial Stress'
FS = FS.rename(columns={'NFCI': 'Chicago Fed National Financial Stress'})

# Rename the 'STLFSI4' column to 'St. Louis Fed Financial Stress'
FS = FS.rename(columns={'STLFSI4': 'St. Louis Fed Financial Stress'})

# Merge the 'FS' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(FS, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

Now, we add the **[Sentiment Index](http://people.stern.nyu.edu/jwurgler/)** from Baker and Wurgler (2006).

In [None]:
# Define a function to parse dates in the format "%Y%m"
def date_parser(date):
    return datetime.strptime(date, "%Y%m")

# Load the 'Sentiment.csv' file into a DataFrame, parsing 'yearmo' as dates using the defined date_parser function
SENT = pd.read_csv("Sentiment.csv", sep=';', parse_dates=['yearmo'], date_parser=date_parser)

# Convert the 'yearmo' column to datetime format
SENT['yearmo'] = pd.to_datetime(SENT['yearmo'])

# Set the 'yearmo' column as the index of the DataFrame
SENT = SENT.set_index('yearmo')

# Rename the index to 'Date'
SENT.index.name = 'Date'

# Rename the 'SENT' column to 'Sentiment'
SENT = SENT.rename(columns={'SENT': 'Sentiment'})

# Select only the first column of the DataFrame (i.e., 'Sentiment')
SENT = SENT.iloc[:, 0:1]

# Merge the 'SENT' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(SENT, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

Now, we add the **[Total return cyclically adjusted price-to-earnings ratio](http://www.econ.yale.edu/~shiller/data.htm)** to out features, which I call **CAPE**.

It is actually a modified version of the original CAPE that takes into account changes in payout policy, shifting from dividends to buybacks, which has taken place in recent years.

In [None]:
# Define a function to parse dates in two different formats: "%Y.%m" and "%Y.%-m"
def date_parser(date):
    try:
        return datetime.strptime(date, "%Y.%m")  # Try to parse the date with the first format
    except ValueError:  # If the first format fails, try the second format
        return datetime.strptime(date, "%Y.%-m")

# Load the 'CAPE.csv' file into a DataFrame, parsing 'Date' as dates using the defined date_parser function
CAPE = pd.read_csv("CAPE.csv", parse_dates=['Date'], date_parser=date_parser, sep=';')

# Convert the 'Date' column to datetime format
CAPE['Date'] = pd.to_datetime(CAPE['Date'])

# Set the 'Date' column as the index of the DataFrame
CAPE = CAPE.set_index('Date')

# Rename the index to 'Date'
CAPE.index.name = 'Date'

# Merge the 'CAPE' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(CAPE, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

Here we add the total returns of the US-American stock market, i.e. the **[Beta](http://www.econ.yale.edu/~shiller/data.htm)** Factor, as major risk factor to our set of features.

In [None]:
# Define a function to parse dates in two different formats: "%Y.%m" and "%Y.%-m"
def date_parser(date):
    try:
        return datetime.strptime(date, "%Y.%m")  # Try to parse the date with the first format
    except ValueError:  # If the first format fails, try the second format
        return datetime.strptime(date, "%Y.%-m")

# Load the 'Beta.csv' file into a DataFrame, parsing 'Date' as dates using the defined date_parser function
BETA = pd.read_csv("Beta.csv", parse_dates=['Date'], date_parser=date_parser, sep=';')

# Convert the 'Date' column to datetime format
BETA['Date'] = pd.to_datetime(BETA['Date'])

# Set the 'Date' column as the index of the DataFrame
BETA = BETA.set_index('Date')

# Rename the index to 'Date'
BETA.index.name = 'Date'

# Select only the second column of the DataFrame (i.e., 'Return')
BETA = BETA.iloc[:, 1:2]

# Rename the 'Return' column to 'Beta'
BETA = BETA.rename(columns={'Return': 'Beta'})

# Merge the 'BETA' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(BETA, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

Here, we add the **[Geopolitical Risk Index](http://www.policyuncertainty.com/gpr.html)** developed by Dario Caldara and Matteo Iacoviello at the Federal Reserve Board.

In [None]:
# Define a function to parse dates in the format "%d.%m.%Y"
def date_parser(date):
    return pd.to_datetime(date, format="%d.%m.%Y")

# Load the 'Geopolitical Risk.csv' file into a DataFrame, parsing 'month' as dates using the defined date_parser function
GEO = pd.read_csv("Geopolitical Risk.csv", sep=';', parse_dates=['month'], date_parser=date_parser)

# Convert the 'month' column to datetime format
GEO['month'] = pd.to_datetime(GEO['month'])

# Set the 'month' column as the index of the DataFrame
GEO = GEO.set_index('month')

# Rename the index to 'Date'
GEO.index.name = 'Date'

# Select only the fourth column of the DataFrame (i.e., 'GPRH')
GEO = GEO.iloc[:, 3:4]

# Rename the 'GPRH' column to 'Geopolitical Risk'
GEO = GEO.rename(columns={'GPRH': 'Geopolitical Risk'})

# Merge the 'GEO' DataFrame into the 'combined_df' DataFrame, aligning on their indexes
# The 'outer' method ensures that all data is retained, even if a match isn't found in the other DataFrame
combined_df = combined_df.merge(GEO, left_index=True, right_index=True, how='outer')

# Display the merged DataFrame
display(combined_df)

Because there are some dots in the data frame, we need to replace them.

In [None]:
# Replace any instances of '.' in the 'combined_df' DataFrame with NaN (representing missing data)
combined_df = combined_df.replace('.', np.nan)

# Display the updated DataFrame
display(combined_df)

Due to the varying frequency of all our features (some are available weekly, others monthly), we need to consider how to merge them and how to handle missing data values.

First, it is obvious to perform a forward fill, since in our model the last available value of a feature is the one that is available at the current time.

In [None]:
# Iterate over each column in the 'combined_df' DataFrame
for col in combined_df.columns:
    # Convert the column's data to numeric values, replacing any errors (like non-numeric strings) with NaN
    combined_df[col] = pd.to_numeric(combined_df[col], errors='coerce')

# Fill any NaN values in the DataFrame with the preceding (forward-fill 'ffill') value in each column
filled_df = combined_df.fillna(method='ffill')

# Display the updated DataFrame
display(filled_df)

Some of the features go back much further than return data is available for both Value and Momentum. Therefore, we need to remove these periods.

In [None]:
# Set pandas option to display all columns of the DataFrame when it is displayed
pd.set_option('display.max_columns', None)

# Convert the start and end date strings to datetime objects
start_date = pd.to_datetime("1926-07-01")
end_date = pd.to_datetime("2023-02-28")

# Filter the 'filled_df' DataFrame to only include rows where the index (date) is between the start_date and end_date
filtered_df = filled_df[(filled_df.index >= start_date) & (filled_df.index <= end_date)]

# Display the filtered DataFrame
display(filtered_df)

Let us look at the **Beta** variable to see if everything looks reasonable so far.

In [None]:
import matplotlib.pyplot as plt

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data
plt.plot(filtered_df["Beta"], color='black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("return")

# Set y-axis label
plt.ylabel("time")

# Set plot title
plt.title("Beta")

# Show the plot
plt.show()

### Merging the Data

Oversampling is now present due to features that are available on a weekly basis.

Let und remind ourselves of our goal: At the end of a month, we want to predict whether Value or Momentum will perform better in the upcoming month.

Therefore, we will sample our data down to a monthly time scale.

In [None]:
# Resample the 'filtered_df' DataFrame to a monthly frequency, using the last observation of each month
df_monthly = filtered_df.resample('M').last()

# Display the resampled DataFrame
display(df_monthly)

To quickly check for resonability, let us plot our target variable, the Regime.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data with black color
plt.plot(df_monthly.iloc[:, -18], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Regime")

# Show the plot
plt.show()

In a last step, we need to shift the regime variable back in time by one month, since we face a predicition problem.
However, we will use the current regime as feature to cover possible persistence.

In [None]:
# Create a copy of the original DataFrame
df_modified = df_monthly.copy()

df_modified["Current Regime"] = df_modified['Regime']

# Shift the "Regime" column up by one row
df_modified['Regime'] = df_modified['Regime'].shift(-1)

# Remove the last row
df_modified = df_modified.drop(df_modified.tail(1).index)

# Display the updated DataFrame
display(df_modified)

### Unbalanced Data

Now we come to the most critical part of this project: Dealing with the fact that not all features go back equally far into the past, which is visualized by the following bar plot.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(df_modified, sort='descending', color = 'black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

We obviously face a trade-off between more potentially informative features and more historical data, which also provides very valuable information.

How do we solve this?

In statistics, there are two basic objectives in modeling: structural analysis and forecasting. The former does not allow variables to be discarded simply because of statistical redundancy, which is why, for example, L1 regularization (lasso) is not useful when estimating regression models that are used for structural analysis. In the bias-variance dilemma, structural analysis is more concerned about bias, i.e., underfitting.

Forecasting, on the other hand, has different priorities and is more concerned about variance, i.e., overfitting. Why a model makes good predictions and where it gets the relevant information from is of secondary importance. For this reason, L1 regularization, with its inherent feature selection that avoids overfitting, is very reasonable.

What does that mean for our problem? If shorter variables are very similar to other longer variables, then we can safely discard the shorter one to get more historical data.

The coefficient of dtermination is a simple way to assess statsitical similarity.

In the next steps, we will successively go through our data, starting with the shortest available time series, and see if we can find similarities to longer time series.

#### 5/1Y Mortage Rates

This is our most limiting, i.e. shortest time series, but economically/theoretically, the 5/1Y Mortage Rates are expected to carry much of the same information as the 15Y Mortage Rates and the 30Y Mortage Rates.

Let us back this up with the data:

First we need to extract the shortest period, where all features are complete.

In [None]:
# Drop any rows that have missing values
df_complete = df_modified.dropna()

# Display the smaller DataFrame
display(df_complete)

Let us take a visual look at the data, where the similarity to especially 15Y Mortage Rates ist particularly striking:

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ['30Y Mortage Rates', '15Y Mortage Rates', '5/1Y Mortage Rates']

# Plotting the columns
plt.plot(df_complete[columns_to_plot])

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('interest Rate')
plt.title('Mortgage Rates')

# Display the plot
plt.show()

Now, let us compute the pairwise coefficients of determination to all other variable.

We define a function for this, since we will need to do this multiple times in the rest of this notebook.

In [None]:
def plot_r_squared(dataframe, column_name):
    # Calculate the correlation matrix
    corr_matrix = dataframe.corr()

    # Select the correlations related to the specified column
    correlations = corr_matrix[column_name]

    # Compute R^2 from correlations
    r_squared = correlations.apply(lambda x: x ** 2)

    # Sort the R^2 values
    sorted_r_squared = r_squared.sort_values(ascending=True)

    # Plot the R^2 values
    plt.figure(figsize=(11, 11), dpi = 100)
    sorted_r_squared.drop(column_name).plot(kind='barh', color='black')  # Exclude the specified column as its R^2 with itself is 1
    plt.xlabel('R^2')
    plt.title(f'Pairwise R^2 with {column_name}')
    plt.show()

In [None]:
# Specify data
df = df_complete
column = '5/1Y Mortage Rates'

# Plot
plot_r_squared(df, column)

The R^2 to to 15Y Mortage Rates is indeed close to one, indicating that differences are mostly due to noise.

To test, how likeli we can assume, that both features provide the same information, plus some noise, we can conduct a  bootstrapping-based hypothesis test:

**Null Hypothesis (H0):** The two time series ("5/1Y Mortgage Rates" and "15Y Mortgage Rates") are perfectly dependent, with their difference only attributable to the observed noise structure (given by the differences between the series in the data).

**Alternative Hypothesis (H1):** The two time series are not perfectly dependent, even after accounting for the noise structure. In other words: The observed MIC value is significantly lower than what we would expect under perfect dependency with the given noise structure

We choose a significance level α of 5% because this usually balances the probabilities of a Type I and a Type II error.

The bootstrapping test defined below takes two time series, scales them uniformly to ensure scale & correlation direction invariance, extracts the difference time series, and then circularly bootstraps it 100,000 times.
This ensures that any autodependencies are preserved in this time series context.
The optimal block length is chosen according to [Patton et al. (2009)](https://www.tandfonline.com/doi/abs/10.1080/07474930802459016).

Then, these difference time series are added to the first time series and the corresponding coefficient of determination is computed with the first time series ifself.
This generates a null distribution under the assumption that the true coefficient of determination is 1 (H0) and the differences are just due to noise.

By doing so, we can compute a p-value for H1.

In [None]:
def standardize_series(s):
    scaler = StandardScaler()
    s_values_scaled = scaler.fit_transform(s.values.reshape(-1, 1))
    return pd.Series(s_values_scaled.flatten(), index=s.index)

def r_squared_bootstrap_test(ts1, ts2, n_permutations):
    
    # Standardize the time series
    ts1 = standardize_series(ts1)
    ts2 = standardize_series(ts2)
    
    if np.corrcoef(ts1, ts2)[0, 1] < 0:
        ts2 = ts2 * -1

    # Calculate observed R^2
    observed_r_squared = np.square(np.corrcoef(ts1, ts2)[0, 1])
    
    # Calculate observed R^2
    observed_r_squared = np.square(np.corrcoef(ts1, ts2)[0, 1])

    # Calculate differences between ts2 and ts1
    differences = ts2 - ts1
    differences_values = differences.values

    # Calculate optimal block length
    opt_block_length_df = optimal_block_length(differences_values)
    opt_block_length = int(opt_block_length_df['circular'])  # Use the optimal block length
    
    # Check if the optimal block length is zero, if so, set it to 1
    if opt_block_length == 0:
        opt_block_length = 1

    # Initialize circular block bootstrap with optimal block length
    bs = CircularBlockBootstrap(opt_block_length, differences_values)

    # Compute R^2 for bootstrapped series and generate null distribution
    null_r_squared = []
    for _, bs_diffs in zip(range(n_permutations), bs.bootstrap(n_permutations)):
        # Generate new ts2 by adding bootstrapped differences to ts1
        # Ensure that ts2_new has the same length as ts1 by discarding extra values or padding with zeros
        bs_diffs_trimmed = bs_diffs[0][0][:len(ts1)]
        ts2_new = ts1 + pd.Series(bs_diffs_trimmed, index=ts1.index)
        r_squared = np.square(np.corrcoef(ts1, ts2_new)[0, 1])
        null_r_squared.append(r_squared)  # Store R^2 for each bootstrapped series

    # Compute p-value: proportion of null R^2 values greater than or equal to observed R^2
    p_value = np.mean(np.array(null_r_squared) <= observed_r_squared)

    # Output
    print(f"p-value: {p_value*100:.2f}%")  # Print the calculated p-value as a percentage with two decimal points
    
    # Set figure size and dpi
    plt.figure(figsize=(11, 6), dpi=100)

    # Plot null distribution
    plt.hist(null_r_squared, bins=100, color='black')  # Plot the histogram of null R^2 values
    plt.axvline(observed_r_squared, color='red', linestyle='dashed', linewidth=2, label=f'Observed R^2: {observed_r_squared:.2f}')
    plt.legend()
    plt.xlabel('R^2')
    plt.ylabel('Frequency')
    plt.title('Null distribution of R^2')
    plt.show()

In [None]:
# Define time series
ts1 = df_complete["5/1Y Mortage Rates"]  # Time series 1
ts2 = df_complete["15Y Mortage Rates"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

We can see, that - accounting for the diffrence structure of the two time series - we can't reject H0.

Therefore, we decide to drop 5/1Y Mortage Rates.

In [None]:
# Drop the column from the DataFrame df_modified
lean_sample_1 = df_modified.drop('5/1Y Mortage Rates', axis=1)

# Display the DataFrame
display(lean_sample_1)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_1, sort='descending', color = 'black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### Federal Reserve Assets

The above steps are now repeated for the variable Federal Reserve Assets.

It should be noted that this sequential approach suffers from the same problems as common greedy feature selection approaches such as stepwise backward feature selection.

In [None]:
# Drop any rows that have missing values
df_complete_1 = lean_sample_1.dropna()

# Display the smaller DataFrame
display(df_complete_1)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data with black color
plt.plot(df_complete_1["Federal Reserve Assets"], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("USD")

# Set plot title
plt.title("Federal Reserve Assets")

# Show the plot
plt.show()

The absolute values of Federal Reserve Assets, representing the level of assets held by the central bank, have limited meaningfulness, since absolute values on their own lack context.

By transforming the Federal Reserve Assets into percentage changes, we shift the focus to the relative change over time. This provides an economically more meaningful and insightful feature for our purposes, since large changes in Assets indicate relevant economic events.

In [None]:
# Compute the percentage changes of the "Federal Reserve Assets" variable
percentage_changes = df_complete_1["Federal Reserve Assets"].pct_change()

# Create a new column for percentage changes in the DataFrame
df_complete_1["Federal Reserve Assets Percentage Changes"] = percentage_changes

# Drop rows with missing values
df_complete_1.dropna(inplace=True)

# Drop the column 'Federal Reserve Assets'
df_complete_1 = df_complete_1.drop('Federal Reserve Assets', axis=1)

# Verify the updated DataFrame
df_complete_1

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data with black color
plt.plot(df_complete_1["Federal Reserve Assets Percentage Changes"], color='black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("change")

# Set plot title
plt.title("Federal Reserve Assets Percentage Changes")

# Show the plot
plt.show()

In [None]:
# Specify data
df = df_complete_1
column = 'Federal Reserve Assets Percentage Changes'

# Plot
plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["Federal Reserve Assets Percentage Changes", 'St. Louis Fed Financial Stress']

# Plotting the columns
plt.plot(df_complete_1[columns_to_plot])

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('')
plt.title('Financial Stress Inficators')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_1["Federal Reserve Assets Percentage Changes"]  # Time series 1
ts2 = df_complete_1["St. Louis Fed Financial Stress"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

**Federal Reserve Assets Percentage Changes** is currently the shortest time series for which we cannot reject H0.

This means that we would have to discard all the data going back to 1930. This would be absurd, but to compare how this trade-off between more data and more features behaves, we use this first short data set to train our models.

## First Modelling

First, the dataset must be divided into a training dataset and a test dataset. The former is used to tune the hyperparameters, perform feature selection, and of course training while the latter is used to test the model.

We choose the usual 70/30 split because the training process requires as much data as possible, but the test dataset should not be too short to obtain economically meaningful test results. It also ensures consistency with the longer data sets we will be working with later.

In [None]:
# Sort the DataFrame by the time-related column
df_sorted = df_complete_1.sort_values('Date')

# Calculate the index to split the data
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_1 = df_sorted[:split_index]
df_test_1 = df_sorted[split_index:]

#Display
df_train_1

### Feature Space

The following diagonalized R^2 matrix shows that many features are very similar.

In [None]:
# Drop the target variable
X = df_train_1.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

Accordingly, the feature space is too high dimensional and we can reduce it while rotating it so that the resulting new features are orthogonal to each other.

For example, when fitting trees, rotating the feature space in a direction that aligns with the axes reduces the number of levels needed by the tree. This speeds up computations and makes overfitting less likely.

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

We see that we can reduce the feature space from 42 to 20, while preserving basically all the information in it.

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

Let us drop them:

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

The big disadvantage of this transformation is that interpretability is lost. But as mentioned above, our goal is not structural analysis but prediction, so statistical arguments have to take precedence.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

By definition, the variance of each principal component decreases progressively. However, since many models are sensitive to different scaling, we unify them again.

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Modeling

#### Logistic Regression (baseline)

Logistic regression is one of the most fundamental models, especially for binary classification problems, such as the one in this case. The basic principle is to use a sigmoid function to model the probability of occurrence of a particular class y, given X. The sigmoid function takes real numbers as input and outputs values in the range [0, 1], allowing the output to be interpreted as a probability.

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

This model has some hyperparameters that need to be specified or tuned in an informed way:

- "**penalty**" defines the type of regularization (L1, L2, elastic net, or none). We set it to elasticnet, since it combines the advantages of both.
- "**dual**" defines the formulation of the problem (primal or dual). We leave it as False, since dual formulation is only implemented for l2 penalty, which we don't use.
- "**tol**" is the tolerance for the stopping criterion, which we leave at 1e-4. 
- "**C**" is the inverse regularization parameter. This is one of the most important hyperparameters and needs to be tuned.
- "**Fit_intercept**" decides whether to include an intercept in the model. We leave it as True.
- "**Intercept_scaling**" is an artificial parameter to scale the intercept. We leave it at 1, since the classes are fairly balanced.
- "**class_weight**" is the weight of the classes. We set it to balanced, as this helps to avoid possible misclassification of minority classes.
- "**Solver**" is the optimization algorithm. We set it to saga, since this is the only one that can handle elastic net based objective functions.
- "**max_iter**" is the maximum number of iterations for the solver. We set it to 1000 to ensure convergence.
- "**multi_class**" is the strategy for multiple classes. We leave it at auto.
- "**verbose**" controls the level of detail in the output. We leave it at False as it can mess the process up.
- "**warm_start**" decides whether to reuse the previous solution for the next fit. We leave at False.
- "**n_jobs**" determines the number of CPUs used for training. We set it to -1 so that all are used to speed up computations.
- "**l1_ratio**" is the ratio of the L1 regularization to the elastic net penalty. This hyperparameter must also be tuned.

In [None]:
from skopt.space import Real

# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

To tune these hyperparameters, we use nested time series cross-validation and Bayesian optimization.
This is useful for several reasons:

- First, nested cross-validation combines both feature selection and hyperparameter tuning, which is important because they are interdependent. Furthermore, feature selection must also be cross-validated, otherwise we will overfit to the chosen predictors.
- Second, time series cross-validation avoids information leakage from the future because it resambles live prediction situations.
- Third, Bayesian optimization searches the hyperparameter space in an informed way compared to exhaustive grid search or randomized search. As a result, it speeds up computation and increases the likelihood of finding a globally optimal solution.

For feature selection, we use Recursive Feature Elimination, which is one of the multivariate wrapper methods.
Wrapper methods refer to a family of supervised feature selection methods that use a specific model to evaluate different subsets of features to finally select the best one.
A major advantage of wrapper methods is the fact that they tend to provide the best performing feature set for the particular type of model chosen.

In [None]:
# Initialize the time series cross-validator for the external and internal loops
tscv_outer = TimeSeriesSplit(n_splits=10)
tscv_inner = TimeSeriesSplit(n_splits=10)

But how are the subsets of hyperparameters and features evaluated in the validation fold?

For classification problems, accuracy is most often taken, but in finance this is not a particularly good choice. The reason is that accuracy scores a classifier in terms of its proportion of correct predictions. This has the disadvantage that it does not take into account class probabilities, which is a problem since we want to use them as portfolio weights in our tactical asset allocation model.

A good alternative is log-loss (also known as cross-entropy loss). It evaluates a classifier in terms of the average log-likelihood of the true labels. This intuitively makes the most sense, since it matters how large our bets are when we are wrong or right.

In [None]:
# Scoring metric
scorer = make_scorer(log_loss, greater_is_better=False, needs_proba=True, labels=[0.0, 1.0])

Now, we can tune the model:

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_1['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

(Your results may vary due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

The model uses only L1 regularization (due to l1_ratio = 1), but the regularization strength is relatively low (C is close to 2).
In addition, it has discarded all but one feature, which is in line with 100% L1 regularization.

Overall, it is clear that the model has been greatly simplified to avoid overfitting. This is to be expected, as the signal/noise ratio in finance is usually very low, due to the fact that strong relationships are arbitraged away.

Now, we can specify its hyperparamaters and and fit on the whole training data set.

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

Let us look at the feature coefficients and the intercept that have been estimated by the fitting process:

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

Interestingly, the coefficient is quite low and the intercept is slightly positive, indicating a slight bias toward Momentum.

Let us look at how both regimes are distributed in the data.

In [None]:
proportion = df_train_1['Regime'].value_counts(normalize=True)
print(proportion)

Now we can use the model to predict the regimes:

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

Next, we visualize the predicted probabilities that we use as weights for our tactical asset allocation implications.
The probabilities are always close to 50%, indicating that the signal picked up by the model was quite low.

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

For the sake of completeness, the confusion matrix, accuracy and low logg are shown below.

The values are actually pretty bad, wore then chance.

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

Now we implement the tactical allocation model based on the predicted probabilities.

This means that each month we invest the probability of Class 1 in Momentum and the rest in Value, assuming monthly rebalancing with no transaction costs.

As expected, there is little difference in performance.

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_1.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

The following somewhat sparse histogram confirms what we see in the cumulative return plot and the corresponding Sharpe ratios printed above.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

It is not necessary at this point, but later we need to formally test whether the higher Sharpe Ratio of our model is statistically significantly different from that of the simple 50/50 Benchmark. Again, this can be done with a right-sided bootstrapping test with the following pair of hypotheses:

**Null Hypothesis (H0)**: The model has a Sharpe Ratio equal to or lower than the benchmark.

**Alternative Hypothesis (H1)**: The model has a higher Sharpe Ratio than the benchmark.

In this test, the data points are drawn from the return sample of the benchmark to create a new sample. The Sharpe ratio of this new sample is then computed and stored.
This process is repeated a total of 100,000 times to obtain a null distribution of the benchmark's Sharpe ratios, which can then be used to perform a right-sided hypothesis test comparing the model's Sharpe ratio to this null distribution.

In [None]:
def sharpe_ratio_bootstrap_test(sample1, sample2, n_permutations=1000000):
    
    # Bootstrap sampling from sample1 and compute Sharpe ratios
    sharpe_ratios = []
    n = len(sample1)
    for _ in range(n_permutations):
        bootstrap_sample = np.random.choice(sample1, size=n, replace=True)
        mean_return = np.mean(bootstrap_sample)
        std_return = np.std(bootstrap_sample)
        sharpe_ratio = mean_return / std_return
        sharpe_ratios.append(sharpe_ratio)

    # Compute the observed Sharpe ratio for sample2
    observed_mean_return = np.mean(sample2)
    observed_std_return = np.std(sample2)
    observed_sharpe_ratio = observed_mean_return / observed_std_return

    # Calculate p-value: proportion of null Sharpe ratios more extreme than observed Sharpe ratio
    p_value = (np.abs(sharpe_ratios) >= np.abs(observed_sharpe_ratio)).mean()
    
    # Output
    print(f"p-value: {p_value*100:.2f}%")  # Print the calculated p-value as a percentage with two decimal points

    # Plot the null distribution of Sharpe ratios
    plt.figure(figsize=(11, 6), dpi=100)
    plt.hist(sharpe_ratios, bins=100, color='black')
    plt.axvline(observed_sharpe_ratio, color='red', linestyle='dashed', linewidth=2,
                label=f'Model Sharpe Ratio: {observed_sharpe_ratio:.2f}')
    plt.legend()
    plt.xlabel('Sharpe Ratio')
    plt.ylabel('frequency')
    plt.title('Null Distribution of Benchmark Sharpe Ratios')
    plt.show()

To demonstrate the test, we can run it for the logistic regression model.

We see that we can be very confident that the small difference in the two Sharpe ratios is due to chance.

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machine

Support Vector Machines (SVMs) are a family of supervised learning algorithms used for classification and regression problems. In the context of binary classification, SVMs attempt to find an optimal hyperplane that separates the two classes.
Specifically, we use the **C-Support Vector Classification (C-SVC)** variant because it is particularly good at avoiding overfitting, similar to the hyperparameter 'C' in logistic regression.
Technically, it controls the trade-off between achieving the largest possible margin and minimizing classification errors.

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

This model has some hyperparameters that need to be specified or tuned in an informed way:

- "**C**" is the regularization parameter. We need to tune this.
- "**Kernel**" specifies the type of kernel used (linear, poly, rbf, sigmoid) for the so-called "kernel trick". We need to tune this as well
- **degree**" is the degree of the polynomial kernel ('poly'). We need to tune this too, if poly is chosen as kernel. 
- "**Gamma**" is the kernel coefficient for "rbf", "poly" and "sigmoid". This must also be adjusted if these kernels are selected.
- "**coef0**" is the independent term in the kernel functions of 'poly' and 'sigmoid'. This should also be tuned.
- "**shrinking**" determines whether heuristic shrinking is used or not. We leave it at True as it can speed up the computation significantly.
- "**Probability**" controls whether probabilities are calculated. Since we need them for our scoring function and portfolio weights, we set it to True.
- "**tol**" is the tolerance for the stop criterion. We leave it at 1e-3.
- "**cache_size**" is the size of the kernel cache (in MB). We set it significantly higher than the default to speed up computations.
- "**class_weight**" is the weight of classes. We set it to 'default' in case the model has unbalanced class distributions.
- "**verbose**" controls the level of detail in the output. We leave it at False.
- "**max_iter**" is the maximum number of iterations. We leave it at -1 to ensure convergence.
- "**decision_function_shape**" specifies the shape of the decision function ('ovr' or 'ovo'). Since we are dealing with a binary classification problem, the specification does not matter, but we set it to 'ovo'.
- **break_ties**" decides whether to break ties when there are multiple classes. We leave it at False since it doesn't really matter for a binary classifcation problem.

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

As with logistic regression, we use nested time series cross-validation with log loss as the scoring metric.

However, Recursive Feature Elimination (RFE) does not work with SVCs because it uses the kernel trick, which prevents the model from providing feature importance. For this reason, we use Sequential Backward Feature Selection, which is similar to RFE but model agnostic.

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_1['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

(Your results may vary due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

Let us talk about the two most important hyperparameters:

A C of 100 is quite large, meaning that the model has very little regularization.
A 'sigmoid' is typically used in neural networks, and when used as a kernel, the SVM behaves like a two-layer perceptron.
Therefore, the combination of a high C and a sigmoid kernel allows for a high complexity model that can fit the data very flexibly.

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

Let us look at the predicted regimes.

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

It may come as a surprise that Class 1, i.e. Momentum, is consistently given a higher probability below, although Class 0, i.e.Value, is also quite often predicted above.
What is the reason for this?

The SVC model in scikit-learn predicts class probabilities using Platt scaling, which involves fitting a logistic regression model to the scores of the SVM. This is done in a cross-validated manner and therefore may result in a different decision boundary than the from raw decision function of the model.
So it's possible for these two methods to produce different results, expecially for small data sets.

Nevertheless, we can still use them.

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

The confusion matrix shows that the SVM is even worse than the logistic regression, even though this is training data. It was not able to fit it well.

The reason could be the mismatch between the scoring metric used in hyperparameter tuning and feature selection, i.e. log loss, and the loss function of the model, i.e., the so-called margin.

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

Now let us look at the trading performance:

During the critical period of the global financial crisis, the model relies on the wrong of the two strategies and consequently loses relative performance that it cannot recover.

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_1.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

The distribution of returns is as follows:

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

The model's Sharpe ratio is worse, but not statistically significantly worse, when we reformulate the above pair of hypotheses in a left-sided way.

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

The Random Forest algorithm are a bagging based ensemble technique, which creates an ensemble of multiple decision tree models and uses the majority decision of these trees for prediction. Each individual tree in the Random Forest is trained on a random subset of the training data (called bootstrap samples) and uses a random selection of features to find the best split at each node of the tree. This randomness leads to increased diversity among individual trees and helps avoid overfitting.

But why didn't we use boosting, e.g. XGBoost, which is considered the king, and rather bagging, like Random Forests.
The reason is, that Boosting works by also correcting the Bias, which comes with greater risk of overfitting, while bagging tries to correct Variance. In finance, where data hast a very low signal to noise ratio, because every strong relationship is arbitraged away, it's very easy to overfit an AI model to noise & random patterns. So Bagging is more favorable in this field.

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

This model has some hyperparameters that need to be specified or tuned in an informed way:

- "**n_estimators**" is the number of trees in the forest. We tune it.
- Kernel**" specifies the type of kernel used (linear, poly, rbf, sigmoid) for the so-called "kernel trick". We need to tune this as well
- **criterion**" is the function to measure the quality of a split. Allthough 'gini' and 'entropy' hardly differ, we tune it as well.
- "**max_depth**" is the maximum depth of trees. This is an important interdependent paramater, which has to be tuned.
- "**min_samples_split**" is the minimum number of samples required to split an interior node. This has to be tuned as well.
- "**min_samples_leaf**" is the minimum number of samples required to have a leaf node. This has to be tuned as well.
- "**min_weight_fraction_leaf**" is the minimum weighted fraction of the total number of input samples required to be present at a leaf node. This paramater should be high enough to ensure that out-of-bag accuracy converges to out-of-sample (k-fold) accuracy.
- "**max_features**" is the number of features considered for finding the best split. This has also to be tuned.
- "**max_leaf_nodes**" is the maximum number of leaf nodes. This is related to 'max_depth' which is why we leave it at None.
- "**min_impurity_decrease**" is a threshold for stopping tree growth early. Since the signal is low in finance, we cannot expect high decrease in impurity, so we leaf it as 0.
- "**bootstrap**" decides whether to use bootstrap samples when building trees. Since this the whole idea of Bagging, we leave it at True.
- "**oob_score**" decides whether to compute out-of-bag scores. We set it at at because we don't need this metric.
- "**n_jobs**" determines the number of CPUs used for training. We set it to -1 such that all are used.
- "**verbose**" controls the level of detail in the output. We leave it as it is.
- "**warm_start**" decides whether to reuse the previous solution for the next fit. We leave it at False as it can mess up the process.
- "**class_weight**" is for the weight of the classes. We set it to 'balanced' to be consistent.
- "**ccp_alpha**" is the complexity parameter for the minimal cost complexity pruning. We don't use it as it can prevent the trees of being diffferent enough to drive down variance.
- "**max_samples**" is the number of samples drawn for adapting each base learner. We tune this as well.

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, n_components_99),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_1['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

(Your results may vary due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

It is interesting that all the features have been retained this time, as if they were all important, which is highly unlikely.
What is the reason for this?

The reason for this is related to the nature of bagging. Increased performance in bagging is related to the diversity of the constituent models; ensembling models that are effectively the same does not reduce the variance in model predictions. For this reason, random forests force the trees to contain suboptimal splits of predictors using a random sample of them.

This is also why we've left the 'ccp_alpha' for pruning and 'min_impurity_decrease' parameters at 0, so as not to interfere with this.

Regarding the hyperparameters, we can say that the combination of a low max_depth, relatively high min_samples_split, and high min_weight_fraction_leaf suggests a highly biased model.

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

We see that the Random Forest also has more variability in its predictions:

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

Unlike the SVM model above, the predicted probabilities are consistent with the class predictions.

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

The accuracy is now way better than the ones of the logistic regression and SVM. However, the log loss ist still quite weak, and this is still training data, so we should take this with a grain of salt.

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

The performance is exceptionally good compared to the previous two models.

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_1.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

If this were the test data set, the result of the Sharpe ratio test would be already.

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

Multilayer perceptrons (MLPs) are a type of artificial neural network consisting of at least three layers of neurons: an input layer, one or more "hidden" layers, and an output layer. Each layer is fully connected to the next, with each node receiving a weighted sum of inputs from the previous layer to which an activation function is applied.

MLPs are so-called universal approximators, meaning that they can represent any function, provided there are enough neurons in the hidden layers. This allows them to model complex, nonlinear relationships between our input features and the target variable.

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

MLPs have a lot of hyper paramaters:

- "**hidden_layer_sizes**" is the number of neurons in the hidden layers. This needs to tuned, but we limit it to two layers, since this is theoretically enough to appriximate any function, given enough neurons per layer, and 20 neurons per layer to not exceed the number of input neurons.
- "**activation**" is the activation function for the hidden layers. We need to tune it as well.#
- "**solver**" is the optimization algorithm. We set it to 'lbfgs' since it converges better & faster for smaller data sets.
- "**alpha**" is the L2 regularization term. We tune it as well.
- "**batch_size**" is the size of the mini-batches for stochastic optimizers. We leave it at 'auto', but it won't be used with 'lbfgs' optimizing anyway.
- "**learning_rate**" controls the type of learning rate. We set it to 'adaptive', but isn't used with 'lbfgs' anyway.
- "**learning_rate_init**" is the initial learning rate. With 'lbfgs' it is not used, so we leave it as it is.
- "**power_t**" is the exponent for the inverse scaling of the learning rate. We leave it as well since it is not used with 'lbfgs'.
- "**max_iter**" is the maximum number of iterations. We set it to 100,000 to ensure convergence.
- "**shuffle**" determines whether the samples are shuffled in each iteration. Since it is not used with 'lbfgs' we leave it as it is.
- "**tol**" is the tolerance for the optimization. We leave it at 1e-4.
- "**verbose**" controls the level of detail in the output. We leave it at its default.
- "**warm_start**" decides whether to reuse the previous solution for the next fit. We leave it at False as it can mess up the process.
- "**momentum**" is the momentum term for  gradient descent update. But we don't use 'sgd' optimizer, so we leave it at default.
- "**nesterovs_momentum**" enables Nesterovs momentum. It also only important for 'sgd' optimizer.
- "**early_stopping**" enables early stopping to avoid overfitting. This definitly something you want to set to True.
- "**validation_fraction**" is the fraction of the training data that is retained for validation. We leave it as 10%, since this resembles a 10-fold partition.
- "**beta_1**" is the exponential decay rate for estimates of the first moment vector in 'adam'. Not relevant for us.
- "**beta_2**" is the exponential decay rate for estimates of the second moment vector in 'adam'. Not relevant for us, too.
- "**epsilon**" is a constant for numerical stability for the 'adam' optimizer. We don't use it so we forget about it.
- "**n_iter_no_change**" is the number of iterations without improvement before stopping early. Again, only relevant with 'adam'.
- "**max_fun**" is the maximum number of function calls. We set it to 100,000 for consistency with 'max_iter'.

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

The problem with using the BayesSearchCV function from the scikit-optimize library with the MLPClassifier function from scikit-learn is that the MLPClassifier takes the number of neurons and hidden layers as a tuple (for example, (50, 50) for two hidden layers with 50 neurons each). However, the BayesSearchCV function cannot optimize tuples as parameters.

To get around this problem, we need to use a little trick.

We implemented a wrapper class for the MLPClassifier that takes the number of neurons and hidden layers as separate decoubled parameters. By soing so, we can use BayesSearchCV, which we definittely want.

In [None]:
class MLPWrapper(MLPClassifier):
    def __init__(self, layer_size=1, num_layers=1, alpha=0.0001, activation='relu'):
        self.layer_size = layer_size
        self.num_layers = num_layers
        self.alpha = alpha
        self.activation = activation
        self.hidden_layer_sizes = tuple([self.layer_size]*self.num_layers)
        self.model = MLPClassifier(hidden_layer_sizes=self.hidden_layer_sizes, 
                                   alpha=self.alpha, 
                                   activation=self.activation,
                                   solver='lbfgs', 
                                   learning_rate='adaptive', 
                                   max_iter=100000, 
                                   early_stopping=True)
        
    def set_params(self, **parameters):
        for parameter, value in parameters.items():
            if parameter == "layer_size":
                self.layer_size = value
            elif parameter == "num_layers":
                self.num_layers = value
            elif parameter == "alpha":
                self.alpha = value
            elif parameter == "activation":
                self.activation = value
            else:
                raise ValueError('Invalid parameter %s for estimator %s. '
                                 'Check the list of available parameters '
                                 'with `estimator.get_params().keys()`.' %
                                 (parameter, self))

    def fit(self, X, y):
        self.model.fit(X, y)
        
    def predict(self, X):
        return self.model.predict(X)
    
    def score(self, X, y):
        y_pred_proba = self.model.predict_proba(X)
        return -log_loss(y, y_pred_proba, labels=[0.0, 1.0])

As with the SVM, recursive feature elimination cannot be combined with MLPs, since they also do not provide model-specific feature importance. For this reason, we again use sequential backward feature selection.

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_1['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 50)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

(Your results may differ due to the stochastic nature of the procedure.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

Tuning a multilayer perceptron is tricky because there is a high risk of overfitting. So what can we say about the hyperparameters?

- The activation function 'tanh' is a non-linear function that allows the model to learn more complex representations. However, it is more controlled than the 'relu' function.

- The alpha parameter is an L2 penalty term. A relatively small value (0.0330897095330755) indicates a lower degree of regularization, meaning that the model is allowed more complexity.

- The hidden_layer_sizes setting with a single hidden layer and 8 neurons indicates a relatively simple architecture, which is favorable given that overfitting is a problem in finance.

In summary, the MLPClassifier is designed with some protection against overfitting (early stopping, adaptive learning rate), but also allows some complexity (tanh activation, low regularization). The hidden layer architecture is relatively simple, which also helps to prevent overfitting.

Now we can fit it.

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)

# Fit the model
mlp_final.fit(X[best_features_mlp], y)

Next we can use to to predict the Regimes.

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

And of course we can now predict the class probabilities. The resulting weights are quite extreme.
It remains to be seen how well this generalizes.

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

The metrics are exceptionally good, actually a little too good, indicating overfitting.

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

Performance is excellent, also a little too excellent.

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_1.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

The bootstrapping test shows similar results, too.

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## First Testing

Now we are ready to test all four models.

In [None]:
df_test_1

Since we have transformed the training data multiple times (PCA & scaling), we need to apply exactly the same transformations to the test data to ensure consistency.

In [None]:
X_test = df_test_1.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

For verification purposes, let us have a look at the data.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

This data can now be used to create forecasts:

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

And we can also predict probabilities.

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

Unfortunately, the classification metrics are poor. In particular, the MLP model is particularly poor, confirming the suspicion that it is overfitted to the training dataset.

In [None]:
y = df_test_1["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_1["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

Despite the poor classification metrics, the models perform solidly, including the MLP model, which actually outperforms on a non risk-adjusted basis. This is remarkable given the extreme positioning. The fact that the first three perform solidly is due to their very balanced positioning around 50/50.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_1.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_1.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_1.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_1.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_1.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_test_1['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_test_1['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_test_1['Value Returns'] + \
                                           0.5 * df_test_1['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

Let us take a look at the distribution of returns.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_1.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

Finally, let's look at the Sharpe ratios of all the models.

To do this, we need to redefine the corresponding bootstrapping test function.

In [None]:
def multi_sharpe_ratio_bootstrap_test(sample1, model_samples, model_names, n_permutations=1000000):
    
    # Bootstrap sampling from sample1 and compute Sharpe ratios
    sharpe_ratios = []
    n = len(sample1)
    for _ in range(n_permutations):
        bootstrap_sample = np.random.choice(sample1, size=n, replace=True)
        mean_return = np.mean(bootstrap_sample)
        std_return = np.std(bootstrap_sample)
        sharpe_ratio = mean_return / std_return
        sharpe_ratios.append(sharpe_ratio)

    # Plot the null distribution of Sharpe ratios
    plt.figure(figsize=(11, 6), dpi=1000)
    plt.hist(sharpe_ratios, bins=100, color='black')

    # Assign a color for each model
    colors = ['red', 'blue', 'green', 'orange']

    # Compute the observed Sharpe ratio for each model sample
    for model_sample, model_name, color in zip(model_samples, model_names, colors):
        observed_mean_return = np.mean(model_sample)
        observed_std_return = np.std(model_sample)
        observed_sharpe_ratio = observed_mean_return / observed_std_return

        # Calculate p-value: proportion of null Sharpe ratios more extreme than observed Sharpe ratio
        p_value = (np.abs(sharpe_ratios) >= np.abs(observed_sharpe_ratio)).mean()

        # Output
        print(f"{model_name} p-value: {p_value*100:.2f}%")  # Print the calculated p-value as a percentage with two decimal points

        # Add vertical line for the model's Sharpe ratio
        plt.axvline(observed_sharpe_ratio, color=color, linestyle='dashed', linewidth=2,
                    label=f'{model_name}: {observed_sharpe_ratio:.2f}')

    plt.legend()
    plt.xlabel('Sharpe Ratio')
    plt.ylabel('frequency')
    plt.title('Null Distribution of Benchmark Sharpe Ratio')
    plt.show()

None of the Sharpe ratios are statistically significantly worse or better. I.e. it is not really worth using these models; especially not after transaction costs.

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    
    # Merge the DataFrames based on their indexes
    df_merged = df_test_1.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])


benchmark_returns = = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']]
    
# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)

## First further Feature Discarding

That was the first period, and we are now discarding **Federal Reserve Assets** to extend the historical period.

In [None]:
lean_sample_2 = lean_sample_1.drop('Federal Reserve Assets', axis=1)

display(lean_sample_2)

The next shortest variable is now **St. Louis Fed Financial Stress** Index.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_2, sort='descending', color = 'black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### St. Louis Fed Financial Stress	

We only consider the period in which the St. Louis Fed financial stress variable is present:

In [None]:
# Drop any rows that have missing values
df_complete_2 = lean_sample_2.dropna()

# Display the smaller DataFrame
display(df_complete_2)

Let us plot it:

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data with black color
plt.plot(df_complete_1["St. Louis Fed Financial Stress"], color='black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("stress level")

# Set plot title
plt.title("St. Louis Fed Financial Stress")

# Show the plot
plt.show()

As expected, the highest pairwise coefficient of determination is with the one associated with the **Chicago Fed National Financial Stress** Index.

In [None]:
df = df_complete_2
column = 'St. Louis Fed Financial Stress'

plot_r_squared(df, column)

Let us take a look at both to visualize the similarity. We see that the resemblance is really high, because economic represents the same thing.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["St. Louis Fed Financial Stress", 'Chicago Fed National Financial Stress']

# Plotting the columns
plt.plot(df_complete_2[columns_to_plot])
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('stress level')
plt.title('Financial Stress Inficators')

# Display the plot
plt.show()

The bootstrapping test confirms the visual impression.

In [None]:
# Define time series
ts1 = df_complete_2["St. Louis Fed Financial Stress"]  # Time series 1
ts2 = df_complete_2["Chicago Fed National Financial Stress"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

Therefore, we discard the St. Louis Fed's financial stress index.

In [None]:
lean_sample_3 = lean_sample_2.drop('St. Louis Fed Financial Stress', axis=1)

display(lean_sample_3)

The shortest time series is now the **15Y Mortgage Rate**.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_3, sort='descending', color='black')

# Display the plot
plt.show()

#### 15Y Mortage Rates

Again, we need to shorten the time period to the one where all the features left so far are complete.

In [None]:
# Drop any rows that have missing values
df_complete_3 = lean_sample_3.dropna()

# Display the smaller DataFrame
display(df_complete_3)

To get a visual impression of the variable, we plot it again.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

# Plot the data with black color
plt.plot(df_complete_1["15Y Mortage Rates"], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("15Y Mortage Rates")

# Show the plot
plt.show()

The highest similarity is with the variable 30Y Mortage Rates.

In [None]:
df = df_complete_3
column = '15Y Mortage Rates'

plot_r_squared(df, column)

The strong similarity, which can already be assumed theoretically/economically, is also confirmed visually.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["15Y Mortage Rates", '30Y Mortage Rates']

# Plotting the columns
plt.plot(df_complete_3[columns_to_plot])
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Mortage Rates')

# Display the plot
plt.show()

The boostrapping test formally confirms this.

In [None]:
# Define time series
ts1 = df_complete_3["15Y Mortage Rates"]  # Time series 1
ts2 = df_complete_3["30Y Mortage Rates"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

Thus, we discard **15Y Mortage Rates**.

In [None]:
lean_sample_4 = lean_sample_3.drop('15Y Mortage Rates', axis=1)

display(lean_sample_4)

The shortest variable is therefore the 10-3M yield spread.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_4, sort='descending', color='black')

# Display the plot
plt.show()

#### 10-3M Yield Spreads

Alltghough it is just about 4 months, we see that we have 3M Treasury Yields and 10y Treasury Yields as distinct features and can compute them on our own resulting in more data for this particular fature.

In [None]:
# Append the percentage changes as a new column in the DataFrame
lean_sample_4["10-3M Yield Spreads"] = lean_sample_4["10Y Treasury Yields"] - lean_sample_4["3M Treasury Yields"]

lean_sample_4

As before, we restrict the data set to the period, where all features are complete.

In [None]:
# Drop any rows that have missing values
df_complete_4 = lean_sample_4.dropna()

# Display the smaller DataFrame
display(df_complete_4)

This variable is interesting to visualize because it is often interpreted as a herald of financial turmoil. This makes sense because it is a proxy for the yield curve, which follows a certain logic that has clear implications.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_4['10-3M Yield Spreads'], color='black')
plt.axhline(0, color='red', linestyle='--')

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('10-3M Yield Spreads')

# Display the plot
plt.show()

As expected, the most similar variable is the **10-2Y Yield Spreads**.

In [None]:
# Calculate the correlation matrix
df = df_complete_4

# Select the correlations related to '5/1Y Mortage Rates'
column = '10-3M Yield Spreads'

plot_r_squared(df, column)

The following visualization confirms that both represent the same economic object, the yield curve.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["10-3M Yield Spreads", '10-2Y Yield Spreads']

# Plotting the columns
plt.plot(df_complete_4[columns_to_plot])
plt.axhline(0, color='red', linestyle='--')
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Yield Spreads')

# Display the plot
plt.show()

The similarity is supported by the bootstrap test.

In [None]:
# Define time series
ts1 = df_complete_4["10-3M Yield Spreads"]  # Time series 1
ts2 = df_complete_4["10-2Y Yield Spreads"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

We therefore discard the **10-3M Yield Spreads**.

In [None]:
lean_sample_5 = lean_sample_4.drop('10-3M Yield Spreads', axis=1)

display(lean_sample_5)

The shortest variable is therefore the 3M Treasury Yields.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_5, sort='descending', color='black')

# Display the plot
plt.show()

#### 3M Treasury Yields

The period of time is again limited to the complete one.

In [None]:
# Drop any rows that have missing values
df_complete_5 = lean_sample_5.dropna()

# Display the smaller DataFrame
display(df_complete_5)

Here is a plot again:

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

plt.plot(df_complete_5['3M Treasury Yields'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("3M Treasury Yields")

# Show the plot
plt.show()

The highest similarity is with 2Y Treasury Yields.

In [None]:
# Calculate the correlation matrix
df = df_complete_5

# Select the correlations related to '5/1Y Mortage Rates'
column = '3M Treasury Yields'

plot_r_squared(df, column)

Here is a comparative plot of all Treasury Yields over all maturities available to us. They are all relatively closely correlated.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["3M Treasury Yields", '2Y Treasury Yields', '5Y Treasury Yields', "10Y Treasury Yields"]

# Plotting the columns
plt.plot(df_complete_5[columns_to_plot])

plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Treasury Yields')

# Display the plot
plt.show()

The test confirms this.

In [None]:
# Define time series
ts1 = df_complete_5["3M Treasury Yields"]  # Time series 1
ts2 = df_complete_5["2Y Treasury Yields"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

And again, we can discard the current shortest variable.

In [None]:
lean_sample_6 = lean_sample_5.drop('3M Treasury Yields', axis=1)

display(lean_sample_6)

The shortest variable is now **2Y Treasury Yields**.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_6, sort='descending', color='black')

# Display the plot
plt.show()

#### 2Y Treasury Yields

The period is shortened again.

In [None]:
# Drop any rows that have missing values
df_complete_6 = lean_sample_6.dropna()

# Display the smaller DataFrame
display(df_complete_6)

Here is again a plot:

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

plt.plot(df_complete_6["2Y Treasury Yields"], color = 'black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("2Y Treasury Yields")

# Show the plot
plt.show()

The highest similarity is with 5Y Treasury Yields.

In [None]:
# Calculate the correlation matrix
df = df_complete_6

# Select the correlations related to '5/1Y Mortage Rates'
column = '2Y Treasury Yields'

plot_r_squared(df, column)

And again, a plot that visually supports that.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["2Y Treasury Yields", '5Y Treasury Yields', "10Y Treasury Yields"]

# Plotting the columns
plt.plot(df_complete_6[columns_to_plot])
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Treasury Yields')

# Display the plot
plt.show()

The bootstrapping test again confirms the visual impression.

In [None]:
# Define time series
ts1 = df_complete_6["2Y Treasury Yields"]  # Time series 1
ts2 = df_complete_6["5Y Treasury Yields"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

Accordingly, this variable is discarded.

In [None]:
lean_sample_7 = lean_sample_6.drop('2Y Treasury Yields', axis=1)

display(lean_sample_6)

The shortest feature time series is now the **10-2Y Yield Spread**.

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_7, sort='descending', color='black')

# Display the plot
plt.show()

(Note: I will no longer comment on every step, otherwise I will go crazy. What is being done is clear by now.)

#### 10-2Y Yield Spreads

In [None]:
# Drop any rows that have missing values
df_complete_7 = lean_sample_7.dropna()

# Display the smaller DataFrame
display(df_complete_7)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

plt.plot(df_complete_7["10-2Y Yield Spreads"], color = 'black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("10-2Y Yield Spreads")

# Show the plot
plt.show()

In [None]:
# Calculate the correlation matrix
df = df_complete_7

# Select the correlations related to '5/1Y Mortage Rates'
column = '10-2Y Yield Spreads'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["10-2Y Yield Spreads", "Effective Federal Funds Rate"]

# Plotting the columns
plt.plot(df_complete_7[columns_to_plot])

plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('')

# Display the plot
plt.show()

The null hypothesis is rejected for the second time, giving us the second period and set of features to model.

In [None]:
# Define time series
ts1 = df_complete_7["10-2Y Yield Spreads"]  # Time series 1
ts2 = df_complete_7["Effective Federal Funds Rate"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

## Second Modeling

In [None]:
# Sort the DataFrame by the time-related column
df_sorted = df_complete_7.sort_values('Date')

# Calculate the index to split the data
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_2 = df_sorted[:split_index]
df_test_2 = df_sorted[split_index:]

#Display
df_train_2

### Feature Space

In [None]:
# Drop the target variable
X = df_train_2.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Modeling

#### Logistic Regression (baseline)

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_2['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

(Your results may vary due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

In [None]:
proportion = df_train_2['Regime'].value_counts(normalize=True)
print(proportion)

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_2.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machine

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_2['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

(Your results may very due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_2.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, 20),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_2['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

(Your results may very due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_2.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_2['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 100)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

(Your results may very due to the stochastic nature of the process.)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
#best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)
# Fit the model
mlp_final.fit(X[best_features_mlp], y)

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_2.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## Second Testing

In [None]:
df_test_2

In [None]:
X_test = df_test_2.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machibne")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
y = df_test_2["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_2["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_2.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_2.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_2.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_2.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_2.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_2.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    # Merge the DataFrames based on their indexes
    df_merged = df_test_2.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])
    
benchmark_returns = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']
    
# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)

## Second further Feature Discarding

In [None]:
lean_sample_8 = lean_sample_7.drop('10-2Y Yield Spreads', axis=1)

display(lean_sample_8)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_8, sort='descending', color = 'black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### 30Y Mortage Rates

In [None]:
# Drop any rows that have missing values
df_complete_8 = lean_sample_8.dropna()

# Display the smaller DataFrame
display(df_complete_8)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_8["30Y Mortage Rates"], color = 'black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("30Y Mortage Rates")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_8
column = '30Y Mortage Rates'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["30Y Mortage Rates", '10Y Treasury Yields']

# Plotting the columns
plt.plot(df_complete_8[columns_to_plot])
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Yield')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_8["30Y Mortage Rates"]  # Time series 1
ts2 = df_complete_8["10Y Treasury Yields"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

In [None]:
lean_sample_9 = lean_sample_8.drop('30Y Mortage Rates', axis=1)

display(lean_sample_9)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_9, sort='descending', color = 'black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### Chicago Fed National Financial Stress

In [None]:
# Drop any rows that have missing values
df_complete_9 = lean_sample_9.dropna()

# Display the smaller DataFrame
display(df_complete_9)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi = 100)

plt.plot(df_complete_9["Chicago Fed National Financial Stress"], color = 'black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("stress")

# Set plot title
plt.title("Chicago Fed National Financial Stress")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_9
column = 'Chicago Fed National Financial Stress'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["Chicago Fed National Financial Stress", 'Effective Federal Funds Rate']

# Plotting the columns
plt.plot(df_complete_9[columns_to_plot])

# Show a legend
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('')
plt.title('')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_9["Chicago Fed National Financial Stress"]  # Time series 1
ts2 = df_complete_9["Effective Federal Funds Rate"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

## Third Modeling

In [None]:
# Sort the DataFrame by the time-related column
df_sorted = df_complete_9.sort_values('Date')

# Calculate the index to split the data
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_3 = df_sorted[:split_index]
df_test_3 = df_sorted[split_index:]

#Display
df_train_3

### Feature Space

In [None]:
# Drop the target variable
X = df_train_3.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Modeling

#### Logistic Regression (baseline)

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_3['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

In [None]:
proportion = df_train_3['Regime'].value_counts(normalize=True)
print(proportion)

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_3.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machine

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_3['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_3.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, 20),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_3['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_3.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_3['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 100)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
#best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)
# Fit the model
mlp_final.fit(X[best_features_mlp], y)

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_3.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## Third Testing

In [None]:
df_test_3

In [None]:
X_test = df_test_3.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machibne")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
y = df_test_3["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_3["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_3.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_3.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_3.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_3.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_3.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_3.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns'].shift()

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    # Merge the DataFrames based on their indexes
    df_merged = df_test_3.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])
    
benchmark_returns = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']
# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)

## Second further Feature Discarding

In [None]:
lean_sample_10 = lean_sample_9.drop('Chicago Fed National Financial Stress', axis=1)

display(lean_sample_10)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_10, sort='descending', color='black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### Sentiment

In [None]:
# Drop any rows that have missing values
df_complete_10 = lean_sample_10.dropna()

# Display the smaller DataFrame
display(df_complete_10)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_10["Sentiment"], color='black')

# Add a horizontal line at zero
plt.axhline(0, color='red', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("")

# Set plot title
plt.title("Sentiment")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_10
column = 'Sentiment'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["Sentiment", 'CAPE']

# Plotting the columns
plt.plot(df_complete_10[columns_to_plot])

# Show a legend
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('')
plt.title('')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_10["Sentiment"]  # Time series 1
ts2 = df_complete_10["CAPE"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

## Fourth Modeling

In [None]:
# Sort the DataFrame by the time-related column
df_sorted = df_complete_10.sort_values('Date')

# Calculate the index to split the data
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_4 = df_sorted[:split_index]
df_test_4 = df_sorted[split_index:]

#Display
df_train_4

### Feature Space

In [None]:
# Drop the target variable
X = df_train_4.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Modeling

#### Logistic Regression (baseline)

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_4['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

In [None]:
proportion = df_train_4['Regime'].value_counts(normalize=True)
print(proportion)

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_4.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machine

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_4['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_4.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, 20),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_4['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_4.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_4['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 100)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
#best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)
# Fit the model
mlp_final.fit(X[best_features_mlp], y)

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_4.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## Fourth Testing

In [None]:
df_test_4

In [None]:
X_test = df_test_4.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machibne")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
y = df_test_4["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_4["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_4.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_4.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_4.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_4.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_4.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_test_4['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_test_4['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_test_4['Value Returns'] + \
                                           0.5 * df_test_4['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_4.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    # Merge the DataFrames based on their indexes
    df_merged = df_test_4.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])

benchmark_returns = 0.5 * df_merged['Value Returns'] + \
                    0.5 * df_merged['Momentum Returns']

# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)

## Third further Feature Discarding

In [None]:
lean_sample_11 = lean_sample_10.drop('Sentiment', axis=1)

display(lean_sample_11)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_11, sort='descending', color='black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### 5Y Treasury Yield

In [None]:
# Drop any rows that have missing values
df_complete_11 = lean_sample_11.dropna()

# Display the smaller DataFrame
display(df_complete_11)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_11["5Y Treasury Yields"], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("5Y Treasury Yields")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_11
column = '5Y Treasury Yields'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ['5Y Treasury Yields', '10Y Treasury Yields']

# Plotting the columns
plt.plot(df_complete_11[columns_to_plot])

# Show a legend
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('Treasury Yields')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_11["5Y Treasury Yields"]  # Time series 1
ts2 = df_complete_11["10Y Treasury Yields"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

In [None]:
lean_sample_12 = lean_sample_11.drop('5Y Treasury Yields', axis=1)

display(lean_sample_12)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_12, sort='descending', color='black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### 10Y Treasury Yield

In [None]:
# Drop any rows that have missing values
df_complete_12 = lean_sample_12.dropna()

# Display the smaller DataFrame
display(df_complete_12)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_12["10Y Treasury Yields"], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("10Y Treasury Yields")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_12
column = '10Y Treasury Yields'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["10Y Treasury Yields", 'Effective Federal Funds Rate']

# Plotting the columns
plt.plot(df_complete_12[columns_to_plot])

# Show a legend
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('%')
plt.title('')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_12["10Y Treasury Yields"]  # Time series 1
ts2 = df_complete_12["Effective Federal Funds Rate"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

In [None]:
lean_sample_13 = lean_sample_12.drop('10Y Treasury Yields', axis=1)

display(lean_sample_13)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_13, sort='descending', color='black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

#### Effective Federal Funds Rate

In [None]:
# Drop any rows that have missing values
df_complete_13 = lean_sample_13.dropna()

# Display the smaller DataFrame
display(df_complete_13)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

plt.plot(df_complete_13["Effective Federal Funds Rate"], color = 'black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("%")

# Set plot title
plt.title("Effective Federal Funds Rate")

# Show the plot
plt.show()

In [None]:
# Assuming df is your DataFrame and column is the name of the column
df = df_complete_13
column = 'Effective Federal Funds Rate'

plot_r_squared(df, column)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Select the desired columns from the DataFrame
columns_to_plot = ["Effective Federal Funds Rate", 'CAPE']

# Plotting the columns
plt.plot(df_complete_13[columns_to_plot])

# Show a legend
plt.legend(columns_to_plot)

# Add labels and title to the plot
plt.xlabel('time')
plt.ylabel('')
plt.title('')

# Display the plot
plt.show()

In [None]:
# Define time series
ts1 = df_complete_13["Effective Federal Funds Rate"]  # Time series 1
ts2 = df_complete_13["CAPE"]  # Time series 2

# Call the function
r_squared_bootstrap_test(ts1, ts2, n_permutations=1000000)

## Fifth Modeling

In [None]:
# Sort the DataFrame by the time-related column
df_sorted = df_complete_13.sort_values('Date')

# Calculate the index to split the data
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_5 = df_sorted[:split_index]
df_test_5 = df_sorted[split_index:]

#Display
df_train_5

### Feature Space

In [None]:
# Drop the target variable
X = df_train_5.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Models

#### Logistic Regression (baseline)

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_5['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

In [None]:
proportion = df_train_5['Regime'].value_counts(normalize=True)
print(proportion)

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_5.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machines

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_5['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_5.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, 20),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_5['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_5.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_5['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 100)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
#best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)
# Fit the model
mlp_final.fit(X[best_features_mlp], y)

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_5.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## Fifth Testing

In [None]:
df_test_5

In [None]:
X_test = df_test_5.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machibne")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
y = df_test_5["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_5["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_5.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_5.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_5.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_5.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_5.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_test_5['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_test_5['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_test_5['Value Returns'] + \
                                           0.5 * df_test_5['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_5.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    # Merge the DataFrames based on their indexes
    df_merged = df_test_5.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])


# Calculate the simple 50/50 weighted return stream
benchmark_returns = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']
    
# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)

# Fourth further Feature Discarding

In [None]:
lean_sample_14 = lean_sample_13.drop('Effective Federal Funds Rate', axis=1)

display(lean_sample_14)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Use the missingno package to visualize the completeness of the 'df_monthly' DataFrame
msno.bar(lean_sample_14, sort='descending', color='black')

# Set x-axis label
plt.xlabel("Features")

# Set y-axis label
plt.ylabel("Missing Values")

# Set plot title
plt.title("Missing Data by Feature")

# Show the plot
plt.show()

# Final Data Set

In [None]:
# Drop any rows that have missing values
df_complete_14 = lean_sample_14.dropna()

# Display the smaller DataFrame
display(df_complete_14)

# Final Modeling

In [None]:
# Assuming df_complete_1 is your DataFrame and target_variable is the name of your target variable column

# Sort the DataFrame by the time-related column (e.g., date)
df_sorted = df_complete_14.sort_values('Date')

# Calculate the index to split the data (e.g., last 20%)
split_index = int(0.7 * len(df_sorted))

# Split the data into training and test sets
df_train_6 = df_sorted[:split_index]
df_test_6 = df_sorted[split_index:]

df_train_6

In [None]:
# Drop the target variable
X = df_train_6.drop('Regime', axis=1)

# Initialize an empty matrix for R^2 values
r2_matrix = np.zeros((len(X.columns), len(X.columns)))

# Calculate R^2 values for each pair of variables
for i, col1 in enumerate(X.columns):
    for j, col2 in enumerate(X.columns):
        if i != j:
            # Fit a linear regression model
            lr = LinearRegression()
            lr.fit(X[col1].values.reshape(-1, 1), X[col2])

            # Calculate R^2 score
            r2 = lr.score(X[col1].values.reshape(-1, 1), X[col2])
            r2_matrix[i, j] = r2

# Perform hierarchical clustering
linkage_matrix = hierarchy.linkage(r2_matrix, method='complete')

# Obtain the order of rows and columns based on the dendrogram
order = hierarchy.dendrogram(linkage_matrix, no_plot=True)['leaves']

# Sort the R^2 matrix based on the order
sorted_r2_matrix = pd.DataFrame(r2_matrix[order, :][:, order], index=X.columns[order], columns=X.columns[order])

# Plot the sorted R^2 matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(sorted_r2_matrix, annot=False, cmap='coolwarm', cbar=True)
plt.title('Sorted R^2 Matrix with Clusters')
plt.show()

In [None]:
# Create an instance of StandardScaler and perform scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Create an instance of PCA and perform PCA transformation
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Create a new DataFrame with the date index and components
df_pca = pd.DataFrame(X_pca, index=X.index)

# Rename the columns to indicate the component number
df_pca.columns = [f"PC {i+1}" for i in range(X_pca.shape[1])]

# Print the new DataFrame
df_pca

In [None]:
# Plot the cumulative explained variance ratio
explained_variance_ratio_cumulative = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(11, 6), dpi=100)
plt.plot(range(1, len(explained_variance_ratio_cumulative) + 1), explained_variance_ratio_cumulative, color = 'black')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance Ratio')
plt.title('Cumulative Explained Variance')
plt.grid(True)

# Determine the number of components explaining at least 99% of variance
n_components_99 = np.argmax(explained_variance_ratio_cumulative >= 0.99) + 1
print("Number of components explaining at least 99% of variance:", n_components_99)

# Add a vertical line at the number of components where 95% is reached
plt.axvline(x=n_components_99, color='red', linestyle='--')

plt.show()

In [None]:
# Keep only the principal components explaining at least 99% of variance
df_pca_99 = df_pca.iloc[:, :n_components_99]

#Display
df_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("PCA Transformed Features")

# Show the plot
plt.show()

In [None]:
# Initialize the StandardScaler
pca_scaler = StandardScaler()

# Fit the scaler to the data and transform the data
df_pca_99_scaled = pca_scaler.fit_transform(df_pca_99)

# If you want to convert the scaled data back to a DataFrame:
df_pca_99_scaled = pd.DataFrame(df_pca_99_scaled, index=df_pca_99.index, columns=df_pca_99.columns)

df_pca_99_scaled

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_pca_99_scaled)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Features")

# Show the plot
plt.show()

### Modeling

#### Logistic Regression (baseline)

In [None]:
log_reg = LogisticRegression(penalty = 'elasticnet',
                             class_weight = 'balanced',
                             solver = 'saga',
                             l1_ratio=0.5,
                             max_iter =100000,
                             n_jobs=-1)

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'l1_ratio': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_6['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(log_reg, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=log_reg,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_log_reg = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_log_reg)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_log_reg])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
log_reg_final = LogisticRegression(penalty='elasticnet',
                                   class_weight='balanced',
                                   solver='saga',
                                   max_iter=100000,
                                   n_jobs=-1,
                                   **best_hyperparameters)
# Fit the model
log_reg_final.fit(X[best_features_log_reg], y)

In [None]:
# Feature Coeffiencts
print(log_reg_final.coef_)

# Intercept
print(log_reg_final.intercept_)

In [None]:
proportion = df_train_6['Regime'].value_counts(normalize=True)
print(proportion)

In [None]:
# Predict classes
y_pred = log_reg_final.predict(X[best_features_log_reg])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Logistic Regression (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = log_reg_final.predict_proba(X[best_features_log_reg])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Logistic Regression (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_6.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Logistic Regression Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Logistic Regression Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Logistic Regression Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Logistic Regression Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Support Vector Machine

In [None]:
# Initialize the Support Vector Classifier
svm = SVC(shrinking=True,
          probability=True,
          cache_size=1000,
          class_weight='balanced',
          decision_function_shape ='ovo')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'C': Real(0.01, 100, prior='log-uniform'),
    'kernel': Categorical(['linear', 'poly', 'rbf', 'sigmoid']),
    'degree': Integer(1, 10),
    'gamma': Categorical(['scale', 'auto']),
    'coef0': Real(0, 1)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_6['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(svm,
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=svm,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_svm = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_svm)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_svm])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Train the final model on all of the training data using the best feature subset and the best hyperparameters
svm_final = SVC(shrinking=True,
                probability=True,
                cache_size=1000,
                class_weight='balanced',
                decision_function_shape ='ovo',
                **best_hyperparameters)
# Fit the model
svm_final.fit(X[best_features_svm], y)

In [None]:
# Predict classes
y_pred = svm_final.predict(X[best_features_svm])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Support Vector Machine (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = svm_final.predict_proba(X[best_features_svm])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Support Vector Machine (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_6.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Support Vector Machine Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Support Vector Machine Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Support Vector Machine Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Support Vector Machine Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Random Forest

In [None]:
# Initialize the Random Forest Classifier
rf = RandomForestClassifier(n_jobs=-1,
                            class_weight='balanced')

In [None]:
# Define the hyperparameter grid you want to search over
param_dist = {
    'n_estimators': Integer(100, 1000),  # Number of trees in the forest
    'criterion': Categorical(['gini', 'entropy']),
    'max_depth': Integer(1, n_components_99),  # Maximum number of levels in each decision tree
    'min_samples_split': Integer(2, 10),  # Minimum number of data points placed in a node before the node is split
    'min_samples_leaf': Integer(1, 10),  # Minimum number of data points allowed in a leaf node
    'min_weight_fraction_leaf': Real(0.05, 0.2),  # Minimum weighted fraction of the total population required to be at a leaf node
    'max_features': Categorical(['sqrt', 'log2']),  # Number of features to consider at every split
    'max_samples': Real(0.01, 1.0)
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Define data
X = df_pca_99_scaled
y = df_train_6['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = RFECV(rf, step=1, cv=tscv_outer, scoring=scorer, n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=rf,
                                 search_spaces=param_dist,
                                 cv=tscv_inner,
                                 scoring=scorer,
                                 n_iter=100,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_rf = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_rf)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_rf])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
# Initialize the Random Forest Classifier
rf_final = RandomForestClassifier(n_jobs=-1,
                                  class_weight='balanced',
                                  **best_hyperparameters)
# Fit the model
rf_final.fit(X[best_features_rf], y)

In [None]:
# Predict classes
y_pred = rf_final.predict(X[best_features_rf])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Random Forest (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = rf_final.predict_proba(X[best_features_rf])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Random Forest (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_6.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Random Forest Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Random Forest Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Random Forest Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Random Forest Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

#### Multi-layer Perceptron

In [None]:
# Initialize the Multi-layer Perceptron Classifier
mlp = MLPClassifier(solver = 'lbfgs',
                    learning_rate = 'adaptive',
                    max_iter = 100000,
                    early_stopping = True)

In [None]:
# Parameter distributions
param_dist = {
    'clf__layer_size': Integer(1, int(n_components_99/2)),
    'clf__num_layers': Integer(1, 2),
    'clf__alpha': Real(1e-6, 1e-1, prior='log-uniform'),
    'clf__activation': Categorical(['identity', 'logistic', 'tanh', 'relu'])
}

In [None]:
# Measure the time needed to execute this cell
start_time = time.time()

# Wrapper Pipeline
pipe = Pipeline([
    ('clf', MLPWrapper())
])

# Define data
X = df_pca_99_scaled
y = df_train_6['Regime']

# Lists to store results
best_params = []
best_scores = []
selected_features_list = []

# Outer loop for feature selection
for train_index_outer, test_index_outer in tscv_outer.split(X):
    X_train_outer, _ = X.iloc[train_index_outer], X.iloc[test_index_outer]
    y_train_outer, _ = y.iloc[train_index_outer], y.iloc[test_index_outer]
    
    # Recursive feature elimination
    selector = SequentialFeatureSelector(MLPWrapper(),
                                         n_features_to_select='auto',
                                         direction='backward',
                                         tol=None,
                                         cv=tscv_outer,
                                         #scoring=scorer,
                                         n_jobs=-1)
    selector = selector.fit(X_train_outer, y_train_outer)
    selected_features = X_train_outer.columns[selector.support_]
    selected_features_list.append(selected_features)
    
    X_train_outer_selected = X_train_outer[selected_features]
    
    # Initialize the Bayes Search CV object for the internal loop
    bayes_search = BayesSearchCV(estimator=pipe,
                                 search_spaces=[(param_dist, 100)],
                                 cv=tscv_inner,
                                 #scoring=scorer,
                                 n_jobs=-1)

    # Inner loop for hyperparameter tuning
    bayes_search.fit(X_train_outer_selected, y_train_outer)
    
    # Store the best parameters and the best score
    best_params.append(bayes_search.best_params_)
    best_scores.append(bayes_search.best_score_)
    
# Find the index of the best score
best_index = np.argmax(best_scores)

# Find the best hyperparameters
best_hyperparameters = best_params[best_index]

# Find the best feature subset
best_features_mlp = selected_features_list[best_index]

# End Time
end_time = time.time()
execution_time = end_time - start_time
print("Execution time:", execution_time / 60, "min")

print("Best hyperparameters: ", best_hyperparameters)
print("Best feature subset: ", best_features_mlp)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(X[best_features_mlp])

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Selected Features")

# Show the plot
plt.show()

In [None]:
hyperparameters_dict = dict(best_hyperparameters)

activation = hyperparameters_dict['clf__activation']
alpha = hyperparameters_dict['clf__alpha']
num_layers = hyperparameters_dict['clf__num_layers']
layer_size = hyperparameters_dict['clf__layer_size']

hidden_layer_sizes = tuple(layer_size for _ in range(num_layers))

# Convert the Index to a list
#best_features = best_features.tolist()

# Initialize the Random Forest Classifier
mlp_final = MLPClassifier(solver = 'lbfgs',
                          learning_rate = 'adaptive',
                          max_iter = 100000,
                          early_stopping = True,
                          activation=activation,
                          alpha=alpha,
                          hidden_layer_sizes=hidden_layer_sizes)
# Fit the model
mlp_final.fit(X[best_features_mlp], y)

In [None]:
# Predict classes
y_pred = mlp_final.predict(X[best_features_mlp])

# Convert predictions to a pandas dataframe with time index
df_predictions = pd.DataFrame(y_pred, columns=['Predictions'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_predictions.index, df_predictions['Predictions'], color='black')

# Set x-axis label
plt.xlabel("time")

# Set y-axis label
plt.ylabel("regime")

# Set plot title
plt.title("Predicted Regime by Multi-layer Perceptron (Training)")

# Show the plot
plt.show()

In [None]:
# Predict probabilities
y_proba = mlp_final.predict_proba(X[best_features_mlp])

# Convert probabilities to a pandas DataFrame with time index
df_proba = pd.DataFrame(y_proba, columns=['Probability_0', 'Probability_1'], index=df_pca.index)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the stacked graph of predicted probabilities
plt.stackplot(df_proba.index, df_proba['Probability_0'], df_proba['Probability_1'],
              labels=['Value', 'Momentum'],
              colors=['blue', 'black'])

# Add a horizontal line
plt.axhline(y=0.5, color='red', linestyle='--')

# Set x-axis label
plt.xlabel('time')

# Set y-axis label
plt.ylabel('probability')

# Set plot title
plt.title('Predicted Probabilities by Multi-layer Perceptron (Training)')

# Show a legend
plt.legend()

# Show the plot
plt.show()

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y, y_pred, normalize='all')
accuracy = accuracy_score(y, y_pred)
logloss = log_loss(y, y_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"Log Loss: {logloss}")

# Create a heat map from the confusion matrix
plt.figure(figsize=(11, 6), dpi=100)
sns.heatmap(cm, annot=True, cmap='coolwarm')

# Add labels to the plot
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')

plt.show()

In [None]:
# Merge the DataFrames based on their indexes
df_merged = df_train_6.merge(df_proba, left_index=True, right_index=True)

# Shift the probabilities by one row to use the probabilities from the previous time step
shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

# Calculate the weighted returns using the shifted probabilities
df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + 0.5 * df_merged['Momentum Returns']

# Compute monthly returns
model_monthly_returns = df_merged['Weighted Returns']
benchmark_monthly_returns = df_merged['Simple Weighted Returns']

# Drop rows with NaN values
df_merged = df_merged.dropna()

# Compute Sharpe ratio based on monthly returns
model_sharpe_ratio = model_monthly_returns.mean() / model_monthly_returns.std()
benchmark_sharpe_ratio = benchmark_monthly_returns.mean() / benchmark_monthly_returns.std()

# Calculate the cumulative returns
df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

# Sharpe Ratio
print("Model Sharpe Ratio:", model_sharpe_ratio)
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the cumulative returns on a logarithmic scale
plt.plot(df_merged['Cumulative Weighted Returns'], label='Multi-layer Perceptron Model', color="red")
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Multi-layer Perceptron Model (Training)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the densities of monthly returns
plt.hist(model_monthly_returns, bins=100, alpha=0.5, label='Multi-layer Perceptron Model', color='red')
plt.hist(benchmark_monthly_returns, bins=100, alpha=0.5, label='50/50 Benchmark', color='black')
plt.axvline(0, color='red', linestyle='--')
plt.legend()
plt.xlabel('monthly returns')
plt.ylabel('frequency')
plt.title('Return Histogram of Multi-layer Perceptron Model (Training)')
plt.show()

In [None]:
# Define the sample to test
sample1= benchmark_monthly_returns # Benchmark
sample2= model_monthly_returns # Model

# Bootstrapping test
sharpe_ratio_bootstrap_test(sample1, sample2)

## Final Testing

In [None]:
df_test_6

In [None]:
X_test = df_test_6.drop("Regime", axis=1)

# Perform scaling on the test set using the same scaler
X_test_scaled = scaler.transform(X_test)

# Perform PCA transformation on the scaled test set using the same PCA instance
X_test_pca = pca.transform(X_test_scaled)

# Cap the data
X_test_pca = X_test_pca[:, :n_components_99]

# Fit the scaler to the PCA data and transform the data
df_test_pca = pca_scaler.transform(X_test_pca)

# Create a new DataFrame with the transformed test set
df_test_pca_99 = pd.DataFrame(df_test_pca, index=X_test.index)

# Rename the columns to indicate the component number
df_test_pca_99.columns = [f"PC {i+1}" for i in range(X_test_pca.shape[1])]

# Print the transformed test set DataFrame
df_test_pca_99

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

# Plot the data
plt.plot(df_test_pca_99)

# Add a horizontal line at zero
plt.axhline(0, color='black', linestyle='--')

# Set x-axis label
plt.xlabel("time")

# Set plot title
plt.title("Scaled PCA Transformed Test Features")

# Show the plot
plt.show()

In [None]:
X = df_test_pca_99

# Predict classes
y_log_reg = log_reg_final.predict(X[best_features_log_reg])
y_svm = svm_final.predict(X[best_features_svm])
y_rf = rf_final.predict(X[best_features_rf])
y_mlp = mlp_final.predict(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['black', 'red', 'green', 'blue']

# Plot the data on each subplot with distinct colors
axs[0].plot(df_test_pca_99.index, y_log_reg, color=colors[0], label='Logistic Regression')
axs[1].plot(df_test_pca_99.index, y_svm, color=colors[1], label='SVM')
axs[2].plot(df_test_pca_99.index, y_rf, color=colors[2], label='Random Forest')
axs[3].plot(df_test_pca_99.index, y_mlp, color=colors[3], label='MLP')

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machibne")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Regime by Models (Test)")

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# Get the predicted probabilities for each model
y_log_reg_prob = log_reg_final.predict_proba(X[best_features_log_reg])
y_svm_prob = svm_final.predict_proba(X[best_features_svm])
y_rf_prob = rf_final.predict_proba(X[best_features_rf])
y_mlp_prob = mlp_final.predict_proba(X[best_features_mlp])

# Set figure size and dpi
fig, axs = plt.subplots(4, 1, figsize=(11, 11), dpi=100, sharex=True)

# Define colors for each model
colors = ['blue', 'black', 'green', 'red']

# Plot the data on each subplot with distinct colors
axs[0].stackplot(df_test_pca_99.index, y_log_reg_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[1].stackplot(df_test_pca_99.index, y_svm_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[2].stackplot(df_test_pca_99.index, y_rf_prob.T, colors=colors, labels=['Value', 'Momentum'])
axs[3].stackplot(df_test_pca_99.index, y_mlp_prob.T, colors=colors, labels=['Value', 'Momentum'])

# Set x-axis label for the bottom subplot
axs[3].set_xlabel("time")

# Set y-axis label for each subplot
axs[0].set_ylabel("Logistic Regression")
axs[1].set_ylabel("Support Vector Machine")
axs[2].set_ylabel("Random Forest")
axs[3].set_ylabel("Multi-Layer Perceptron")

# Set plot title
plt.suptitle("Predicted Probabilities by Models (Test)")

# Add legend to the plot
handles, labels = axs[0].get_legend_handles_labels()
fig.legend(handles, labels)

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
y = df_test_6["Regime"]

# Calculate confusion matrices for each model
cm1 = confusion_matrix(y, y_log_reg, normalize='all')
cm2 = confusion_matrix(y, y_svm, normalize='all')
cm3 = confusion_matrix(y, y_rf, normalize='all')
cm4 = confusion_matrix(y, y_mlp, normalize='all')

# Set figure size and dpi
fig, axs = plt.subplots(2, 2, figsize=(11, 11), dpi=100)

# Generate confusion matrix and metrics for each model
cms = [cm1, cm2, cm3, cm4]
models = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Percetron']
metrics = []

for i, cm in enumerate(cms):
    # Calculate metrics
    accuracy = np.diag(cm).sum() / cm.sum()
    logloss = -np.log(np.diag(cm) / np.sum(cm, axis=1))
    
    # Store metrics
    metrics.append((accuracy, logloss))
    
    # Plot confusion matrix
    ax = axs[i // 2, i % 2]
    sns.heatmap(cm, annot=True, cmap='coolwarm', ax=ax)
    ax.set_title(f'{models[i]}')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

# Adjust spacing between subplots
plt.tight_layout()

# Show the plot
plt.show()

In [None]:
# assuming y_true is your ground truth
y_true = df_test_6["Regime"]

# Calculate accuracy scores
log_reg_acc = accuracy_score(y_true, y_log_reg)
svm_acc = accuracy_score(y_true, y_svm)
rf_acc = accuracy_score(y_true, y_rf)
mlp_acc = accuracy_score(y_true, y_mlp)

# Calculate log loss scores
log_reg_loss = log_loss(y_true, y_log_reg_prob)
svm_loss = log_loss(y_true, y_svm_prob)
rf_loss = log_loss(y_true, y_rf_prob)
mlp_loss = log_loss(y_true, y_mlp_prob)

# Create a dataframe to display
df = pd.DataFrame({
    'Model': ['Logistic Regression', 'Suport Vector Machine', 'Random Forest', 'Multi-layer Perceptron'],
    'Accuracy': [log_reg_acc, svm_acc, rf_acc, mlp_acc],
    'Log Loss': [log_reg_loss, svm_loss, rf_loss, mlp_loss]
})

display(df)

In [None]:
# Set figure size and dpi
plt.figure(figsize=(11, 6), dpi=100)

df_log_reg_prob = pd.DataFrame(y_log_reg_prob, index=df_test_6.index, columns=['Probability_0', 'Probability_1'])
df_svm_prob = pd.DataFrame(y_svm_prob, index=df_test_6.index, columns=['Probability_0', 'Probability_1'])
df_rf_prob = pd.DataFrame(y_rf_prob, index=df_test_6.index, columns=['Probability_0', 'Probability_1'])
df_mlp_prob = pd.DataFrame(y_mlp_prob, index=df_test_6.index, columns=['Probability_0', 'Probability_1'])

# Repeat the process for each model
for df_prob, model_name in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                               ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_6.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_test_6['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_test_6['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_test_6['Value Returns'] + \
                                           0.5 * df_test_6['Momentum Returns']
    
    # Compute Sharpe ratios
    model_sharpe_ratio = df_merged['Weighted Returns'].mean() / df_merged['Weighted Returns'].std()

    # Print Sharpe ratios
    print(f"{model_name} Sharpe Ratio:", model_sharpe_ratio)

    # Calculate the cumulative returns
    df_merged['Cumulative Weighted Returns'] = (1 + df_merged['Weighted Returns']).cumprod()
    df_merged['Cumulative Simple Weighted Returns'] = (1 + df_merged['Simple Weighted Returns']).cumprod()

    # Plot the cumulative returns
    plt.plot(df_merged['Cumulative Weighted Returns'], label=model_name)

# Plot the 50/50 benchmark
plt.plot(df_merged['Cumulative Simple Weighted Returns'], label='50/50 Benchmark', color='black')

# Add a horizontal line at zero
plt.axhline(1, color='black', linestyle='--')

plt.legend()
plt.xlabel('time')
plt.ylabel('returns')
plt.title('Model Comparison (Test)')
plt.yscale('log')  # Set y-axis scale to logarithmic
plt.show()

benchmark_sharpe_ratio = df_merged['Simple Weighted Returns'].mean() / df_merged['Simple Weighted Returns'].std()
print("Benchmark Sharpe Ratio:", benchmark_sharpe_ratio)

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(11, 12), dpi=100)

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Repeat the process for each model
for df_prob, model_name, ax in zip([df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob], 
                                   ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron'], 
                                   axes):
    # Merge the DataFrames based on their indexes
    df_merged = df_test_6.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']

    # Calculate the simple 50/50 weighted return stream
    df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                           0.5 * df_merged['Momentum Returns']

    # Shift the simple weighted returns by one row to match with the models
    df_merged['Simple Weighted Returns'] = df_merged['Simple Weighted Returns']

    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Plot the densities of monthly returns for the model and the benchmark
    ax.hist(df_merged['Weighted Returns'], bins=100, alpha=0.5, label=model_name, color='red')
    ax.hist(df_merged['Simple Weighted Returns'], bins=100, alpha=0.5, label='50/50 Benchmark', color='black')

    # Add a vertical line at zero
    ax.axvline(0, color='red', linestyle='--')
    
    ax.legend()
    ax.set_xlabel('monthly returns')
    ax.set_ylabel('frequency')
    ax.set_title(f'Return Histogram of {model_name} (Test)')

plt.tight_layout()
plt.show()

In [None]:
# Prepare lists to store model returns and model names
model_returns = []
model_names = ['Logistic Regression', 'Support Vector Machine', 'Random Forest', 'Multi-layer Perceptron']

# Get returns for each model
for df_prob in [df_log_reg_prob, df_svm_prob, df_rf_prob, df_mlp_prob]:
    # Merge the DataFrames based on their indexes
    df_merged = df_test_6.merge(df_prob, left_index=True, right_index=True)

    # Shift the probabilities by one row to use the probabilities from the previous time step
    shifted_probabilities = df_merged[['Probability_0', 'Probability_1']].shift()

    # Calculate the weighted returns using the shifted probabilities
    df_merged['Weighted Returns'] = shifted_probabilities['Probability_0'] * df_merged['Value Returns'] + \
                                    shifted_probabilities['Probability_1'] * df_merged['Momentum Returns']
    
    # Drop rows with NaN values
    df_merged = df_merged.dropna()

    # Append model returns to the list
    model_returns.append(df_merged['Weighted Returns'])


# Calculate the simple 50/50 weighted return stream
df_merged['Simple Weighted Returns'] = 0.5 * df_merged['Value Returns'] + \
                                       0.5 * df_merged['Momentum Returns']

benchmark_returns = df_merged['Simple Weighted Returns']
    
# Run bootstrap test
multi_sharpe_ratio_bootstrap_test(benchmark_returns, model_returns, model_names)