## Overview
In this project, I've engineered an adaptive machine learning algorithm that undergoes biannual recalibration to select the most accurate model for sector-based investment strategies. To counteract the pitfalls of over-forecasting, the algorithm employs a custom loss function that penalizes overpredictions. It comprehensively integrates a diverse range of financial indicators, including equity, debt, commodities, and market volatility. To enhance computational efficiency and model precision, I employed Principal Component Analysis for feature reduction. The model's robustness was substantiated through a 15-year backtest, during which it outperformed the SPY index by an estimated 91.85%. The finalized, vetted model has been encapsulated in a real-time dashboard.

## Business Understanding: Adaptive Sector Selection
The mercurial landscape of the financial markets warrent strategies that are dynamic and adapative, but even strategies like sector rotation often fall short due to their reliance on static heuristics. This project mitigates such limitations by employing a machine learning-driven "model of models" framework. This ensemble of algorithms undergoes biannual retraining and evaluation. The best-performing model is then selected for the next six-month cycle, ensuring the investment strategy continually adapts to current market conditions.

Once the leading model is identified, it selects the investment sector based on its predicted mean returns, specifically targeting the sector forecasted to yield the highest return. This dynamic, model-driven sector selection aims to optimize investment outcomes by leveraging timely and precise machine learning predictions.

The strategy is then tested via a 15-year backtest, offering empirical validation of its sector-based approach. Thus, the framework's utility manifests in its ability to not only adapt to market vicissitudes but also pinpoint the most promising sectors for investment based on forecasts.

---

## 2. Modeling
This notebook goes through the process of modeling:
1. [Custom Error Function: Over-Under Error](#custom-error-function-over-under-error)
2. [The Nature of Walk Forward Cross-Validation and Training](#the-nature-of-walk-forward-cross-validation-and-training)
3. ["Model of Models" Architecture](#model-of-models-architecture)
4. [Modeling](#modeling)
    - [The Use of PCA](#the-use-of-pca)
    - [Naive Model](#naive-model)
    - [`ARIMAX` Model](#arimax-model)
    - [`sklearn` Models](#sk)

The below are the necessary imports.

In [3]:
# Data manipulation and analysis
import pandas as pd
import numpy as np

# Custom modules and functions
from capstone.model_selection import (
    overunder_error, 
    naive_cross_val_score, 
    arimax_cross_val_score
)

from capstone.utils import read_file, get_sectors, set_plot_style

# SARIMAX model from statsmodels
from statsmodels.tsa.statespace.sarimax import SARIMAX

# sklearn models
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Progres bar for loops
from tqdm.auto import tqdm

# Ignore warnings
from warnings import filterwarnings
filterwarnings("ignore")

set_plot_style()

### Custom Error Function: Over-Under Error

The Over-Under Error (OUE) calculates a loss based on the differences between true and predicted portfolio values. It is given by the formula:

$$\text{loss} = 
\begin{cases} 
\text{underpred penalty} \times \left| \text{residual} \right|^\alpha & \text{if residual} < 0 \\
\text{overpred penalty} \times \left| \text{residual} \right|^\alpha & \text{otherwise}
\end{cases}$$

Read below for the documentation or visit the [`capstone.model_selection`](capstone/model_selection.py) module for the source code.

In [4]:
help(overunder_error)

Help on function overunder_error in module capstone.model_selection:

overunder_error(y_true, y_pred, underpred_penalty=1.0, overpred_penalty=1.0, alpha=0.5)
    Calculate the Over-Under Error for portfolio optimization.
    
    Parameters:
        y_true (array-like): True portfolio values.
        y_pred (array-like): Predicted portfolio values.
        underpred_penalty (float, optional): Penalty factor for underpredictions. Default is 1.0.
        overpred_penalty (float, optional): Penalty factor for overpredictions. Default is 1.0.
        alpha (float, optional): Exponent for residual calculation. Default is 0.5.
    
    Returns:
        float: Mean Over-Under Error.
    
    The Over-Under Error is a custom loss function written for portfolio optimization. It calculates a loss based on the 
    differences between true and predicted portfolio values. The function allows for penalties on overpredictions and 
    underpredictions using the 'overpred_penalty' and 'underpred_pena

Though the function allows parametization of the overprediction penalty (`overpred_penalty`) and underprediction penalty (`underpred_penalty`), for the purposes of a long-only portfolio I've set `underpred_penalty = 0` and `overpred_penalty = 2`. This reflects a specific calculus, namely:

1. **Underprediction Tolerance**: In a long-only portfolio, underpredicting an asset's performance isn't inherently detrimental. If an asset outperforms your model's predictions, the consequence is a positive surprise. Hence, `underpred_penalty = 0`.
2. **Overprediction Risk**: Overestimating an asset's performance, however, can be more damaging in a long-only strategy. It may lead to an allocation that is disproportionately heavy in underperforming assets. This can hamper the portfolio's overall returns and increase its risk profile. Therefore, `overpred_penalty = 2` provides a stringent control mechanism for overoptimistic forecasts.
3. **Risk Aversion**: The higher overprediction penalty factor acts as a risk management tool, pushing the model toward conservative estimates and mitigating the impact of potential overallocations.
4. **Resource Allocation**: Penalizing overprediction guides the model to allocate resources more judiciously, reinforcing positions in assets that have a more reliable performance outlook.

That said, whether overpredicting or underpredicting is detrimental to the strategy is still contingent on one's specific risk tolerance, so feel free to play around with the parameters.

### The Nature of Walk Forward Cross-Validation and Training

Walk Forward Cross-Validation (WFCV) is a form of time-series cross-validation that simulates a realistic trading environment. Unlike traditional cross-validation methods which randomly partition data, walk-forward cross validation (WFCV) respects the temporal order of observations.

![Walk-Forward Cross Validation](img/wfcv.png)

WFCV can be "rolling" or "expanding". The picture above shows an expanding approach, where each testing window gets added to the subsequent training window; the training window thus "expands". In a rolling approach, the training window "rolls" forward, maintaining a fixed size. As new data points become available for testing, the earliest data points in the training set are removed to keep its size constant.

In this project I use the expanding approach, as the validation slice per time frame is small (126 days). The algorithm thus needs as much data as it can get from each time frame.

### "Model of Models" Architecture

The framework in this project employs a dynamic "Model of Models" architecture that re-trains each constituent model biannually, using data from the preceding six months. The model yielding the best Over-Under Error (OUE) score is selected as the lead model for the subsequent period. This chosen model identifies the most promising sector for investment based on the mean of its predicted returns. Stocks from the chosen sector are identified via their GICS segmentation.

### Modeling

Before we get to modeling, we'll need to load in `master_df.csv` from `data`, after which we'll separate it out into it features and targets.

In [5]:
# Load in files
sectors = get_sectors()
df = read_file("master_df", index_col=0)

# Separate `df` into features and targets
y_all = df[sectors]
X = df[df.columns[~df.columns.isin(sectors)]]

X.shape, y_all.shape

((4230, 109), (4230, 11))

#### Naive Model

We'll start with our naive model. This model uses the past 6 months of returns *as* the forecasted returns and will serve as a baseline upon which we can assess the efficacy of the more sophisticated models. Below is how I cross-validation function for the naive model. Visit the [`capstone.model_selection`](capstone/model_selection.py) module to see source code.

In [6]:
help(naive_cross_val_score)

Help on function naive_cross_val_score in module capstone.model_selection:

naive_cross_val_score(r_true: pandas.core.series.Series, r_hat: pandas.core.series.Series, cv: int, scorer: Callable, **scorer_kwargs: Optional[Any]) -> List[float]
    Perform time-series cross-validation using a naive forecast model.
    
    Parameters:
        - r_true (Series): Actual target values.
        - r_hat (Series): Predicted target values.
        - cv (int): Number of splits/folds for time-series cross-validation.
        - scorer (Callable): Scoring function to evaluate the predictions. 
                             Must take two arrays 'y_true' and 'y_pred' as arguments,
                             along with any additional keyword arguments (**scorer_kwargs).
        - **scorer_kwargs (Optional[Any]): Additional keyword arguments to pass to the scoring function.
    
    Returns:
        - cv_scores (List[float]): List of scores calculated for each fold during cross-validation.
    
    This

In [7]:
# Define the forecast horizon in terms of trading days per year
trading_days = 252
forecast = int(trading_days / 2)

In [8]:
# Define the forecast horizon in terms of trading days per year
trading_days = 252
forecast = int(trading_days / 2)

# Shift returns for forecasting, align indices
returns_shifted = y_all.shift(forecast).dropna()
returns_reind = y_all.reindex(returns_shifted.index)

# Initialize output DataFrames
naive_oues = pd.DataFrame()
naive_preds = pd.DataFrame()

# Loop through sectors
for sector in sectors:
    r_trues = returns_reind[sector]
    r_hats = returns_shifted[sector]
    
    # Time-chunk loop
    for i in range(forecast + 1, len(returns_reind), forecast):
        r_hat = r_hats.iloc[i-forecast:i]
        r_true = r_trues.iloc[i-forecast:i]
        
        # Calculate and store mean over-under loss
        mean_oul = np.mean(
            naive_cross_val_score(
                r_true, r_hat, cv=2, scorer=overunder_error,
                overpred_penalty=2, underpred_penalty=0
            )
        )

        naive_oues.loc[r_hat.index.max(), sector] = mean_oul
        naive_preds.loc[r_hat.index.max(), sector] = np.mean(r_hat)

In [11]:
# Take the mean Over-Under Error (OUE) across sectors
mean_naive_oues = pd.DataFrame(naive_oues.mean(axis=1), columns=["Naive"])
mean_naive_oues.tail()

Unnamed: 0,Naive
2021-05-27,0.112676
2021-11-24,0.089535
2022-05-26,0.11422
2022-11-25,0.126298
2023-05-30,0.114692


In [13]:
# Take the sector with the highest predicted mean return
naive_sectors = pd.DataFrame(naive_preds.idxmax(axis=1), columns=["Naive"])
naive_sectors.tail()

Unnamed: 0,Naive
2021-05-27,MATERIALS
2021-11-24,ENERGY
2022-05-26,ENERGY
2022-11-25,ENERGY
2023-05-30,INDUSTRIALS


#### ARIMAX Model

Now that we have a baseline to compare to, we'll begin with our ARIMAX model. ARIMA (AutoRegressive Integrated Moving Average) is a time-series forecasting model that uses past observations and their lags to predict future points. ARIMAX extends ARIMA by incorporating external (X) variables, allowing the model to capture additional influencing factors not present in the time-series itself. Learn more about ARIMAX [here](https://www.smarten.com/blog/arimax-forecasting-enterprise-analysis/#:~:text=An%20Autoregressive%20Integrated%20Moving%20Average,moving%20average%20(MA)%20terms.).

In [34]:
# Shift the features to match the forecast horizon, and drop any missing values
X_shifted = X.shift(forecast).dropna()

# Create a pipeline for standardizing and applying PCA
pca_pipe = make_pipeline(StandardScaler(), PCA(n_components=.8, random_state=42))

# Define the ARIMAX orders for ARIMA and seasonal components
order = (1, 0, 1) # Returns are usually stationary, so no differencing applied

# Initialize empty DataFrames to store predictions and over-under loss scores
arimax_preds = pd.DataFrame()
arimax_ouls = pd.DataFrame()

# Loop through each sector
for sector in tqdm(sectors):

    # Extract the target variable for the current sector
    y = y_all[sector].reindex(X_shifted.index)

    # Loop through the data with a window equal to the forecast horizon
    for i in range(forecast, len(y), forecast):
        
        # Split the data into training and testing sets
        X_train, X_test = X_shifted.iloc[i-forecast:i], X_shifted.iloc[i:i+forecast]
        y_train, y_test = y[i-forecast:i], y[i:i+forecast]

        # Apply PCA to the training and testing feature sets
        X_train_pca = pca_pipe.fit_transform(X_train)
        X_test_pca = pca_pipe.transform(X_test)

        # Perform time-series cross-validation and calculate the mean over-under loss
        mean_oul = np.mean(
            arimax_cross_val_score(
                X_train,
                y_train,
                order=order,
                pca=pca_pipe,
                cv=2,
                scorer=overunder_error,
                overpred_penalty=2,
                underpred_penalty=0
            )
        )
        
        # Store the mean over-under loss score
        arimax_ouls.loc[X_test.index.min(), sector] = mean_oul

        # Fit the ARIMAX model to the training data
        model = SARIMAX(y_train.values, X_train_pca, order=order).fit()

        # Generate forecasts for the testing data
        forecast_results = model.get_forecast(steps=len(X_test_pca), exog=X_test_pca)
        y_hat = forecast_results.predicted_mean

        # Store the mean forecasted value
        arimax_preds.loc[X_test.index.min(), sector] = np.mean(y_hat)

  0%|          | 0/11 [00:00<?, ?it/s]