### Forecasting BIG HiBM Returns using XGBoost

_In this notebook, we forecast the monthly returns of the BIG HiBM portfolio using the XGBoost model. We begin by engineering features such as lagged returns and rolling statistics, then split the data into training (up to December 2015) and testing (from January 2016 onward) sets. Next, we tune the XGBoost model using a time-series–aware cross-validation, evaluate its performance, and finally visualize the forecasts against the actual returns._

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost as xgb
from sklearn.model_selection import RandomizedSearchCV, TimeSeriesSplit
from sklearn.metrics import mean_absolute_error, mean_squared_error

Feature Engineering

The function below creates lag features for the previous 1, 2, and 3 months as well as rolling mean and standard deviation over 3- and 6-month windows.

In [2]:
def create_features(series, lags=[1, 2, 3], windows=[3, 6]):
    """
    Generate lagged features and rolling statistics for a time series.
    
    Parameters:
        series (pd.Series): Time series data.
        lags (list): List of lag periods to include.
        windows (list): List of rolling window sizes (in months) for computing statistics.
        
    Returns:
        pd.DataFrame: DataFrame containing the target variable and generated features.
    """
    df = pd.DataFrame({'y': series})
    # Create lag features
    for lag in lags:
        df[f'lag_{lag}'] = df['y'].shift(lag)
    # Create rolling features: rolling mean and std
    for window in windows:
        df[f'roll_mean_{window}'] = df['y'].rolling(window=window).mean()
        df[f'roll_std_{window}'] = df['y'].rolling(window=window).std()
    df.dropna(inplace=True)
    return df

Loading the data

In [4]:
file_path = r"C:\Users\GIORDANO\Desktop\financial-time-series-forecasting\data\selected_portfolios.csv"

df = pd.read_csv(file_path, parse_dates=True, index_col='date')
df.index = pd.DatetimeIndex(df.index)
df.index.freq = 'MS' 
big_returns = df['BIG HiBM'].dropna()
big_returns

date
1990-07-01    0.0139
1990-08-01   -0.1021
1990-09-01   -0.1160
1990-10-01    0.0874
1990-11-01   -0.0345
               ...  
2024-08-01    0.0148
2024-09-01    0.0069
2024-10-01   -0.0071
2024-11-01    0.0502
2024-12-01   -0.0507
Freq: MS, Name: BIG HiBM, Length: 414, dtype: float64

In [5]:
# Generating features using the create_features function
feature_df = create_features(big_returns, lags=[1,2,3], windows=[3,6])
print("Feature DataFrame shape:", feature_df.shape)
print(feature_df.head())

Feature DataFrame shape: (409, 8)
                 y   lag_1   lag_2   lag_3  roll_mean_3  roll_std_3  \
date                                                                  
1990-12-01  0.0028 -0.0345  0.0874 -0.1160     0.018567    0.062461   
1991-01-01  0.0361  0.0028 -0.0345  0.0874     0.001467    0.035319   
1991-02-01  0.1037  0.0361  0.0028 -0.0345     0.047533    0.051412   
1991-03-01 -0.0338  0.1037  0.0361  0.0028     0.035333    0.068753   
1991-04-01  0.0088 -0.0338  0.1037  0.0361     0.026233    0.070388   

            roll_mean_6  roll_std_6  
date                                 
1990-12-01     -0.02475    0.076466  
1991-01-01     -0.02105    0.079198  
1991-02-01      0.01325    0.081604  
1991-03-01      0.02695    0.059462  
1991-04-01      0.01385    0.051622  


In [6]:
# Split the data into training (up to Dec 2015) and testing (from Jan 2016 onward)
train_df = feature_df.loc[:'2015-12-31']
test_df = feature_df.loc['2016-01-01':]

In [7]:
# Split the data into target and features
X_train, y_train = train_df.drop(columns='y'), train_df['y']
X_test, y_test = test_df.drop(columns='y'), test_df['y']

print("Training set:", X_train.shape, y_train.shape)
print("Testing set:", X_test.shape, y_test.shape)

Training set: (301, 7) (301,)
Testing set: (108, 7) (108,)


Model Training using XGBoost 

We perform hyperparameter tuning using RandomizedSearchCV with a TimeSeriesSplit to respect the temporal order. This helps us select the best model configuration for forecasting.


In [8]:
# Define a parameter grid for XGBoost
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}

# Initialize the XGBoost regressor
xgb_model = xgb.XGBRegressor(random_state=42)

# Use TimeSeriesSplit for cross-validation
tscv = TimeSeriesSplit(n_splits=5)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=xgb_model,
    param_distributions=param_grid,
    n_iter=20,
    scoring='neg_mean_squared_error',
    cv=tscv,
    random_state=42,
    n_jobs=-1
)

In [9]:
random_search.fit(X_train, y_train)
print("Best parameters found:", random_search.best_params_)
best_xgb = random_search.best_estimator_

Best parameters found: {'subsample': 0.8, 'n_estimators': 300, 'max_depth': 7, 'learning_rate': 0.1, 'colsample_bytree': 1.0}


Model Evaluation

In [10]:
predictions = best_xgb.predict(X_test)

mae = mean_absolute_error(y_test, predictions)
rmse = np.sqrt(mean_squared_error(y_test, predictions))

print(f"XGBoost Forecast Performance:\nMAE: {mae:.4f}\nRMSE: {rmse:.4f}")


XGBoost Forecast Performance:
MAE: 0.0151
RMSE: 0.0229


Visualizing the forecasts

In [12]:
plt.figure(figsize=(12, 6))
plt.plot(y_test.index, y_test, label='Actual Returns', color='blue')
plt.plot(y_test.index, predictions, label='XGBoost Predictions', color='red', linestyle='--')
plt.title('XGBoost Forecast for BIG HiBM Returns')
plt.xlabel('Date')
plt.ylabel('Return (decimal)')
plt.legend()
plt.savefig('plots/xgboost_forecast.png')
plt.close()