# Predicting Next-Month Excess Returns with ML

In this notebook, we train machine learning models to predict next-month excess returns for a set of stocks. 

We use the feature matrix created in the ml_features notebook which includes:

- Momentum indicators (`12-1`, `6-1`, `3-1`)
- 3-month rolling volatility
- Lagged Fama-French 5 factors

We compare the performance of three models:
- Linear Regression
- Random Forest Regressor
- XGBoost Regressor

In [1]:
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error

data = pd.read_csv("../data/processed/ml_feature_matrix.csv", index_col=[0, 1], parse_dates=[0])
data = data.dropna()

X = data.drop(columns=["y_next"])
X = X.drop(columns=["key_0"])
y = data["y_next"]

# sort by date + split based on time
dates = X.index.get_level_values(0).sort_values().unique()
split_point = int(len(dates) * 0.8)
train_dates = dates[:split_point]
test_dates = dates[split_point:]

X_train = X.loc[X.index.get_level_values(0).isin(train_dates)]
X_test  = X.loc[X.index.get_level_values(0).isin(test_dates)]
y_train = y.loc[X_train.index]
y_test  = y.loc[X_test.index]

# train Linear Regression
lr = LinearRegression().fit(X_train, y_train)
y_pred = lr.predict(X_test)

# Evaluate
print("R²:", r2_score(y_test, y_pred))
print("MSE:", mean_squared_error(y_test, y_pred))

print("y_test mean:", y_test.mean())
print("y_pred mean:", y_pred.mean())

R²: -0.2538394511906015
MSE: 0.011532806507899838
y_test mean: 0.03783213784293213
y_pred mean: -0.010491452219582255


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error
import pandas as pd

rf = RandomForestRegressor(
    n_estimators=400,
    max_depth=None,
    min_samples_leaf=3,
    n_jobs=-1,
    random_state=42,
)
rf.fit(X_train, y_train)

rf_pred = rf.predict(X_test)

print("RandomForest R²:", r2_score(y_test, rf_pred))
print("RandomForest MSE:", mean_squared_error(y_test, rf_pred))
print("y_test mean:", y_test.mean())
print("rf_pred mean:", pd.Series(rf_pred, index=y_test.index).mean())

# Top features
rf_imp = pd.Series(rf.feature_importances_, index=X_train.columns).sort_values(ascending=False)

RandomForest — R²: -0.3728113763178946
RandomForest — MSE: 0.012627109443624615
y_test mean: 0.03783213784293213
rf_pred mean: 0.0027211695173584302
