# 04 - Predictions

This notebook completes the stock price prediction pipeline by generating forecasts of stock returns using machine learning.

The dataset prepared in previous steps is used to define the prediction target and to train or load models for forecasting. The predictions are then evaluated and converted into formats suitable for investment analysis.

The key steps include:

- Definition of the target variable as the future log return over a set horizon (in business days).
- Formatting the dataset for compatibility with AutoGluon's time series models.
- Splitting the data chronologically into training, validation, and testing sets.
- Model training or loading of a pre-trained model.
- Generation of return forecasts and their evaluation.
- Conversion of predicted log returns to percentage returns and forecasted prices.
- Creation of a summary report for interpretation or further analysis.

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import warnings

from autogluon.timeseries import TimeSeriesDataFrame, TimeSeriesPredictor

In [2]:
# Suppress all warnings
warnings.filterwarnings('ignore')

## 1. Data Loading and Target Engineering

This section involves the loading of the dataset, the generation of the log return prediction target, and the conversion of the dataset into a time series format compatible with AutoGluon.

In [None]:
# Load the dataset from local storage
dataset = pd.read_csv("dataset\\dataset.csv")

In [None]:
# Define the prediction horizon in business days
horizon = 30

### 1.1 Target Generation: Log Return over Forecast Horizon

To train a forecasting model, a supervised learning target must first be defined. In this case, logarithmic returns (log returns) are computed over a user-defined prediction horizon — in this notebook, set to 30 business days.

A function was implemented to calculate this target for each stock symbol using the adjusted close price (close_adj). The resulting column, named log_return_30d, is later used as the variable the model attempts to predict.

In [None]:
def add_log_return_target(df, price_col="close_adj", horizon=30):
    """
    Adds a log return column to the dataframe for a given forecast horizon.

    Parameters:
        df (pd.DataFrame): The input dataframe containing price data.
        price_col (str): The name of the column containing adjusted close prices.
        horizon (int): The number of business days over which to compute the log return.

    Returns:
        pd.DataFrame: The dataframe with an additional log return target column.
    """
    df = df.copy()
    
    # Compute log return by symbol over the forward horizon
    df[f"log_return_{horizon}d"] = (
        np.log(df.groupby("Symbol")[price_col].shift(-horizon) / df[price_col])
    )
    
    # Remove rows where the future price is not available
    df = df.dropna(subset=[f"log_return_{horizon}d"])
    
    return df

In [None]:
# Add the log return target column to the dataset
dataset_30 = add_log_return_target(dataset, horizon=horizon)
dataset_30

Unnamed: 0,Date,Symbol,Sector,open_adj,high_adj,low_adj,close_adj,volume_adj,dividends,stock_splits,...,volume_raw,roe,pe_ratio,pb_ratio,eps_growth_qoq,eps_growth_yoy,ev_ebitda,de_ratio,fcf_yield,log_return_30d
0,2006-09-01,PG,Consumer Staples,36.455025,36.513832,36.313885,36.407978,4132300,0.0,0.0,...,4132300.0,0.030171,94.884738,2.904178,-0.061759,0.044869,66.807390,1.157039,0.047664,0.002258
1,2006-09-05,PG,Consumer Staples,36.166855,36.278589,35.984550,36.202141,6288700,0.0,0.0,...,6288700.0,0.030171,96.255465,2.904178,-0.061759,0.044869,67.513398,1.157039,0.047664,0.011147
2,2006-09-06,PG,Consumer Staples,36.202144,36.266831,35.955149,36.131573,4595800,0.0,0.0,...,4595800.0,0.030171,95.900091,2.921415,-0.061759,0.044869,67.330359,1.157039,0.047664,0.023537
3,2006-09-07,PG,Consumer Staples,36.396214,36.478545,36.072771,36.184505,7174000,0.0,0.0,...,7174000.0,0.030171,95.510872,2.948577,-0.061759,0.044869,67.129888,1.157039,0.047664,0.020634
4,2006-09-08,PG,Consumer Staples,36.660858,36.660858,35.931638,35.955162,5725900,0.0,0.0,...,5725900.0,0.030171,96.137007,2.951188,-0.061759,0.044869,67.452385,1.157039,0.047664,0.029868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
22414,2025-04-02,XOM,Energy,117.309555,117.755445,116.586221,117.586998,12614600,0.0,0.0,...,12614605.0,0.028858,64.672615,1.725992,-0.116144,-0.118237,39.044061,0.693460,0.066437,-0.079691
22416,2025-04-03,JNJ,Health Care,158.750000,160.649994,157.479996,159.820007,13249300,0.0,0.0,...,13249313.0,0.047993,112.958125,5.146820,0.273991,-0.574349,79.741272,1.519289,0.057003,-0.054585
22417,2025-04-03,NVDA,Information Technology,103.510002,105.629997,101.599998,101.800003,338769400,0.0,0.0,...,338769412.0,0.278480,156.695381,72.334008,0.147227,5.848701,134.977060,0.406848,0.001761,0.285223
22418,2025-04-03,PG,Consumer Staples,173.150696,173.717172,169.672377,171.322098,9393300,0.0,0.0,...,9393270.0,0.090486,88.772981,8.229180,0.173246,0.020353,69.501532,1.391417,0.036577,-0.048079


**Why Use Log Returns?**

Instead of predicting future prices directly or using raw percentage returns, **logarithmic returns** (log returns) are preferred in financial modeling for several reasons:

- **Mathematical Additivity**
Log returns can be **summed over time**, which simplifies the modeling of multi-step forecasts and enables easier comparisons across different time intervals.

$$
\log\left(\frac{P_t}{P_0}\right) = \log\left(\frac{P_t}{P_{t-1}}\right) + \log\left(\frac{P_{t-1}}{P_{t-2}}\right) + \dots + \log\left(\frac{P_1}{P_0}\right)
$$


- **Symmetry & Normality**
Log returns are typically **more symmetric** and exhibit statistical properties closer to a **normal distribution** compared to raw returns. This improves the performance of many forecasting models and loss metrics (e.g., RMSE).

- **Scale-Invariance**
A 1% move has the same log return whether a stock is priced at \$10 or \$1,000. This **scale-independence** allows for consistent treatment of securities across varying price levels.

- **Interpretability**
While log returns may appear less intuitive than raw returns, they can be **converted back** into percentage returns or price forecasts:

$$
\text{Percent Return} = \left(e^{\text{log return}} - 1\right) \times 100
$$

This flexibility ensures that forecasts can be easily understood in terms of both return and price.


**Implementation: Calculating Log Returns**

The supervised learning target is generated as the log return over a forward-looking **forecast horizon** (e.g., 30 business days):

$$
\log\left(\frac{P_{t+h}}{P_t}\right)
$$

Where:
- \( P_t \) is the current adjusted close price
- \( P_{t+h} \) is the adjusted close price after the forecast horizon
- \( h \) is the number of business days into the future (here, 30)

This log return is stored in the `log_return_30d` column and serves as the value that the model is trained to predict.

### 1.2 Dataset structuring for AutoGluon

The dataset was prepared for use with AutoGluon `TimeSeriesPredictor` by converting it into a `TimeSeriesDataFrame`.

This required the removal of unused raw columns, alignment of the index, and inclusion of static features.

In [None]:
def create_time_series_dataframe(df, horizon=30):
    """
    Converts a dataframe with log return targets into an AutoGluon TimeSeriesDataFrame.

    Parameters:
        df (pd.DataFrame): The input dataframe with log return targets.
        horizon (int): The forecast horizon used in the log return column.

    Returns:
        TimeSeriesDataFrame: The formatted time series dataframe.
    """
    df = df.copy()
    
    # Drop raw price/volume columns that are not used in modeling
    df.drop(columns=["open_raw", "high_raw", "low_raw", "close_raw", "volume_raw"], inplace=True)
    
    # Convert to AutoGluon's time series format
    ts_df = TimeSeriesDataFrame.from_data_frame(
        df.dropna(subset=[f"log_return_{horizon}d"]),  # ensure target column is present
        id_column="Symbol",
        timestamp_column="Date",
        static_features_df=df[["Symbol", "Sector"]].drop_duplicates().reset_index(drop=True)
    )
    
    # Drop the static feature post-construction as it's not needed during training
    ts_df.drop(columns=["Sector"], inplace=True)

    return ts_df


In [None]:
# Convert the prepared dataset to a TimeSeriesDataFrame
ts_dataset_30 = create_time_series_dataframe(dataset_30, horizon=horizon)
ts_dataset_30

Unnamed: 0_level_0,Unnamed: 1_level_0,open_adj,high_adj,low_adj,close_adj,volume_adj,dividends,stock_splits,sma_10,sma_50,ema_12,...,stoch_d,roe,pe_ratio,pb_ratio,eps_growth_qoq,eps_growth_yoy,ev_ebitda,de_ratio,fcf_yield,log_return_30d
item_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
PG,2006-09-01,36.455025,36.513832,36.313885,36.407978,4132300,0.0,0.0,36.014553,34.175059,35.926273,...,88.502218,0.030171,94.884738,2.904178,-0.061759,0.044869,66.807390,1.157039,0.047664,0.002258
PG,2006-09-05,36.166855,36.278589,35.984550,36.202141,6288700,0.0,0.0,36.060420,34.247270,35.968714,...,89.817134,0.030171,96.255465,2.904178,-0.061759,0.044869,67.513398,1.157039,0.047664,0.011147
PG,2006-09-06,36.202144,36.266831,35.955149,36.131573,4595800,0.0,0.0,36.083355,34.315028,35.993770,...,86.666800,0.030171,95.900091,2.921415,-0.061759,0.044869,67.330359,1.157039,0.047664,0.023537
PG,2006-09-07,36.396214,36.478545,36.072771,36.184505,7174000,0.0,0.0,36.119229,34.386652,36.023114,...,81.069066,0.030171,95.510872,2.948577,-0.061759,0.044869,67.129888,1.157039,0.047664,0.020634
PG,2006-09-08,36.660858,36.660858,35.931638,35.955162,5725900,0.0,0.0,36.124523,34.453455,36.012659,...,72.979108,0.030171,96.137007,2.951188,-0.061759,0.044869,67.452385,1.157039,0.047664,0.029868
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
XOM,2025-04-02,117.309555,117.755445,116.586221,117.586998,12614600,0.0,0.0,116.360300,109.946254,115.664151,...,90.546944,0.028858,64.672615,1.725992,-0.116144,-0.118237,39.044061,0.693460,0.066437,-0.079691
JNJ,2025-04-03,158.750000,160.649994,157.479996,159.820007,13249300,0.0,0.0,161.077002,159.018286,160.749212,...,33.996134,0.047993,112.958125,5.146820,0.273991,-0.574349,79.741272,1.519289,0.057003,-0.054585
NVDA,2025-04-03,103.510002,105.629997,101.599998,101.800003,338769400,0.0,0.0,112.541000,122.355733,111.606592,...,27.657119,0.278480,156.695381,72.334008,0.147227,5.848701,134.977060,0.406848,0.001761,0.285223
PG,2025-04-03,173.150696,173.717172,169.672377,171.322098,9393300,0.0,0.0,167.096436,168.169206,168.261870,...,81.968289,0.090486,88.772981,8.229180,0.173246,0.020353,69.501532,1.391417,0.036577,-0.048079


**What Are Static Features?**

In time series forecasting, **static features** refer to attributes that are **constant over time for each time series** (or entity). These features do **not change across timestamps** but can help the model differentiate and learn patterns across different series.

**Example of Static Features:**
- **Sector**: If each time series corresponds to a stock, its sector (e.g., "Technology", "Healthcare") is a static feature.

**Why Are Static Features Useful?**
Static features allow the model to:
- **Generalize across related series** (e.g., tech stocks vs. energy stocks)
- **Share information across entities** while accounting for categorical differences
- **Improve performance** in multi-series settings where patterns vary by group

In AutoGluon, static features are passed via the `static_features_df` parameter when constructing a `TimeSeriesDataFrame`. They help the model understand contextual relationships without needing to be repeated at every timestamp.


## 2. Time-Based Splitting

To properly train and evaluate a time series forecasting model, the dataset must be divided into **training**, **validation**, and **test** sets based on time. This is essential in time series tasks where data is naturally ordered and future information must not leak into the past.

Unlike standard machine learning tasks, **random splits cannot be used** for time series data, as they would violate temporal ordering.

In [9]:
def split_time_series_data(ts_dataset: TimeSeriesDataFrame, train_start, train_end, val_end, test_end):
    """
    Splits an AutoGluon TimeSeriesDataFrame into train, validation, and test sets
    based on the provided timestamp string boundaries.

    Parameters:
        ts_dataset (TimeSeriesDataFrame): The input time series dataset.
        train_start (str): Start date for training data in 'YYYY-MM-DD' format.
        train_end (str): End date for training data in 'YYYY-MM-DD' format.
        val_end (str): End date for validation data in 'YYYY-MM-DD' format.
        test_end (str): End date for test data in 'YYYY-MM-DD' format.

    Returns:
        Tuple[TimeSeriesDataFrame, TimeSeriesDataFrame, TimeSeriesDataFrame]:
            train_data, val_data, test_data
    """
    train_start = pd.Timestamp(train_start)
    train_end = pd.Timestamp(train_end)
    val_end = pd.Timestamp(val_end)
    test_end = pd.Timestamp(test_end)
    
    timestamps = ts_dataset.index.get_level_values("timestamp")

    train_data = ts_dataset[(timestamps >= train_start) & (timestamps <= train_end)]
    val_data = ts_dataset[(timestamps > train_end) & (timestamps <= val_end)]
    test_data = ts_dataset[(timestamps > val_end) & (timestamps <= test_end)]

    return train_data, val_data, test_data


In [10]:
train_start = "2015-01-01"
train_end = "2021-12-31"
val_end = "2022-12-31"
test_end = "2023-12-31"

train_data, val_data, test_data = split_time_series_data(ts_dataset_30, train_start, train_end, val_end, test_end)

In [11]:
train_data

Unnamed: 0_level_0,Unnamed: 1_level_0,open_adj,high_adj,low_adj,close_adj,volume_adj,dividends,stock_splits,sma_10,sma_50,ema_12,...,stoch_d,roe,pe_ratio,pb_ratio,eps_growth_qoq,eps_growth_yoy,ev_ebitda,de_ratio,fcf_yield,log_return_30d
item_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AAPL,2015-01-02,24.778683,24.789806,23.879985,24.320436,212818400,0.0,0.0,24.959976,24.811783,24.854051,...,68.848125,0.075905,328.602051,23.711892,0.107369,0.136285,219.942254,1.078397,0.020322,0.160269
JNJ,2015-01-02,79.080402,79.456796,78.387832,78.681419,5753600,0.0,0.0,79.254282,80.031228,79.174056,...,46.449074,0.062005,65.054411,4.187313,0.101575,0.245868,44.502018,0.724707,0.052443,-0.039818
NVDA,2015-01-02,0.483099,0.486699,0.475420,0.483099,113680000,0.0,0.0,0.490395,0.479771,0.487895,...,63.026518,0.041145,2640.000336,104.271317,0.381516,-0.177780,1669.115095,0.637702,0.001553,0.105956
PG,2015-01-02,67.739859,67.859175,67.053812,67.441582,7251400,0.0,0.0,68.836786,66.634275,68.343286,...,66.827672,0.030126,128.581307,3.622573,-0.227341,0.038438,71.515276,1.080221,0.052009,-0.049200
XOM,2015-01-02,58.598702,59.106877,58.319205,58.967129,10220400,0.0,0.0,59.143722,59.304266,58.728542,...,74.687690,0.044688,51.188138,2.456961,-0.077435,-0.240010,32.104998,0.915177,0.025136,0.009866
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AAPL,2021-12-31,175.027343,176.147738,174.211615,174.516296,64062300,0.0,0.0,173.029294,159.262729,173.126969,...,78.679125,0.325741,129.957901,40.043824,-0.046534,0.714023,108.652749,4.563512,0.038951,-0.048903
JNJ,2021-12-31,156.898930,157.035335,155.534829,155.571213,4409500,0.0,0.0,154.081625,149.259419,154.127148,...,77.637088,0.052183,118.820415,6.874484,-0.416617,-0.021832,92.744406,1.550490,0.048033,-0.032498
NVDA,2021-12-31,29.622004,29.977381,29.279605,29.359465,266530000,0.0,0.0,29.338504,29.244594,29.648006,...,59.474533,0.103538,3397.562094,380.182940,0.035457,0.524685,2797.578052,0.707370,0.000731,-0.192251
PG,2021-12-31,149.368362,150.166285,149.001504,150.028717,5327000,0.0,0.0,147.465254,137.932864,147.219272,...,91.300076,0.089176,91.366172,8.019255,0.422856,0.108652,74.908403,1.588797,0.044405,-0.037329


In [12]:
val_data

Unnamed: 0_level_0,Unnamed: 1_level_0,open_adj,high_adj,low_adj,close_adj,volume_adj,dividends,stock_splits,sma_10,sma_50,ema_12,...,stoch_d,roe,pe_ratio,pb_ratio,eps_growth_qoq,eps_growth_yoy,ev_ebitda,de_ratio,fcf_yield,log_return_30d
item_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AAPL,2022-01-03,174.771805,179.734962,174.653874,178.879913,104487900,0.0,0.0,174.097601,159.906422,174.012038,...,77.219084,0.325741,130.338345,39.819279,-0.046534,0.714023,108.943302,4.563512,0.038951,-0.050711
JNJ,2022-01-03,154.789166,156.053233,153.779730,155.998657,6012200,0.0,0.0,154.382642,149.426809,154.415073,...,79.812307,0.052183,116.493463,6.880393,-0.416617,-0.021832,91.242923,1.550490,0.048033,-0.024968
NVDA,2022-01-03,29.762758,30.657188,29.732810,30.068222,391547000,0.0,0.0,29.570097,29.392969,29.712654,...,58.779848,0.103538,3291.571753,384.049156,0.035457,0.524685,2710.466578,0.707370,0.000731,-0.128267
PG,2022-01-03,148.295276,149.441723,146.635222,149.405029,9317300,0.0,0.0,147.964186,138.365213,147.555542,...,86.014278,0.089176,91.976026,7.999626,0.422856,0.108652,75.336707,1.588797,0.044405,-0.032653
XOM,2022-01-03,54.093256,56.177839,54.066754,56.124844,24282400,0.0,0.0,54.050853,55.182937,54.354518,...,64.414851,0.042033,39.178344,1.783683,0.431412,-2.564487,28.957989,1.053509,-0.014848,0.215936
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AAPL,2022-12-30,126.934142,128.456435,125.965402,128.436661,77034200,0.0,0.0,129.922398,141.378203,131.445403,...,8.138785,0.408924,117.685564,50.177278,0.075290,0.088917,95.841108,5.961537,0.049393,0.166274
JNJ,2022-12-30,165.650387,165.911708,163.699792,164.866409,4216600,0.0,0.0,164.942024,162.856056,165.228575,...,43.774699,0.059760,105.186420,6.316988,-0.071654,0.417170,69.909941,1.347538,0.043192,-0.086327
NVDA,2022-12-30,14.322264,14.617023,14.221347,14.602036,310490000,0.0,0.0,15.320446,15.304749,15.337590,...,7.169141,0.031852,5662.807500,131.779611,0.043637,1.230730,3567.387546,0.896482,0.001310,0.452253
PG,2022-12-30,143.273023,143.508005,141.402578,142.455292,4532000,0.0,0.0,142.681830,135.049062,142.597569,...,64.584184,0.089370,90.616464,7.369489,0.300785,0.055598,73.002983,1.632399,0.037160,-0.078240


In [13]:
test_data

Unnamed: 0_level_0,Unnamed: 1_level_0,open_adj,high_adj,low_adj,close_adj,volume_adj,dividends,stock_splits,sma_10,sma_50,ema_12,...,stoch_d,roe,pe_ratio,pb_ratio,eps_growth_qoq,eps_growth_yoy,ev_ebitda,de_ratio,fcf_yield,log_return_30d
item_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
AAPL,2023-01-03,128.782649,129.395518,122.742873,123.632530,112117500,0.0,0.0,128.989248,141.020708,130.243423,...,10.145841,0.408924,116.822116,48.537351,0.075290,0.088917,95.209992,5.961537,0.049393,0.218204
JNJ,2023-01-03,164.409095,166.481009,164.269092,166.303680,6344900,0.0,0.0,165.177211,163.119970,165.393976,...,45.097770,0.059760,105.556542,6.252687,-0.071654,0.417170,70.116631,1.347538,0.043192,-0.111622
NVDA,2023-01-03,14.838841,14.983723,14.084459,14.303280,401277000,0.0,0.0,15.095031,15.347197,15.178465,...,10.008960,0.031852,5893.965000,129.007609,0.043637,1.230730,3712.394096,0.896482,0.001310,0.463872
PG,2023-01-03,141.881942,142.596296,140.161874,142.464706,6447300,0.0,0.0,142.788042,135.511130,142.577128,...,57.926546,0.089370,92.555022,7.350959,0.300785,0.055598,74.323402,1.632399,0.037160,-0.080606
XOM,2023-01-03,100.815354,101.035754,96.875675,97.812386,15146200,0.0,0.0,99.094387,100.055173,99.090112,...,80.902364,0.105642,24.397436,2.837963,0.111639,-2.025777,20.028638,0.951687,0.137824,0.093620
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
AAPL,2023-12-29,192.742785,193.239801,190.585738,191.380966,42628800,0.0,0.0,193.420724,185.446808,192.757678,...,24.855426,0.369388,129.504163,42.802218,0.162231,0.003411,109.154330,4.673462,0.049393,-0.038405
JNJ,2023-12-29,150.502374,150.877404,149.992720,150.723557,4311100,0.0,0.0,149.669627,146.179193,149.804688,...,77.271108,0.365418,14.673412,5.050047,4.210697,-0.137416,63.142382,1.331401,0.043192,-0.001724
NVDA,2023-12-29,49.794301,49.978234,48.732700,49.503410,389293000,0.0,0.0,49.205721,46.664760,48.905561,...,77.831794,0.277860,1330.187298,459.406814,0.496692,-0.547089,1123.508734,0.627777,0.000777,0.376025
PG,2023-12-29,140.713081,141.638325,140.452854,141.233521,5300900,0.0,0.0,140.136725,143.503817,140.727193,...,53.750909,0.094796,82.709238,7.814326,0.337234,0.016052,65.197424,1.562484,0.036577,0.070582


## 3. Model Initialization and Training

In [None]:
# Set the target dynamically
target_column = f"log_return_{horizon}d"

In [15]:
# Set the path dynamically
path = f"AutoGluon_Models/{target_column}"

The workflow supports two options:

- **Loading a Pretrained Model**:  
    If a previously trained model exists at the specified path, it is loaded to save time and resources. This is useful for quick experimentation or deployment.

- **Training a New Model**:  
    Alternatively, the user can initialize a new TimeSeriesPredictor and train it from scratch using the provided training and validation datasets.

In [None]:
# Load pretrained model (preferred for speed)
predictor = TimeSeriesPredictor.load(path)

`ctrl + /` to uncomment

In [None]:
# predictor = TimeSeriesPredictor(
#     target=target_column,
#     path=path,
#     prediction_length=horizon,
#     freq="B",
#     eval_metric="RMSE"
# )

In [None]:
# predictor.fit(
#     train_data=train_data, 
#     tuning_data=val_data, 
#     presets="best_quality", 
#     hyperparameters = {
#         "PatchTST": {
#             "context_length": 256,
#             "patch_len": 32,
#             "stride": 4,
#             "d_model": 128,
#             "nhead": 8,
#             "num_encoder_layers": 6,
#             "max_epochs": 200,
#             "batch_size": 64,
#             "num_batches_per_epoch": 50
#         }
#     }
# )

**Model Choice: PatchTST**

PatchTST (Patch Time-Series Transformer) was selected as the forecasting model after extensive benchmarking. It demonstrated superior performance on this dataset compared to alternative models.

Key advantages of PatchTST include:

- Effective handling of long-range temporal dependencies via transformer architecture

- Efficient use of local patch information through patch embeddings

- Strong performance in multivariate and univariate time series forecasting tasks

The hyperparameters specified in the training block reflect an optimal configuration found through experimentation, balancing accuracy and training time.

## 4. Predictions and Evaluation

After loading or training the model, we forecast future values over the specified prediction horizon for each symbol in the test set

In [20]:
predictions = predictor.predict(test_data)

data with frequency 'IRREG' has been resampled to frequency 'B'.


### 4.1 Handling Invalid Prediction Dates

Time series models may predict values for calendar dates that are not in the actual dataset, such as weekends or market holidays. To ensure alignment with real trading data, we remove predictions made on dates not present in the reference dataset:

In [None]:
def remove_missing_prediction_dates(predictions: pd.DataFrame, reference_ts: pd.DataFrame) -> pd.DataFrame:
    """
    Removes prediction timestamps not present in the reference time series (e.g., due to holidays).

    Parameters:
        predictions (pd.DataFrame): MultiIndex DataFrame with ('item_id', 'timestamp').
        reference_ts (pd.DataFrame): TimeSeriesDataFrame with valid timestamps.

    Returns:
        pd.DataFrame: Cleaned predictions with only valid timestamps.
    """
    pred_timestamps = predictions.index.get_level_values("timestamp").unique()
    reference_timestamps = reference_ts.index.get_level_values("timestamp").unique()
    
    missing_timestamps = list(pred_timestamps.difference(reference_timestamps))
    
    if missing_timestamps:
        print(f"Dropping {len(missing_timestamps)} timestamps not in reference dataset.")
    
    return predictions[~predictions.index.get_level_values("timestamp").isin(missing_timestamps)]


In [None]:
predictions = remove_missing_prediction_dates(predictions, ts_dataset_30)
predictions

Dropping 2 timestamps not in reference dataset.


Unnamed: 0_level_0,Unnamed: 1_level_0,mean,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9
item_id,timestamp,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
AAPL,2024-01-02,-0.021978,-0.058285,-0.044966,-0.036065,-0.028729,-0.021978,-0.015226,-0.007890,0.001011,0.014330
AAPL,2024-01-03,-0.019316,-0.058452,-0.044118,-0.034521,-0.026604,-0.019316,-0.012028,-0.004111,0.005487,0.019821
AAPL,2024-01-04,-0.018461,-0.063714,-0.047135,-0.036038,-0.026886,-0.018461,-0.010036,-0.000884,0.010213,0.026791
AAPL,2024-01-05,-0.017092,-0.063759,-0.046405,-0.034990,-0.025655,-0.017092,-0.008529,0.000806,0.012220,0.029574
AAPL,2024-01-08,-0.006309,-0.059162,-0.039697,-0.026748,-0.016100,-0.006309,0.003481,0.014129,0.027078,0.046543
...,...,...,...,...,...,...,...,...,...,...,...
XOM,2024-02-05,0.023755,-0.057190,-0.025696,-0.006069,0.009570,0.023755,0.037940,0.053579,0.073206,0.104700
XOM,2024-02-06,0.021151,-0.059887,-0.029451,-0.009664,0.006427,0.021151,0.035875,0.051966,0.071752,0.102188
XOM,2024-02-07,0.018642,-0.059816,-0.030401,-0.011238,0.004362,0.018642,0.032923,0.048523,0.067686,0.097101
XOM,2024-02-08,0.020114,-0.065096,-0.031927,-0.011269,0.005188,0.020114,0.035039,0.051496,0.072155,0.105323


### 4.2 Evaluating predictions

AutoGluon provides built-in evaluation tools. The following command reports standard metrics such as RMSE across the entire test set:

In [None]:
predictor.evaluate(test_data)

To gain a more granular understanding of model performance, we compute the Root Mean Squared Error (RMSE) for each symbol individually. This helps identify which stocks the model forecasts well—and which it struggles with.

In [None]:
def compute_per_symbol_rmse(
    predictions: pd.DataFrame,
    true_values: pd.DataFrame,
    target_column: str = "log_return_30d"
) -> pd.Series:
    """
    Compute RMSE per symbol from predictions and true values.
    Trims predictions and true values to matching timestamps.

    Parameters:
    - predictions: DataFrame with MultiIndex (item_id, timestamp) and predicted values in 'mean' column (or similar).
    - true_values: DataFrame with MultiIndex (item_id, timestamp) containing target_column.
    - target_column: Name of the true target column in true_values.

    Returns:
    - pd.Series with RMSE per symbol.
    """

    # Select only the target column from true_values
    true_values_target = true_values[[target_column]]

    # Trim predictions and true values to common index
    common_index = predictions.index.intersection(true_values_target.index)
    trimmed_predictions = predictions.loc[common_index]
    trimmed_true_values = true_values_target.loc[common_index]

    symbols = trimmed_true_values.index.get_level_values("item_id").unique()
    rmse_per_symbol = {}

    for symbol in symbols:
        y_true = trimmed_true_values.loc[symbol][target_column]
        y_pred = trimmed_predictions.loc[symbol]

        # Use 'mean' if present, else first column
        if 'mean' in y_pred.columns:
            y_pred_values = y_pred['mean']
        else:
            y_pred_values = y_pred.iloc[:, 0]

        # Align y_true and y_pred_values on timestamps
        y_true, y_pred_values = y_true.align(y_pred_values, join='inner')

        rmse = np.sqrt(((y_true - y_pred_values) ** 2).mean())
        rmse_per_symbol[symbol] = rmse

    return pd.Series(rmse_per_symbol)

In [None]:
compute_per_symbol_rmse(predictions, ts_dataset_30)

## 5. Forecast Interpretation and Reporting

In [26]:
def get_forecast_start_dates(predictions: pd.DataFrame) -> pd.Series:
    """
    Get the first forecast timestamp per symbol.
    """
    return (
        predictions.index.to_frame()
        .reset_index(drop=True)
        .groupby("item_id")["timestamp"]
        .min()
    )

In [27]:
get_forecast_start_dates(predictions)

item_id
AAPL   2024-01-02
JNJ    2024-01-02
NVDA   2024-01-02
PG     2024-01-02
XOM    2024-01-02
Name: timestamp, dtype: datetime64[ns]

### 5.1 Retrieving Base Prices

To translate predicted log returns into future prices, we first identify the last known adjusted close price before the forecast horizon begins. This serves as the base price.

In [None]:
def get_base_close_prices(dataset: pd.DataFrame, forecast_start_dates: pd.Series) -> pd.Series:
    """
    Extract the last known close_adj price before forecast starts for each symbol.
    
    Parameters:
        dataset: Original dataset with columns ['Symbol', 'Date', 'close_adj']
        forecast_start_dates: Series indexed by symbol with forecast start timestamps.
        
    Returns:
        Series of base close prices indexed by symbol.
    """
    dataset = dataset.copy()
    dataset["Date"] = pd.to_datetime(dataset["Date"])
    dataset_indexed = dataset.set_index(["Symbol", "Date"]).sort_index()
    
    base_close_prices = {}

    for symbol, forecast_start in forecast_start_dates.items():
        try:
            symbol_data = dataset_indexed.loc[symbol]
            valid_dates = symbol_data[symbol_data.index <= forecast_start]
            
            if not valid_dates.empty:
                last_price = valid_dates.iloc[-1]["close_adj"]
                base_close_prices[symbol] = last_price
            else:
                print(f"No data before forecast date for symbol: {symbol}")
        except KeyError:
            print(f"Symbol not found in dataset: {symbol}")
    
    return pd.Series(base_close_prices)

In [29]:
base_close_prices = get_base_close_prices(dataset, get_forecast_start_dates(predictions))

### 5.2 Generating the Forecast Report

A summary table is produced that converts predicted log returns into:

- Percent Returns: Easier to interpret for non-technical users.

- Future Prices: Estimated prices after the forecast horizon.

It includes:

- PredictionDate: Start of the forecast

- TargetDate: Date after the horizon (e.g., 30 business days later)

- BasePrice: Most recent actual price before prediction

- Forecast columns for:

    - mean (expected return)

    - 0.1 (10th percentile / pessimistic scenario)

    - 0.9 (90th percentile / optimistic scenario)

In [None]:
def predictions_report(log_return_df: pd.DataFrame, base_close_prices: pd.Series, horizon_days: int) -> pd.DataFrame:
    """
    Converts predicted log returns into percent returns and predicted prices.
    Keeps only mean, 10th percentile, and 90th percentile.

    Parameters:
        log_return_df (pd.DataFrame): Predictions with MultiIndex ('item_id', 'timestamp').
        base_close_prices (pd.Series): Base prices indexed by symbol.
        horizon_days (int): Horizon of prediction (e.g. 30).

    Returns:
        pd.DataFrame: Investor-friendly summary of predictions.
    """
    # Select and rename
    df = log_return_df.copy()
    df = df[['mean', '0.1', '0.9']].reset_index()
    df = df.rename(columns={'item_id': 'Symbol', 'timestamp': 'PredictionDate'})

    # Add future date
    df['TargetDate'] = df['PredictionDate'] + pd.offsets.BDay(horizon_days)

    # Percent returns
    for col in ['mean', '0.1', '0.9']:
        df[f'{col}_pct_return'] = (np.exp(df[col]) - 1) * 100  # Convert to %
    
    # Prices
    df['BasePrice'] = df['Symbol'].map(base_close_prices)
    for col in ['mean', '0.1', '0.9']:
        df[f'{col}_price'] = df['BasePrice'] * np.exp(df[col])

    # Clean up
    final_cols = [
        'PredictionDate', 'TargetDate', 'Symbol', 'BasePrice',
        'mean_pct_return', '0.1_pct_return', '0.9_pct_return',
        'mean_price', '0.1_price', '0.9_price'
    ]
    return df[final_cols].sort_values(['Symbol', 'PredictionDate'])


In [None]:
report = predictions_report(predictions, base_close_prices, horizon_days=horizon)
report

Unnamed: 0,PredictionDate,TargetDate,Symbol,BasePrice,mean_pct_return,0.1_pct_return,0.9_pct_return,mean_price,0.1_price,0.9_price
0,2024-01-02,2024-02-13,AAPL,184.532074,-2.173769,-5.661863,1.443267,180.520772,174.084121,187.195364
1,2024-01-03,2024-02-14,AAPL,184.532074,-1.913041,-5.677688,2.001834,181.001900,174.054918,188.226100
2,2024-01-04,2024-02-15,AAPL,184.532074,-1.829195,-6.172621,2.715302,181.156622,173.141608,189.542676
3,2024-01-05,2024-02-16,AAPL,184.532074,-1.694709,-6.176853,3.001559,181.404792,173.133799,190.070913
4,2024-01-08,2024-02-19,AAPL,184.532074,-0.628960,-5.744547,4.764271,183.371441,173.931543,193.323682
...,...,...,...,...,...,...,...,...,...,...
135,2024-02-05,2024-03-18,XOM,97.210678,2.403975,-5.558527,11.037790,99.547598,91.807197,107.940589
136,2024-02-06,2024-03-19,XOM,97.210678,2.137589,-5.812878,10.759187,99.288643,91.559941,107.669756
137,2024-02-07,2024-03-20,XOM,97.210678,1.881742,-5.806243,10.197186,99.039933,91.566390,107.123432
138,2024-02-08,2024-03-21,XOM,97.210678,2.031744,-6.302219,11.106968,99.185750,91.084248,108.007837


In [None]:
# Save the report to a CSV file
report.to_csv("predictions_report.csv", index=False)