# Retail Sales Forecasting using the M5 dataset with Granite Time Series - Few-shot finetuning, evaluation, and visualization

In this tutorial, we will explore [timeseries forecasting](https://www.ibm.com/think/insights/time-series-forecasting) using the [IBM Granite Timeseries model](https://ibm.com/granite) to predict retail sales. We will cover key techniques such as few-shot forecasting and fine-tuning. We are using  [M5 datasets](https://drive.google.com/drive/folders/1D6EWdVSaOtrP1LEFh1REjI3vej6iUS_4?usp=sharing) from the official M-Competitions [repository](https://github.com/Mcompetitions/M5-methods) to forecast future sales aggregated by state. The aim of this recipe is to showcase how to use a pre-trained time series foundation model for multivariate forecasting and explores various features available with Granite Time Series Foundation Models.

This recipe uses TinyTimeMixers (TTMs), which are compact pre-trained models for Multivariate Time-Series Forecasting, open-sourced by IBM Research. With less than 1 Million parameters, TTM introduces the notion of the first-ever "tiny" pre-trained models for Time-Series Forecasting. TTM outperforms several popular benchmarks demanding billions of parameters in zero-shot and few-shot forecasting and can easily be fine-tuned for multivariate forecasts.

## Setting Up

### Install the TSFM Library

The [granite-tsfm library](https://github.com/ibm-granite/granite-tsfm) provides utilities for working with Time Series Foundation Models (TSFM). Here we retrieve and install the latest version of the library.

In [1]:
# Install the tsfm library
! pip install "granite-tsfm[notebooks] @ git+https://github.com/ibm-granite/granite-tsfm.git@v0.2.22"
# Install a utility to help download data files from google drive during the data prep process
! pip install gdown

Collecting granite-tsfm@ git+https://github.com/ibm-granite/granite-tsfm.git@v0.2.22 (from granite-tsfm[notebooks]@ git+https://github.com/ibm-granite/granite-tsfm.git@v0.2.22)
  Cloning https://github.com/ibm-granite/granite-tsfm.git (to revision v0.2.22) to /private/var/folders/02/q215g0zs37l9h3w3x7ntgygw0000gn/T/pip-install-vp5uyayh/granite-tsfm_fc32ca1a78814807be09833633018db7
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite/granite-tsfm.git /private/var/folders/02/q215g0zs37l9h3w3x7ntgygw0000gn/T/pip-install-vp5uyayh/granite-tsfm_fc32ca1a78814807be09833633018db7
  Running command git checkout -q 216850d0cb073e31689049c1334f701fe11bc2c3
  Resolved https://github.com/ibm-granite/granite-tsfm.git to commit 216850d0cb073e31689049c1334f701fe11bc2c3
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
Collecting scikit-learn (from granite-t

### Import Packages

From `tsfm_public`, we use the TinyTimeMixer model, forecasting pipeline, and plotting function. We also leverage a few components for the fine-tuning process.

In [18]:
import math
import os

import numpy as np
import pandas as pd
import torch
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import Subset
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments, set_seed

from tsfm_public import (
    ForecastDFDataset,
    TimeSeriesForecastingPipeline,
    TimeSeriesPreprocessor,
    TinyTimeMixerForPrediction,
    TrackingCallback,
    count_parameters,
)
from tsfm_public.toolkit.lr_finder import optimal_lr_finder
from tsfm_public.toolkit.time_series_preprocessor import prepare_data_splits
from tsfm_public.toolkit.util import select_by_timestamp
from tsfm_public.toolkit.visualization import plot_predictions

### Specify configuration variables

The forecast length is specified as well as the context length (in time steps) which is set to match the pretrained model. Additionally, we declare the Granite Time Series Model and the specific revision we are targeting.

The granite-timeseries TTM R2 card has several different revisions of the model available for various context lengths and prediction lengths. In this example we will be working with daily data, so we choose a model suitable for that resolution -- 90 days of history to forecast the next 30 days.

In [17]:
forecast_length = 28
context_length = 90

TTM_MODEL_PATH = "ibm-granite/granite-timeseries-ttm-r2"
REVISION = "90-30-ft-l1-r2.1"

# Use MPS (Metal Performance Shaders) for Apple Silicon GPU acceleration
if torch.backends.mps.is_available():
    device = "mps"
    print("✓ Using Apple Silicon GPU (MPS)")
elif torch.cuda.is_available():
    device = "cuda"
    print("✓ Using CUDA GPU")
else:
    device = "cpu"
    print("⚠ Using CPU (slower)")

✓ Using Apple Silicon GPU (MPS)


## Preparing the Data

As mentioned in the introduction, this notebook makes use of the [M5 datasets](https://drive.google.com/drive/folders/1D6EWdVSaOtrP1LEFh1REjI3vej6iUS_4?usp=sharing) from the official M-Competitions [repository](https://github.com/Mcompetitions/M5-methods).


The original data includes hierarchy and product information. For this example we aggregate the sales by state into three separate time series. The code for downloading the datasets and preparing them is available in `M5_retail_data_prep.py`. Here, we first make sure we have access to the `M5_retail_data_prep.py` file (in an environment like colab, we need to download the file) and then we simply run the `prepare_data()` function to save the prepared dataset.


### Read in the data

We parse the CSV into a pandas dataframe and ensure the timestamp column is a UTC datetime and drop two unnecessary columns.

In [19]:
dataset_path = "/Users/darenpalmer/Desktop/UCL/CS/fyp.nosync/data/combined_obd_dataset_watsonx.csv"
print(f"\nLoading combined dataset from: {dataset_path}")
data = pd.read_csv(dataset_path, parse_dates=["timestamp"])


Loading combined dataset from: /Users/darenpalmer/Desktop/UCL/CS/fyp.nosync/data/combined_obd_dataset_watsonx.csv


In [20]:
data = data.drop(columns=["vehicle", "Time_original", "unique_id", "date", "original_row_id", "route_from", "route_to"])
print(f"✓ Loaded: {len(data):,} total samples")
print(f"✓ Columns: {len(data.columns)}")

# Preview
print(f"\nDataset preview:")
data.head()

✓ Loaded: 2,693,087 total samples
✓ Columns: 16

Dataset preview:


Unnamed: 0,timestamp,Engine Coolant Temperature [Â°C],Intake Manifold Absolute Pressure [kPa],Engine RPM [RPM],Vehicle Speed Sensor [km/h],Intake Air Temperature [Â°C],Air Flow Rate from Mass Flow Sensor [g/s],Absolute Throttle Position [%],Ambient Air Temperature [Â°C],Accelerator Pedal Position D [%],Accelerator Pedal Position E [%],source_file,traffic_condition,Engine Coolant Temperature [°C],Intake Air Temperature [°C],Ambient Air Temperature [°C]
0,2017-07-05 07:16:31.257,31.0,96.0,0.0,0.0,22.0,0.91,89.0,21.0,14.1,14.5,2017-07-05_Seat_Leon_RT_S_Stau.csv,Stau,,,
1,2017-07-05 07:16:31.345,31.0,96.0,0.0,0.0,22.0,0.91,89.0,21.0,14.1,14.5,2017-07-05_Seat_Leon_RT_S_Stau.csv,Stau,,,
2,2017-07-05 07:16:31.435,31.0,96.0,0.0,0.0,22.0,0.91,89.0,21.0,14.1,14.5,2017-07-05_Seat_Leon_RT_S_Stau.csv,Stau,,,
3,2017-07-05 07:16:31.525,31.0,96.0,0.0,0.0,22.0,0.91,89.0,21.0,14.1,14.5,2017-07-05_Seat_Leon_RT_S_Stau.csv,Stau,,,
4,2017-07-05 07:16:31.615,31.0,96.0,0.0,0.0,22.0,0.91,89.0,21.0,14.1,14.5,2017-07-05_Seat_Leon_RT_S_Stau.csv,Stau,,,


Next, we must clean up the columns in our data and declare the names of the timestamp column, the target column to be predicted as well as the categorical column used to aggregate the data.

In [21]:
cols = list(data.columns)
[cols.remove(c) for c in ["timestamp", "Engine RPM [RPM]", "source_file", "traffic_condition"]]
cols

column_specifiers = {
    "timestamp_column": "timestamp",
    "id_columns": ["source_file"],
    "target_columns": ["Engine RPM [RPM]"],
    "control_columns": cols,
    "static_categorical_columns": ["traffic_condition"],
}

### Train the Preprocessor

The preprocessor is trained on the training portion of the input data to learn the scaling factors. The scaling will be applied when we use the preprocess method of the time series preprocessor.

In [22]:
tsp = TimeSeriesPreprocessor(
    **column_specifiers,
    context_length=context_length,
    prediction_length=forecast_length,
    scaling=True,
    encode_categorical=True,
    scaler_type="standard",
)

# Group data by source file
grouped_data = data.groupby(column_specifiers["id_columns"])

# Get the list of unique source files
source_files = list(grouped_data.groups.keys())

# Determine the split point (e.g., 80% for training, 20% for testing)
split_index = int(len(source_files) * 0.8)

# Split source files into training and testing sets
train_files = source_files[:split_index]
test_files = source_files[split_index:]

# Create training and testing dataframes
df_train = data[data[column_specifiers["id_columns"][0]].isin(train_files)].copy()
df_test = data[data[column_specifiers["id_columns"][0]].isin(test_files)].copy()

# Train the preprocessor on the entire dataset
trained_tsp = tsp.train(data)

  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_sample_count
  T = new_sum / new_sample_count
  new_unnormalized_variance -= correction**2 / new_sample_count
  updated_mean = (last_sum + new_sum) / updated_samp

## Finetune the model

Now we will focus on fine-tuning the pretrained model. We use the same data splits we defined above, but now include extra columns during the fine-tuning process.

### Preparing the data for fine-tuning

We split the data into training, validation, and test sets. The training set is used to train the model, while the test set is used to evaluate its performance.

In [23]:
# Preprocess the entire dataset
data_preprocessed = tsp.preprocess(data)

split_params = {"train": 0.75, "test": 0.25} # Adjusted split ratios for within train_data_preprocessed

# Create training and testing dataframes from the preprocessed data
train_data_preprocessed = data_preprocessed[data_preprocessed[column_specifiers["id_columns"][0]].isin(train_files)].copy()
test_data_preprocessed = data_preprocessed[data_preprocessed[column_specifiers["id_columns"][0]].isin(test_files)].copy()

# Split the preprocessed training data into training and validation sets
train_data, valid_data, _ = prepare_data_splits(
    train_data_preprocessed, id_columns=column_specifiers["id_columns"], split_config=split_params, context_length=context_length
)

test_data = test_data_preprocessed # Use the preprocessed df_test as the test set

Here we will construct the torch dataset because we cant pass panda dataframes using our torch dataset class specifically designed for forecasting usecases.

In [24]:
frequency_token = tsp.get_frequency_token(tsp.freq)

dataset_params = column_specifiers.copy()
dataset_params["frequency_token"] = frequency_token
dataset_params["context_length"] = context_length
dataset_params["prediction_length"] = forecast_length


train_dataset = ForecastDFDataset(train_data, **dataset_params)
valid_dataset = ForecastDFDataset(valid_data, **dataset_params)
test_dataset = ForecastDFDataset(test_data, **dataset_params)



Now let's take a smaller sample from the torch datasets produced above.

In [25]:
# 20% training and validation data (few-shot finetuning)
fewshot_fraction = 0.20
n_train_all = len(train_dataset)
train_index = np.random.permutation(n_train_all)[: int(fewshot_fraction * n_train_all)]
train_dataset = Subset(train_dataset, train_index)

n_valid_all = len(valid_dataset)
valid_index = np.random.permutation(n_valid_all)[: int(fewshot_fraction * n_valid_all)]
valid_dataset = Subset(valid_dataset, valid_index)

n_train_all, len(train_dataset), n_valid_all, len(valid_dataset)

(1509213, 301842, 64, 12)

### Load the model for fine-tuning

We must first load the TTM model available on HuggingFace using the model and revision set above. We have one target channel, several exogenous channels, and one static categorical input. To take these into account, we use the `TimeSeriesPreprocessor` to provide the `prediction_channel_indices`, `exogenous_channel_indices`, and `categorical_vocab_size_list` information to the model. Note that we also enable channel mixing in the decoder and forecast channel mising. This allows the decoder to be tuned to capture interactions between the channels as well as adjust the forecasts based on interactions with the exogenous.

In [26]:
set_seed(1234)

finetune_forecast_model = TinyTimeMixerForPrediction.from_pretrained(
    TTM_MODEL_PATH,
    revision=REVISION,
    context_length=context_length,
    prediction_filter_length=forecast_length,
    num_input_channels=tsp.num_input_channels,
    decoder_mode="mix_channel",  # exog:  set to mix_channel for mixing channels in history
    prediction_channel_indices=tsp.prediction_channel_indices,
    exogenous_channel_indices=tsp.exogenous_channel_indices,
    fcm_context_length=1,  # exog: indicates lag length to use in the exog fusion. for Ex. if today sales can get affected by discount on +/- 2 days, mention 2
    fcm_use_mixer=True,  # exog: Try true (1st option) or false
    fcm_mix_layers=2,  # exog: Number of layers for exog mixing
    enable_forecast_channel_mixing=True,  # exog: set true for exog mixing
    categorical_vocab_size_list=tsp.categorical_vocab_size_list,  # sizes of the static categorical variables
    fcm_prepend_past=True,  # exog: set true to include lag from history during exog infusion.
)

Some weights of TinyTimeMixerForPrediction were not initialized from the model checkpoint at ibm-granite/granite-timeseries-ttm-r2 and are newly initialized: ['decoder.decoder_block.mixers.0.channel_feature_mixer.gating_block.attn_layer.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.gating_block.attn_layer.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc1.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc1.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc2.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc2.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.norm.norm.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.norm.norm.weight', 'decoder.decoder_block.mixers.1.channel_feature_mixer.gating_block.attn_layer.bias', 'decoder.decoder_block.mixers.1.channel_feature_mixer.gating_block.attn_layer.weight', 'decoder.decoder_block.mixers.1.channel_feature_mixer.mlp.fc1.bias', 'dec

### Optional: Freeze the TTM Backbone

Oftentimes, during fine-tuning we freeze the backbone and focus on tuning only the parameters in the decoder. This reduces the overall number of parameters being tuned and maintains what the encoder learned during pretraining.

For this dataset, however, we found that performance was better when the backbone remained unfrozen -- for other datasets one might prefer to freeze the backbone. We have disabled the backbone freezing code, but left it intact as an example of what might need to be done for other datasets.

In [27]:
freeze_backbone = False
if freeze_backbone:
    print(
        "Number of params before freezing backbone",
        count_parameters(finetune_forecast_model),
    )

    # Freeze the backbone of the model
    for param in finetune_forecast_model.backbone.parameters():
        param.requires_grad = False

    # Count params
    print(
        "Number of params after freezing the backbone",
        count_parameters(finetune_forecast_model),
    )

### Set up a Trainer for Fine-tuning

Configure a Trainer for use in fine-tuning and evaluating the model.

In [29]:
num_epochs = 50
# Optimized batch size for M2 with 16GB - can increase if you have more memory headroom
batch_size = 32  # Increased from 16 for better GPU utilization

learning_rate, finetune_forecast_model = optimal_lr_finder(
    finetune_forecast_model,
    train_dataset,
    batch_size=batch_size,
    enable_prefix_tuning=True,
)
print("OPTIMAL SUGGESTED LEARNING RATE =", learning_rate)

INFO:p-31617:t-8814599936:lr_finder.py:optimal_lr_finder:LR Finder: Running learning rate (LR) finder algorithm. If the suggested LR is very low, we suggest setting the LR manually.
INFO:p-31617:t-8814599936:lr_finder.py:optimal_lr_finder:LR Finder: Using CPU.
INFO:p-31617:t-8814599936:lr_finder.py:optimal_lr_finder:LR Finder: Suggested learning rate = 8.697490026177835e-06


OPTIMAL SUGGESTED LEARNING RATE = 8.697490026177835e-06


### Train the Model

Here we train the model on the training data.

In [30]:
OUT_DIR = "ttm_finetuned_models/"

print(f"Using learning rate = {learning_rate}")
print(f"Using device = {device}")

finetune_forecast_args = TrainingArguments(
    output_dir=os.path.join(OUT_DIR, "output"),
    overwrite_output_dir=True,
    learning_rate=learning_rate,
    num_train_epochs=num_epochs,
    do_eval=True,
    eval_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=2 * batch_size,
    dataloader_num_workers=2,  # Increased for better data loading on M2
    report_to="none",
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    logging_dir=os.path.join(OUT_DIR, "logs"),
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    use_cpu=False,  # Always use accelerator (MPS or CUDA) when available
    fp16=False,  # MPS doesn't support fp16 yet
    # Optimization: gradient accumulation for effective larger batch size
    gradient_accumulation_steps=2,  # Effective batch size = 16 * 2 = 32
)

# Create the early stopping callback
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=5,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.001,  # Minimum improvement required to consider as improvement
)
tracking_callback = TrackingCallback()

# Optimizer and scheduler
optimizer = AdamW(finetune_forecast_model.parameters(), lr=learning_rate)
scheduler = OneCycleLR(
    optimizer,
    learning_rate,
    epochs=num_epochs,
    steps_per_epoch=math.ceil(len(train_dataset) / (batch_size)),
)

finetune_forecast_trainer = Trainer(
    model=finetune_forecast_model,
    args=finetune_forecast_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback, tracking_callback],
    optimizers=(optimizer, scheduler),
)

# Fine tune
finetune_forecast_trainer.train()

finetune_forecast_trainer.evaluate(test_dataset)

Using learning rate = 8.697490026177835e-06
Using device = mps




Epoch,Training Loss,Validation Loss


KeyboardInterrupt: 

### Evaluate the Model

Evaluate the fine-tuned model on the test dataset.

In [None]:
# Define some standard metrics.
def custom_metric(actual, prediction, column_header="results"):
    """Simple function to compute MSE"""
    a = np.asarray(actual.tolist())
    p = np.asarray(prediction.tolist())
    if p.shape[1] < a.shape[1]:
        a = a[:, : p.shape[1]]

    mask = ~np.any(np.isnan(a), axis=1)

    mse = np.mean(np.square(a[mask, :] - p[mask, :]))
    mae = np.mean(np.abs(a[mask, :] - p[mask, :]))
    return pd.DataFrame(
        {
            column_header: {
                "mean_squared_error": mse,
                "root_mean_squared_error": np.sqrt(mse),
                "mean_absolute_error": mae,
            }
        }
    )

In [None]:
# generate forecasts using the finetuned model
pipeline = TimeSeriesForecastingPipeline(
    finetune_forecast_model,
    device=device,  # Specify your local GPU or CPU.
    feature_extractor=tsp,
    batch_size=batch_size,
)

# Make a forecast on the target column given the input data.
finetune_forecast = pipeline(test_data)
finetune_forecast.head()

In [None]:
custom_metric(finetune_forecast["sales"], finetune_forecast["sales_prediction"], "fine-tune forecast")

### Plot the Predictions vs. Actuals

Plot the predictions vs. actuals for some random samples of time intervals in test dataset.

In [None]:
plot_predictions(
    input_df=test_data[test_data.state_id == "CA"],
    predictions_df=finetune_forecast[finetune_forecast.state_id == "CA"],
    freq="d",
    timestamp_column=column_specifiers["timestamp_column"],
    channel=column_specifiers["target_columns"][0],
)