<a href="https://colab.research.google.com/github/azhgh22/Walmart-Recruiting-Store-Sales-Forecasting/blob/main/notebooks/03_patch_tst.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Notebook Overview




The main objective of this notebook is to evaluate how well the Neural Network model — specifically **PatchTST** — performs on our problem.

To achieve this, we will:

- Experiment with the PatchTST model.
- Tune its hyperparameters to identify the best-performing configuration.
- Use a slightly **different data splitting strategy**. Unlike the usual approach, we will split the data such that the **validation set better reflects the real-world score**, as this has proven to yield more accurate assessments for this model.
- Train the model **only on the time series data**, excluding any additional features. That is, we will predict based solely on the **store**, **department**, and **date** information.

We will be using our own implementation of the `NeuralForecastModels` framework, located in:

```
models/neural_forecast_models
```

# Notebook Setup

The following setup is provided as a basic example for initializing the notebook environment. It includes necessary imports, optional configuration, and a placeholder for data loading or downloading.

This section is **not part of the core model logic**, and the code here may vary depending on your environment or data access method.

## Setup Environment


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from google.colab import userdata
token = userdata.get('GITHUB_TOKEN')
user_name = userdata.get('GITHUB_USERNAME')
mail = userdata.get('GITHUB_MAIL')

!git config --global user.name "{user_name}"
!git config --global user.email "{mail}"
!git clone https://{token}@github.com/azhgh22/Walmart-Recruiting-Store-Sales-Forecasting.git

%cd Walmart-Recruiting-Store-Sales-Forecasting

Cloning into 'Walmart-Recruiting-Store-Sales-Forecasting'...
remote: Enumerating objects: 367, done.[K
remote: Counting objects: 100% (158/158), done.[K
remote: Compressing objects: 100% (136/136), done.[K
remote: Total 367 (delta 88), reused 52 (delta 22), pack-reused 209 (from 1)[K
Receiving objects: 100% (367/367), 6.92 MiB | 4.03 MiB/s, done.
Resolving deltas: 100% (185/185), done.
/content/Walmart-Recruiting-Store-Sales-Forecasting


In [None]:
%%capture
!pip install -r requirements.txt

In [None]:
%%capture
from google.colab import userdata
kaggle_json_path = userdata.get('KAGGLE_JSON_PATH')
! ./src/data_loader.sh -f {kaggle_json_path}

In [None]:
from google.colab import userdata
wandb_api_key = userdata.get('WANDB_API_LOGIN')
!wandb login {wandb_api_key}

[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


## Load and Split Data

In [None]:
from src import data_loader, processing
import importlib
importlib.reload(processing)

dataframes = data_loader.load_raw_data()
df = processing.run_preprocessing(dataframes, process_test=False, merge_features=True, merge_stores=True)['train']
X_train, y_train, X_valid, y_valid = processing.split_data_by_ratio(df, separate_target=True)

print(f"Shapes of train_df and valid_df: {X_train.shape}, {X_valid.shape}")

Data loading complete.
Shapes of train_df and valid_df: (337256, 15), (84314, 15)


In [None]:
X_train

Unnamed: 0,Store,Dept,Date,IsHoliday,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment,Type,Size
0,1,1,2010-02-05,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
1,1,2,2010-02-05,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
2,1,3,2010-02-05,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
3,1,4,2010-02-05,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
4,1,5,2010-02-05,False,42.31,2.572,,,,,,211.096358,8.106,A,151315
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
337251,22,27,2012-04-13,False,49.89,4.025,5981.5,10877.85,9.5,1633.96,1932.86,141.843393,7.671,B,119557
337252,22,28,2012-04-13,False,49.89,4.025,5981.5,10877.85,9.5,1633.96,1932.86,141.843393,7.671,B,119557
337253,22,29,2012-04-13,False,49.89,4.025,5981.5,10877.85,9.5,1633.96,1932.86,141.843393,7.671,B,119557
337254,22,30,2012-04-13,False,49.89,4.025,5981.5,10877.85,9.5,1633.96,1932.86,141.843393,7.671,B,119557


In [None]:
X_valid

Unnamed: 0,Store,Dept,Date,IsHoliday
337256,22,32,2012-04-13,False
337257,22,33,2012-04-13,False
337258,22,34,2012-04-13,False
337259,22,35,2012-04-13,False
337260,22,36,2012-04-13,False
...,...,...,...,...
421565,45,93,2012-10-26,False
421566,45,94,2012-10-26,False
421567,45,95,2012-10-26,False
421568,45,97,2012-10-26,False


# Train

We begin by defining the `run_patchtst_cv` method, which will be used throughout this notebook to perform cross-validation for the PatchTST model.

In [None]:
from itertools import product
from neuralforecast.models import PatchTST
from models.neural_forecast_models import NeuralForecastModels
from src.utils import wmae as compute_wmae
import logging

logging.getLogger().setLevel(logging.WARNING)
logging.getLogger("neuralforecast").setLevel(logging.WARNING)
logging.getLogger("pytorch_lightning").setLevel(logging.WARNING)
logging.getLogger("lightning_fabric").setLevel(logging.WARNING)

def run_patchtst_cv(X_train, y_train, X_valid, y_valid,
                            param_grid,
                            fixed_params,
                            return_all=False):
    results = []

    keys, values = zip(*param_grid.items())
    for vals in product(*values):
        params = dict(zip(keys, vals))
        params.update(fixed_params)

        params['enable_progress_bar'] = False
        params['enable_model_summary'] = False

        model = PatchTST(**params)

        nf_model = NeuralForecastModels(models=[model], model_names=['PatchTST'], freq='W-FRI', one_model=True)
        nf_model.fit(X_train, y_train)
        y_pred = nf_model.predict(X_valid)
        score = compute_wmae(y_valid, y_pred, X_valid['IsHoliday'])

        result = {'wmae': score, 'preds': y_pred}
        result.update(params)

        results.append(result)
        print(" → ".join(f"{k}={v}" for k,v in params.items() if k not in ['enable_progress_bar','enable_model_summary']) + f" → WMAE={score:.4f}")

    if return_all:
        return results
    else:
        return min(results, key=lambda r: r['wmae'])


## Input Size and Batch Size




The first hyperparameters we will tune are:

- **Input Size**: This determines how many past time steps the model uses to make a prediction. Choosing an appropriate input size is crucial because too small a window may miss important patterns, while too large may introduce noise and increase computational cost.

- **Batch Size**: This controls how many samples the model processes before updating its internal weights. The batch size affects training stability and speed. Smaller batches offer more frequent updates but can be noisy; larger batches provide smoother gradients but require more memory.

In [None]:
from neuralforecast import NeuralForecast
from neuralforecast.models import PatchTST
from models.neural_forecast_models import NeuralForecastModels
from src.utils import wmae as compute_wmae

param_grid = {
    'input_size' : [40, 52, 70],
    'batch_size' : [32, 64, 128]
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'random_seed': 42,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


input_size=40, batch_size=32 → WMAE=1727.5545
input_size=40, batch_size=64 → WMAE=1702.6652
input_size=40, batch_size=128 → WMAE=1716.7269
input_size=52, batch_size=32 → WMAE=1562.4860
input_size=52, batch_size=64 → WMAE=1538.3386
input_size=52, batch_size=128 → WMAE=1549.5148
input_size=70, batch_size=32 → WMAE=1637.8330
input_size=70, batch_size=64 → WMAE=1670.5391
input_size=70, batch_size=128 → WMAE=1643.3104


In [None]:
param_grid = {
    'dropout': [0.0, 0.1, 0.2],
    # 'patch_len': [2, 4],
    # 'stride': [1, 2, 4],
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


dropout=0.0 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1573.5200
dropout=0.1 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1540.9934
dropout=0.2 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1538.3386

Best hyperparameters found:
  dropout: 0.2
Best WMAE: 1538.3386


## Regularization



We will tune dropout rates among the values **0.0**, **0.1**, and **0.2**, in order to deal with potential overfitting.

In [None]:
param_grid = {
    'dropout': [0.2, 0.4],
    'patch_len': [2, 4],
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


dropout=0.2 → patch_len=2 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1769.1498
dropout=0.2 → patch_len=4 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1657.5459
dropout=0.4 → patch_len=2 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1823.3388
dropout=0.4 → patch_len=4 → max_steps=2600 → h=53 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1692.0812

Best hyperparameters found:
  dropout: 0.2
  patch_len: 4
Best WMAE: 1657.5459


## Other Hyperparameters



Although the model appears to capture the information quite well, we will still tune the **stride** and **patch length** parameters. However, we will focus on using moderately smaller values to balance model complexity and computational efficiency.


In [None]:
param_grid = {
    'patch_len': [16, 32],
    'stride' : [8, 16]
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'dropout' : 0.2,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


patch_len=16 → stride=8 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1538.3386
patch_len=16 → stride=16 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1558.7274
patch_len=32 → stride=8 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1554.5784
patch_len=32 → stride=16 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1560.6312

Best hyperparameters found:
  patch_len: 16
  stride: 8
Best WMAE: 1538.3386


Finally, it is a good idea to experiment with learning variables like learning rate, optimizer, and activation function because it’s interesting to see if we can help the model learn more effectively and extract better information from the data.


In [None]:
import torch.optim as optim

param_grid = {
    'optimizer': [optim.Adam, optim.AdamW],
    'learning_rate': [5e-3, 1e-3, 5e-4]
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'dropout' : 0.2,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


optimizer=<class 'torch.optim.adam.Adam'> → learning_rate=0.005 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1813.3649
optimizer=<class 'torch.optim.adam.Adam'> → learning_rate=0.001 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1580.1102
optimizer=<class 'torch.optim.adam.Adam'> → learning_rate=0.0005 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1568.4398
optimizer=<class 'torch.optim.adamw.AdamW'> → learning_rate=0.005 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1764.6460
optimizer=<class 'torch.optim.adamw.AdamW'> → learning_rate=0.001 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1576.4206
optimizer=<class 'torch.optim.adamw.AdamW'> → learning_rate=0.0005 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 →

In [None]:
import torch.optim as optim

param_grid = {
    'learning_rate': [1e-4, 5e-4, 1e-5]
}

fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'dropout' : 0.2,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


learning_rate=0.0001 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1538.3386
learning_rate=0.0005 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1568.4398
learning_rate=1e-05 → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1626.1420

Best hyperparameters found:
  learning_rate: 0.0001
Best WMAE: 1538.3386


In [None]:
param_grid = {
    'activation': ['relu', 'gelu']
}


fixed_params = {
    'max_steps': 25 * 104,
    'h': 53,
    'dropout' : 0.2,
    'random_seed': 42,
    'input_size': 52,
    'batch_size' : 64,
}

best_result = run_patchtst_cv(
    X_train, y_train, X_valid, y_valid,
    param_grid=param_grid,
    fixed_params=fixed_params,
    return_all=False
)

print("\nBest hyperparameters found:")
for param in param_grid.keys():
    print(f"  {param}: {best_result[param]}")
print(f"Best WMAE: {best_result['wmae']:.4f}")


activation=relu → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1537.9701
activation=gelu → max_steps=2600 → h=53 → dropout=0.2 → random_seed=42 → input_size=52 → batch_size=64 → WMAE=1538.3386

Best hyperparameters found:
  activation: relu
Best WMAE: 1537.9701


# Final Model


The best model found via cross-validation is shown below:

| Parameter            | Value          |
|----------------------|----------------|
| **input_size**       | 52             |
| **dropout**          | 0.2            |
| **h (forecast horizon)** | 53         |
| **max_steps**        | 6240 (60 * 104)|
| **batch_size**       | 64             |
| **random_seed**      | 42             |
| **activation**       | relu           |


Validation score (WMAE): **1526.46**


In [None]:
from neuralforecast.models import PatchTST
from models.neural_forecast_models import NeuralForecastModels
from src.utils import wmae as compute_wmae

model = PatchTST(
    input_size=52,
    dropout = 0.2,
    h=53,
    max_steps= 60 * 104,
    batch_size=64,
    random_seed=42,
    activation='relu',
    enable_progress_bar=False,
    enable_model_summary=False,
)
nf_model = NeuralForecastModels(models=[model], model_names=['PatchTST'], freq='W-FRI', one_model=True)

nf_model.fit(X_train, y_train)
y_pred = nf_model.predict(X_valid)
wmae = compute_wmae(y_valid, y_pred, X_valid['IsHoliday'])

print(wmae)

1526.4649587770039


Now, we will train the selected best model on the entire dataset to leverage all available data. Additionally, we will log the training process and metrics using **Weights & Biases (wandb)** for experiment tracking.

In [None]:
from neuralforecast.models import PatchTST
from models.neural_forecast_models import NeuralForecastModels
from src.utils import wmae as compute_wmae

model = PatchTST(
    input_size=52,
    dropout = 0.2,
    h=53,
    max_steps= 60 * 104,
    batch_size=64,
    random_seed=42,
    activation='relu',
    enable_progress_bar=False,
    enable_model_summary=False,
)
nf_model = NeuralForecastModels(models=[model], model_names=['PatchTST'], freq='W-FRI', one_model=True)

nf_model.fit(df.drop(columns='Weekly_Sales'), df['Weekly_Sales'])

In [None]:
from configs import basic_config
from configs import nn_models_config

import importlib
importlib.reload(basic_config)
importlib.reload(nn_models_config)

from sklearn.pipeline import Pipeline
from configs.basic_config import minimal_config as cfg
from configs.nn_models_config import patchtst_config
from src.utils import log_to_wandb

log_to_wandb(
    model=nf_model,
    train_score=-1,
    val_score=wmae,
    config=cfg | patchtst_config,
    run_name='patch_tst_01',
    artifact_name="patch_tst",
)

[34m[1mwandb[0m: Currently logged in as: [33mzhorzholianimate[0m ([33mMLBeasts[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


0,1
train_wmae,▁
val_wmae,▁

0,1
train_wmae,-1.0
val_wmae,1526.46496
