**Downloading Kaggle data sets directly into Colab**

Install the kaggle python library

In [3]:
! pip install kaggle



Mount the Google drive so you can store your kaggle API credentials for future use

In [5]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


Make a directory for kaggle at the temporary instance location on Colab drive.

Download your kaggle API key (.json file). You can do this by going to your kaggle account page and clicking 'Create new API token' under the API section.

In [6]:
! mkdir ~/.kaggle

Upload the json file to Google Drive and then copy to the temporary location.

In [7]:
!cp /content/drive/MyDrive/ColabNotebooks/kaggle_API_credentials/kaggle.json ~/.kaggle/kaggle.json

Change the file permissions to read/write to the owner only

In [8]:
! chmod 600 ~/.kaggle/kaggle.json

**Competitions and Datasets are the two types of Kaggle data**

**1. Download competition data**

If you get 403 Forbidden error, you need to click 'Late Submission' on the Kaggle page for that competition.

In [9]:
! kaggle competitions download -c walmart-recruiting-store-sales-forecasting

Downloading walmart-recruiting-store-sales-forecasting.zip to /content
  0% 0.00/2.70M [00:00<?, ?B/s]
100% 2.70M/2.70M [00:00<00:00, 932MB/s]


Unzip, in case the downloaded file is zipped. Refresh the files on the left hand side to update the view.

In [10]:
! unzip walmart-recruiting-store-sales-forecasting

Archive:  walmart-recruiting-store-sales-forecasting.zip
  inflating: features.csv.zip        
  inflating: sampleSubmission.csv.zip  
  inflating: stores.csv              
  inflating: test.csv.zip            
  inflating: train.csv.zip           


In [14]:
!pip install torch torchvision torchaudio
!pip install pytorch-lightning
!pip install optuna
!pip install mlflow==2.7.1
!pip install -q dagshub



In [1]:
!pip install --upgrade numpy
!pip install --upgrade pandas

Collecting pandas
  Using cached pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Using cached pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
  Attempting uninstall: pandas
    Found existing installation: pandas 2.2.2
    Uninstalling pandas-2.2.2:
      Successfully uninstalled pandas-2.2.2
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mlflow 2.7.1 requires numpy<2, but you have numpy 2.3.1 which is incompatible.
google-colab 1.0.0 requires pandas==2.2.2, but you have pandas 2.3.0 which is incompatible.
dask-expr 1.1.21 requires pyarrow>=14.0.1, but you have pyarrow 13.0.0 which is incompatible.
bigframes 2.8.0 requires pyarrow>=15.0.2, but you have pyarrow 13.0.0 which is incompatible.
db-dtypes 1.4.3 requires packaging>=24.2.0, but you hav

In [3]:
import pandas as pd
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor, ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_absolute_error, mean_squared_error
import matplotlib.pyplot as plt
import seaborn as sns
import gc
import optuna
import mlflow
import dagshub
from typing import Optional, List, Tuple, Dict, Any
import warnings
warnings.filterwarnings('ignore')

pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.expand_frame_repr', False)

In [25]:
stores = pd.read_csv('stores.csv')
train = pd.read_csv("train.csv.zip")
features = pd.read_csv('features.csv.zip')
sample = pd.read_csv('sampleSubmission.csv.zip')
test = pd.read_csv('test.csv.zip')

In [26]:
# Convert 'Date' columns to datetime objects for easier manipulation
train['Date'] = pd.to_datetime(train['Date'])
test['Date'] = pd.to_datetime(test['Date'])
features['Date'] = pd.to_datetime(features['Date'])

# Merge features with train and test data.
# Note: 'IsHoliday' is present in both train/test and features.csv.
# We'll merge on it to ensure consistency, but if there were discrepancies,
# we'd need a more careful merge strategy.
train_df = pd.merge(train, features, on=['Store', 'Date', 'IsHoliday'], how='left')
test_df = pd.merge(test, features, on=['Store', 'Date', 'IsHoliday'], how='left')

# Merge store information
train_df = pd.merge(train_df, stores, on='Store', how='left')
test_df = pd.merge(test_df, stores, on='Store', how='left')

print("\n--- Merged Train Data Head ---")
print(train_df.head())
print("\n--- Merged Test Data Head ---")
print(test_df.head())

print("\n--- Merged Train Data Info ---")
print(train_df.info())
print("\n--- Merged Test Data Info ---")
print(test_df.info())

# Free up memory
del train, test, features, stores
gc.collect()


--- Merged Train Data Head ---
   Store  Dept       Date  Weekly_Sales  IsHoliday  Temperature  Fuel_Price  MarkDown1  MarkDown2  MarkDown3  MarkDown4  MarkDown5         CPI  Unemployment Type    Size
0      1     1 2010-02-05      24924.50      False        42.31       2.572        NaN        NaN        NaN        NaN        NaN  211.096358         8.106    A  151315
1      1     1 2010-02-12      46039.49       True        38.51       2.548        NaN        NaN        NaN        NaN        NaN  211.242170         8.106    A  151315
2      1     1 2010-02-19      41595.55      False        39.93       2.514        NaN        NaN        NaN        NaN        NaN  211.289143         8.106    A  151315
3      1     1 2010-02-26      19403.54      False        46.63       2.561        NaN        NaN        NaN        NaN        NaN  211.319643         8.106    A  151315
4      1     1 2010-03-05      21827.90      False        46.50       2.625        NaN        NaN        NaN        Na

28950

## **DATA CLEANING**


In [8]:
class MissingValueImputer(BaseEstimator, TransformerMixin):
    """Custom Transformer to handle missing values for specific columns."""
    def __init__(self, markdown_cols=None, numerical_cols_to_impute=None):
        self.markdown_cols = markdown_cols if markdown_cols is not None else [f'MarkDown{i}' for i in range(1, 6)]
        self.numerical_cols_to_impute = numerical_cols_to_impute if numerical_cols_to_impute is not None else ['Temperature', 'Fuel_Price', 'CPI', 'Unemployment']
        self.means = {}

    def fit(self, X, y=None):
        for col in self.numerical_cols_to_impute:
            if col in X.columns:
                self.means[col] = X[col].mean()
        return self

    def transform(self, X):
        X_copy = X.copy()

        # Fill MarkDown columns with 0 and create missing indicators
        for col in self.markdown_cols:
            if col in X_copy.columns:
                X_copy[f"{col}_was_missing"] = X_copy[col].isna().astype(int)
                X_copy[col] = X_copy[col].fillna(0)

        # Impute other numerical columns
        for col in self.numerical_cols_to_impute:
            if col in X_copy.columns:
                X_copy[col] = X_copy[col].fillna(method='ffill').fillna(method='bfill')
                if X_copy[col].isnull().any() and col in self.means:
                    X_copy[col] = X_copy[col].fillna(self.means[col])
        return X_copy

In [9]:
class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    def __init__(self, date_column='Date', keep_date=True):
        self.date_column = date_column
        self.keep_date = keep_date

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X_copy = X.copy()
        if self.date_column not in X_copy.columns:
            raise ValueError(f"Date column '{self.date_column}' not found in DataFrame.")

        X_copy[self.date_column] = pd.to_datetime(X_copy[self.date_column])

        # Create time features
        X_copy['Year'] = X_copy[self.date_column].dt.year
        X_copy['Month'] = X_copy[self.date_column].dt.month
        X_copy['Month_sin'] = np.sin(2 * np.pi * X_copy['Month'] / 12)
        X_copy['Month_cos'] = np.cos(2 * np.pi * X_copy['Month'] / 12)
        X_copy['Week'] = X_copy[self.date_column].dt.isocalendar().week.astype(int)
        X_copy['Day'] = X_copy[self.date_column].dt.day
        X_copy['DayOfWeek'] = X_copy[self.date_column].dt.dayofweek

        # Convert boolean to int
        if 'IsHoliday' in X_copy.columns and X_copy['IsHoliday'].dtype == bool:
            X_copy['IsHoliday'] = X_copy['IsHoliday'].astype(int)

        # Drop Month, optionally keep Date
        columns_to_drop = ["Month"]
        if not self.keep_date:
            columns_to_drop.append(self.date_column)

        return X_copy.drop(columns=columns_to_drop)

In [10]:
class PatchTSTPreprocessor(BaseEstimator, TransformerMixin):
    """Preprocessor for PatchTST model that handles encoding and scaling"""
    def __init__(self):
        self.scalers = {}
        self.categorical_encoders = {}

    def fit(self, X, y=None):
        # Fit scalers for numerical columns
        numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()
        # Remove target column if present
        if 'Weekly_Sales' in numerical_cols:
            numerical_cols.remove('Weekly_Sales')

        for col in numerical_cols:
            if col in X.columns:
                self.scalers[col] = StandardScaler()
                self.scalers[col].fit(X[[col]])

        # Fit encoders for categorical columns
        categorical_cols = ['Type']  # Store and Dept will be handled as group identifiers
        for col in categorical_cols:
            if col in X.columns:
                unique_values = X[col].unique()
                self.categorical_encoders[col] = {val: idx for idx, val in enumerate(unique_values)}

        return self

    def transform(self, X):
        X_copy = X.copy()

        # Scale numerical features
        for col, scaler in self.scalers.items():
            if col in X_copy.columns:
                X_copy[col] = scaler.transform(X_copy[[col]]).flatten()

        # Encode categorical features
        for col, encoder in self.categorical_encoders.items():
            if col in X_copy.columns:
                X_copy[col] = X_copy[col].map(encoder).fillna(-1)

        # Create group identifier
        X_copy['group_id'] = X_copy['Store'].astype(str) + '_' + X_copy['Dept'].astype(str)

        return X_copy

In [11]:
class PatchTSTModel(nn.Module):
    def __init__(self, seq_len, pred_len, patch_len=16, stride=8, d_model=128,
                 n_heads=8, n_layers=3, d_ff=256, dropout=0.1, num_features=1):
        super().__init__()

        self.seq_len = seq_len
        self.pred_len = pred_len
        self.patch_len = patch_len
        self.stride = stride
        self.d_model = d_model
        self.num_features = num_features

        # Calculate number of patches
        self.n_patches = max(1, (seq_len - patch_len) // stride + 1)

        # Input projection
        self.patch_embedding = nn.Linear(patch_len * num_features, d_model)

        # Positional encoding
        self.pos_encoding = nn.Parameter(torch.randn(1, self.n_patches, d_model) * 0.1)

        # Transformer layers
        encoder_layer = nn.TransformerEncoderLayer(
            d_model=d_model,
            nhead=n_heads,
            dim_feedforward=d_ff,
            dropout=dropout,
            batch_first=True,
            norm_first=True
        )
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=n_layers)

        # Output projection
        self.head = nn.Sequential(
            nn.Linear(d_model * self.n_patches, d_ff),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, pred_len)
        )

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            nn.init.xavier_uniform_(module.weight)
            if module.bias is not None:
                nn.init.constant_(module.bias, 0)

    def create_patches(self, x):
        """Create patches from time series data"""
        batch_size, seq_len, num_features = x.shape

        # If sequence is shorter than patch_len, pad it
        if seq_len < self.patch_len:
            padding = self.patch_len - seq_len
            x = F.pad(x, (0, 0, 0, padding), mode='replicate')
            seq_len = self.patch_len

        patches = []
        for i in range(0, seq_len - self.patch_len + 1, self.stride):
            patch = x[:, i:i+self.patch_len, :]
            patches.append(patch)

        # Ensure we have at least one patch
        if len(patches) == 0:
            patches = [x[:, :self.patch_len, :]]

        # Stack patches
        patches = torch.stack(patches, dim=1)  # [batch_size, n_patches, patch_len, num_features]

        # Reshape for embedding
        patches = patches.reshape(batch_size, len(patches[0]), -1)  # [batch_size, n_patches, patch_len * num_features]

        return patches

    def forward(self, x):
        # Create patches
        patches = self.create_patches(x)

        # Embed patches
        embedded = self.patch_embedding(patches)

        # Add positional encoding
        embedded = embedded + self.pos_encoding[:, :embedded.size(1), :]

        # Apply transformer
        encoded = self.transformer(embedded)

        # Flatten and project to output
        flattened = encoded.reshape(encoded.size(0), -1)
        output = self.head(flattened)

        return output

In [36]:
# class PatchTSTLightningModule(pl.LightningModule):
#     def __init__(self, seq_len, pred_len, patch_len=16, stride=8, d_model=128,
#                  n_heads=8, n_layers=3, d_ff=256, dropout=0.1, num_features=1,
#                  learning_rate=1e-3, weight_decay=1e-4):
#         super().__init__()
#         self.save_hyperparameters()

#         self.model = PatchTSTModel(
#             seq_len=seq_len,
#             pred_len=pred_len,
#             patch_len=patch_len,
#             stride=stride,
#             d_model=d_model,
#             n_heads=n_heads,
#             n_layers=n_layers,
#             d_ff=d_ff,
#             dropout=dropout,
#             num_features=num_features
#         )

#         self.learning_rate = learning_rate
#         self.weight_decay = weight_decay

#     def forward(self, x):
#         return self.model(x)

#     def training_step(self, batch, batch_idx):
#         x, y = batch
#         y_hat = self(x)

#         # Ensure shapes match
#         if y_hat.shape != y.shape:
#             y_hat = y_hat[:, :y.shape[1]]

#         loss = F.mse_loss(y_hat, y)

#         # Check for NaN/inf
#         if torch.isnan(loss) or torch.isinf(loss):
#             return None

#         self.log('train_loss', loss, on_step=True, on_epoch=True, prog_bar=True)
#         return loss

#     def validation_step(self, batch, batch_idx):
#         x, y = batch
#         y_hat = self(x)

#         # Ensure shapes match
#         if y_hat.shape != y.shape:
#             y_hat = y_hat[:, :y.shape[1]]

#         loss = F.mse_loss(y_hat, y)
#         mae = F.l1_loss(y_hat, y)

#         # Check for NaN/inf
#         if torch.isnan(loss) or torch.isinf(loss):
#             return None

#         self.log('val_loss', loss, on_step=False, on_epoch=True, prog_bar=True)
#         self.log('val_mae', mae, on_step=False, on_epoch=True, prog_bar=True)
#         return loss

#     def configure_optimizers(self):
#         optimizer = torch.optim.AdamW(
#             self.parameters(),
#             lr=self.learning_rate,
#             weight_decay=self.weight_decay
#         )

#         scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(
#             optimizer,
#             mode='min',
#             factor=0.5,
#             patience=5,
#             verbose=True
#         )

#         return {
#             'optimizer': optimizer,
#             'lr_scheduler': {
#                 'scheduler': scheduler,
#                 'monitor': 'val_loss',
#                 'frequency': 1
#             }
#         }


class PatchTSTLightningModule(pl.LightningModule):
    def __init__(self, seq_len, pred_len, patch_len=16, stride=8, d_model=128,
                 n_heads=8, n_layers=3, d_ff=256, dropout=0.1, num_features=1,
                 learning_rate=1e-3, weight_decay=1e-4):
        super().__init__()
        self.save_hyperparameters()

        self.model = PatchTSTModel(
            seq_len=seq_len,
            pred_len=pred_len,
            patch_len=patch_len,
            stride=stride,
            d_model=d_model,
            n_heads=n_heads,
            n_layers=n_layers,
            d_ff=d_ff,
            dropout=dropout,
            num_features=num_features
        )

        self.learning_rate = learning_rate
        self.weight_decay = weight_decay

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(
            self.parameters(),
            lr=self.learning_rate,
            weight_decay=self.weight_decay
        )
        return optimizer

    def training_step(self, batch, batch_idx):
        x, y = batch

        # Add checks for NaN/Inf
        if torch.isnan(x).any() or torch.isinf(x).any():
            print(f"NaN/Inf detected in input at batch {batch_idx}")
            return None

        y_hat = self(x)

        # Add checks for NaN/Inf in output
        if torch.isnan(y_hat).any() or torch.isinf(y_hat).any():
            print(f"NaN/Inf detected in output at batch {batch_idx}")
            return None

        loss = nn.MSELoss()(y_hat, y)

        # Add checks for NaN/Inf in loss
        if torch.isnan(loss) or torch.isinf(loss):
            print(f"NaN/Inf detected in loss at batch {batch_idx}")
            return None

        self.log('train_loss', loss, prog_bar=True)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch

        # Same checks as training
        if torch.isnan(x).any() or torch.isinf(x).any():
            return None

        y_hat = self(x)

        if torch.isnan(y_hat).any() or torch.isinf(y_hat).any():
            return None

        loss = nn.MSELoss()(y_hat, y)

        if torch.isnan(loss) or torch.isinf(loss):
            return None

        self.log('val_loss', loss, prog_bar=True)
        return loss

    def forward(self, x):
        return self.model(x)

    def configure_optimizers(self):
        optimizer = torch.optim.AdamW(
            self.parameters(),
            lr=self.learning_rate,
            weight_decay=self.weight_decay
        )
        return optimizer

In [13]:
class WalmartTSDataset(Dataset):
    def __init__(self, df, seq_len, pred_len, feature_cols, target_col='Weekly_Sales'):
        self.df = df.copy()
        self.seq_len = seq_len
        self.pred_len = pred_len
        self.feature_cols = feature_cols
        self.target_col = target_col

        # Ensure data is sorted
        self.df = self.df.sort_values(['group_id', 'Date']).reset_index(drop=True)

        # Create samples
        self.samples = self._create_samples()

        # Initialize scalers
        self.feature_scaler = StandardScaler()
        self.target_scaler = StandardScaler()

        # Fit scalers and transform data
        self._fit_scalers()

        print(f"Dataset created with {len(self.samples)} samples")

    def _create_samples(self):
        """Create valid samples from the dataset"""
        samples = []

        for group_id in self.df['group_id'].unique():
            group_data = self.df[self.df['group_id'] == group_id].copy()
            # Reset index to ensure continuous indexing within group
            group_data = group_data.reset_index(drop=True)

            # Skip if not enough data
            if len(group_data) < self.seq_len + self.pred_len:
                continue

            # Create sliding window samples using position-based indexing
            for i in range(len(group_data) - self.seq_len - self.pred_len + 1):
                sample = {
                    'group_id': group_id,
                    'start_idx': i,
                    'seq_end_idx': i + self.seq_len,
                    'pred_end_idx': i + self.seq_len + self.pred_len,
                    'group_data': group_data  # Store the group data for this sample
                }
                samples.append(sample)

        return samples

    def _fit_scalers(self):
        """Fit scalers on all data"""
        if len(self.samples) == 0:
            raise ValueError("No samples created. Check your data and parameters.")

        # Get all feature and target data
        all_features = []
        all_targets = []

        for sample in self.samples:
            group_data = sample['group_data']

            # Get sequence data (input features)
            seq_data = group_data.iloc[sample['start_idx']:sample['seq_end_idx']]
            # Get prediction data (target values)
            pred_data = group_data.iloc[sample['seq_end_idx']:sample['pred_end_idx']]

            # Features from sequence
            features = seq_data[self.feature_cols].values
            all_features.append(features)

            # Target from prediction period
            target = pred_data[self.target_col].values
            all_targets.append(target)

        # Combine and fit scalers
        all_features = np.concatenate(all_features, axis=0)
        all_targets = np.concatenate(all_targets, axis=0).reshape(-1, 1)

        # Handle NaN values
        all_features = np.nan_to_num(all_features, nan=0.0, posinf=0.0, neginf=0.0)
        all_targets = np.nan_to_num(all_targets, nan=0.0, posinf=0.0, neginf=0.0)

        self.feature_scaler.fit(all_features)
        self.target_scaler.fit(all_targets)

    def __len__(self):
        return len(self.samples)

    def __getitem__(self, idx):
        sample = self.samples[idx]
        group_data = sample['group_data']

        # Get sequence data (input features)
        seq_data = group_data.iloc[sample['start_idx']:sample['seq_end_idx']]
        # Get prediction data (target values)
        pred_data = group_data.iloc[sample['seq_end_idx']:sample['pred_end_idx']]

        # Extract features and target
        features = seq_data[self.feature_cols].values
        target = pred_data[self.target_col].values

        # Handle NaN values
        features = np.nan_to_num(features, nan=0.0, posinf=0.0, neginf=0.0)
        target = np.nan_to_num(target, nan=0.0, posinf=0.0, neginf=0.0)

        # Scale data
        features_scaled = self.feature_scaler.transform(features)
        target_scaled = self.target_scaler.transform(target.reshape(-1, 1)).flatten()

        # Convert to tensors
        X = torch.FloatTensor(features_scaled)
        y = torch.FloatTensor(target_scaled)

        return X, y

In [29]:
preprocessing_pipeline = Pipeline([
    ('imputer', MissingValueImputer()),
    ('date_extractor', DateFeatureExtractor(keep_date=True)),
    ('patchtst_preprocessor', PatchTSTPreprocessor())
])

# Apply preprocessing
train_processed = preprocessing_pipeline.fit_transform(train_df)
test_processed = preprocessing_pipeline.transform(test_df)

print("Train processed shape:", train_processed.shape)
print("Test processed shape:", test_processed.shape)

# Define parameters
seq_len = 8  # 52 weeks of history
pred_len = 4 # 39 weeks to predict
validation_cutoff_date = pd.to_datetime('2012-02-01')

# Split data temporally
train_temporal = train_processed[train_processed['Date'] < validation_cutoff_date].copy()
val_temporal = train_processed[train_processed['Date'] >= validation_cutoff_date].copy()

print(f"Training temporal split: {train_temporal['Date'].min()} to {train_temporal['Date'].max()}")
print(f"Validation temporal split: {val_temporal['Date'].min()} to {val_temporal['Date'].max()}")

min_samples_per_group = seq_len + pred_len
train_group_sizes = train_temporal.groupby('group_id').size()
val_group_sizes = val_temporal.groupby('group_id').size()

valid_groups = train_group_sizes[train_group_sizes >= min_samples_per_group].index
valid_groups = valid_groups.intersection(val_group_sizes.index)

train_filtered = train_temporal[train_temporal['group_id'].isin(valid_groups)].copy()
val_filtered = val_temporal[val_temporal['group_id'].isin(valid_groups)].copy()

print(f"Training samples after filtering: {len(train_filtered)}")
print(f"Validation samples after filtering: {len(val_filtered)}")

feature_cols = [
    'Temperature', 'Fuel_Price', 'CPI', 'Unemployment',
    'Month_sin', 'Month_cos', 'Week', 'Day', 'Year', 'IsHoliday', 'DayOfWeek',
    'Size', 'Type'
] + [f'MarkDown{i}' for i in range(1, 6)] + [f'MarkDown{i}_was_missing' for i in range(1, 6)]

# Filter existing columns
feature_cols = [col for col in feature_cols if col in train_filtered.columns]

# Create datasets
train_dataset = WalmartTSDataset(train_filtered, seq_len, pred_len, feature_cols)
val_dataset = WalmartTSDataset(val_filtered, seq_len, pred_len, feature_cols)

print(f"Training dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")

Train processed shape: (421570, 28)
Test processed shape: (115064, 27)
Training temporal split: 2010-02-05 00:00:00 to 2012-01-27 00:00:00
Validation temporal split: 2012-02-03 00:00:00 to 2012-10-26 00:00:00
Training samples after filtering: 304081
Validation samples after filtering: 114789
Dataset created with 270267 samples
Dataset created with 81477 samples
Training dataset size: 270267
Validation dataset size: 81477


In [18]:
!pip install optuna-integration[pytorch_lightning]

Collecting optuna-integration[pytorch_lightning]
  Downloading optuna_integration-4.4.0-py3-none-any.whl.metadata (12 kB)
Collecting lightning (from optuna-integration[pytorch_lightning])
  Downloading lightning-2.5.2-py3-none-any.whl.metadata (38 kB)
Downloading lightning-2.5.2-py3-none-any.whl (821 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m821.1/821.1 kB[0m [31m51.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading optuna_integration-4.4.0-py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.9/98.9 kB[0m [31m10.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: optuna-integration, lightning
Successfully installed lightning-2.5.2 optuna-integration-4.4.0


In [32]:
from optuna.integration import PyTorchLightningPruningCallback
import torch
import torch.nn as nn # Assuming nn is needed for MSELoss
from torch.utils.data import DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks.early_stopping import EarlyStopping

# Assuming PatchTSTModel and your dataset classes (train_dataset, val_dataset)
# as well as seq_len, pred_len, feature_cols are defined elsewhere in your notebook.

def objective(trial):
    """Optuna objective function"""

    # Sample hyperparameters with better ranges
    patch_len = trial.suggest_categorical('patch_len', [8, 16, 32])
    stride = trial.suggest_categorical('stride', [4, 8, 16])
    d_model = trial.suggest_categorical('d_model', [64, 128, 256])
    n_heads = trial.suggest_categorical('n_heads', [4, 8, 16])
    n_layers = trial.suggest_int('n_layers', 2, 6)
    d_ff = trial.suggest_categorical('d_ff', [256, 512, 1024])
    dropout = trial.suggest_float('dropout', 0.1, 0.3)
    learning_rate = trial.suggest_float('learning_rate', 1e-4, 1e-2, log=True)
    weight_decay = trial.suggest_float('weight_decay', 1e-6, 1e-3, log=True)
    batch_size = trial.suggest_categorical('batch_size', [16, 32, 64])

    try:
        # Create data loaders
        train_loader = DataLoader(
            train_dataset,
            batch_size=batch_size,
            shuffle=True,
            num_workers=0, # CORRECTED: Changed from 2 to 0
            # persistent_workers=True # REMOVED: No longer needed with num_workers=0
        )

        val_loader = DataLoader(
            val_dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=0, # CORRECTED: Changed from 2 to 0
            # persistent_workers=True # REMOVED: No longer needed with num_workers=0
        )

        # Check if we have data
        if len(train_loader) == 0 or len(val_loader) == 0:
            raise ValueError("Empty data loaders")

        # Create model
        model = PatchTSTLightningModule(
            seq_len=seq_len,
            pred_len=pred_len,
            patch_len=patch_len,
            stride=stride,
            d_model=d_model,
            n_heads=n_heads,
            n_layers=n_layers,
            d_ff=d_ff,
            dropout=dropout,
            num_features=len(feature_cols),
            learning_rate=learning_rate,
            weight_decay=weight_decay
        )

        # Callbacks
        pruning_callback = PyTorchLightningPruningCallback(trial, monitor='val_loss')
        early_stopping = EarlyStopping(
            monitor='val_loss',
            patience=10,
            verbose=True,
            mode='min'
        )

        # Trainer
        trainer = pl.Trainer(
            max_epochs=50,
            callbacks=[pruning_callback, early_stopping],
            logger=False,
            enable_checkpointing=False,
            enable_progress_bar=False,
            accelerator='gpu' if torch.cuda.is_available() else 'cpu',
            devices=1
        )

        # Train
        trainer.fit(model, train_loader, val_loader)

        # Return validation loss
        return trainer.callback_metrics['val_loss'].item()

    except Exception as e:
        print(f"Trial failed with error: {e}")
        return float('inf')

In [None]:
# %pip install -q dagshub


[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/261.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m261.0/261.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m139.9/139.9 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m92.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m203.4/203.4 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.2/85.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m74.3/74.3 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25h

In [16]:
#!pip install mlflow==2.7.1

Collecting pandas<3 (from mlflow==2.7.1)
  Using cached pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Using cached pandas-2.3.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
[0mTraceback (most recent call last):
  File "/usr/lib/python3.11/pathlib.py", line 540, in __str__
    return self._str
           ^^^^^^^^^
AttributeError: 'PosixPath' object has no attribute '_str'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/base_command.py", line 179, in exc_logging_wrapper
    status = run_func(*args)
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/cli/req_command.py", line 67, in wrapper
    return func(self, options, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/dist-packages/pip/_internal/commands/install.py", line 447, in run


In [20]:

import dagshub
# Try to get credentials from environment first
dagshub.init(
    repo_owner='abarb22',
    repo_name='Walmart-Recruiting---Store-Sales-Forecasting',
    mlflow=True
)



Output()



Open the following link in your browser to authorize the client:
https://dagshub.com/login/oauth/authorize?state=328ae7dd-8c55-4193-93e8-bcc1c6b63b9a&client_id=32b60ba385aa7cecf24046d8195a71c07dd345d9657977863b52e7748e0f0f28&middleman_request_id=cb54fa4c8fcf081c06358413a95900798e15feef5b5ae3ffd7a31ab4b90f15b4




In [33]:
print("Starting hyperparameter optimization...")
study = optuna.create_study(direction="minimize")
study.optimize(objective, n_trials=20, timeout=3600)

print(f"Best trial: {study.best_trial.number}")
print(f"Best value: {study.best_value:.6f}")
print("Best params:")

for key, value in study.best_params.items():
    print(f"  {key}: {value}")


[I 2025-07-07 13:12:15,402] A new study created in memory with name: no-name-38e7c771-af10-4ee8-9a39-79412be2f64a
[I 2025-07-07 13:12:15,453] Trial 0 finished with value: inf and parameters: {'patch_len': 16, 'stride': 16, 'd_model': 64, 'n_heads': 16, 'n_layers': 2, 'd_ff': 1024, 'dropout': 0.10861796211265512, 'learning_rate': 0.0001866462047343469, 'weight_decay': 4.356911186145703e-05, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,499] Trial 1 finished with value: inf and parameters: {'patch_len': 8, 'stride': 4, 'd_model': 64, 'n_heads': 4, 'n_layers': 4, 'd_ff': 256, 'dropout': 0.16674551931777387, 'learning_rate': 0.00016110882498651525, 'weight_decay': 7.617696335827932e-06, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,559] Trial 2 finished with value: inf and parameters: {'patch_len': 32, 'stride': 16, 'd_model': 256, 'n_heads': 16, 'n_layers': 5, 'd_ff': 256, 'dropout': 0.2992445044594322, 'learning_rate': 0.000118378

Starting hyperparameter optimization...
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent


[I 2025-07-07 13:12:15,618] Trial 3 finished with value: inf and parameters: {'patch_len': 8, 'stride': 8, 'd_model': 128, 'n_heads': 4, 'n_layers': 6, 'd_ff': 256, 'dropout': 0.18845133906535877, 'learning_rate': 0.0001583509136529113, 'weight_decay': 1.442467141270708e-06, 'batch_size': 32}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,686] Trial 4 finished with value: inf and parameters: {'patch_len': 32, 'stride': 8, 'd_model': 256, 'n_heads': 4, 'n_layers': 4, 'd_ff': 512, 'dropout': 0.20095593471160472, 'learning_rate': 0.00043916144215369883, 'weight_decay': 3.199946752835361e-05, 'batch_size': 32}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,738] Trial 5 finished with value: inf and parameters: {'patch_len': 32, 'stride': 4, 'd_model': 128, 'n_heads': 4, 'n_layers': 3, 'd_ff': 512, 'dropout': 0.11847503284506125, 'learning_rate': 0.000670300563631743, 'weight_decay': 1.4380039237198848e-06, 'batch_size': 64}. Best is trial 0 with value: inf.
[I 2025-07-07

Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent


[I 2025-07-07 13:12:15,843] Trial 7 finished with value: inf and parameters: {'patch_len': 32, 'stride': 16, 'd_model': 128, 'n_heads': 4, 'n_layers': 6, 'd_ff': 512, 'dropout': 0.1496286077305124, 'learning_rate': 0.008841642102633723, 'weight_decay': 5.1134147288885495e-06, 'batch_size': 32}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,932] Trial 8 finished with value: inf and parameters: {'patch_len': 16, 'stride': 8, 'd_model': 256, 'n_heads': 16, 'n_layers': 6, 'd_ff': 1024, 'dropout': 0.22063345932337736, 'learning_rate': 0.00011807893437164302, 'weight_decay': 1.6052035945524025e-06, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:15,992] Trial 9 finished with value: inf and parameters: {'patch_len': 16, 'stride': 16, 'd_model': 256, 'n_heads': 16, 'n_layers': 6, 'd_ff': 256, 'dropout': 0.13646087032170218, 'learning_rate': 0.005265119957909737, 'weight_decay': 3.5726300199939233e-06, 'batch_size': 32}. Best is trial 0 with value: inf.


Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent


[I 2025-07-07 13:12:16,059] Trial 10 finished with value: inf and parameters: {'patch_len': 16, 'stride': 16, 'd_model': 64, 'n_heads': 8, 'n_layers': 2, 'd_ff': 1024, 'dropout': 0.2661305971893949, 'learning_rate': 0.0023703692868288348, 'weight_decay': 0.0008039073484434579, 'batch_size': 64}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:16,128] Trial 11 finished with value: inf and parameters: {'patch_len': 8, 'stride': 4, 'd_model': 64, 'n_heads': 16, 'n_layers': 4, 'd_ff': 1024, 'dropout': 0.16161067875300722, 'learning_rate': 0.0003042571656490628, 'weight_decay': 3.2589271099699285e-05, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:16,193] Trial 12 finished with value: inf and parameters: {'patch_len': 8, 'stride': 4, 'd_model': 64, 'n_heads': 16, 'n_layers': 3, 'd_ff': 1024, 'dropout': 0.11029913842952208, 'learning_rate': 0.0002801058754879683, 'weight_decay': 8.171612236250178e-05, 'batch_size': 16}. Best is trial 0 with value: inf.


Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent


[I 2025-07-07 13:12:16,260] Trial 13 finished with value: inf and parameters: {'patch_len': 8, 'stride': 4, 'd_model': 64, 'n_heads': 4, 'n_layers': 3, 'd_ff': 256, 'dropout': 0.17195404382117363, 'learning_rate': 0.00157521410665243, 'weight_decay': 1.266301905809342e-05, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:16,341] Trial 14 finished with value: inf and parameters: {'patch_len': 16, 'stride': 16, 'd_model': 64, 'n_heads': 8, 'n_layers': 5, 'd_ff': 1024, 'dropout': 0.1006776996801308, 'learning_rate': 0.00021824218851070222, 'weight_decay': 0.00010397275851485883, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:16,400] Trial 15 finished with value: inf and parameters: {'patch_len': 8, 'stride': 16, 'd_model': 64, 'n_heads': 4, 'n_layers': 2, 'd_ff': 256, 'dropout': 0.130495964756822, 'learning_rate': 0.00042289273821251994, 'weight_decay': 1.5549926934725438e-05, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07

Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Trial failed with error: Expected a parent


[I 2025-07-07 13:12:16,599] Trial 18 finished with value: inf and parameters: {'patch_len': 16, 'stride': 16, 'd_model': 64, 'n_heads': 16, 'n_layers': 4, 'd_ff': 1024, 'dropout': 0.21905224319526623, 'learning_rate': 0.00010695940641407221, 'weight_decay': 0.0003256131412391981, 'batch_size': 16}. Best is trial 0 with value: inf.
[I 2025-07-07 13:12:16,661] Trial 19 finished with value: inf and parameters: {'patch_len': 16, 'stride': 8, 'd_model': 128, 'n_heads': 8, 'n_layers': 2, 'd_ff': 256, 'dropout': 0.24821442509156091, 'learning_rate': 0.0004107710535187776, 'weight_decay': 3.929026119662057e-05, 'batch_size': 16}. Best is trial 0 with value: inf.


Trial failed with error: Expected a parent
Trial failed with error: Expected a parent
Best trial: 0
Best value: inf
Best params:
  patch_len: 16
  stride: 16
  d_model: 64
  n_heads: 16
  n_layers: 2
  d_ff: 1024
  dropout: 0.10861796211265512
  learning_rate: 0.0001866462047343469
  weight_decay: 4.356911186145703e-05
  batch_size: 16


In [None]:
print("\n=== TRAINING FINAL MODEL ===")
best_params = study.best_params

# Start MLflow run
with mlflow.start_run(run_name="PatchTST_Walmart_Best_Model"):

    # Log parameters
    mlflow.log_param("seq_len", seq_len)
    mlflow.log_param("pred_len", pred_len)
    mlflow.log_param("validation_cutoff_date", str(validation_cutoff_date))
    mlflow.log_param("training_samples", len(train_filtered))
    mlflow.log_param("validation_samples", len(val_filtered))

    for key, value in best_params.items():
        mlflow.log_param(key, value)

    # Create final dataloaders
    final_batch_size = best_params["batch_size"]
    train_loader = DataLoader(train_dataset, batch_size=final_batch_size, shuffle=True, num_workers=0)
    val_loader = DataLoader(val_dataset, batch_size=final_batch_size, shuffle=False, num_workers=0)

    # Create final model
    final_model = PatchTSTLightningModule(
        seq_len=seq_len,
        pred_len=pred_len,
        patch_len=best_params["patch_len"],
        stride=best_params["stride"],
        d_model=best_params["d_model"],
        n_heads=best_params["n_heads"],
        n_layers=best_params["n_layers"],
        d_ff=best_params["d_ff"],
        dropout=best_params["dropout"],
        num_features=len(feature_cols),
        learning_rate=best_params["learning_rate"],
        weight_decay=best_params["weight_decay"]
    )

    # Log model info
    total_params = sum(p.numel() for p in final_model.parameters())
    mlflow.log_param("total_parameters", total_params)
    mlflow.log_param("model_size_mb", total_params * 4 / 1024 / 1024)

    print(f"Number of parameters: {total_params/1e6:.2f}M")

    # Create trainer
    trainer = pl.Trainer(
        max_epochs=100,
        accelerator='gpu' if torch.cuda.is_available() else 'cpu',
        devices=1 if torch.cuda.is_available() else 'auto',
        callbacks=[
            EarlyStopping(monitor="val_loss", patience=15, mode="min"),
            LearningRateMonitor(logging_interval='epoch'),
            ModelCheckpoint(monitor="val_loss", mode="min", save_top_k=1),
        ],
        logger=TensorBoardLogger("lightning_logs"),
        enable_progress_bar=True,
    )

    # Train model
    print("Training final model...")
    trainer.fit(final_model, train_loader, val_loader)

    # Log training results
    mlflow.log_metric("final_train_loss", trainer.callback_metrics.get("train_loss", 0))
    mlflow.log_metric("final_val_loss", trainer.callback_metrics.get("val_loss", 0))
    mlflow.log_metric("final_train_wmae", trainer.callback_metrics.get("train_wmae", 0))
    mlflow.log_metric("final_val_wmae", trainer.callback_metrics.get("val_wmae", 0))

    print("Training completed!")
    print(f"Final validation loss: {trainer.callback_metrics.get('val_loss', 0):.6f}")
    print(f"Final validation WMAE: {trainer.callback_metrics.get('val_wmae', 0):.6f}")


=== TRAINING FINAL MODEL ===


INFO: GPU available: True (cuda), used: True
INFO:lightning.pytorch.utilities.rank_zero:GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO:lightning.pytorch.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO:lightning.pytorch.utilities.rank_zero:HPU available: False, using: 0 HPUs
INFO:pytorch_lightning.accelerators.cuda:LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:pytorch_lightning.callbacks.model_summary:
  | Name  | Type          | Params | Mode 
------------------------------------------------
0 | model | PatchTSTModel | 392 K  | train
------------------------------------------------
392 K     Trainable params
0         Non-trainable params
392 K     Total params
1.570     Total estimated model params size (MB)
29        Modules in train mode
0         Modules in eval mode


Number of parameters: 0.39M
Training final model...


Sanity Checking: |          | 0/? [00:00<?, ?it/s]

Training: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]

Validation: |          | 0/? [00:00<?, ?it/s]