<a href="https://colab.research.google.com/github/azhgh22/Walmart-Recruiting-Store-Sales-Forecasting/blob/main/notebooks/xgboost_base_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Notebook Objective

The main focus of this notebook is to evaluate how tree-based models, particularly **XGBoost**, perform on our forecasting task.

We will apply several feature engineering techniques to enhance model input and then search for the best-performing XGBoost model through hyperparameter tuning.


## Notebook Setup

The following code handles the initial setup of the notebook: cloning the GitHub repository, loading the data, and splitting it for training and validation. This setup code is user-specific and may need to be adjusted depending on your environment.


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
from google.colab import userdata
token = userdata.get('GITHUB_TOKEN')
user_name = userdata.get('GITHUB_USERNAME')
mail = userdata.get('GITHUB_MAIL')

!git config --global user.name "{user_name}"
!git config --global user.email "{mail}"
!git clone https://{token}@github.com/azhgh22/Walmart-Recruiting-Store-Sales-Forecasting.git

%cd Walmart-Recruiting-Store-Sales-Forecasting

Cloning into 'Walmart-Recruiting-Store-Sales-Forecasting'...
remote: Enumerating objects: 437, done.[K
remote: Counting objects: 100% (33/33), done.[K
remote: Compressing objects: 100% (26/26), done.[K
remote: Total 437 (delta 17), reused 7 (delta 7), pack-reused 404 (from 1)[K
Receiving objects: 100% (437/437), 9.11 MiB | 16.65 MiB/s, done.
Resolving deltas: 100% (235/235), done.
/content/Walmart-Recruiting-Store-Sales-Forecasting


In [3]:
%%capture
from google.colab import userdata
!pip install -r requirements.txt
kaggle_json_path = userdata.get('KAGGLE_JSON_PATH')
! ./src/data_loader.sh -f {kaggle_json_path}

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [7]:
from src.config import *

stores = pd.read_csv(STORES_PATH)
features = pd.read_csv(FEATURES_PATH)
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

In [8]:
merged_train = pd.merge(train,stores,on='Store',how='left').merge(features,how='left',on=['Store','Date','IsHoliday'])

In [9]:
from src.time_series_split import TimeSeriesSplit
from src.config import SPLIT_DATE
merged_train.Date = pd.to_datetime(merged_train.Date)
x_train, x_val = TimeSeriesSplit(SPLIT_DATE).split(merged_train)

y_train = x_train.pop('Weekly_Sales')
y_val = x_val.pop('Weekly_Sales')

In [10]:
merged_train.head()

Unnamed: 0,Store,Dept,Date,Weekly_Sales,IsHoliday,Type,Size,Temperature,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,CPI,Unemployment
0,1,1,2010-02-05,24924.5,False,A,151315,42.31,2.572,,,,,,211.096358,8.106
1,1,1,2010-02-12,46039.49,True,A,151315,38.51,2.548,,,,,,211.24217,8.106
2,1,1,2010-02-19,41595.55,False,A,151315,39.93,2.514,,,,,,211.289143,8.106
3,1,1,2010-02-26,19403.54,False,A,151315,46.63,2.561,,,,,,211.319643,8.106
4,1,1,2010-03-05,21827.9,False,A,151315,46.5,2.625,,,,,,211.350143,8.106


# Custom Feature Engineering Classes

For feature engineering, we will use the following custom-built classes:

1. **FeatureAdder**  
   Adds various time-related features such as Fourier terms, week of the year, and more.  
   Implementation: `feature_engineering/time_features`

2. **NAImputer**  
   Handles missing values by identifying columns with NaNs and applying a specified imputation strategy (default: `'mean'`).  
   It uses `SimpleImputer` internally and fills missing values while preserving the rest of the data.  
   Implementation: `feature_engineering/feature_transformers`

3. **Cat2Num**  
   Transforms categorical columns into numerical format.  
   - Converts `'IsHoliday'` to integers  
   - Encodes `'Type'` as categorical numeric codes  

Implementation: `feature_engineering/feature_transformers`


In [11]:
from feature_engineering.time_features import FeatureAdder
from feature_engineering.imputers import NaImputer
from feature_engineering.feature_transformers import Cat2Num


# Training and Validating

We begin by using cross-validation with XGBoost to establish a strong baseline model.


In [12]:
from xgboost import DMatrix
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import root_mean_squared_error
from sklearn.metrics import mean_absolute_error
from src.utils import wmae

In [13]:
from src.cross_validation import manual_model_search
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('adder', FeatureAdder()),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
  ])


mod = Pipeline([
    ('model', XGBRegressor(
        n_estimators=1000,
        learning_rate=0.1,
        max_depth=7,
        reg_lambda=3,
        min_split_loss=100,
        objective='reg:squarederror',
        random_state=42,
    ))
  ])

transformed_train = pipeline.fit_transform(x_train,y_train)
transformed_val = pipeline.transform(x_val)

param_grid = {
    'model__n_estimators': [200,500,800,1000],
    'model__learning_rate': [0.1],
    'model__max_depth': [7],
    'model__reg_lambda': [3],
    'model__min_split_loss': [100],
}


metric_kwargs = {
    'is_holiday': transformed_val['IsHoliday']
}

best_model, best_params, best_score = manual_model_search(
    model=mod,
    param_grid=param_grid,
    X_train=transformed_train,
    y_train=y_train,
    X_valid=transformed_val,
    y_valid=y_val,
    metric_func=wmae,
    metric_kwargs=metric_kwargs
)

print("\nBest Params:", best_params)
print("Best Validation Score:", best_score)



Params: {'model__n_estimators': 200, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 3383.7155
Params: {'model__n_estimators': 500, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 3020.4466
Params: {'model__n_estimators': 800, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 2895.2265
Params: {'model__n_estimators': 1000, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100} -> Score: 2848.9664

Best Params: {'model__n_estimators': 1000, 'model__learning_rate': 0.1, 'model__max_depth': 7, 'model__reg_lambda': 3, 'model__min_split_loss': 100}
Best Validation Score: 2848.966433266072


In [17]:
from sklearn.pipeline import Pipeline
from feature_engineering.lag_adder import LagAdder

xgb = XGBRegressor(
        n_estimators=1000,
        learning_rate=0.1,
        max_depth=7,
        reg_lambda=3,
        objective='reg:squarederror',
        random_state=42,
    )

pipeline = Pipeline([
    ('adder', FeatureAdder(
        add_week_num=True,
        add_holiday_flags=False,
        add_holiday_proximity=False,
        add_holiday_windows=False,
        add_fourier_features=True,
        add_month_and_year=False,
        list_of_holiday_proximity=None,
        replace_time_index = False,

        add_dummy_date=True,start_date=pd.Timestamp('2010-02-05'))),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
    ('model', xgb)
  ])

model = pipeline.fit(x_train, y_train)


y_train_predict = model.predict(x_train)
y_val_predict = model.predict(x_val)

train_score = wmae(y_train, y_train_predict,x_train['IsHoliday'].to_list())
val_score = wmae(y_val, y_val_predict,x_val['IsHoliday'].to_list())
print(f"Train wmae: {train_score}, Val wmae: {val_score}")

Train wmae: 1632.747744372342, Val wmae: 2893.6627682140415


Next, we introduce the `LagAdder` function, which plays a key role in making XGBoost act as an **autoregressive** model. It works by adding lagged versions of the target variable (predicted values from previous time steps) as new features to the dataset. This allows the model to incorporate past values directly into its predictions, capturing temporal dependencies. This technique is good for adapting non-sequential models like XGBoost to time series forecasting tasks.

In [16]:
from sklearn.pipeline import Pipeline
from feature_engineering.lag_adder import LagAdder

xgb = XGBRegressor(
        n_estimators=1000,
        learning_rate=0.01,
        max_depth=7,
        reg_lambda=1000,
        objective='reg:squarederror',
        random_state=42,
    )

pipeline = Pipeline([
    ('adder', FeatureAdder(
        add_week_num=True,
        add_holiday_flags=False,
        add_holiday_proximity=False,
        add_holiday_windows=False,
        add_fourier_features=True,
        add_month_and_year=False,
        list_of_holiday_proximity=None,
        replace_time_index = False,

        add_dummy_date=True,start_date=pd.Timestamp('2010-02-05'))),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
    ('model', LagAdder(y_val,xgb,2))
  ])

model = pipeline.fit(x_train, y_train)


y_train_predict = model.predict(x_train)
y_val_predict = model.predict(x_val)

train_score = wmae(y_train, y_train_predict,x_train['IsHoliday'].to_list())
val_score = wmae(y_val, y_val_predict,x_val['IsHoliday'].to_list())
print(f"Train wmae: {train_score}, Val wmae: {val_score}")

0
0
1




Train wmae: 3001.246717622876, Val wmae: 3232.624473235746




We will not proceed with further training in this notebook.

Our experiments suggest that **autoregressive modeling** using lagged features has a negative impact on performance in this case, and is not suitable for our XGBoost setup.

While **XGBoost** is a strong model — achieving a mean absolute error (MAE) of approximately **2848** (as seen in Weights & Biases logs, though not shown here) — it alone is **not sufficient** for capturing the full complexity of the time series patterns in this problem.


# Log to WandB

In [18]:
xgb = XGBRegressor(
        n_estimators=1000,
        learning_rate=0.1,
        max_depth=7,
        reg_lambda=3,
        objective='reg:squarederror',
        random_state=42,
    )

pipeline = Pipeline([
    ('adder', FeatureAdder(
        add_week_num=True,
        add_holiday_flags=False,
        add_holiday_proximity=False,
        add_holiday_windows=False,
        add_fourier_features=True,
        add_month_and_year=False,
        list_of_holiday_proximity=None,
        replace_time_index = False,

        add_dummy_date=True,start_date=pd.Timestamp('2010-02-05'))),
    ('imputer', NaImputer()),
    ('cat2num', Cat2Num()),
    ('model', xgb)
  ])

In [None]:
! wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mazhgh22[0m ([33mMLBeasts[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
import wandb
import joblib

y = merged_train['Weekly_Sales'].copy()
x = merged_train.drop(columns=['Weekly_Sales'])

fin_model = pipeline.fit(x,y)

joblib.dump(model, "xgb_pipeline_lags.pkl")
wandb.init(project="Walmart Recruiting - Store Sales Forecasting", name="xgboost:run2:lag_features")

wandb.config.update({
    'merge1' : 'train, store, how=left, on=Store',
    'merge2' : 'train, features, how=left, on=Store, Date, IsHoliday',
    'merged_tables' : ['train','stores','features'],
    'time_features' : [
        'DateDummy', 'Month', 'Year',
       'WeekOfYear', 'week_sin', 'week_cos', 'month_sin', 'month_cos'
    ],
    'lags' : 2,
    'add_dummy_date' : True,
    'start_date' : '2010-02-05',
    'score_metric' : 'WMAE',
    'score_policy' : {
        'weight on holidays' : 5,
        'weight on non_holidays' : 1
    },
    'model' : 'Xgboost',
    'n_estimators' : 1000,
    'learning_rate' : 0.1,
    'max_depth' : 7,
    'reg_lambda' : 1000,
    'objective' : 'reg:squarederror',
})

wandb.log({
    'train_wmae': train_score,
    'val_wmae': val_score
})


artifact = wandb.Artifact(
    name="xgb_pipeline",
    type="model",
    description="XGBoost pipeline with Date engineering and imputing"
)

artifact.add_file("xgb_pipeline_lags.pkl")
wandb.log_artifact(artifact)

wandb.finish()

[34m[1mwandb[0m: Currently logged in as: [33mazhgh22[0m ([33mMLBeasts[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


0,1
train_wmae,▁
val_wmae,▁

0,1
train_wmae,3001.24672
val_wmae,3232.62447
