# Store Sales Forecasting First Place Solution

## Introduction

In this post, I share the approach behind my first-place solution for the Store Sales Forecasting challenge. The competition required accurately predicting daily sales for multiple stores using historical sales data and additional metadata. My solution combined robust feature engineering and careful validation strategies to achieve top performance. Below, I break down the key components that contributed to the winning result.

Let's first import packages.

## Importing Packages

<div style="background-color: #ffcccc; padding: 10px; border-left: 5px solid red;">
  <strong>Note:</strong> If you're not running this on Kaggle or Google Colab, make sure to install the required packages first using <code>pip install</code> before importing them.
</div>


In [1]:
# Basic Libraries
import os  # Interact with the operating system (e.g., file paths, directories)
from pathlib import Path  # Easier and more readable file path handling
import warnings  # To manage warnings in Python
warnings.filterwarnings('ignore')  # Hide warnings to keep output clean

# Data Handling
import numpy as np  # Useful for numerical operations and handling arrays
import pandas as pd  # For working with data tables (e.g., reading CSV files, dataframes)

# Data Visualization
import matplotlib.pyplot as plt  # Create basic plots like line, bar, scatter
import seaborn as sns  # Build attractive and informative statistical plots

# Machine Learning Tools
from sklearn.model_selection import KFold  # Split data into folds for cross-validation
from sklearn.metrics import mean_squared_error  # Measure prediction error

# Machine Learning Model
import lightgbm as lgb  # A fast and powerful gradient boosting framework

# Additional Utilities
import statistics  # Perform basic statistical calculations (mean, median, etc.)

Let's load our Data with pathlib's Path and pandas.

In [2]:
DATA = Path('/kaggle/input/store-sales-forecasting-challenge-for-beginners/store-sales-forecasting-challenge-for-beginners/')

In [3]:
train = pd.read_csv(DATA / 'train.csv')
test = pd.read_csv(DATA / 'test.csv')
dates = pd.read_csv(DATA / 'dates.csv')
holidays = pd.read_csv(DATA / 'holidays.csv')
stores = pd.read_csv(DATA / 'stores.csv')
ss = pd.read_csv(DATA / 'SampleSubmission.csv')

<div style="background-color: lightgreen; padding: 10px; border-left: 5px solid green;">
  <strong>Note:</strong> One of the most important thing to do is it familiarize yourself with the Data (VERY IMPORTANT)
</div>


Let's explore the data now! 🔥

In [4]:
train.head()

Unnamed: 0,date,store_id,category_id,target,onpromotion,nbr_of_transactions
0,365,store_1,category_24,0.0,0,0.0
1,365,store_1,category_21,0.0,0,0.0
2,365,store_1,category_32,0.0,0,0.0
3,365,store_1,category_18,0.0,0,0.0
4,365,store_1,category_26,0.0,0,0.0


In [5]:
dates.head()

Unnamed: 0,date,year,month,dayofmonth,dayofweek,dayofyear,weekofyear,quarter,is_month_start,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end,year_weekofyear
0,365,1,1,1,2,1,1,1,True,False,True,False,True,False,101
1,366,1,1,2,3,2,1,1,False,False,False,False,False,False,101
2,367,1,1,3,4,3,1,1,False,False,False,False,False,False,101
3,368,1,1,4,5,4,1,1,False,False,False,False,False,False,101
4,369,1,1,5,6,5,1,1,False,False,False,False,False,False,101


In [6]:
test.head()

Unnamed: 0,date,store_id,category_id,onpromotion
0,1627,store_1,category_24,0
1,1627,store_1,category_21,0
2,1627,store_1,category_32,0
3,1627,store_1,category_18,16
4,1627,store_1,category_26,0


In [7]:
holidays.head()

Unnamed: 0,date,type
0,1,0
1,5,4
2,12,4
3,42,0
4,43,0


In [8]:
stores.head()

Unnamed: 0,store_id,city,type,cluster
0,store_1,0,0,0
1,store_2,0,0,0
2,store_3,0,0,1
3,store_4,0,0,2
4,store_5,1,0,3


What do you think? 🤔

Well Let me guess? We should merge the datasets and yes, that's the answer. So let's do that😜

## Merge the Datasets

We are going to use pandas' <code>merge()</code> and we will merge on dates and store_id. You will know this when we go through the data.

In [9]:
# Merge on 'date'
merged_train = pd.merge(train, dates, on="date", how="left")
merged_test = pd.merge(test, dates, on="date", how="left")

In [10]:
# Merge on 'store_id'
merged_train = pd.merge(merged_train, stores, on="store_id", how="left")
merged_test = pd.merge(merged_test, stores, on="store_id", how="left")

## Feature Engineering

Actually feature engineering is very necessary so let's create a weekend feature. It makes sense weekends can definitely affect the sales in a store.

In [11]:
merged_train["is_weekend"] = merged_train["dayofweek"].isin([5, 6]).astype(int)
merged_test["is_weekend"] = merged_test["dayofweek"].isin([5, 6]).astype(int)

## More on Explorative Data Analysis

In [12]:
merged_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2248884 entries, 0 to 2248883
Data columns (total 24 columns):
 #   Column               Dtype  
---  ------               -----  
 0   date                 int64  
 1   store_id             object 
 2   category_id          object 
 3   target               float64
 4   onpromotion          int64  
 5   nbr_of_transactions  float64
 6   year                 int64  
 7   month                int64  
 8   dayofmonth           int64  
 9   dayofweek            int64  
 10  dayofyear            int64  
 11  weekofyear           int64  
 12  quarter              int64  
 13  is_month_start       bool   
 14  is_month_end         bool   
 15  is_quarter_start     bool   
 16  is_quarter_end       bool   
 17  is_year_start        bool   
 18  is_year_end          bool   
 19  year_weekofyear      int64  
 20  city                 int64  
 21  type                 int64  
 22  cluster              int64  
 23  is_weekend           int64  
dty

In [13]:
merged_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99792 entries, 0 to 99791
Data columns (total 22 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   date              99792 non-null  int64 
 1   store_id          99792 non-null  object
 2   category_id       99792 non-null  object
 3   onpromotion       99792 non-null  int64 
 4   year              99792 non-null  int64 
 5   month             99792 non-null  int64 
 6   dayofmonth        99792 non-null  int64 
 7   dayofweek         99792 non-null  int64 
 8   dayofyear         99792 non-null  int64 
 9   weekofyear        99792 non-null  int64 
 10  quarter           99792 non-null  int64 
 11  is_month_start    99792 non-null  bool  
 12  is_month_end      99792 non-null  bool  
 13  is_quarter_start  99792 non-null  bool  
 14  is_quarter_end    99792 non-null  bool  
 15  is_year_start     99792 non-null  bool  
 16  is_year_end       99792 non-null  bool  
 17  year_weekofy

In [14]:
# Check for Null values
merged_train.isnull().sum()

date                   0
store_id               0
category_id            0
target                 0
onpromotion            0
nbr_of_transactions    0
year                   0
month                  0
dayofmonth             0
dayofweek              0
dayofyear              0
weekofyear             0
quarter                0
is_month_start         0
is_month_end           0
is_quarter_start       0
is_quarter_end         0
is_year_start          0
is_year_end            0
year_weekofyear        0
city                   0
type                   0
cluster                0
is_weekend             0
dtype: int64

In [15]:
# Check for Null values
merged_test.isnull().sum()

date                0
store_id            0
category_id         0
onpromotion         0
year                0
month               0
dayofmonth          0
dayofweek           0
dayofyear           0
weekofyear          0
quarter             0
is_month_start      0
is_month_end        0
is_quarter_start    0
is_quarter_end      0
is_year_start       0
is_year_end         0
year_weekofyear     0
city                0
type                0
cluster             0
is_weekend          0
dtype: int64

From this, there is no null value and the date ranges for 3 and half years (I think🤔, never mind😂)

You can see it here:

<div style="background-color: lightgreen; padding: 10px; border-left: 5px solid green;">
The train set contains transaction information for 3 years and 6 months. You are tasked with forecasting the next 8 weeks for the same stores and same products.
</div>

In [16]:
merged_train.head()

Unnamed: 0,date,store_id,category_id,target,onpromotion,nbr_of_transactions,year,month,dayofmonth,dayofweek,...,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end,year_weekofyear,city,type,cluster,is_weekend
0,365,store_1,category_24,0.0,0,0.0,1,1,1,2,...,False,True,False,True,False,101,0,0,0,0
1,365,store_1,category_21,0.0,0,0.0,1,1,1,2,...,False,True,False,True,False,101,0,0,0,0
2,365,store_1,category_32,0.0,0,0.0,1,1,1,2,...,False,True,False,True,False,101,0,0,0,0
3,365,store_1,category_18,0.0,0,0.0,1,1,1,2,...,False,True,False,True,False,101,0,0,0,0
4,365,store_1,category_26,0.0,0,0.0,1,1,1,2,...,False,True,False,True,False,101,0,0,0,0


In [17]:
merged_test.head()

Unnamed: 0,date,store_id,category_id,onpromotion,year,month,dayofmonth,dayofweek,dayofyear,weekofyear,...,is_month_end,is_quarter_start,is_quarter_end,is_year_start,is_year_end,year_weekofyear,city,type,cluster,is_weekend
0,1627,store_1,category_24,0,4,6,19,0,170,25,...,False,False,False,False,False,425,0,0,0,0
1,1627,store_1,category_21,0,4,6,19,0,170,25,...,False,False,False,False,False,425,0,0,0,0
2,1627,store_1,category_32,0,4,6,19,0,170,25,...,False,False,False,False,False,425,0,0,0,0
3,1627,store_1,category_18,16,4,6,19,0,170,25,...,False,False,False,False,False,425,0,0,0,0
4,1627,store_1,category_26,0,4,6,19,0,170,25,...,False,False,False,False,False,425,0,0,0,0


We are at the interesting part, Model Training🔥🔥🙌🏽!

Now Let's split the data for training, X and y. X is for features, and y is the target.

## Model Training

I know you have realized something.

```
y = np.log1p(merged_train.target) not y = merged_train.target
```

But in the Data Section:

<div style="background-color: #ffcccc; padding: 10px; border-left: 5px solid red;">
<strong>NB:</strong> logp1 (log(x+1)) transformation was applied to the label on the testing set, hence we are asking you to apply the logp1 transformation on your predictions before making any submission.
</div>


This is the reason we converted it first and trained on it.

In [18]:
X = merged_train.drop(columns=['target','nbr_of_transactions','year','weekofyear'])
y = np.log1p(merged_train.target)
X_test = merged_test

Convert string features to categorical features so that our model, lightgbm, can make work with it.

In [19]:
def change_object_to_cat(df):
  # changes objects columns to category and returns dataframe and list

  df = df.copy()
  list_str_obj_cols = df.columns[df.dtypes == "object"].tolist()
  for str_obj_col in list_str_obj_cols:
      df[str_obj_col] = df[str_obj_col].astype("category")

  return df,list_str_obj_cols

In [20]:
X, cat_list = change_object_to_cat(X)
X_test, cat_list = change_object_to_cat(X_test)

This code below trains a LightGBM (LGBM) regression model using cross-validation. It’s especially useful when you want to evaluate how well your model generalizes to unseen data and avoid overfitting.

It contains two key functions:

<strong><code>lgbm_trainer</code><strong>: Trains one LGBM model and evaluates it using a validation set.

This function trains a single LightGBM model using:

train_data and validation_data prepared using lgb.Dataset, with categorical features properly specified.

Training is done using lgb.train, with:

Early stopping

Evaluation logging

Evaluation result recording


<strong><code>cv_train_lgbm</code><strong>: Performs cross-validation (here, 6-fold), trains models on each fold, and calculates the average RMSE.

This function performs 6-fold cross-validation and:

Trains one model per fold using lgbm_trainer

Collects predictions and computes RMSE for each fold

Returns the mean RMSE and a list of trained models


In [21]:
def lgbm_trainer(X_train, y_train, X_test, y_test, params, num_round, categorical):
    """
    Trains an LGBM (LightGBM) model using the training data and evaluates it on the validation data.
    Returns the trained model (bst).
    """

    # Prepare the training dataset for LightGBM
    # 'lgb.Dataset' is used to create a LightGBM dataset object from the training data
    # 'categorical_feature' specifies which columns are categorical
    train_data = lgb.Dataset(X_train, y_train, feature_name=X_train.columns.tolist(), categorical_feature=categorical, free_raw_data=False)

    # Prepare the validation dataset for LightGBM in the same way as the training data
    validation_data = lgb.Dataset(X_test, y_test, feature_name=X_train.columns.tolist(), categorical_feature=categorical, free_raw_data=False)

    # Initialize an empty dictionary to record evaluation results during training
    eval_result = {}

    # Train the LightGBM model using the specified parameters and datasets
    bst = lgb.train(
        params,  # Model parameters (e.g., learning rate, number of leaves, etc.)
        train_data,  # The training dataset
        num_round,  # The number of boosting rounds (iterations)
        valid_sets=[train_data, validation_data],  # Datasets to evaluate during training
        callbacks=[  # Callbacks to use during training
            lgb.early_stopping(stopping_rounds=100),  # Stop training early if the validation score doesn't improve for 17 rounds
            lgb.log_evaluation(100),  # Log evaluation results every 100 rounds
            lgb.record_evaluation(eval_result)  # Record evaluation results in 'eval_result'
        ]
    )

    # Return the trained model (bst) after training is complete
    return bst
    
# We start by defining default parameters and setting the objective metric
param = {"verbose": -100}  # Set verbosity to -100 to suppress detailed output
param['metric'] = 'rmse'   # Set the evaluation metric to RMSE (Root Mean Squared Error)
param['device_type'] ='gpu'
param['gpu_use_dp'] = False
param['random_state'] = 42

# Lists to save metrics and predictions from the cross-validation folds
def cv_train_lgbm(X_train, y_train, params, num_rounds, category):
    """
    Function to perform 14-fold cross-validation and train an LGBM model
    Returns the out-of-fold validation score and the models from the cross-validation
    Parameters:
    X_train (DataFrame): The feature matrix for training.
    y_train (Series): The target labels for training.
    params (dict): The parameters for training the LightGBM model.
    num_rounds (int): The number of boosting iterations (rounds) to train the model.
    cat_list (list): A list of categorical feature names or indices.
    
    Returns:
    tuple: A tuple containing the mean RMSE (float) and the list of trained models (list).
    """
    kf = KFold(n_splits=6, random_state=42, shuffle=True)  # 14-fold cross-validation
    lgbm_rmses = []  # List to store RMSE values for each fold
    lgbm_y_vals = []  # List to store true values for each fold (not used in this example)
    lgbm_y_hats = []  # List to store predicted values for each fold (not used in this example)
    lgbm_models = []  # List to store trained models for each fold

    # Loop through each fold of the cross-validation
    for trn_idx, test_idx in kf.split(X_train, y_train):  # Split the data into training and validation sets
        X_tr, X_val = X_train.iloc[trn_idx], X_train.iloc[test_idx]  # Training and validation features
        y_tr, y_val = y_train.iloc[trn_idx], y_train.iloc[test_idx]  # Training and validation labels
        
        # Train the LGBM model using the training data and validation data
        lgbm_cls = lgbm_trainer(X_tr, y_tr, X_val, y_val, params, num_rounds, category)
        lgbm_models.append(lgbm_cls)  # Save the trained model
        
        # Use the trained model to make predictions on the validation set
        lgbm_y_hat = lgbm_cls.predict(X_val, num_iteration=lgbm_cls.best_iteration)
        
        # Calculate RMSE (Root Mean Squared Error) between true and predicted values
        lgbm_rmse = mean_squared_error(y_val, lgbm_y_hat, squared=False)  # RMSE is the square root of MSE
        lgbm_rmses.append(lgbm_rmse)  # Save the RMSE for this fold
    
    # Calculate the mean RMSE across all folds
    lgbm_mean_rmse = statistics.mean(lgbm_rmses)
    print("Mean RMSE: {}".format(lgbm_mean_rmse))  # Print the average RMSE across all folds
    
    return lgbm_mean_rmse, lgbm_models  # Return the average RMSE and the list of trained models


# Run the cross-validation function with the training data, parameters, and category list
lgbm_rmse, lgbm_models = cv_train_lgbm(X, y, param, 3000, cat_list)

# Print the final average RMSE
print(lgbm_rmse)



Training until validation scores don't improve for 100 rounds
[100]	training's rmse: 0.54822	valid_1's rmse: 0.550831
[200]	training's rmse: 0.492301	valid_1's rmse: 0.495234
[300]	training's rmse: 0.462197	valid_1's rmse: 0.465713
[400]	training's rmse: 0.441255	valid_1's rmse: 0.44504
[500]	training's rmse: 0.428024	valid_1's rmse: 0.432162
[600]	training's rmse: 0.417954	valid_1's rmse: 0.422341
[700]	training's rmse: 0.410588	valid_1's rmse: 0.415445
[800]	training's rmse: 0.404062	valid_1's rmse: 0.409305
[900]	training's rmse: 0.398307	valid_1's rmse: 0.404004
[1000]	training's rmse: 0.394149	valid_1's rmse: 0.400256
[1100]	training's rmse: 0.390478	valid_1's rmse: 0.39701
[1200]	training's rmse: 0.38732	valid_1's rmse: 0.394218
[1300]	training's rmse: 0.384562	valid_1's rmse: 0.391918
[1400]	training's rmse: 0.381886	valid_1's rmse: 0.389613
[1500]	training's rmse: 0.379544	valid_1's rmse: 0.387657
[1600]	training's rmse: 0.377223	valid_1's rmse: 0.385702
[1700]	training's rmse:

## 🔮 Making Predictions on Test Set

In [22]:
pred_df = X_test.copy()

# --- Predict using the trained log1p models ---
for i, model in enumerate(lgbm_models):
    if hasattr(model, 'best_iteration') and model.best_iteration:
        pred_log = model.predict(X_test[X.columns], num_iteration=model.best_iteration)
    else:
        pred_log = model.predict(X_test[X.columns])
    
    # Apply np.expm1 to reverse the log1p transformation
    pred_df[f"pred_lgbm_{i}"] = np.expm1(pred_log)

# --- Ensemble the Predictions (Mean and Median) ---
lgbm_preds = pred_df.filter(like='pred_lgbm').values
pred_df["mean_pred"] = lgbm_preds.mean(axis=1).clip(min=0)
pred_df["median_pred"] = np.median(lgbm_preds, axis=1).clip(min=0)

# --- Create ID Column ---
pred_df["ID"] = (
    "year_week_" +
    X_test["year_weekofyear"].astype(str) + "_" +
    X_test["store_id"].astype(str) + "_" +
    X_test["category_id"].astype(str)
)

# --- Aggregate Final Submission ---
submission = (
    pred_df.groupby("ID", as_index=False)["mean_pred"]
    .sum()
    .rename(columns={"mean_pred": "target"})
)

# --- Final Cleanup ---
submission.fillna(0, inplace=True)

In [23]:
submission.target = np.log1p(submission.target)

# Fill any NaNs just in case (precaution)
submission.fillna(0, inplace=True)

Did you see the trick?🤔

Yes I used numpy's <code>np.expm1</code> to convert the log values back to the normal target values, summed them up because it is a weekly target we are to submit😮‍💨, and then convert it back to the log values.

In [24]:
submission

Unnamed: 0,ID,target
0,year_week_425_store_10_category_0,4.537493
1,year_week_425_store_10_category_1,0.536428
2,year_week_425_store_10_category_10,1.694017
3,year_week_425_store_10_category_11,1.898512
4,year_week_425_store_10_category_12,6.841830
...,...,...
14251,year_week_432_store_9_category_5,0.760796
14252,year_week_432_store_9_category_6,3.388472
14253,year_week_432_store_9_category_7,10.727893
14254,year_week_432_store_9_category_8,6.236538


In [25]:
# Save to CSV
submission.to_csv("first_lgbm_submission24.csv", index=False)

Ermm, my private board score is 0.386 but this code might give you 0.393. Forgive me, I forgot to use random state😂

And we are done!😂

## Don't Just Have A Good Day, Have A Great Day!

## Thank You! 