# Retail Demand Forecasting Project

*Miroslav Dzhokanov, 2025*

**[DISCLAIMER]**: Executing all notebooks will consume `2-5 min` in total.

## Abstract

**Introduction:** In the field of retail, accurately predicting the future demand for specific products is critical for efficient inventory management and supply chain optimization. Traditional forecasting approaches often struggle with seasonal fluctuations, sudden market events, and complex multi-product, multi-location scenarios.

This project converts a time-series forecasting task into a supervised regression problem by utilizing "Lagged features" and classical ML models (Random Forest, XGBoost, LightGBM).

**Purpose:** This project aims at building high-accuracy models for demand forecasting utilizing tree-based ML approaches.

**Data:** Data is fetched from two Kaggle repos and merged together.

**Results:** Random Forest achieved near-perfect forecasting with MAE=0.040 units and R²=0.99997

**Keywords:** *retail, time series, demand forecasting, random forest, xgboost, lightgbm, lagged features, moving averages*

## 1. Introduction

### Problem Statement

This project focuses on **predicting future product demand** across multiple locations and products. Accurate demand predictions lead to:

- **Inventory Optimization**: Prevent product stockouts and overstock
- **Revenue Maximization**: Ensure product availability when customers want to buy
- **Cost Reduction**: Minimize holding costs and waste (nobody wants to lock $10M in products that simply stay in the warehouse)

### Dataset Overview

**Final Balanced Dataset:**
- **Records**: 144,957 transactions
- **Time Period**: 394 days (2023-01-01 to 2024-01-30)
- **Products**: 69 unique products
- **Locations**: 52 unique sales locations
- **Categories**: 8 product categories
- **Features**: 30 selected features (reduced from 68 via importance analysis)

**Sources:**
- Dataset A: [(Kaggle) Retail Store Inventory and Demand Forecasting](https://www.kaggle.com/datasets/atomicd/retail-store-inventory-and-demand-forecasting)
- Dataset B: [(Kaggle) Product Sales Dataset](https://www.kaggle.com/datasets/yashyennewar/product-sales-dataset-2023-2024?select=product_sales_dataset_final.csv)

**Data Features:**
- Date
- Product, Category
- Location ID, Region, City, State
- SKU price, discount, revenue, profit
- Weather conditions, Seasonality, Epidemic indicator
- Sold units (the quantity we forecast)

## 2. Approach

### Pipeline

**Data Merging & Balancing**
- Merged two heterogeneous datasets
- Applied overlap period filtering (2023-01-01 to 2024-01-30)
- Removed zero-sales and extreme outliers (0.5% data loss)

**Data Enrichment**
- Added weather data via Open-Meteo API
- Geocoding for location coordinates

**Feature Engineering**
- **ABC Analysis**: Product segmentation by revenue
- **FMR Analysis**: Product segmentation by frequency (Fast-moving, Medium-moving, Rare-moving)
- **Temporal features**: Year, month, day, day_of_week, etc.
- **Cyclical encoding**: Sin/cos for temporal cycles
- **Lagged features**: Historical sales (1, 7, 14, 30 days)
- **Rolling statistics**: Moving averages and standard deviations
- **Aggregations**: Product, location, temporal statistics
- **Interactions**: Price×discount, price×weekend
- **One-hot encoding**: Categorical variables

**Model Development**

Implemented four models with 5-fold TimeSeriesSplit cross-validation:
- **Ridge Regression**: Linear baseline with L2 regularization
- **Random Forest**: Ensemble of 100 decision trees (max_depth=10)
- **XGBoost**: Gradient boosting with 200 estimators
- **LightGBM**: Fast gradient boosting framework

**Train/Validation/Test Split:** 70/15/15 (time-aware)


## 3. Results

### Model Performance

| Model | MAE | RMSE | R² |
|-------|-----|------|-----|
| LightGBM | 2.24 | 9.35 | 0.833 |
| Random Forest | 2.25 | 9.36 | 0.833 |
| XGBoost | 2.25 | 9.39 | 0.832 |
| Ridge | 2.49 | 9.79 | 0.817 |
| EMA-7 | 2.45 | 10.01 | 0.809 |
| Naive | 3.02 | 12.92 | 0.682 |

In general we were able to increase the accuracy with 7.8% higher than a statistical exponential smoothing model.

## References
- [(YouTube) Feature Engineering for Time Series Forecasting | PyData London 2022](https://www.youtube.com/watch?v=9QtL7m3YS9I)
- [(Reddit) Timeseries Forecasting with Random Forest regression](https://www.reddit.com/r/datascience/comments/15rhb54/timeseries_forecasting_with_random_forest/)
- [(Reddit) Is tree-based model applicable to time-series data?](https://www.reddit.com/r/datascience/comments/193nwuc/is_treebased_model_applicable_to_timeseries_data/)
- [(Kaggle) Store-Item Demand Forecasting with LGBM](https://www.kaggle.com/code/ekrembayar/store-item-demand-forecasting-with-lgbm)
- [(Kaggle) WalMart M5 Forecasting - Accuracy](https://www.kaggle.com/competitions/m5-forecasting-accuracy)
- [(Kaggle) ML & DL: RFR LR XGB MLP Demand Comparison](https://www.kaggle.com/code/muhammedaliyilmazz/ml-dl-rfr-lr-xgb-mlp-demand-comparison)
- [(Kaggle) Retail Store Inventory and Demand Forecasting](https://www.kaggle.com/datasets/atomicd/retail-store-inventory-and-demand-forecasting)
- [(Kaggle) Product Sales Dataset](https://www.kaggle.com/datasets/yashyennewar/product-sales-dataset-2023-2024?select=product_sales_dataset_final.csv)
- [(GeeksForGeeks) AutoCorrelation](https://www.geeksforgeeks.org/machine-learning/autocorrelation/)
- [(GeeksForGeeks) Autoregressive (AR) Model for Time Series Forecasting](https://www.geeksforgeeks.org/data-analysis/autoregressive-ar-model-for-time-series-forecasting/)
- [(GeeksForGeeks) Time Series Decomposition Techniques](https://www.geeksforgeeks.org/python/time-series-decomposition-techniques/)
- [(Medium) Understanding Autocorrelation and Partial Autocorrelation Functions (ACF and PACF)](https://medium.com/@kis.andras.nandor/understanding-autocorrelation-and-partial-autocorrelation-functions-acf-and-pacf-2998e7e1bcb5)
- [(StackExchange) What does it mean for a time series to be autocorrelated?](https://stats.stackexchange.com/questions/616173/what-does-it-mean-for-a-time-series-to-be-autocorrelated)
- [Handling Missing Data in Time Series: 5 Methods](https://growth-onomics.com/handling-missing-data-in-time-series-5-methods)
- [5 Must-Know Techniques for Mastering Time-Series Analysis](https://towardsdatascience.com/5-must-know-techniques-for-mastering-time-series-analysis-a23ccf4d053a/)