# Sales Analysis, Forecasting and Demand Planning Project  

## Project Overview  
This project focuses on developing a robust sales forecasting and demand planning system for retail businesses. By analyzing historical sales data, we aim to optimize production, inventory management, and resource planning by accurately predicting future sales trends. Leveraging machine learning and statistical models, this project provides actionable insights to improve operational efficiency, reduce costs, and enhance customer satisfaction.

---

## About the Dataset  

This dataset provides historical sales data for the retail furniture sector, serving as a valuable resource for business analysis. It includes detailed transaction-level information that can be used to understand sales trends, forecast future demand, and optimize inventory. The dataset enables informed decision-making to ensure business stability and growth in the competitive retail environment.

## Data Source

This sales data is available on Kaggle in the following Link,

> https://www.kaggle.com/datasets/tanayatipre/store-sales-forecasting-dataset

### Dataset Features  

| Feature          | Description                                                                          |
|-------------------|--------------------------------------------------------------------------------------|
| `Row ID`         | Sequential identifier for each row.                                                 |
| `Order ID`       | Unique identifier for each sales order.                                             |
| `Order Date`     | Date of the sales order.                                                            |
| `Ship Date`      | Date of shipment for the order.                                                     |
| `Ship Mode`      | Mode of shipment for the order.                                                     |
| `Customer ID`    | Unique identifier for each customer.                                                |
| `Customer Name`  | Name of the customer.                                                               |
| `Segment`        | Segment classification of the customer.                                             |
| `Country`        | Country where the sale occurred.                                                    |
| `City`           | City where the sale occurred.                                                       |
| `State`          | State where the sale occurred.                                                      |
| `Postal Code`    | Postal code where the sale occurred.                                                |
| `Region`         | Geographical region where the sale occurred.                                        |
| `Product ID`     | Unique identifier for each product.                                                 |
| `Category`       | Category classification of the product.                                             |
| `Sub-Category`   | Sub-category classification of the product.                                         |
| `Product Name`   | Name of the product.                                                                |
| `Sales`          | Total sales amount for the order.                                                   |
| `Quantity`       | Quantity of products sold in the order.                                             |
| `Discount`       | Discount applied to the order.                                                      |
| `Profit`         | Profit generated from the order.                                                    |

---

## Business Objectives  

1. **Sales Forecasting:**  
   - Predict sales for the next 30 days for each product category.  
   - Identify and leverage trends and seasonality in sales patterns.  

2. **Demand Planning:**  
   - Determine products or categories likely to experience surges in demand.  
   - Reduce overstocking and understocking through accurate forecasts.  

3. **Optimization:**  
   - Optimize production schedules and inventory management.  
   - Identify periods requiring special promotions to counter seasonal declines.  

---

## Methodology  

### 1. **Data Understanding**  
   - **Data Collection:** Gather historical sales data, pricing, promotions, holidays, and external factors (e.g., weather).  
   - **Exploratory Data Analysis (EDA):** Perform statistical analysis and create visualizations to uncover trends, seasonality, and anomalies.  
   - **Data Quality Assessment:** Identify and address missing, inconsistent, or irrelevant data.  

### 2. **Data Preparation**  
   - **Data Cleaning:** Handle missing values, outliers, and duplicates. Normalize sales data if necessary.  
   - **Feature Engineering:** Create lag variables, rolling averages, seasonal indices, and encode categorical variables for modeling.  
   - **Data Splitting:** Split the dataset into training, validation, and testing sets.  

### 3. **Modeling**  
   - **Baseline Models:** Develop simple models such as moving averages or exponential smoothing for benchmarking.  
   - **Advanced Models:** Train machine learning (e.g., ARIMA, SARIMA, XGBoost, Random Forest) and deep learning models (e.g., LSTM, GRU, Prophet).  
   - **Hyperparameter Optimization:** Fine-tune models to enhance accuracy and efficiency.  

### 4. **Evaluation**  
   - **Evaluation Metrics:** Use RMSE, MAPE, MAE, and R² to assess model performance.  
   - **Visualization:** Plot predicted vs. actual sales to analyze trends and deviations.  
   - **Model Selection:** Choose the best-performing model for deployment.  

---

## Applications  

- **Inventory Management:** Ensure optimal inventory levels, minimizing costs associated with overstocking or stockouts.  
- **Production Planning:** Use forecasts to adjust production schedules based on predicted demand.  
- **Promotional Campaigns:** Identify low-demand periods and design targeted promotions to boost sales.  
- **Revenue Forecasting:** Provide accurate revenue projections to guide financial planning.  

---

## Research Questions  

1. What are the expected sales for the next 30 days for each product category?  
2. Which products or categories show clear trends or seasonal demand patterns?  
3. How can accurate demand forecasts improve inventory management and reduce operational costs?  
4. Which time periods require targeted promotional strategies to mitigate sales dips?  

---

## Results and Insights  

1. **Seasonal Trends:** Sales demonstrate clear peaks during holiday seasons and dips during specific months.  
2. **Top-Selling Products:** Analysis of product categories reveals best-performing items and their contribution to revenue.  
3. **Demand Surges:** Certain products experience predictable spikes in demand, enabling proactive inventory management.  
4. **Model Performance:** LSTM and SARIMA models outperformed baseline methods in forecasting accuracy.  

---

## Conclusion  

This project equips retail businesses with powerful forecasting tools to make data-driven decisions. By understanding historical sales patterns, businesses can optimize inventory, enhance production efficiency, and maximize profitability while maintaining customer satisfaction.


### Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
import warnings

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from keras.models import Sequential
from keras.layers import Input, Dense, Dropout, GRU
from sklearn.metrics import mean_absolute_error, mean_squared_error, mean_absolute_percentage_error

2024-12-06 08:02:36.451168: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Settings

In [38]:
# Warnings
warnings.filterwarnings("ignore")

# Plot
sns.set_style("darkgrid")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "ssf_monthly.csv")
csv_path = os.path.join(data_path, "ssf_cleaned.csv")

### Load Data

In [39]:
df = pd.read_csv(csv_path)

In [40]:
# Check Data
df.head()

Unnamed: 0,Date,Sales
0,2014-01-06,2573.82
1,2014-01-07,76.728
2,2014-01-08,68.465333
3,2014-01-09,60.202667
4,2014-01-10,51.94


### Define Functions for Data Preprocessing, Model Preparation, Training and Evaluation

In [41]:
# Data Preparation

def prepare_data(data, feature):
    data = data[feature].values.reshape(-1, 1)
    sc = MinMaxScaler(feature_range= (0,1))
    data_scaled = sc.fit_transform(data)
    return data_scaled, sc

# Create sequences
def create_sequence(data, time_steps= 10):
    X, y = [], []
    for i in range(time_steps, len(data)):
        X.append(data[i - time_steps:i, 0])  # Sequence of time_steps
        y.append(data[i, 0])   # Next value after the sequence
    return np.array(X), np.array(y)

# Split the data in train and test set
def train_test_split(X, y, train_split_at= 0.8):
    train_size = int(len(X) * train_split_at)
    X_train, X_test = X[:train_size], X[train_size:]
    y_train, y_test = y[:train_size], y[train_size:]
    return X_train, X_test, y_train, y_test

# Build and compile model
def build_compile_lstm(X):
    # Initialize model
    model = Sequential([
        Input(shape= (X.shape[1], 1)), # Input Layer
        GRU(50, return_sequences= True), # Hidden GRU Layer 1
        Dropout(0.2), # Dropout layer 1
        GRU(50, return_sequences= True), # Hidden GRU Layer 2
        Dropout(0.2), # Dropout layer 2
        GRU(50, return_sequences= True), # Hidden GRU Layer 3
        Dropout(0.2), # Dropout Layer 3
        Dense(1) # Output Layer
    ])
    # Compile the model
    model.compile(optimizer="adam", loss= "mean_squared_error")
    return model

# Train Model
def train_model(model, X_train, y_train, X_test, y_test, epochs= 50, batch_size= 16):
    # Train the model
    model.fit(X_train, y_train, epochs= epochs, batch_size= batch_size, validation_data= (X_test, y_test))
    return model

# Evaluate model
def evaluate_model(model, sc, X_train, y_train, X_test, y_test):
    # Make Predictions on train
    y_pred = model.predict(X_train)
    # print("Scaler input features:", sc.n_features_in_)
    # print("y_pred shape:", y_pred.shape)
    # print("y_test shape:", y_test.shape)
    y_pred_last = y_pred[:, -1, :] # Extract the last shape
    # print("y_pred_2d shape:", y_pred_last.shape)
    y_pred = sc.inverse_transform(y_pred_last)
    y_train = y_train.reshape(-1, 1)  # Shape: (num_samples, 1)
    y_train = sc.inverse_transform(y_train)

    # Filter out the small values to avoid division by small values
    threshold= 1
    valid_indices = y_train > threshold
    y_train_filtered= y_train[valid_indices]
    y_pred_last_filtered = y_pred_last[valid_indices]

    # Evaluate the model
    rmse = np.sqrt(mean_squared_error(y_train_filtered, y_pred_last_filtered))
    mape = mean_absolute_percentage_error(y_train_filtered, y_pred_last_filtered)
    mae = mean_absolute_error(y_train_filtered, y_pred_last_filtered)
    # score = r2_score(y_train_filtered, y_pred_last_filtered)
    # Print metrics
    print(f"RMSE: {rmse: 0.4f}")
    print(f"MAPE: {mape * 100: 0.2f}")
    print(f"MAE: {mae: 0.4f}")
    # print(f"Score: {score: 0.2f}")
    # Make Predictions on test
    y_pred = model.predict(X_test)
    # print("Scaler input features:", sc.n_features_in_)
    # print("y_pred shape:", y_pred.shape)
    # print("y_test shape:", y_test.shape)
    y_pred_last = y_pred[:, -1, :] # Extract the last shape
    # print("y_pred_2d shape:", y_pred_last.shape)
    y_pred = sc.inverse_transform(y_pred_last)
    y_test = y_test.reshape(-1, 1)  # Shape: (num_samples, 1)
    y_test = sc.inverse_transform(y_test)

    # Filter out the small values to avoid division by small values
    threshold= 1
    valid_indices = y_test > threshold
    y_test_filtered= y_test[valid_indices]
    y_pred_last_filtered = y_pred_last[valid_indices]

    # Evaluate the model
    rmse = np.sqrt(mean_squared_error(y_test_filtered, y_pred_last_filtered))
    mape = mean_absolute_percentage_error(y_test_filtered, y_pred_last_filtered)
    mae = mean_absolute_error(y_test_filtered, y_pred_last_filtered)
    # score = r2_score(y_test_filtered, y_pred_last_filtered)
    # Print metrics
    print(f"RMSE: {rmse: 0.4f}")
    print(f"MAPE: {mape * 100: 0.2f}")
    print(f"MAE: {mae: 0.4f}")

### Data Preprocessing

In [42]:
data, scaler = prepare_data(df, "Sales")

In [43]:
# Create Data sequence
X, y = create_sequence(data, 5)
# Sanity check
print(f"Shape of X: {X.shape}") # Expected: (num_samples, time_steps, num_features)
print(f"Shape of y: {y.shape}") # Expexted(num_samples, num_features)

Shape of X: (1450, 5)
Shape of y: (1450,)


In [44]:
# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y)
# Sanity check
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (1160, 5)
Shape of y_train: (1160,)
Shape of X_test: (290, 5)
Shape of y_test: (290,)


### Model Training and Evaluation

In [45]:
# Build the model
model = build_compile_lstm(X_train)

# Print model summary
model.summary()

In [46]:
# Train the model
model = train_model(model, X_train, y_train, X_test, y_test, epochs = 500, batch_size= 4)

Epoch 1/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 9ms/step - loss: 0.0075 - val_loss: 0.0090
Epoch 2/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0080 - val_loss: 0.0093
Epoch 3/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 9ms/step - loss: 0.0098 - val_loss: 0.0094
Epoch 4/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 8ms/step - loss: 0.0086 - val_loss: 0.0094
Epoch 5/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 12ms/step - loss: 0.0070 - val_loss: 0.0090
Epoch 6/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0069 - val_loss: 0.0090
Epoch 7/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 7ms/step - loss: 0.0089 - val_loss: 0.0089
Epoch 8/500
[1m290/290[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - loss: 0.0084 - val_loss: 0.0093
Epoch 9/500
[1m290/290[0m [3

In [47]:
# Evaluate the model
evaluate_model(model, scaler, X_train, y_train, X_test, y_test)

[1m37/37[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 37ms/step
RMSE:  1159.3020
MAPE:  99.94
MAE:  757.2377
[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 6ms/step
RMSE:  1279.9966
MAPE:  99.92
MAE:  857.0465


### Insights

#### High Error Rates (MAPE):

- Both training and testing MAPE are near 100%, which is extremely high. This indicates that the model struggles to accurately capture the proportional changes in the time series.
- The model fails to generalize well to unseen data, suggesting issues with either:
    - Model complexity (insufficient learning capacity for this dataset).
    - Data preprocessing or the inherent nature of the data.

#### Overfitting or Underfitting:

- The difference between training and testing metrics is small, indicating that the model may be underfitting the data rather than overfitting.
- This might occur because the model complexity (50 units per layer) is not sufficient to learn intricate patterns in the sales data.

#### MAE and RMSE Insights:

- The RMSE values **(1159.31 for training, 1280.00 for testing)** show that the error magnitude is substantial in both cases.
- The slightly higher testing RMSE suggests a marginal degradation in performance on unseen data, a typical sign of limited model capacity.

#### Effect of Hyperparameters:

- Despite varying epochs and batch sizes, the high MAPE persists. This suggests that neither increasing training duration nor tuning batch sizes effectively improves the model's performance.
