# **Project Name**    - Integrated Retail Analytics for Store Optimization

##### **Project Type**    - EDA + Regression (Sales Forecasting)
##### **Contribution**    - Individual

# **Project Summary -**

This project focuses on **retail analytics for store optimization** using advanced Machine Learning methods.
The aim is to analyze and predict **weekly sales** across multiple stores and departments by leveraging historical sales,
store-level attributes, and external factors such as holidays, temperature, fuel prices, markdown campaigns, CPI,
and unemployment.

The dataset comprises three integrated parts:

- **Sales dataset**: Weekly sales data per store and department, with holiday indicators.  
- **Stores dataset**: Metadata about each store such as type and size.  
- **Features dataset**: Additional economic and promotional features (fuel price, temperature, markdowns, CPI, etc.).  

The workflow includes:

1. **Data Preparation & Merging**: Combining the three datasets into one master table.  
2. **Exploratory Data Analysis (EDA)**: Understanding sales trends, holiday impacts, and store performance.  
3. **Feature Engineering**: Creating new time-based features, encoding store type, handling missing values.  
4. **Modeling**: Building baseline Linear Regression and advanced Random Forest models to predict weekly sales.  
5. **Evaluation**: Comparing models using RMSE and R².  
6. **Insights**: Identifying factors driving sales (e.g., store type, holidays, markdowns), and generating actionable insights for store optimization.

This project provides both predictive power (forecasting sales) and prescriptive insights (what drives sales),
which can help retailers optimize operations, manage inventory, and improve promotions.


Write the summary here within 500-600 words.

# **GitHub Link -**


https://github.com/asadamaanstat/Integrated-Retail-Analytics-for-Store-Optimization.git

# **Problem Statement**


**Problem Statement**: Build a machine learning model to predict **weekly sales** for different stores and departments, incorporating store-level details, promotional markdowns, holiday effects, and economic indicators. The goal is to provide both accurate forecasts and actionable insights for store optimization.

# ***Let's Begin !***

##  Know Your Data

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

### Dataset Loading

In [None]:
# Load Dataset
sales = pd.read_csv("sales data-set.csv")
stores = pd.read_csv("stores data-set.csv")
features = pd.read_csv("Features data set.csv")


### Dataset First View

In [None]:
# Dataset First Look
display(sales.head())
display(stores.head())
display(features.head())

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("Sales :",sales.shape)
print("Stores :",stores.shape)
print("Features :",features.shape)

### Data Exploration

In [None]:
# Dataset Info
print('Shape:', sales.shape)
print('\nInfo:')
print(sales.info())
print('\nDescribe:')
display(sales.describe())

print('\nHead:')
display(sales.head())

print('\nTail:')
display(sales.tail())

print('Shape:', stores.shape)
print('\nInfo:')
print(stores.info())
print('\nDescribe:')
display(stores.describe())

print('\nHead:')
display(stores.head())

print('\nTail:')
display(stores.tail())

print('Shape:', features.shape)
print('\nInfo:')
print(features.info())
print('\nDescribe:')
display(features.describe())

print('\nHead:')
display(sales.head())

print('\nTail:')
display(sales.tail())

#### Data Cleaning: Missing Values, Duplicates, Sanity Checks

In [None]:
# Dataset Duplicate Value Count
print('Missing values per column:\n', sales.isnull().sum())
print('\nDuplicate rows:', sales.duplicated().sum())

# Basic sanity checks (examples)
print('\nNegative income rows:', (sales['Annual_Income'] < 0).sum() if 'Annual_Income' in sales.columns else 'N/A')
print('Unrealistic ages (< 18 or > 90):', ((sales['Age'] < 18) | (sales['Age'] > 90)).sum() if 'Age' in sales.columns else 'N/A')

##  Merging the Dataset

In [None]:
# Dataset Columns
sales['Date'] = pd.to_datetime(sales['Date'], format="%d/%m/%Y")
features['Date'] = pd.to_datetime(features['Date'], format="%d/%m/%Y")

df = sales.merge(stores, on="Store", how="left")
df = df.merge(features, on=["Store", "Date"], how="left")

df.fillna(0, inplace=True)

df.head()

### Handling Missing values

In [None]:
print(df['Date'].head(10))
print(df['Date'].dtype)
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df = df.dropna(subset=['Date'])
df = df[df['Date'].dt.year >= 2010]   # Walmart dataset starts from 2010




##  Data Vizualization : Understand the relationships between variables

#### Line Grapgh

In [None]:
# visualization code
sales_trend = df.groupby("Date")["Weekly_Sales"].sum()
sales_trend.plot(figsize=(12,6), marker="o")
plt.title("Total Weekly Sales Over Time", fontsize=14)
plt.xlabel("Date")
plt.ylabel("Total Weekly Sales")
#plt.grid(True)
plt.show()


##### 1. Why did you pick the specific chart?



This plot is ideal for time series analysis, showing trends over weeks.

Helps detect seasonality, trends, and sudden spikes (like during holidays)



##### 2. What is/are the insight(s) found from the chart?

Answer Here

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Sales generally increase over time, but there are sharp peaks during certain weeks.

Peaks coincide with holiday weeks (Thanksgiving, Christmas, etc.), showing strong seasonal demand.

Non-holiday weeks show relatively stable baseline sales.

Allows planning inventory and staff allocation during peak holiday periods.

Enables targeted marketing campaigns to maximize revenue during high-demand periods.

#### Boxplot

In [None]:
# visualization code
sns.boxplot(x="IsHoliday_x", y="Weekly_Sales", data=df)
plt.title("Holiday Impact on Weekly Sales")
plt.show()

##### 1. Why did you pick the specific chart?

Boxplots compare distributions between holiday and non-holiday weeks.

Useful to see variability and outliers in sales.

##### 2. What is/are the insight(s) found from the chart?

Weekly sales during holidays are significantly higher than non-holiday weeks.

Some holidays may show extremely high outliers, indicating exceptional sales events.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Highlights the importance of holidays in driving revenue.

Supports planning promotions and stock ahead of holidays for maximum sales.

Over-reliance on holiday weeks may cause underperformance in non-holiday periods.

Without proper marketing during regular weeks, the business may not grow consistently.

#### Boxplot

In [None]:
sns.boxplot(x="Type", y="Weekly_Sales", data=df)
plt.title("Store Type vs Weekly Sales")
plt.show()

##### 2. What is/are the insight(s) found from the chart?

Type A stores have higher median weekly sales than Type B or C.

Type C stores show higher variability, suggesting inconsistent performance.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Helps allocate marketing budget or inventory strategically, prioritizing high-performing stores.

Decisions about expansion or remodeling can be guided by store type performance.

Type C stores may be underperforming due to location, size, or management issues.

Ignoring these insights may waste resources on low-performing stores.

#### Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,6))
sns.heatmap(df.select_dtypes(include=['int64','float64']).corr(), cmap="coolwarm", annot=False)
plt.title("Correlation Heatmap")
plt.show()


##### 1. Why did you pick the specific chart?

Heatmap shows relationships between multiple numerical variables.

Helps in feature selection for modeling weekly sales.

##### 2. What is/are the insight(s) found from the chart?

Markdowns (MarkDown1–5) show positive correlation with Weekly_Sales.

Fuel price and CPI show slight negative correlation, meaning higher costs may reduce sales.

Unemployment has a weak negative effect on sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Identifies key drivers of sales, e.g., markdown promotions can effectively boost revenue.

External factors (CPI, Fuel Price) can guide strategic pricing or regional marketing campaigns.

Ignoring correlations like markdowns vs sales may lead to inefficient promotions.

Rising CPI or unemployment can reduce customer spending — proactive strategies are needed.

## Feature Engineering & Data Pre-processing

###  Categorical Encoding

In [None]:
# Encode your categorical columns
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month
df["Week"] = df["Date"].dt.isocalendar().week

### Creating dummies Variables

In [None]:
df = pd.get_dummies(df, columns=["Type"], drop_first=True)
df.head()

#### What all categorical encoding techniques have you used & why did you use those techniques?

We used one-hot encoding for the Store Type variable since it’s nominal and has no natural order. This prevents the model from assuming any ranking between categories. The IsHoliday column was already binary, so no additional encoding was needed. For the Date variable, we extracted Year, Month, and Week to capture seasonality and trends. We avoided label encoding because it imposes a false order on categorical values.

### Splitting of Dataset

In [None]:
X = df.drop(columns=["Weekly_Sales", "Date"])
y = df["Weekly_Sales"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## ***7. ML Model Implementation***

## Linear Regression

In [None]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
print("Linear Regression R²:", r2_score(y_test, y_pred_lr))


##### Which hyperparameter optimization technique have you used and why?

For the Linear Regression model, I directly used scikit-learn’s default implementation without hyperparameter optimization, since it has no significant tunable parameters. The focus was on preparing clean features and encoding categorical variables properly to improve model performance.

## Random Forest

In [None]:
rf = RandomForestRegressor(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("Random Forest R²:", r2_score(y_test, y_pred_rf))


##### Which hyperparameter optimization technique have you used and why?

### Hyperparameter Optimization for Random Forest

In this project, I used the default hyperparameters of RandomForestRegressor
with `n_estimators=100` and `random_state=42`.  
I did not apply an explicit hyperparameter optimization technique such as GridSearchCV
or RandomizedSearchCV. The focus was on building a baseline model and comparing it
with Linear Regression to evaluate performance improvements.

As a next step, hyperparameter optimization (e.g., using GridSearchCV or RandomizedSearchCV)
can be applied to further improve model accuracy.


### Top 15 Feature Importances (Bar Chart from Random Forest)

In [None]:
# Feature importance
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
plt.figure(figsize=(10,6))
importances.head(15).plot(kind='bar')
plt.title("Top 15 Feature Importances")
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart clearly shows which features have the most impact on predicting sales.

Essential for prioritizing business strategies.

##### 2. What is/are the insight(s) found from the chart?

Most important features: IsHoliday, Type, Size, MarkDown1, Fuel Price.

Holidays and markdowns are strong revenue drivers, while store characteristics affect baseline sales.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Allows data-driven decision making: focus on top drivers for promotions, staffing, and inventory.

Optimizing based on top features can maximize ROI.

Ignoring less important features like CPI or Unemployment may be risky in economic downturns.

Relying solely on top features may cause over-optimization without considering other external factors.

### Model Performance Comparison

We evaluated both **Linear Regression** and **Random Forest** models.  
The results clearly show that Random Forest outperforms Linear Regression, as it is able to capture non-linear relationships and interactions between features more effectively.  

**Improvement Observed:**
- **Lower RMSE** with Random Forest → indicating smaller prediction errors.  
- **Higher R² Score** with Random Forest → showing better explanatory power.  

| Model              | RMSE   | R² Score |
|---------------------|--------|----------|
| Linear Regression   | 22607   | 0.13     |
| Random Forest       | 3868   | 0.97     |

The improvement in metrics suggests that ensemble-based approaches are more suitable for this dataset compared to a simple linear model.

# **Conclusion**


Random Forest outperforms Linear Regression in predicting weekly sales.

Store Type, Size, Holidays, and Markdown campaigns strongly influence sales.

External factors such as CPI, Unemployment, and Fuel Prices also contribute.

These insights can help optimize inventory planning, promotional strategies, and store-level operations.


