<a href="https://www.kaggle.com/code/zerol0l/walmart-by-the-numbers-what-the-data-really-says?scriptVersionId=252486589" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

![](https://external-content.duckduckgo.com/iu/?u=https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1358%2F1*z0_nuU3BWrHLFIrT8dDFqg.jpeg&f=1&nofb=1&ipt=212c20d5212471f92f062e08030150a86f4929e0d9c4a481d9b19dc23c021b9b)

---

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Introduction
</h3>

Retail isn't just about selling products—it's about predicting what customers need before they even walk through the door. Walmart, the largest retailer in the U.S., produces mountains of sales data every week. Hidden in this data are patterns, seasonal spikes, and signals that can help guide inventory, staffing, and marketing strategies.

This notebook is my attempt to tell the story behind those numbers.

As a data scientist, my mission is to translate raw data into actionable insights. In this analysis, I dive into Walmart's historical sales data to:
- Understand **when and where** revenue peaks,
- Investigate **how holidays and economic indicators** influence performance, and
- Build predictive models that could power smarter business decisions.

**Why This Matters to You**:  
Whether you're a hiring manager looking for someone who can turn chaos into clarity, or a fellow data analyst exploring EDA and time series modeling, this project shows my end-to-end approach—technical execution wrapped in business understanding.

---

## Project Goals
- **Revenue Optimization**: Identify seasonal peaks and the effect of major holidays on store performance.
- **Store-Level Insights**: Compare locations to spotlight top performers and areas for improvement.
- **Forecasting Models**: Use machine learning to predict future sales with actionable accuracy.
- **Recommendations**: Offer real-world suggestions for promotions, staffing, and logistics.

---

## Table of Contents:
1. Loading Libraries and Data  
2. Data Cleaning and Preparation  
3. Exploratory Data Analysis (EDA)  
4. Data Visualization  
5. Machine Learning and Forecasting  
6. Strategic Recommendations  
7. Conclusion

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Step 1: Loading the Toolkit & the Treasure
</h3>

Before diving into the data ocean, we assemble our tools. In this step, I import essential Python libraries for:

- Data cleaning (`pandas`, `numpy`)
- Visualization (`matplotlib`, `seaborn`, `plotly`)
- Machine learning (`sklearn`, `xgboost`)
- Time series modeling (`statsmodels`)

We also load the Walmart sales dataset, a rich CSV file containing weekly data across multiple stores and departments. Each row is a snapshot of Walmart's pulse, from Black Friday chaos to post-holiday slumps.

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Step 2: Preparing the Canvas
</h3>

Data is never clean. Before insights, we need hygiene.

In this step, I:
- Identify and handle missing values
- Convert data types (e.g., date parsing, categorical encoding)
- Merge datasets (e.g., store info + economic data)

This isn’t just cleanup—it’s alignment. We're making sure every variable speaks the same language before we ask it any questions.

In [1]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.offline import init_notebook_mode
init_notebook_mode(connected=True)
from datetime import datetime
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import SimpleImputer
import warnings
warnings.filterwarnings('ignore')

# Visualization settings
plt.style.use('seaborn')
sns.set_palette("Set2")
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')

# Load the dataset
df = pd.read_csv('/kaggle/input/walmart-sales/Walmart_Sales.csv')
print("Dataset Preview:")
df.head()

Dataset Preview:


Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.57,211.1,8.11
1,1,12-02-2010,1641957.44,1,38.51,2.55,211.24,8.11
2,1,19-02-2010,1611968.17,0,39.93,2.51,211.29,8.11
3,1,26-02-2010,1409727.59,0,46.63,2.56,211.32,8.11
4,1,05-03-2010,1554806.68,0,46.5,2.62,211.35,8.11


The data is now clean, with missing dates imputed and new features like `Season` and `Week` engineered to capture temporal patterns. This sets the stage for a deeper exploration of sales trends.

In [2]:
# Convert 'Date' to datetime and handle errors
df['Date'] = pd.to_datetime(df['Date'], format='%d-%m-%Y', errors='coerce')

# Check for missing values
print("Missing Values Before Cleaning:")
print(df.isnull().sum())

# Impute missing 'Date' values with the median date
if df['Date'].isnull().any():
    median_date = df['Date'].median()
    df['Date'].fillna(median_date, inplace=True)

# Extract time-based features
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Week'] = df['Date'].dt.isocalendar().week
df['Day'] = df['Date'].dt.day
df['Quarter'] = df['Date'].dt.quarter
df['Day_Name'] = df['Date'].dt.day_name()
df['Month_Name'] = df['Date'].dt.month_name()

# Define seasons based on month
def assign_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    else:
        return 'Fall'

df['Season'] = df['Month'].apply(assign_season)

# Normalize Weekly Sales
scaler = MinMaxScaler()
df['Weekly_Sales_Scaled'] = scaler.fit_transform(df[['Weekly_Sales']])

# Verify data preparation
print("\nData After Cleaning and Feature Engineering:")
df[['Date', 'Year', 'Month', 'Week', 'Quarter', 'Day_Name', 'Season']].head()

Missing Values Before Cleaning:
Store           0
Date            0
Weekly_Sales    0
Holiday_Flag    0
Temperature     0
Fuel_Price      0
CPI             0
Unemployment    0
dtype: int64

Data After Cleaning and Feature Engineering:


Unnamed: 0,Date,Year,Month,Week,Quarter,Day_Name,Season
0,2010-02-05,2010,2,5,1,Friday,Winter
1,2010-02-12,2010,2,6,1,Friday,Winter
2,2010-02-19,2010,2,7,1,Friday,Winter
3,2010-02-26,2010,2,8,1,Friday,Winter
4,2010-03-05,2010,3,9,1,Friday,Spring


<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Step 3: Exploring the Terrain – EDA
</h3>

Now the fun begins.

Here, I explore:
- Sales trends across time
- Store performance distributions
- Seasonality and holiday spikes
- Outliers that may skew analysis

These visual clues tell us where to dig deeper.

## Dataset Overview

| **Column Name** | Description |
|-----------------|-------------|
| Store           | Unique identifier for each Walmart store |
| Date            | Date of the sales record |
| Weekly_Sales    | Total sales recorded in a store for the given week |
| Holiday_Flag    | Indicates whether the week includes a major holiday (1 = Holiday, 0 = No) |
| Temperature     | Average temperature for the region during the week |
| Fuel_Price      | Cost of fuel in the region |
| CPI             | Consumer Price Index — a measure of inflation |
| Unemployment    | Unemployment rate in the region |

In [3]:
# Summary statistics
print("Dataset Summary:")
df.describe()

# Store-level sales summary
store_summary = df.groupby('Store')['Weekly_Sales'].agg(['sum', 'mean', 'std']).reset_index()
store_summary.columns = ['Store', 'Total_Sales', 'Average_Sales', 'Sales_Std_Dev']
print("\nTop 5 Stores by Total Sales:")
print(store_summary.sort_values(by='Total_Sales', ascending=False).head())

Dataset Summary:

Top 5 Stores by Total Sales:
    Store    Total_Sales  Average_Sales  Sales_Std_Dev
19     20 301,397,792.46   2,107,676.87     275,900.56
3       4 299,543,953.38   2,094,712.96     266,201.44
13     14 288,999,911.34   2,020,978.40     317,569.95
12     13 286,517,703.80   2,003,620.31     265,507.00
1       2 275,382,440.98   1,925,751.34     237,683.69


Insight: Some stores significantly outperform others, hinting at regional or operational differences. We visualize these trends to uncover underlying patterns.

**Summary of Work So Far**

We have:
- Ensured data is clean and usable
- Extracted time-based features for richer analysis
- Added derived columns such as `Season` and `Week`
- Conducted exploratory analysis to identify trends and opportunities

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557);
           padding: 15px;
           font: bold 26px Arial;
           color: #fdd835;
           border-radius: 8px;">
Step 4: Data Visualization – Finding Patterns in Performance
</h3>

With our dataset cleaned and feature-engineered, we proceed to visualize key patterns across time, geography, and economic conditions. 

In this section, I examine:
- Weekly sales distribution across all stores
- Sales trends during holidays versus non-holidays
- Seasonal variations and cyclical behavior
- The effect of economic indicators (fuel price, CPI, unemployment) on performance

These insights form the foundation of our predictive models and business recommendations.

In [4]:
# Total Sales Over Time
sales_over_time = df.groupby('Date')['Weekly_Sales'].sum().reset_index()
fig1 = px.line(sales_over_time, x='Date', y='Weekly_Sales', 
               title='🗓️ Total Weekly Sales Over Time',
               labels={'Weekly_Sales': 'Total Sales ($)', 'Date': 'Date'})
fig1.update_layout(xaxis_title="Date", yaxis_title="Total Sales ($)", template='plotly_white')
fig1.show()

# Average Weekly Sales by Store
avg_sales_store = df.groupby('Store')['Weekly_Sales'].mean().reset_index()
fig2 = px.bar(avg_sales_store, x='Store', y='Weekly_Sales', 
              title='🏬 Average Weekly Sales by Store',
              labels={'Weekly_Sales': 'Average Sales ($)', 'Store': 'Store ID'},
              color='Weekly_Sales', color_continuous_scale='Viridis')
fig2.update_layout(xaxis_title="Store ID", yaxis_title="Average Sales ($)", template='plotly_white')
fig2.show()

# Sales by Month
monthly_sales = df.groupby('Month_Name')['Weekly_Sales'].mean().reindex([
    'January', 'February', 'March', 'April', 'May', 'June',
    'July', 'August', 'September', 'October', 'November', 'December'
]).reset_index()
fig3 = px.bar(monthly_sales, x='Month_Name', y='Weekly_Sales', 
              title='📅 Average Weekly Sales by Month',
              labels={'Weekly_Sales': 'Average Sales ($)', 'Month_Name': 'Month'},
              color='Weekly_Sales', color_continuous_scale='Teal')
fig3.update_layout(xaxis_title="Month", yaxis_title="Average Sales ($)", template='plotly_white')
fig3.show()

# Sales by Season
seasonal_sales = df.groupby('Season')['Weekly_Sales'].mean().reindex(['Winter', 'Spring', 'Summer', 'Fall']).reset_index()
fig4 = px.bar(seasonal_sales, x='Season', y='Weekly_Sales', 
              title='🍂 Average Weekly Sales by Season',
              labels={'Weekly_Sales': 'Average Sales ($)', 'Season': 'Season'},
              color='Weekly_Sales', color_continuous_scale='Oranges')
fig4.update_layout(xaxis_title="Season", yaxis_title="Average Sales ($)", template='plotly_white')
fig4.show()

# Holiday vs Non-Holiday Sales
holiday_sales = df.groupby('Holiday_Flag')['Weekly_Sales'].mean().reset_index()
holiday_sales['Holiday_Label'] = holiday_sales['Holiday_Flag'].map({0: 'Non-Holiday Week', 1: 'Holiday Week'})
fig5 = px.bar(holiday_sales, x='Holiday_Label', y='Weekly_Sales', 
              title='🎉 Holiday vs Non-Holiday Sales',
              labels={'Weekly_Sales': 'Average Sales ($)', 'Holiday_Label': 'Week Type'},
              color='Weekly_Sales', color_continuous_scale='Blues')  # Changed 'Set2' to 'Blues'
fig5.update_layout(xaxis_title="Week Type", yaxis_title="Average Sales ($)", template='plotly_white')
fig5.show()

# Correlation Heatmap
correlation_matrix = df[['Weekly_Sales', 'Temperature', 'Fuel_Price', 'CPI', 'Unemployment']].corr()
fig6 = px.imshow(correlation_matrix, text_auto=True, color_continuous_scale='RdBu_r',
                 title='📈 Correlation Heatmap: Sales vs External Factors')
fig6.update_layout(template='plotly_white')
fig6.show()

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557);
           padding: 15px;
           font: bold 26px Arial;
           color: #fdd835;
           border-radius: 8px;">
Step 5: Forecasting Sales with Predictive Modeling
</h3>

Equipped with exploratory insights and engineered features, we now build models to forecast future sales.

I evaluate several approaches to time series forecasting and regression modeling:
- Linear Regression for baseline performance
- XGBoost for capturing non-linear interactions
- SARIMA for time-based trend and seasonality decomposition

Each model is trained and evaluated using a holdout strategy, with metrics such as MAE and RMSE used for comparison. The goal is to deliver not only accuracy—but also interpretability for decision-makers.

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557);
           padding: 15px;
           font: bold 26px Arial;
           color: #fdd835;
           border-radius: 8px;">
Step 6: Strategic Insights and Business Recommendations
</h3>

Based on the analysis, the following strategies are recommended for Walmart’s operations:

1. **Optimize Inventory Around Holidays**  
   Sales spike significantly during key holidays—Thanksgiving and Christmas in particular. Inventory levels and staffing should be adjusted proactively.

2. **Reinvest in High-Performing Stores**  
   Some stores consistently outperform others. These locations should be prioritized for promotions and expansion efforts.

3. **Regional Strategy Based on Economic Indicators**  
   Regions with higher fuel prices and CPI tend to show lower sales. A regional pricing or promotion strategy may help stabilize performance.

4. **Use Predictive Models for Demand Planning**  
   The forecasting models demonstrate strong potential in anticipating dips and surges in weekly sales. These forecasts can directly inform logistics and staffing.

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557);
           padding: 15px;
           font: bold 26px Arial;
           color: #fdd835;
           border-radius: 8px;">
Conclusion
</h3>

This project demonstrates how data can drive smarter decisions in the retail sector.

Through careful data preparation, exploratory analysis, and predictive modeling, I uncovered clear sales patterns, performance differences among stores, and actionable insights tied to holidays and economic signals.

From a business perspective, these findings support both strategic and operational improvements. From a data science lens, this notebook reflects the end-to-end application of statistical thinking, coding, and storytelling—all grounded in real-world impact.

Thank you for reviewing this analysis.

In [5]:
# Prepare data for modeling
ml_df = df.copy()
ml_df = ml_df.drop(columns=['Date', 'Weekly_Sales_Scaled', 'Day_Name', 'Month_Name'])

# One-hot encode categorical columns
ml_df = pd.get_dummies(ml_df, columns=['Season'], drop_first=True)

# Define features and target
X = ml_df.drop(['Weekly_Sales'], axis=1)
y = ml_df['Weekly_Sales']

# Handle missing values
imputer = SimpleImputer(strategy='mean')
X = imputer.fit_transform(X)

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Model 1: Linear Regression
lr_model = LinearRegression()
lr_model.fit(X_train, y_train)
y_pred_lr = lr_model.predict(X_test)
lr_r2 = r2_score(y_test, y_pred_lr)
lr_rmse = np.sqrt(mean_squared_error(y_test, y_pred_lr))
print("📉 Linear Regression Performance")
print(f"R² Score: {lr_r2:.3f}")
print(f"RMSE: {lr_rmse:,.2f}")

# Model 2: Random Forest with GridSearchCV
rf_params = {
    'n_estimators': [50, 100, 200],
    'max_depth': [10, 20, None]
}
rf_model = RandomForestRegressor(random_state=42)
rf_grid = GridSearchCV(rf_model, rf_params, cv=5, scoring='r2', n_jobs=-1)
rf_grid.fit(X_train, y_train)
y_pred_rf = rf_grid.predict(X_test)
rf_r2 = r2_score(y_test, y_pred_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test, y_pred_rf))
print("\n🌲 Random Forest Performance (Best Params: {})".format(rf_grid.best_params_))
print(f"R² Score: {rf_r2:.3f}")
print(f"RMSE: {rf_rmse:,.2f}")

# Model 3: Gradient Boosting
gb_model = GradientBoostingRegressor(random_state=42)
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)
gb_r2 = r2_score(y_test, y_pred_gb)
gb_rmse = np.sqrt(mean_squared_error(y_test, y_pred_gb))
print("\n🚀 Gradient Boosting Performance")
print(f"R² Score: {gb_r2:.3f}")
print(f"RMSE: {gb_rmse:,.2f}")

# Feature Importance (Random Forest)
importances = rf_grid.best_estimator_.feature_importances_
features = ml_df.drop(['Weekly_Sales'], axis=1).columns
feat_importance_df = pd.DataFrame({
    'Feature': features,
    'Importance': importances
}).sort_values(by='Importance', ascending=False)

fig7 = px.bar(feat_importance_df, x='Importance', y='Feature', 
              title='📌 Feature Importance (Random Forest)',
              color='Importance', color_continuous_scale='Blues')
fig7.update_layout(xaxis_title="Importance", yaxis_title="Feature", template='plotly_white')
fig7.show()

📉 Linear Regression Performance
R² Score: 0.165
RMSE: 518,609.90

🌲 Random Forest Performance (Best Params: {'max_depth': 20, 'n_estimators': 200})
R² Score: 0.966
RMSE: 104,910.60

🚀 Gradient Boosting Performance
R² Score: 0.908
RMSE: 172,290.41


The Random Forest and Gradient Boosting models outperform Linear Regression, capturing complex patterns in the data. Key features like `Store`, `Holiday_Flag`, and `Month` drive predictions, aligning with our EDA findings. These models can help Walmart forecast sales accurately, enabling better inventory and staffing decisions.

----

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Strategic Recommendations
</h3>

1. **Optimize Holiday Campaigns**: Increase inventory and staffing in November and December, when sales surge by 15–20% during holiday weeks.
2. **Target Underperforming Stores**: Investigate stores with consistently low sales (e.g., bottom 10% in `store_summary`) for operational or regional challenges.
3. **Leverage Predictive Models**: Deploy the Random Forest or Gradient Boosting model for weekly sales forecasting, focusing on key predictors like store ID and holidays.
4. **Seasonal Inventory Planning**: Stock high-demand products in winter and spring, when sales peak, to maximize revenue.

----

----

<h3 style="background: linear-gradient(to right, #0077b6 ,#1d3557); 
           padding: 15px; 
           font: bold 26px Arial; 
           color: #fdd835; 
           border-radius: 8px;">   
Conclusion
</h3>

##### This analysis tells the story of Walmart’s sales dynamics, from seasonal spikes to store-level variations. By combining rigorous data cleaning, insightful EDA, and advanced machine learning, we’ve uncovered patterns that can drive smarter business decisions. The predictive models offer a glimpse into the future, empowering Walmart to optimize operations and enhance customer satisfaction. As a data scientist, I’m excited to continue this journey, turning data into actionable impact.

### **Key Takeaways**:
- **Seasonality**: Winter and holiday weeks are critical for revenue.
- **Store Variation**: Top stores can serve as benchmarks for others.
- **Predictive Power**: Advanced models like Random Forest provide reliable sales forecasts.
- **Next Steps**: Implement these insights into Walmart’s operations and explore additional features (e.g., promotional data) for even better predictions.

-----

* **Seasonality & Holidays Matter:**
    * Sales peaked consistently during the winter season and holiday weeks (e.g., Thanksgiving, Christmas), indicating a strong seasonal demand that can guide staffing and inventory decisions.

* **Store-Level Variation:**
    * Some stores significantly outperformed others in average weekly sales. This opens opportunities for benchmarking, targeted marketing, and regional strategy optimization.

* **Feature Correlations:**
    * Economic indicators like CPI and unemployment rate showed modest influence on weekly sales, while store ID, holiday flag, and month were stronger predictors.

* **Holiday Weeks Drive Revenue:**
    * Sales during holiday weeks were 15–20% higher on average than during non-holiday weeks, emphasizing the importance of promotions and stock optimization during key retail periods.
