Objective: To develop a machine learningâ€“based sales forecasting system for Walmart that accurately predicts future store-level sales for the next one week and the next one month.

In [4]:
#import the data 
import numpy as np 
import pandas as pd
df = pd.read_csv(r"Walmart DataSet.csv")
df.head()

Unnamed: 0,Store,Date,Weekly_Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment
0,1,05-02-2010,1643690.9,0,42.31,2.572,211.096358,8.106
1,1,12-02-2010,1641957.44,1,38.51,2.548,211.24217,8.106
2,1,19-02-2010,1611968.17,0,39.93,2.514,211.289143,8.106
3,1,26-02-2010,1409727.59,0,46.63,2.561,211.319643,8.106
4,1,05-03-2010,1554806.68,0,46.5,2.625,211.350143,8.106


In [None]:
#to check the no of rows and columns in the dataset
df.shape

In [None]:
#checks count and data type of the dataset
df.info()

In [None]:
#display statistics of the dataset
df.describe()

In [None]:
#to check total null values in the dataset
df.isnull().sum()

In [None]:
#convert date column from object to date type
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')

In [None]:
#split the date column to year, month and week
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Week'] = df['Date'].dt.isocalendar().week

In [None]:
df

In [None]:
df.isnull().sum()

In [None]:
#to ignore the future warnings
import warnings
warnings.filterwarnings("ignore")

In [None]:
#to replace missing values
df.fillna(method='ffill', inplace=True)

Exploratory Data Analysis

Created data visualisation using Matplotlib and Seaborn 

In [None]:
#weekly sales
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(12,5))
sns.lineplot(data=df, x='Date', y='Weekly_Sales')
plt.title("Weekly Sales")
plt.show()

In [None]:
#monthly sales
monthly_sales = df.groupby(['Year','Month'])['Weekly_Sales'].sum().reset_index()
plt.figure(figsize=(10,5))
sns.lineplot(data=monthly_sales, x='Month', y='Weekly_Sales', hue='Year')
plt.title("Monthly Sales")
plt.show()

In [None]:
#impact of holidays on sales
plt.figure(figsize=(6,4))
sns.boxplot(data=df, x='Holiday_Flag', y='Weekly_Sales')
plt.title("Holiday vs Non-Holiday Sales")
plt.show()

In [None]:
#store-wise sales comparison
store_sales = df.groupby('Store')['Weekly_Sales'].sum().reset_index()
plt.figure(figsize=(12,5))
sns.barplot(data=store_sales, x='Store', y='Weekly_Sales')
plt.title("Total Sales by Store")
plt.show()

In [None]:
#correlation analysis between sales and external factors
df1=df.corr()
plt.figure(figsize=(10,6))
sns.heatmap(df1, annot=True)
plt.title("Correlation Matrix")
plt.show()

Feature Engineering

In [None]:
#select input features and target column
input_features = ['Store', 'Holiday_Flag', 'Temperature', 'Fuel_Price','CPI', 'Unemployment', 'Year', 'Month', 'Week']
target = 'Weekly_Sales'
x = df[input_features]
y = df[target]
x.head()

In [None]:
#to check top elements in y
y.head()

In [None]:
#scale the input features
from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
x_scaled=scaler.fit_transform(x)
x_scaled

In [None]:
x_scaled=pd.DataFrame(x_scaled, columns=x.columns)
x_scaled.head()

In [None]:
#split into training and testing
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.25,random_state=0)

Apply ML Algorithm 

In [None]:
#Linear Regression
from sklearn.linear_model import LinearRegression
lr=LinearRegression()
lr.fit(x_train,y_train)

In [None]:
#Random Forest Regressor 
from sklearn.ensemble import RandomForestRegressor
rf=RandomForestRegressor(n_estimators=100)
rf.fit(x_train,y_train)

In [None]:
#Gradient Boosting Regressor 
from sklearn.ensemble import GradientBoostingRegressor
gb=GradientBoostingRegressor(n_estimators=100)
gb.fit(x_train,y_train)

Model Evaluation

In [None]:
#model evaluation using evaluation metrics 
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
def evaluate_model(model, x_test, y_test):
    y_pred = model.predict(x_test)
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    return mae, mse, r2

In [None]:
lr_metrics=evaluate_model(lr, x_test, y_test)
lr_metrics

In [None]:
rf_metrics = evaluate_model(rf, x_test, y_test)
rf_metrics

In [None]:
gb_metrics = evaluate_model(gb, x_test, y_test)
gb_metrics

In [None]:
#model evaluation is converted into dataframe to compare with other models
results = pd.DataFrame({
    'Model': ['Linear Regression', 'Random Forest', 'Gradient Boosting'],
    'MAE': [lr_metrics[0], rf_metrics[0], gb_metrics[0]],
    'MSE': [lr_metrics[1], rf_metrics[1], gb_metrics[1]],
    'R2 Score': [lr_metrics[2], rf_metrics[2], gb_metrics[2]]
})
results

Random Forest Regressor achieves the highest R2 score compared to Linear Regression and Gradient Boosting Regressor.

In [None]:
#to predict walmart store sales for next 1 week 
last_week_data = x.tail(1)
next_week_prediction = rf.predict(last_week_data)

next_week_prediction

In [None]:
#to predict walmart store sales for next 1 month
last_week_data = x.tail(1).astype(float)
future_predictions = []

for i in range(4):
    next_pred = rf.predict(last_week_data)[0]
    future_predictions.append(next_pred)

    last_week_data = last_week_data.shift(-1, axis=1)
    last_week_data.iloc[0, -1] = next_pred

print("Next 1 month prediction:", future_predictions)

In [None]:
#shows the actual and predicted sales value graph 
y_pred_test = rf.predict(x_test)

plt.figure(figsize=(10,5))
plt.plot(y_test.values[:100], label='Actual')
plt.plot(y_pred_test[:100], label='Predicted')
plt.legend()
plt.title("Actual vs Predicted Sales")
plt.show()

Summary: The Walmart Sales Prediction project shows that machine learning models can effectively forecast short-term sales using historical data. These predictions can help retailers optimize inventory management, staffing, and strategic planning.