## 1. Data Loading

We begin by importing the dataset from the `data/raw/` folder. This file contains weekly sales data for Walmart stores, along with economic and environmental indicators. We also ensure the working directory is correctly set to the project root.


We ensure the working directory is set to the project root so relative paths work correctly across notebooks.

We load the Walmart sales dataset from `data/raw/`. This includes weekly sales, store info, and economic indicators.

In [9]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os 
import csv

# --- UNIVERSAL PATH FIX ---
# This code block ensures the working directory is always the project root
# when the notebook is run in an interactive environment (like VS Code).
# When Papermill runs it from the root, this block is harmless.
try:
    if os.path.basename(os.getcwd()) == 'notebooks':
        os.chdir('..')
except Exception:
    # Safely ignore if the change fails in a restricted environment
    pass

print("Working Directory:", os.getcwd())
# The path is now consistently relative to the project root for both local and CI/CD
file_path = "data/raw/Walmart Data Analysis and Forecasting.csv" 

# Load data
df = pd.read_csv(file_path)

# Quick review
print("shape:", df.shape)
df.head()

Working Directory: c:\Users\Emron nabizadeh\Documents\Data-analyst\Project\walmart-sales-forecasting


FileNotFoundError: [Errno 2] No such file or directory: 'data/raw/Walmart Data Analysis and Forecasting.csv'

## 2. Column Cleaning

To simplify analysis, we standardize column names by stripping whitespace, converting to lowercase, and replacing spaces with underscores.


We standardize column names by stripping whitespace, converting to lowercase, and replacing spaces with underscores.

In [None]:
df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
print(df.columns.tolist())




In [None]:
df['store'], df['date'], df['weekly_sales'], df['holiday_flag'], ...


## 3. Data Inspection

We inspect the dataset to understand its structure, check for missing values, and review basic statistics. This helps identify potential cleaning steps and modeling challenges.


In [None]:
# Basic info
df.info()

# Summary stats
df.describe()

# Missing values
df.isnull().sum()


## 4. Weekly Sales Distribution

We plot the distribution of weekly sales to understand its spread, skewness, and detect any outliers. This is useful for modeling and feature engineering.


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(10, 6))
sns.histplot(df['weekly_sales'], bins=50, kde=True)
plt.title("Distribution of Weekly Sales")
plt.xlabel("Sales")
plt.ylabel("Frequency")
plt.show()


## 5. Date Range and Holiday Impact

We explore the time span of the dataset and analyze how holidays may affect weekly sales. This insight will be valuable for time series forecasting.


In [None]:
df['date'] = pd.to_datetime(df['date'], errors='coerce')

print("Date range:", df['date'].min(), "to", df['date'].max())
print("Holiday breakdown:\n", df['holiday_flag'].value_counts())


## 6. Time Series Preview

We aggregate weekly sales over time to visualize trends and seasonality. This sets the stage for building forecasting models in later phases.


In [None]:
sales_by_date = df.groupby('date')['weekly_sales'].sum()

plt.figure(figsize=(12, 6))
sales_by_date.plot()
plt.title("Total Weekly Sales Over Time")
plt.xlabel("Date")
plt.ylabel("Sales")
plt.grid(True)
plt.show()
