# Corporación Favorita Grocery Sales Forecasting
**w01_d04_EDA_temporal_patterns.ipynb**

**Author:** Alberto Diaz Durana  
**Date:** November 2025  
**Purpose:** Engineer temporal features, visualize time series patterns, analyze product dynamics

---

## Objectives

This notebook accomplishes the following:

- Create rolling statistics features (7/14/30-day moving averages)
- Visualize overall sales trends (2013-2017)
- Generate year-month heatmap for seasonality
- Analyze autocorrelation for lag selection
- Deep dive into day-of-week patterns by family
- Investigate payday effects (1st, 15th of month)
- Classify fast vs slow movers
- Perform Pareto (80/20) analysis

---

## Business Context

**Why temporal patterns matter:**

Understanding when sales occur enables:
- Accurate forecasting of seasonal peaks/troughs
- Optimal inventory allocation by day-of-week
- Promotional timing optimization (payday effects)
- Resource planning (staffing for high-traffic days)

**Why product dynamics matter:**

Identifying fast vs slow movers enables:
- Differentiated inventory strategies
- Focus on high-impact items (80/20 rule)
- Better demand forecasting accuracy
- Reduced waste for slow movers

**Deliverables:**
- Rolling average features (7/14/30-day windows)
- Time series visualizations (trend, seasonality)
- Autocorrelation analysis (lag structure)
- Day-of-week and payday patterns
- Fast/slow mover classification
- Pareto chart (sales concentration)

---

## Input Dependencies

From Day 3:
- Clean dataset with temporal features (300K rows, 29 columns)
- Store metadata merged (type, city, cluster)
- No missing values, outliers flagged

---

## 1. Feature Engineering - Rolling Statistics

**Objective:** Create smoothed sales features using rolling windows

**Activities:**
- Load clean dataset from Day 3
- Sort by (store_nbr, item_nbr, date) for temporal order
- Calculate 7-day, 14-day, 30-day rolling means
- Handle edge cases (min_periods parameter)
- Visualize raw vs smoothed for sample items

**Expected output:** 
- 3 new rolling average columns
- Smoothing visualization showing effectiveness

In [None]:
# Import libraries
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import scipy
from scipy import stats

# Configure environment
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.2f}'.format)
plt.style.use('seaborn-v0_8-darkgrid')

print("Package Versions:")
print(f"  pandas: {pd.__version__}")
print(f"  numpy: {np.__version__}")
print(f"  matplotlib: {matplotlib.__version__}")
print(f"  seaborn: {sns.__version__}")
print(f"  scipy: {scipy.__version__}")
print("\nOK - Libraries imported")

In [None]:
# Determine paths
current_dir = Path(__file__).parent if '__file__' in globals() else Path.cwd()
project_root = current_dir.parent if current_dir.name == 'notebooks' else current_dir

# Define path constants
DATA_PROCESSED = project_root / 'data' / 'processed'
OUTPUTS = project_root / 'outputs' / 'figures' / 'eda'

# Verify paths
assert DATA_PROCESSED.exists(), f"ERROR - Path not found: {DATA_PROCESSED}"
assert OUTPUTS.exists(), f"ERROR - Path not found: {OUTPUTS}"

print("OK - Paths validated:")
print(f"  Project root: {project_root.resolve()}")
print(f"  DATA_PROCESSED: {DATA_PROCESSED.resolve()}")
print(f"  OUTPUTS: {OUTPUTS.resolve()}")

# Set random seed
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
print(f"\nRandom seed: {RANDOM_SEED}")

In [None]:
# Load clean dataset from Day 3
print("Loading clean dataset from Day 3...")

# Note: We'll load from pickle for speed since it has all Day 3 processing
# But Day 3 didn't save a new pickle, so we need to reload and recreate

# Check if we have a processed dataset from Day 3
day3_pkl = DATA_PROCESSED / 'guayas_clean_day3.pkl'

if day3_pkl.exists():
    df = pd.read_pickle(day3_pkl)
    print(f"OK - Loaded from Day 3 pickle")
else:
    # Load original sample and note we need to reapply Day 3 transformations
    df = pd.read_pickle(DATA_PROCESSED / 'guayas_sample_300k.pkl')
    print("NOTE - Loading original sample (Day 3 transformations need to be reapplied)")
    print("       This is expected if Day 3 notebook didn't export final dataset")

print(f"\nDataset loaded:")
print(f"  Shape: {df.shape}")
print(f"  Columns: {len(df.columns)}")
print(f"  Memory: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

print("\nFirst 3 rows:")
print(df.head(3))

In [None]:
# Quick reapplication of Day 3 transformations
print("Reapplying Day 3 transformations...")

# 1. Convert date to datetime
df['date'] = pd.to_datetime(df['date'])

# 2. Fill onpromotion NaN with 0
df['onpromotion'] = df['onpromotion'].fillna(0.0)

# 3. Load and merge store metadata
df_stores = pd.read_csv(project_root / 'data' / 'raw' / 'stores.csv')
df = df.merge(df_stores[['store_nbr', 'city', 'state', 'type', 'cluster']], 
              on='store_nbr', how='left')

# 4. Create temporal features
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek
df['day_of_month'] = df['date'].dt.day
df['week_of_year'] = df['date'].dt.isocalendar().week
df['quarter'] = df['date'].dt.quarter
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['is_month_start'] = df['date'].dt.is_month_start.astype(int)
df['is_month_end'] = df['date'].dt.is_month_end.astype(int)

print(f"OK - Transformations applied")
print(f"  Missing values: {df.isnull().sum().sum()}")
print(f"  Shape: {df.shape}")
print(f"  Date range: {df['date'].min().date()} to {df['date'].max().date()}")

print("\nDataset ready for feature engineering")

In [None]:
# Sort data for rolling window calculations
print("Preparing data for rolling statistics...")
print("Sorting by (store_nbr, item_nbr, date) for temporal order...")

# Critical: Sort by store, item, date for proper rolling windows
df = df.sort_values(['store_nbr', 'item_nbr', 'date']).reset_index(drop=True)

print(f"OK - Data sorted")
print(f"\nFirst 5 rows (after sorting):")
print(df[['date', 'store_nbr', 'item_nbr', 'unit_sales', 'family']].head())

print(f"\nLast 5 rows (after sorting):")
print(df[['date', 'store_nbr', 'item_nbr', 'unit_sales', 'family']].tail())

# Verify temporal order within groups
sample_group = df[(df['store_nbr'] == 24) & (df['item_nbr'] == df['item_nbr'].iloc[0])]
print(f"\nSample group (Store 24, Item {df['item_nbr'].iloc[0]}):")
print(f"  Rows: {len(sample_group)}")
print(f"  Date range: {sample_group['date'].min().date()} to {sample_group['date'].max().date()}")
print(f"  Dates in order: {sample_group['date'].is_monotonic_increasing}")

In [None]:
# Calculate rolling statistics
print("Computing rolling statistics...")
print("This may take 1-2 minutes for 300K rows...\n")

# Calculate 7-day, 14-day, and 30-day rolling means per store-item group
# Note: With sparse data, rolling windows operate on actual sales dates only
df['unit_sales_7d_avg'] = df.groupby(['store_nbr', 'item_nbr'])['unit_sales'].transform(
    lambda x: x.rolling(window=7, min_periods=1).mean()
)

df['unit_sales_14d_avg'] = df.groupby(['store_nbr', 'item_nbr'])['unit_sales'].transform(
    lambda x: x.rolling(window=14, min_periods=1).mean()
)

df['unit_sales_30d_avg'] = df.groupby(['store_nbr', 'item_nbr'])['unit_sales'].transform(
    lambda x: x.rolling(window=30, min_periods=1).mean()
)

print(f"OK - Rolling statistics computed")
print(f"\nNew columns created:")
print(f"  • unit_sales_7d_avg (7-day moving average)")
print(f"  • unit_sales_14d_avg (14-day moving average)")
print(f"  • unit_sales_30d_avg (30-day moving average)")

print(f"\nDataset shape: {df.shape}")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")

# Display sample with rolling features
print("\nSample rows with rolling features:")
sample = df[df['item_nbr'] == df['item_nbr'].iloc[0]].head(10)
print(sample[['date', 'store_nbr', 'item_nbr', 'unit_sales', 
              'unit_sales_7d_avg', 'unit_sales_14d_avg', 'unit_sales_30d_avg']])

In [None]:
# Visualize rolling statistics effectiveness
print("Visualizing rolling statistics (raw vs smoothed)...")

# Select a high-volume item for clear visualization
high_vol_items = df.groupby('item_nbr')['unit_sales'].sum().nlargest(5).index

fig, axes = plt.subplots(2, 2, figsize=(16, 10))

for idx, item in enumerate(high_vol_items[:4]):
    ax = axes[idx // 2, idx % 2]
    
    # Get data for this item (all stores combined for clarity)
    item_data = df[df['item_nbr'] == item].groupby('date').agg({
        'unit_sales': 'sum',
        'unit_sales_7d_avg': 'mean',
        'unit_sales_14d_avg': 'mean',
        'unit_sales_30d_avg': 'mean',
        'family': 'first'
    }).reset_index()
    
    # Sort by date
    item_data = item_data.sort_values('date')
    
    # Plot
    ax.plot(item_data['date'], item_data['unit_sales'], 
            alpha=0.3, color='gray', label='Raw Sales', linewidth=0.8)
    ax.plot(item_data['date'], item_data['unit_sales_7d_avg'], 
            color='#2ca02c', label='7-day MA', linewidth=2)
    ax.plot(item_data['date'], item_data['unit_sales_14d_avg'], 
            color='#ff7f0e', label='14-day MA', linewidth=2)
    ax.plot(item_data['date'], item_data['unit_sales_30d_avg'], 
            color='#d62728', label='30-day MA', linewidth=2)
    
    ax.set_title(f"Item #{item} ({item_data['family'].iloc[0]})", 
                 fontsize=12, fontweight='bold')
    ax.set_xlabel('Date', fontsize=10)
    ax.set_ylabel('Unit Sales', fontsize=10)
    ax.legend(loc='best', fontsize=9)
    ax.grid(True, alpha=0.3)

plt.suptitle('Rolling Statistics - Smoothing Effect on Top 4 Items', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.savefig(OUTPUTS / '04_rolling_statistics_smoothing.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Visualization saved to outputs/figures/eda/04_rolling_statistics_smoothing.png")
print("\n" + "=" * 70)
print("SECTION 1 COMPLETE - Feature Engineering (Rolling Statistics)")
print("=" * 70)

## 2. Time Series Visualization

**Objective:** Visualize overall sales trends and identify seasonal patterns

**Activities:**
- Aggregate total sales by date (300K → daily totals)
- Plot 5-year time series (2013-2017)
- Identify trend, seasonality, anomalies
- Create year-month heatmap
- Annotate major patterns (December peaks, Q1 lulls)

**Expected output:** 
- Time series plot with trend analysis
- Year-month heatmap
- Pattern interpretation report

In [None]:
# Aggregate sales by date for overall trend
print("Time Series Visualization")
print("=" * 70)

print("\nAggregating sales by date...")
daily_sales = df.groupby('date').agg({
    'unit_sales': 'sum',
    'store_nbr': 'nunique',
    'item_nbr': 'nunique'
}).reset_index()

daily_sales.columns = ['date', 'total_sales', 'active_stores', 'active_items']

print(f"OK - Daily aggregation complete")
print(f"  Date range: {daily_sales['date'].min().date()} to {daily_sales['date'].max().date()}")
print(f"  Total days: {len(daily_sales)}")
print(f"  Total sales: {daily_sales['total_sales'].sum():,.0f} units")

print("\nDaily sales statistics:")
print(daily_sales[['total_sales', 'active_stores', 'active_items']].describe())

print("\nSample daily data:")
print(daily_sales.head(10))

In [None]:
# Plot overall time series trend
fig, ax = plt.subplots(figsize=(16, 6))

# Plot daily sales
ax.plot(daily_sales['date'], daily_sales['total_sales'], 
        color='steelblue', linewidth=1, alpha=0.6, label='Daily Sales')

# Add 30-day moving average for trend
ma_30 = daily_sales['total_sales'].rolling(window=30, center=True).mean()
ax.plot(daily_sales['date'], ma_30, 
        color='red', linewidth=3, label='30-day Moving Average', alpha=0.8)

# Styling
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Total Sales (units)', fontsize=12)
ax.set_title('Total Sales Time Series (2013-2017) - Guayas Region', 
             fontsize=14, fontweight='bold')
ax.legend(loc='upper left', fontsize=11)
ax.grid(True, alpha=0.3)

# Format y-axis with comma separator
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, p: f'{int(x):,}'))

# Add vertical lines for year boundaries
for year in range(2014, 2018):
    ax.axvline(pd.Timestamp(f'{year}-01-01'), color='gray', linestyle='--', alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUTS / '05_sales_time_series_overall.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Time series plot saved")

# Identify patterns
print("\nPattern Analysis:")
print(f"  Overall trend: {'Increasing' if ma_30.iloc[-100] > ma_30.iloc[100] else 'Stable/Decreasing'}")
print(f"  Peak sales: {daily_sales['total_sales'].max():,.0f} on {daily_sales.loc[daily_sales['total_sales'].idxmax(), 'date'].date()}")
print(f"  Lowest sales: {daily_sales['total_sales'].min():,.0f} on {daily_sales.loc[daily_sales['total_sales'].idxmin(), 'date'].date()}")

# Check for December peaks (holiday season)
dec_sales = df[df['month'] == 12].groupby('date')['unit_sales'].sum()
avg_dec = dec_sales.mean()
avg_overall = daily_sales['total_sales'].mean()
print(f"  December avg: {avg_dec:.0f} vs Overall avg: {avg_overall:.0f} ({(avg_dec/avg_overall - 1)*100:+.1f}%)")

In [None]:
# Create year-month heatmap for seasonality
print("\nCreating year-month heatmap...")

# Aggregate by year and month
monthly_sales = df.groupby(['year', 'month'])['unit_sales'].sum().reset_index()
monthly_pivot = monthly_sales.pivot(index='month', columns='year', values='unit_sales')

# Calculate percentage of annual average for color scaling
monthly_pct = monthly_pivot.div(monthly_pivot.mean(axis=0), axis=1) * 100

fig, ax = plt.subplots(figsize=(12, 8))

# Create heatmap
sns.heatmap(monthly_pct, annot=True, fmt='.0f', cmap='RdYlGn', 
            center=100, vmin=70, vmax=130,
            cbar_kws={'label': '% of Annual Average'},
            linewidths=1, linecolor='white', ax=ax)

# Styling
ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Month', fontsize=12)
ax.set_title('Sales Seasonality Heatmap (% of Annual Average)', 
             fontsize=14, fontweight='bold')

# Month labels
month_names = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 
               'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
ax.set_yticklabels(month_names, rotation=0)

plt.tight_layout()
plt.savefig(OUTPUTS / '06_sales_seasonality_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Heatmap saved")

# Identify seasonal patterns
print("\nSeasonal Patterns:")
monthly_avg = df.groupby('month')['unit_sales'].sum().sort_values(ascending=False)
print(f"  Strongest months: {', '.join([month_names[m-1] for m in monthly_avg.head(3).index])}")
print(f"  Weakest months: {', '.join([month_names[m-1] for m in monthly_avg.tail(3).index])}")

print("\n" + "=" * 70)
print("SECTION 2 COMPLETE - Time Series Visualization")
print("=" * 70)

## 3. Autocorrelation Analysis

**Objective:** Assess temporal dependence to guide lag feature selection

**Activities:**
- Aggregate daily sales for autocorrelation calculation
- Plot autocorrelation function (ACF)
- Interpret lag significance
- Document findings for Week 2 feature engineering

**Expected output:** 
- Autocorrelation plot
- Lag analysis interpretation
- Recommendations for lag features (1, 7, 14, 30 days)

In [None]:
# Autocorrelation analysis
print("Autocorrelation Analysis")
print("=" * 70)

print("\nCalculating autocorrelation for daily sales...")

# Use daily aggregated sales for ACF
daily_sales_sorted = daily_sales.sort_values('date')

# Plot autocorrelation
fig, axes = plt.subplots(1, 2, figsize=(16, 5))

# ACF plot
ax1 = axes[0]
from pandas.plotting import autocorrelation_plot
autocorrelation_plot(daily_sales_sorted['total_sales'], ax=ax1)
ax1.set_title('Autocorrelation Function (ACF) - Daily Sales', 
              fontsize=13, fontweight='bold')
ax1.set_xlabel('Lag (days)', fontsize=11)
ax1.set_ylabel('Autocorrelation', fontsize=11)
ax1.grid(True, alpha=0.3)
ax1.axhline(y=0, color='k', linestyle='-', linewidth=0.8)
ax1.axhline(y=0.05, color='r', linestyle='--', alpha=0.5, label='±5% threshold')
ax1.axhline(y=-0.05, color='r', linestyle='--', alpha=0.5)
ax1.legend()

# Manual ACF calculation for specific lags
ax2 = axes[1]
lags = [1, 7, 14, 30, 60, 90]
acf_values = []

for lag in lags:
    corr = daily_sales_sorted['total_sales'].autocorr(lag=lag)
    acf_values.append(corr)

bars = ax2.bar(range(len(lags)), acf_values, color='steelblue', alpha=0.7, edgecolor='black')
ax2.set_xticks(range(len(lags)))
ax2.set_xticklabels([f'{lag}d' for lag in lags])
ax2.set_xlabel('Lag', fontsize=11)
ax2.set_ylabel('Autocorrelation', fontsize=11)
ax2.set_title('Autocorrelation at Key Lags', fontsize=13, fontweight='bold')
ax2.axhline(y=0, color='k', linestyle='-', linewidth=0.8)
ax2.axhline(y=0.05, color='r', linestyle='--', alpha=0.5)
ax2.grid(axis='y', alpha=0.3)

# Add value labels
for bar, val in zip(bars, acf_values):
    height = bar.get_height()
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{val:.3f}', ha='center', va='bottom' if val > 0 else 'top', fontsize=10)

plt.tight_layout()
plt.savefig(OUTPUTS / '07_autocorrelation_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Autocorrelation plot saved")

# Interpret results
print("\nAutocorrelation at Key Lags:")
for lag, acf in zip(lags, acf_values):
    significance = "Strong" if abs(acf) > 0.3 else "Moderate" if abs(acf) > 0.1 else "Weak"
    print(f"  Lag {lag:>2}d: {acf:>6.3f} ({significance})")

print("\nInterpretation:")
print("  → Lag 1 (yesterday): Strong autocorrelation expected")
print("  → Lag 7 (week): Weekly patterns if significant")
print("  → Lag 30 (month): Monthly cycles if significant")

print("\nRecommendations for Week 2:")
print("  • Include lag features: 1, 7, 14 days (strong temporal dependence)")
print("  • Consider day-of-week effects (weekly patterns)")
print("  • Monthly seasonality features useful (lag 30 moderate)")

print("\n" + "=" * 70)
print("SECTION 3 COMPLETE - Autocorrelation Analysis")
print("=" * 70)

## 4. Temporal Deep Dive

**Objective:** Uncover day-of-week, monthly, and payday patterns

**Activities:**
- Analyze day-of-week patterns by product family
- Compare weekend vs weekday sales
- Investigate payday effects (1st and 15th of month)
- Monthly seasonality by family
- End-of-month patterns

**Expected output:** 
- Day-of-week analysis by family
- Payday effect quantification
- Monthly patterns comparison
- Temporal insights for inventory planning

In [None]:
# Day-of-week patterns analysis
print("Temporal Deep Dive")
print("=" * 70)

print("\n1. Day-of-Week Patterns by Family:")
print("=" * 70)

# Aggregate by day-of-week and family
dow_family = df.groupby(['day_of_week', 'family'])['unit_sales'].sum().reset_index()
dow_family_pivot = dow_family.pivot(index='day_of_week', columns='family', values='unit_sales')

# Calculate percentage of weekly average
dow_family_pct = dow_family_pivot.div(dow_family_pivot.mean(axis=0), axis=1) * 100

print("\nSales by Day-of-Week (% of weekly average):")
dow_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_family_pct.index = dow_names
print(dow_family_pct.round(1))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Set day names ONCE before plotting
dow_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_family_pivot.index = dow_names
dow_family_pct.index = dow_names

# Plot 1: Absolute sales by day-of-week
ax1 = axes[0]
dow_family_pivot.plot(kind='bar', ax=ax1, width=0.8, alpha=0.8)
ax1.set_xlabel('Day of Week', fontsize=11)
ax1.set_ylabel('Total Sales (units)', fontsize=11)
ax1.set_title('Sales by Day-of-Week and Family', fontsize=13, fontweight='bold')
ax1.legend(title='Family', fontsize=10, bbox_to_anchor=(1.05, 1), loc='upper left')
ax1.grid(axis='y', alpha=0.3)
ax1.tick_params(axis='x', rotation=0)  # Use tick_params instead of set_xticklabels

# Plot 2: Percentage of weekly average
ax2 = axes[1]
dow_family_pct.plot(kind='line', ax=ax2, marker='o', linewidth=2.5, markersize=8)
ax2.set_xlabel('Day of Week', fontsize=11)
ax2.set_ylabel('% of Weekly Average', fontsize=11)
ax2.set_title('Day-of-Week Pattern (% of Average)', fontsize=13, fontweight='bold')
ax2.axhline(y=100, color='gray', linestyle='--', alpha=0.5, label='Average')
ax2.legend(title='Family', fontsize=10, bbox_to_anchor=(1.05, 1), loc='upper left')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUTS / '08_day_of_week_patterns.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Day-of-week visualization saved")

In [None]:
# Weekend vs Weekday comparison
print("\n2. Weekend vs Weekday Comparison:")
print("=" * 70)

weekend_comparison = df.groupby(['is_weekend', 'family'])['unit_sales'].agg(['sum', 'mean', 'count']).reset_index()
weekend_comparison['is_weekend'] = weekend_comparison['is_weekend'].map({0: 'Weekday', 1: 'Weekend'})

print("\nSales by Weekend/Weekday:")
for family in df['family'].unique():
    family_data = weekend_comparison[weekend_comparison['family'] == family]
    weekday_sales = family_data[family_data['is_weekend'] == 'Weekday']['sum'].values[0]
    weekend_sales = family_data[family_data['is_weekend'] == 'Weekend']['sum'].values[0]
    weekend_lift = (weekend_sales / weekday_sales - 1) * 100
    print(f"  {family:<15} Weekend lift: {weekend_lift:>+6.1f}%")

# Overall weekend effect
overall_weekday = df[df['is_weekend'] == 0]['unit_sales'].sum()
overall_weekend = df[df['is_weekend'] == 1]['unit_sales'].sum()
overall_lift = (overall_weekend / overall_weekday - 1) * 100
print(f"  {'Overall':<15} Weekend lift: {overall_lift:>+6.1f}%")

print(f"\nWeekend accounts for: {df[df['is_weekend'] == 1]['unit_sales'].sum() / df['unit_sales'].sum() * 100:.1f}% of total sales")
print(f"  (Expected if uniform: 28.6% = 2 days / 7 days)")

This is not right...
The calculation is misleading: Comparing total weekend (2 days) vs total weekday (5 days) shows negative lift, BUT weekend accounts for 34.9% of sales (vs expected 28.6%).
This means weekends ARE elevated (34.9% / 28.6% = 22% higher than expected).

In [None]:
# Correct weekend analysis - compare DAILY averages
print("\n3. Corrected Weekend Analysis (Daily Averages):")
print("=" * 70)

# Calculate average daily sales
weekday_daily_avg = df[df['is_weekend'] == 0]['unit_sales'].sum() / df[df['is_weekend'] == 0]['date'].nunique()
weekend_daily_avg = df[df['is_weekend'] == 1]['unit_sales'].sum() / df[df['is_weekend'] == 1]['date'].nunique()

print(f"\nAverage daily sales:")
print(f"  Weekday: {weekday_daily_avg:>8,.1f} units/day")
print(f"  Weekend: {weekend_daily_avg:>8,.1f} units/day")
print(f"  Weekend lift: {(weekend_daily_avg / weekday_daily_avg - 1) * 100:>+6.1f}%")

# By family
print("\nWeekend lift by family (daily averages):")
for family in df['family'].unique():
    family_df = df[df['family'] == family]
    weekday_avg = family_df[family_df['is_weekend'] == 0]['unit_sales'].sum() / family_df[family_df['is_weekend'] == 0]['date'].nunique()
    weekend_avg = family_df[family_df['is_weekend'] == 1]['unit_sales'].sum() / family_df[family_df['is_weekend'] == 1]['date'].nunique()
    lift = (weekend_avg / weekday_avg - 1) * 100
    print(f"  {family:<15} {lift:>+6.1f}%")

print("\nConclusion:")
print("  → Weekends have significantly higher daily sales (~22% lift)")
print("  → Saturday/Sunday are peak shopping days (grocery restocking)")
print("  → Inventory should be elevated for weekend demand")

CORRECTED: Weekend lift is +33.9% (BEVERAGES highest at +40.2%)

In [None]:
# Payday effects analysis (1st and 15th of month)
print("\n4. Payday Effects Analysis:")
print("=" * 70)

# Define payday windows (1st ±2 days, 15th ±2 days)
df['is_payday_window'] = df['day_of_month'].isin([1, 2, 3, 14, 15, 16]).astype(int)

payday_sales = df[df['is_payday_window'] == 1]['unit_sales'].sum()
non_payday_sales = df[df['is_payday_window'] == 0]['unit_sales'].sum()

payday_days = df[df['is_payday_window'] == 1]['date'].nunique()
non_payday_days = df[df['is_payday_window'] == 0]['date'].nunique()

payday_daily_avg = payday_sales / payday_days
non_payday_daily_avg = non_payday_sales / non_payday_days

print(f"\nPayday window (1st ±2, 15th ±2 days of month):")
print(f"  Payday window avg: {payday_daily_avg:>8,.1f} units/day ({payday_days} days)")
print(f"  Non-payday avg:    {non_payday_daily_avg:>8,.1f} units/day ({non_payday_days} days)")
print(f"  Payday lift:       {(payday_daily_avg / non_payday_daily_avg - 1) * 100:>+6.1f}%")

# Specific day analysis
day_of_month_sales = df.groupby('day_of_month').agg({
    'unit_sales': 'sum',
    'date': 'nunique'
}).reset_index()
day_of_month_sales['daily_avg'] = day_of_month_sales['unit_sales'] / day_of_month_sales['date']

# Visualize
fig, ax = plt.subplots(figsize=(14, 6))

bars = ax.bar(day_of_month_sales['day_of_month'], day_of_month_sales['daily_avg'], 
              color='steelblue', alpha=0.7, edgecolor='black')

# Highlight payday windows
payday_days_list = [1, 2, 3, 14, 15, 16]
for i, day in enumerate(day_of_month_sales['day_of_month']):
    if day in payday_days_list:
        bars[i].set_color('#2ca02c')
        bars[i].set_alpha(0.9)

ax.set_xlabel('Day of Month', fontsize=12)
ax.set_ylabel('Average Daily Sales', fontsize=12)
ax.set_title('Sales by Day of Month (Payday Effect)', fontsize=14, fontweight='bold')
ax.axhline(y=day_of_month_sales['daily_avg'].mean(), color='red', linestyle='--', 
           alpha=0.5, label='Monthly Average')
ax.grid(axis='y', alpha=0.3)
ax.legend()

plt.tight_layout()
plt.savefig(OUTPUTS / '09_payday_effects.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Payday effects visualization saved")

print("\nTop 5 days of month by sales:")
top_days = day_of_month_sales.nlargest(5, 'daily_avg')[['day_of_month', 'daily_avg']]
for _, row in top_days.iterrows():
    payday_flag = " (PAYDAY)" if row['day_of_month'] in payday_days_list else ""
    print(f"  Day {row['day_of_month']:>2}: {row['daily_avg']:>8,.1f} units/day{payday_flag}")

print("\n" + "=" * 70)
print("SECTION 4 COMPLETE - Temporal Deep Dive")
print("=" * 70)

## 5. Product Analysis - Fast vs Slow Movers

**Objective:** Classify items by sales velocity and identify sales concentration

**Activities:**
- Calculate sales velocity per item (total sales / days active)
- Classify fast movers (top 20%), slow movers (bottom 20%)
- Perform Pareto (80/20) analysis
- Analyze sales concentration by family
- Identify hero products

**Expected output:** 
- Fast/slow mover classification
- Pareto chart showing sales concentration
- Hero products list per family
- Inventory recommendations

In [None]:
# Product analysis - Fast vs Slow movers
print("Product Analysis - Fast vs Slow Movers")
print("=" * 70)

print("\nCalculating sales velocity per item...")

# Calculate total sales and days active per item
item_performance = df.groupby('item_nbr').agg({
    'unit_sales': 'sum',
    'date': 'nunique',
    'family': 'first'
}).reset_index()

item_performance.columns = ['item_nbr', 'total_sales', 'days_active', 'family']

# Calculate velocity (sales per active day)
item_performance['velocity'] = item_performance['total_sales'] / item_performance['days_active']

# Sort by total sales
item_performance = item_performance.sort_values('total_sales', ascending=False).reset_index(drop=True)

print(f"OK - Item performance calculated")
print(f"  Total items: {len(item_performance):,}")
print(f"  Total sales: {item_performance['total_sales'].sum():,.0f} units")

print("\nTop 10 items by total sales:")
print(item_performance.head(10)[['item_nbr', 'family', 'total_sales', 'days_active', 'velocity']])

print("\nBottom 10 items by total sales:")
print(item_performance.tail(10)[['item_nbr', 'family', 'total_sales', 'days_active', 'velocity']])

In [None]:
# Classify fast vs slow movers
print("\nClassifying Fast vs Slow Movers:")
print("=" * 70)

# Define thresholds (top 20% = fast, bottom 20% = slow)
velocity_80th = item_performance['velocity'].quantile(0.80)
velocity_20th = item_performance['velocity'].quantile(0.20)

item_performance['mover_type'] = 'Medium'
item_performance.loc[item_performance['velocity'] >= velocity_80th, 'mover_type'] = 'Fast'
item_performance.loc[item_performance['velocity'] <= velocity_20th, 'mover_type'] = 'Slow'

# Count by type
mover_counts = item_performance['mover_type'].value_counts()
print(f"\nItem classification:")
print(f"  Fast movers (top 20%):    {mover_counts['Fast']:>5,} items ({mover_counts['Fast']/len(item_performance)*100:>5.1f}%)")
print(f"  Medium movers (mid 60%):  {mover_counts['Medium']:>5,} items ({mover_counts['Medium']/len(item_performance)*100:>5.1f}%)")
print(f"  Slow movers (bottom 20%): {mover_counts['Slow']:>5,} items ({mover_counts['Slow']/len(item_performance)*100:>5.1f}%)")

print(f"\nVelocity thresholds:")
print(f"  Fast mover: ≥ {velocity_80th:.2f} units/day")
print(f"  Slow mover: ≤ {velocity_20th:.2f} units/day")

# Sales contribution by mover type
sales_by_type = item_performance.groupby('mover_type')['total_sales'].sum().sort_values(ascending=False)
print(f"\nSales contribution:")
for mover_type, sales in sales_by_type.items():
    pct = sales / item_performance['total_sales'].sum() * 100
    print(f"  {mover_type:<12} {sales:>10,.0f} units ({pct:>5.1f}%)")

# By family
print(f"\nMover classification by family:")
family_movers = item_performance.groupby(['family', 'mover_type']).size().unstack(fill_value=0)
print(family_movers)

In [None]:
# Pareto (80/20) analysis
print("\nPareto (80/20) Analysis:")
print("=" * 70)

# Calculate cumulative sales
item_performance['cumulative_sales'] = item_performance['total_sales'].cumsum()
item_performance['cumulative_pct'] = (item_performance['cumulative_sales'] / 
                                       item_performance['total_sales'].sum() * 100)

# Find 80% threshold
items_for_80pct = (item_performance['cumulative_pct'] <= 80).sum()
pct_items_for_80 = items_for_80pct / len(item_performance) * 100

print(f"\nPareto Principle:")
print(f"  {items_for_80pct:,} items ({pct_items_for_80:.1f}%) generate 80% of sales")
print(f"  Top 20% of items ({int(len(item_performance)*0.2):,} items) generate {item_performance.head(int(len(item_performance)*0.2))['total_sales'].sum() / item_performance['total_sales'].sum() * 100:.1f}% of sales")

# Visualize Pareto chart
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Cumulative sales curve
ax1 = axes[0]
ax1.plot(range(len(item_performance)), item_performance['cumulative_pct'], 
         color='steelblue', linewidth=2.5)
ax1.axhline(y=80, color='red', linestyle='--', alpha=0.7, label='80% threshold')
ax1.axvline(x=items_for_80pct, color='red', linestyle='--', alpha=0.7)
ax1.fill_between(range(items_for_80pct), 0, 80, alpha=0.2, color='green', 
                  label=f'{items_for_80pct} items for 80%')
ax1.set_xlabel('Number of Items (ranked by sales)', fontsize=11)
ax1.set_ylabel('Cumulative Sales %', fontsize=11)
ax1.set_title('Pareto Chart - Sales Concentration', fontsize=13, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.legend(fontsize=10)
ax1.set_xlim(0, len(item_performance))
ax1.set_ylim(0, 105)

# Plot 2: Sales distribution by mover type
ax2 = axes[1]
colors_map = {'Fast': '#2ca02c', 'Medium': '#ff7f0e', 'Slow': '#d62728'}
sales_data = item_performance.groupby('mover_type')['total_sales'].sum().reindex(['Fast', 'Medium', 'Slow'])
bars = ax2.bar(range(len(sales_data)), sales_data.values, 
               color=[colors_map[x] for x in sales_data.index], alpha=0.7, edgecolor='black')
ax2.set_xticks(range(len(sales_data)))
ax2.set_xticklabels(sales_data.index)
ax2.set_ylabel('Total Sales (units)', fontsize=11)
ax2.set_title('Sales by Mover Type', fontsize=13, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)

# Add percentage labels
for bar, (mover, sales) in zip(bars, sales_data.items()):
    height = bar.get_height()
    pct = sales / item_performance['total_sales'].sum() * 100
    ax2.text(bar.get_x() + bar.get_width()/2., height,
             f'{sales:,.0f}\n({pct:.1f}%)', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.savefig(OUTPUTS / '10_pareto_analysis.png', dpi=300, bbox_inches='tight')
plt.show()

print("\nOK - Pareto visualization saved")

# Hero products by family
print(f"\nHero Products (Top 5 per family):")
for family in item_performance['family'].unique():
    print(f"\n{family}:")
    top_items = item_performance[item_performance['family'] == family].head(5)
    for idx, row in top_items.iterrows():
        print(f"  Item #{row['item_nbr']}: {row['total_sales']:>8,.0f} units ({row['velocity']:>6.2f} units/day) - {row['mover_type']}")

print("\n" + "=" * 70)
print("SECTION 5 COMPLETE - Product Analysis")
print("=" * 70)

print("\n" + "=" * 70)
print("DAY 4 COMPLETE - Temporal Patterns & Product Analysis")
print("=" * 70)

print("\nKey Accomplishments:")
print("  ✓ Rolling statistics features (7/14/30-day)")
print("  ✓ Time series visualization (increasing trend, December +30%)")
print("  ✓ Autocorrelation analysis (strong at all lags)")
print("  ✓ Day-of-week patterns (weekends +34% lift)")
print("  ✓ Payday effects (+11% lift, Day 1 peak)")
print("  ✓ Fast/slow mover classification (20/60/20 split)")
print("  ✓ Pareto analysis (20% items = 58% sales)")

print(f"\nTime spent: ~3.5 hours / 5 hours allocated")
print(f"Status: 1.5 hours under budget! ✓")

print("\nReady for Day 5:")
print("  → Holiday impact analysis")
print("  → Promotion effectiveness")
print("  → Perishable deep dive")
print("  → Final dataset export")