# Eksperimen Supply Chain Demand Forecasting
**Nama**: I Made Wisnu Adi Sanjaya  
**Proyek**: MLOps - Retail Demand Prediction

---

## 1. Perkenalan Dataset

### Informasi Umum
- **Nama Dataset**: Supply Chain Demand Forecasting
- **Sumber**: [Kaggle - Demand Forecasting](https://www.kaggle.com/datasets/aswathrao/demand-forecasting)
- **Tujuan**: Membangun model forecasting untuk memprediksi jumlah unit yang terjual (`units_sold`) guna mengoptimalkan manajemen stok dan supply chain

### Deskripsi
Dataset ini berisi riwayat transaksi penjualan selama **2-3 tahun** untuk berbagai produk (SKU) di **10 toko** berbeda. Data mencakup informasi temporal, lokasi, dan pricing yang dapat digunakan untuk memprediksi permintaan produk.

### Atribut Dataset
| Kolom | Tipe | Deskripsi |
|-------|------|----------|
| `week` | Date | Tanggal/minggu penjualan |
| `store_id` | Categorical | ID unik toko |
| `sku_id` | Categorical | ID produk (Stock Keeping Unit) |
| `base_price` | Numeric | Harga dasar produk |
| `total_price` | Numeric | Harga akhir setelah promosi/diskon |
| `units_sold` | Numeric | **Target**: Jumlah unit terjual (yang ingin diprediksi) |

### Use Case
Model forecasting ini akan membantu retailer dalam:
- Optimalisasi inventory management
- Mencegah stockout atau overstock
- Perencanaan supply chain yang lebih akurat
- Strategi pricing dan promosi yang data-driven


## 2. Import Library

Mengimpor library yang dibutuhkan untuk:
- **Data Processing**: pandas, numpy
- **Visualization**: matplotlib, seaborn
- **Encoding**: category_encoders
- **Persistence**: joblib


In [None]:
import pandas as pd
import numpy as np
import joblib
import os
import matplotlib.pyplot as plt
import seaborn as sns
from category_encoders import MEstimateEncoder
import warnings
warnings.filterwarnings('ignore')

# Set style untuk visualisasi
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

# Import config (fallback jika tidak ada)
try:
    from config import SKU_SPECIFIC_LAGS, SKU_SPECIFIC_MAS
    print("✓ Config loaded successfully")
except ImportError:
    print("⚠ Config.py not found, using default values")
    SKU_SPECIFIC_LAGS = {216418: [1, 2, 3]}
    SKU_SPECIFIC_MAS = {216418: [2, 4]}

print(f"Target SKU: {list(SKU_SPECIFIC_LAGS.keys())}")
print(f"Lag features: {SKU_SPECIFIC_LAGS}")
print(f"Moving average windows: {SKU_SPECIFIC_MAS}")

## 3. Memuat Dataset

Memuat data training dari file CSV ke dalam pandas DataFrame.

In [None]:
# Load dataset dengan error handling
try:
    train = pd.read_csv('../data/train.csv')
    print("✓ Dataset loaded successfully from ../data/train.csv")
except FileNotFoundError:
    if os.path.exists('data/train.csv'):
        train = pd.read_csv('data/train.csv')
        print("✓ Dataset loaded successfully from data/train.csv")
    else:
        raise FileNotFoundError("train.csv not found in expected locations")

print(f"\nDataset Shape: {train.shape}")
print(f"Total Records: {len(train):,}")
print(f"Total Features: {train.shape[1]}")

### Preview Data
Melihat sampel data untuk memahami struktur dataset.

In [None]:
# Tampilkan 10 baris pertama
train.head(10)

## 4. Exploratory Data Analysis (EDA)

Analisis mendalam terhadap dataset untuk memahami:
- Struktur dan tipe data
- Missing values
- Distribusi statistik
- Outliers
- Pola temporal

### 4.1 Informasi Dasar Dataset

In [None]:
# Dataset info
print("=" * 60)
print("DATASET INFORMATION")
print("=" * 60)
print(train.info())

print("\n" + "=" * 60)
print("COLUMN DATA TYPES")
print("=" * 60)
print(train.dtypes)

### 4.2 Statistik Deskriptif

In [None]:
# Statistik deskriptif
print("=" * 60)
print("DESCRIPTIVE STATISTICS - NUMERIC FEATURES")
print("=" * 60)
display(train.describe().T)

# Statistik untuk categorical features
print("\n" + "=" * 60)
print("CATEGORICAL FEATURES SUMMARY")
print("=" * 60)
print(f"Unique Stores: {train['store_id'].nunique()}")
print(f"Unique SKUs: {train['sku_id'].nunique()}")
print(f"Date Range: {train['week'].min()} to {train['week'].max()}")

### 4.3 Data Quality Check

In [None]:
# Missing values analysis
print("=" * 60)
print("MISSING VALUES ANALYSIS")
print("=" * 60)
missing = train.isnull().sum()
missing_pct = (missing / len(train) * 100).round(2)
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage (%)': missing_pct
})
display(missing_df[missing_df['Missing Count'] > 0])

# Duplicates check
duplicates = train.duplicated().sum()
print(f"\nDuplicate Rows: {duplicates} ({duplicates/len(train)*100:.2f}%)")

# Zero values in target
zero_units = (train['units_sold'] == 0).sum()
print(f"Zero Units Sold: {zero_units} ({zero_units/len(train)*100:.2f}%)")

### 4.4 Visualisasi Distribusi Data

In [None]:
# Distribution of target variable
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Histogram
axes[0].hist(train['units_sold'], bins=50, edgecolor='black', alpha=0.7, color='skyblue')
axes[0].set_xlabel('Units Sold', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Units Sold', fontsize=14, fontweight='bold')
axes[0].axvline(train['units_sold'].mean(), color='red', linestyle='--', linewidth=2, label=f'Mean: {train["units_sold"].mean():.2f}')
axes[0].legend()

# Boxplot
axes[1].boxplot(train['units_sold'], vert=True, patch_artist=True,
                boxprops=dict(facecolor='lightblue', alpha=0.7),
                medianprops=dict(color='red', linewidth=2))
axes[1].set_ylabel('Units Sold', fontsize=12)
axes[1].set_title('Boxplot of Units Sold (Outlier Detection)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Mean Units Sold: {train['units_sold'].mean():.2f}")
print(f"Median Units Sold: {train['units_sold'].median():.2f}")
print(f"Std Dev: {train['units_sold'].std():.2f}")

In [None]:
# Price distributions
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Base price
axes[0].hist(train['base_price'].dropna(), bins=30, edgecolor='black', alpha=0.7, color='lightgreen')
axes[0].set_xlabel('Base Price', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Distribution of Base Price', fontsize=14, fontweight='bold')

# Total price
axes[1].hist(train['total_price'].dropna(), bins=30, edgecolor='black', alpha=0.7, color='lightcoral')
axes[1].set_xlabel('Total Price', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Distribution of Total Price', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.show()

### 4.5 Analisis Kategorical Features

In [None]:
# Distribution by store and SKU
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Sales by store
store_sales = train.groupby('store_id')['units_sold'].sum().sort_values(ascending=False)
axes[0].bar(range(len(store_sales)), store_sales.values, color='steelblue', alpha=0.7)
axes[0].set_xlabel('Store ID (sorted by total sales)', fontsize=12)
axes[0].set_ylabel('Total Units Sold', fontsize=12)
axes[0].set_title('Total Sales by Store', fontsize=14, fontweight='bold')

# Top 15 SKUs by sales
sku_sales = train.groupby('sku_id')['units_sold'].sum().sort_values(ascending=False).head(15)
axes[1].barh(range(len(sku_sales)), sku_sales.values, color='coral', alpha=0.7)
axes[1].set_xlabel('Total Units Sold', fontsize=12)
axes[1].set_ylabel('SKU ID (Top 15)', fontsize=12)
axes[1].set_title('Top 15 SKUs by Total Sales', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()

plt.tight_layout()
plt.show()

print(f"\nTop 3 Stores by Sales:")
print(store_sales.head(3))
print(f"\nTop 3 SKUs by Sales:")
print(sku_sales.head(3))

## 5. Data Preprocessing

Tahapan pembersihan dan transformasi data:
1. **Konversi Tipe Data**: Date parsing dan type conversion
2. **Handling Missing Values**: Imputasi nilai yang hilang
3. **Handling Duplicates**: Menghapus data duplikat
4. **Encoding**: Target encoding untuk categorical features
5. **Feature Engineering**: Membuat lag features dan moving averages untuk time series


### 5.1 Konversi Date & Fix Leap Year
Mengatasi masalah leap year yang menyebabkan error parsing tanggal.

In [None]:
print("Step 1: Date Conversion & Leap Year Fix")
print("=" * 60)

# Convert to datetime
train['week'] = pd.to_datetime(train['week'], format='%d/%m/%y')

# Fix leap year issue (29 Feb 2012 -> 28 Feb 2012)
train.loc[train['week'] >= '2012-03-06', 'week'] -= pd.Timedelta(days=1)

print(f"✓ Date converted successfully")
print(f"Date Range: {train['week'].min()} to {train['week'].max()}")
print(f"Total Days Covered: {(train['week'].max() - train['week'].min()).days}")

### 5.2 Handling Missing Values
Mengisi missing values pada `total_price` dengan `base_price`.

In [None]:
print("Step 2: Handling Missing Values")
print("=" * 60)

missing_before = train['total_price'].isnull().sum()
train['total_price'] = train['total_price'].fillna(train['base_price'])
missing_after = train['total_price'].isnull().sum()

print(f"✓ Missing values in total_price filled")
print(f"Before: {missing_before} missing values")
print(f"After: {missing_after} missing values")
print(f"Filled: {missing_before - missing_after} values")

### 5.3 Handling Duplicates

In [None]:
print("Step 3: Handling Duplicates")
print("=" * 60)

initial_rows = len(train)
train = train.drop_duplicates()
final_rows = len(train)

print(f"✓ Duplicates removed")
print(f"Initial rows: {initial_rows:,}")
print(f"Final rows: {final_rows:,}")
print(f"Removed: {initial_rows - final_rows:,} duplicate rows")

### 5.4 Target Encoding
Menggunakan M-Estimate Encoder untuk mengubah `store_id` menjadi numerical value berdasarkan target variable.

In [None]:
print("Step 4: Target Encoding")
print("=" * 60)

# Initialize encoder
encoder = MEstimateEncoder(cols=['store_id'])
encoder.fit(train[['store_id']], train['units_sold'])

# Save encoder for inference
os.makedirs('model_artifacts', exist_ok=True)
joblib.dump(encoder, 'model_artifacts/store_encoder.pkl')

# Transform
train['store_encoded'] = encoder.transform(train[['store_id']])

print(f"✓ Target encoding completed")
print(f"Encoder saved to: model_artifacts/store_encoder.pkl")
print(f"\nSample encoded values:")
display(train[['store_id', 'store_encoded']].drop_duplicates().head(10))

### 5.5 Feature Engineering: Time Series Features

Membuat features untuk time series forecasting:
- **Lag Features**: Nilai sales pada t-1, t-2, t-3
- **Moving Averages**: Rolling average untuk menangkap trend

**Note**: Untuk efisiensi, kita filter hanya SKU tertentu sesuai config.

In [None]:
print("Step 5: Feature Engineering - Time Series Features")
print("=" * 60)

# Filter target SKUs
target_skus = list(SKU_SPECIFIC_LAGS.keys())
print(f"Target SKUs for model: {target_skus}")

rows_before_filter = len(train)
train = train[train['sku_id'].isin(target_skus)].copy()
rows_after_filter = len(train)

print(f"\nFiltering completed:")
print(f"Before: {rows_before_filter:,} rows")
print(f"After: {rows_after_filter:,} rows")
print(f"Filtered out: {rows_before_filter - rows_after_filter:,} rows")

In [None]:
print("Creating lag features and moving averages...")
print("=" * 60)

df_list = []

for sku_id in target_skus:
    print(f"\nProcessing SKU: {sku_id}")
    df_sku = train[train['sku_id'] == sku_id].copy().sort_values('week')
    
    # Create lag features
    lags = SKU_SPECIFIC_LAGS.get(sku_id, [1])
    print(f"  Creating lag features: {lags}")
    for lag in lags:
        df_sku[f'lag_{lag}'] = df_sku['units_sold'].shift(lag)
    
    # Create moving average features
    mas = SKU_SPECIFIC_MAS.get(sku_id, [1])
    print(f"  Creating moving averages: {mas}")
    for window in mas:
        df_sku[f'ma_{window}'] = df_sku['units_sold'].rolling(window=window).mean().shift(1)
    
    print(f"  Rows for SKU {sku_id}: {len(df_sku)}")
    df_list.append(df_sku)

# Concatenate all SKUs
df_final = pd.concat(df_list, ignore_index=True)
print(f"\n{'='*60}")
print(f"Total rows before dropna: {len(df_final):,}")

# Remove rows with NaN (from lag/MA operations)
df_final = df_final.dropna()
print(f"Total rows after dropna: {len(df_final):,}")
print(f"Dropped: {len(pd.concat(df_list)) - len(df_final):,} rows with NaN")

print(f"\n✓ Feature engineering completed")
print(f"Final dataset shape: {df_final.shape}")

### 5.6 Hasil Preprocessing
Melihat preview data yang sudah diproses dan siap untuk modeling.

In [None]:
print("Final Processed Dataset Preview")
print("=" * 60)
print(f"Shape: {df_final.shape}")
print(f"\nFeatures: {df_final.columns.tolist()}")
print(f"\nSample Data:")
display(df_final.head(10))

print(f"\nFeature Statistics:")
display(df_final.describe().T)

## 6. Simpan Hasil

Menyimpan data yang sudah diproses ke file CSV untuk digunakan pada tahap modeling.

In [None]:
# Create output directory
os.makedirs('data/processed', exist_ok=True)
output_path = 'data/processed/train_processed.csv'

# Save to CSV
df_final.to_csv(output_path, index=False)

print("=" * 60)
print("DATA PREPROCESSING COMPLETED")
print("=" * 60)
print(f"✓ Processed data saved to: {output_path}")
print(f"\nFinal Statistics:")
print(f"  - Total Records: {len(df_final):,}")
print(f"  - Total Features: {df_final.shape[1]}")
print(f"  - File Size: {os.path.getsize(output_path) / 1024:.2f} KB")
print(f"\nData is ready for model training!")

## 7. Summary

### Preprocessing Pipeline Summary

| Step | Action | Result |
|------|--------|--------|
| 1 | Date Conversion | Fixed leap year issue |
| 2 | Missing Values | Filled total_price with base_price |
| 3 | Duplicates | Removed duplicate rows |
| 4 | Encoding | Applied M-Estimate encoding to store_id |
| 5 | Feature Engineering | Created lag features & moving averages |
| 6 | Filtering | Selected target SKUs for modeling |

### Features Created
- **Original Features**: week, store_id, sku_id, base_price, total_price, units_sold
- **Engineered Features**: 
  - `store_encoded`: Target-encoded store values
  - `lag_1`, `lag_2`, `lag_3`: Previous time step values
  - `ma_2`, `ma_4`: Moving averages

### Next Steps
1. ✅ Data preprocessing completed
2. → Model training dengan `modelling.py`
3. → Hyperparameter tuning dengan `modelling_tuning.py`
4. → Model deployment & monitoring

---
**End of Notebook**