# **1. Perkenalan Dataset**

**Nama Dataset**: Retail Transaction Data (Demand Forecasting)
**Sumber**: [Kaggle - Retail Transaction Data](https://www.kaggle.com/datasets/mukuldeshantri/ecommerce-data)
**Deskripsi Ringkas**: Dataset ini berisi transaksi penjualan historis yang mencakup informasi harga, unit terjual, dan tanggal. Tujuan utama eksperimen ini adalah memprediksi jumlah unit yang terjual (`units_sold`) di masa depan untuk keperluan manajemen inventaris (Demand Forecasting).

Informasi Atribut:
- `week`: Tanggal penjualan (mingguan)
- `sku_id`: ID unik untuk setiap produk
- `store_id`: ID unik untuk setiap toko
- `base_price`: Harga dasar produk
- `total_price`: Harga jual produk (termasuk diskon/promosi)
- `units_sold`: Target variabel (jumlah unit yang terjual)


# **2. Import Library**
Pada tahap ini, library yang dibutuhkan untuk pemrosesan data, visualisasi, dan pemodelan diimpor.


In [None]:
import pandas as pd
import numpy as np
import joblib
import os
import matplotlib.pyplot as plt
import seaborn as sns
from category_encoders import MEstimateEncoder

# Pastikan config.py ada di folder yang sama atau definisikan di sini
try:
    from config import SKU_SPECIFIC_LAGS, SKU_SPECIFIC_MAS
except ImportError:
    # Definisi fallback jika config.py tidak ditemukan
    SKU_SPECIFIC_LAGS = {216418: [1, 2, 3]}
    SKU_SPECIFIC_MAS = {216418: [2, 4]} 


# **3. Memuat Dataset**
Memuat dataset `train.csv` ke dalam DataFrame pandas.


In [None]:
try:
    train = pd.read_csv('../data/train.csv') # Path relative menyesuaikan struktur folder
    print("Dataset loaded successfully.")
except FileNotFoundError:
    # Fallback path
    if os.path.exists('data/train.csv'):
        train = pd.read_csv('data/train.csv')
    else:
        print("Error: train.csv not found.")

train.head()


# **4. Exploratory Data Analysis (EDA)**
Melihat struktur data, tipe data, missing values, dan statistik deskriptif.


In [None]:
# Cek Info Data
print("Data Info:")
print(train.info())

# Cek Statistik Deskriptif
print("\nStatistik Deskriptif:")
print(train.describe())

# Cek Missing Values
print("\nMissing Values:")
print(train.isnull().sum())

# Cek Duplikasi
print(f"\nJumlah Duplikat: {train.duplicated().sum()}")

# Visualisasi Outlier Sederhana (Boxplot units_sold)
plt.figure(figsize=(8, 4))
sns.boxplot(x=train['units_sold'])
plt.title('Boxplot Units Sold')
plt.show()


# **5. Data Preprocessing**
Tahapan pembersihan dan penyiapan data:
1. Konversi Tipe Data (Date)
2. Handling Missing Values
3. Handling Duplicates
4. Encoding (Target Encoding)
5. Feature Engineering (Lags & Moving Averages)


In [None]:
print("Mulai Preprocessing...")

# 1. Konversi Date & Fix Leap Year
train['week'] = pd.to_datetime(train['week'], format='%d/%m/%y')
train.loc[train['week'] >= '2012-03-06', 'week'] -= pd.Timedelta(days=1)

# 2. Handling Missing Values (Total Price)
train['total_price'] = train['total_price'].fillna(train['base_price'])
print("Missing values in total_price filled.")

# 3. Handling Duplicates (Jika ada)
initial_rows = len(train)
train = train.drop_duplicates()
print(f"Duplicates dropped: {initial_rows - len(train)}")

# 4. Target Encoding
print("Melakukan Target Encoding...")
encoder = MEstimateEncoder(cols=['store_id'])
encoder.fit(train[['store_id']], train['units_sold'])

# Simpan Encoder
os.makedirs('model_artifacts', exist_ok=True)
joblib.dump(encoder, 'model_artifacts/store_encoder.pkl')
train['store_encoded'] = encoder.transform(train[['store_id']])

# 5. Feature Engineering (Lags & MA)
# Filter SKU sesuai eksperimen (untuk efisiensi)
target_skus = list(SKU_SPECIFIC_LAGS.keys())
train = train[train['sku_id'].isin(target_skus)].copy()

print("Membuat Fitur Time Series...")
df_list = []
for sku_id in target_skus:
    df_sku = train[train['sku_id'] == sku_id].copy().sort_values('week')
    
    # Lag
    lags = SKU_SPECIFIC_LAGS.get(sku_id, [1])
    for lag in lags:
        df_sku[f'lag_{lag}'] = df_sku['units_sold'].shift(lag)
        
    # Moving Average
    mas = SKU_SPECIFIC_MAS.get(sku_id, [1])
    for window in mas:
        df_sku[f'ma_{window}'] = df_sku['units_sold'].rolling(window=window).mean().shift(1)
        
    df_list.append(df_sku)

df_final = pd.concat(df_list)
df_final = df_final.dropna()
print(f"Data final shape: {df_final.shape}")


# **6. Simpan Hasil**
Menyimpan data yang sudah diproses ke `train_processed.csv`.


In [None]:
os.makedirs('data/processed', exist_ok=True)
output_path = 'data/processed/train_processed.csv'
df_final.to_csv(output_path, index=False)
print(f"Data tersimpan di {output_path}")
