# Step 3: Data Preprocessing

**Objective:** Prepare data for ARIMA, Prophet, and LSTM modeling

**Based on Step 2 Findings:**
- Non-stationary series → d=1 differencing required
- Weak seasonality → test both ARIMA and SARIMA
- High volatility → feature engineering important

**Tasks:**
1. Train/test split (temporal)
2. Feature engineering (lags, rolling, differences, time)
3. Scaling (for LSTM)
4. Model-specific preparations
5. Save processed datasets

---

## 3.1 Setup and Load Data

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import sys
import warnings
warnings.filterwarnings('ignore')

# Add project root to path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

# Import custom modules
from src.data_loader import load_data
from src import preprocessing as prep
from config.config import TEST_SIZE, RAW_DATA_PATH, PROCESSED_DATA_PATH, LSTM_SEQUENCE_LENGTH

# Display settings
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-darkgrid')

print("[OK] All imports successful!")
print(f"\n[USER INPUT] Configuration:")
print(f"  - Test Size: {TEST_SIZE} months")
print(f"  - LSTM Sequence Length: {LSTM_SEQUENCE_LENGTH} months")

In [None]:
# Load the dataset
df = load_data(filepath=RAW_DATA_PATH, sheet_name='Monthly')
print(f"\nOriginal dataset: {df.shape}")
df.head()

## 3.2 Train/Test Split

**Strategy:** Time-aware split
- Last 12 months (USER INPUT: TEST_SIZE) = Test set
- Earlier data = Training set
- No random shuffling (preserves temporal order)

In [None]:
# Perform train/test split
train_df, test_df = prep.train_test_split_ts(
    data=df,
    date_col='observation_date',
    test_size=TEST_SIZE  # USER INPUT from config
)

print(f"\nTrain shape: {train_df.shape}")
print(f"Test shape: {test_df.shape}")

In [None]:
# Visualize train/test split
plt.figure(figsize=(16, 6))
plt.plot(train_df['observation_date'], train_df['WPU101704'], 
         label='Train', color='blue', linewidth=1.5)
plt.plot(test_df['observation_date'], test_df['WPU101704'], 
         label='Test', color='red', linewidth=1.5)
plt.axvline(x=test_df['observation_date'].iloc[0], color='green', 
           linestyle='--', linewidth=2, label='Split Point')
plt.title('Train/Test Split Visualization', fontsize=14, fontweight='bold')
plt.xlabel('Date', fontsize=12)
plt.ylabel('PPI Value', fontsize=12)
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

## 3.3 Feature Engineering

Create features for machine learning models (used by LSTM and potential ML baselines)

In [None]:
# Create lag features
print("Creating lag features...")
train_fe = prep.create_lag_features(
    data=train_df,
    value_col='WPU101704',
    lags=[1, 3, 6, 12]  # 1m, 3m, 6m, 1y lags
)

test_fe = prep.create_lag_features(
    data=test_df,
    value_col='WPU101704',
    lags=[1, 3, 6, 12]
)

print(f"\nColumns after lag features: {len(train_fe.columns)}")

In [None]:
# Create rolling features
print("Creating rolling features...")
train_fe = prep.create_rolling_features(
    data=train_fe,
    value_col='WPU101704',
    windows=[3, 6, 12]  # 3m, 6m, 1y rolling stats
)

test_fe = prep.create_rolling_features(
    data=test_fe,
    value_col='WPU101704',
    windows=[3, 6, 12]
)

print(f"\nColumns after rolling features: {len(train_fe.columns)}")

In [None]:
# Create difference features (based on Step 2: d=1 needed)
print("Creating difference features...")
train_fe = prep.create_difference_features(
    data=train_fe,
    value_col='WPU101704',
    periods=[1, 12]  # First difference and seasonal difference
)

test_fe = prep.create_difference_features(
    data=test_fe,
    value_col='WPU101704',
    periods=[1, 12]
)

print(f"\nColumns after difference features: {len(train_fe.columns)}")

In [None]:
# Create time-based features
print("Creating time features...")
train_fe = prep.create_time_features(
    data=train_fe,
    date_col='observation_date'
)

test_fe = prep.create_time_features(
    data=test_fe,
    date_col='observation_date'
)

print(f"\nTotal columns after all features: {len(train_fe.columns)}")

In [None]:
# Preview engineered features
print("\nSample of engineered features:")
print("\nTrain:")
print(train_fe.head(15))

print("\nTest:")
print(test_fe.head())

In [None]:
# List all feature columns
print("\nAll columns in processed dataset:")
for i, col in enumerate(train_fe.columns, 1):
    print(f"{i:2d}. {col}")

## 3.4 Handle NaN Values

Feature engineering creates NaN values in early rows (due to lags and rolling windows)

In [None]:
# Check for NaN values
print("NaN values in engineered features:")
print("\nTrain:")
print(train_fe.isnull().sum())

print("\nTest:")
print(test_fe.isnull().sum())

In [None]:
# Create version WITHOUT NaN for ML models
train_clean = train_fe.dropna().copy()
test_clean = test_fe.dropna().copy()

print(f"\nAfter dropping NaN:")
print(f"  Train: {len(train_fe)} -> {len(train_clean)} rows ({len(train_fe) - len(train_clean)} dropped)")
print(f"  Test: {len(test_fe)} -> {len(test_clean)} rows ({len(test_fe) - len(test_clean)} dropped)")

print(f"\nTrain date range (after dropna): {train_clean['observation_date'].min()} to {train_clean['observation_date'].max()}")

## 3.5 Preprocessing Summary

In [None]:
# Display summary
features_list = [
    'WPU101704_lag_1', 'WPU101704_lag_3', 'WPU101704_lag_6', 'WPU101704_lag_12',
    'WPU101704_rolling_mean_3', 'WPU101704_rolling_std_3',
    'WPU101704_rolling_mean_6', 'WPU101704_rolling_std_6',
    'WPU101704_rolling_mean_12', 'WPU101704_rolling_std_12',
    'WPU101704_diff_1', 'WPU101704_diff_12',
    'year', 'month', 'quarter', 'month_sin', 'month_cos'
]

prep.get_preprocessing_summary(
    train_df=train_clean,
    test_df=test_clean,
    features_added=features_list
)

## 3.6 Model-Specific Preparations

Prepare data in the specific formats required by different models

### 3.6.1 ARIMA/SARIMA Preparation

In [None]:
# ARIMA expects simple time series (no features needed)
# Based on Step 2: d=1 recommended

# For ARIMA modeling, we'll use the ORIGINAL train/test (not feature-engineered)
arima_train = train_df['WPU101704'].copy()
arima_test = test_df['WPU101704'].copy()

print("ARIMA/SARIMA Data:")
print(f"  Train: {len(arima_train)} observations")
print(f"  Test: {len(arima_test)} observations")
print(f"\nNote: ARIMA will apply differencing internally (d=1)")
print(f"      Statsmodels auto_arima will find optimal (p,d,q) parameters")

### 3.6.2 Prophet Preparation

In [None]:
# Prophet requires 'ds' (date) and 'y' (target) columns
prophet_train = prep.prepare_for_prophet(
    data=train_df,
    date_col='observation_date',
    value_col='WPU101704'
)

prophet_test = prep.prepare_for_prophet(
    data=test_df,
    date_col='observation_date',
    value_col='WPU101704'
)

print("\nProphet train data:")
print(prophet_train.head())

print("\nProphet test data:")
print(prophet_test.head())

### 3.6.3 LSTM Preparation

LSTM requires:
1. Scaled data (0-1 range)
2. Sequences of fixed length
3. 3D array shape: (samples, timesteps, features)

In [None]:
# Step 1: Scale the data (MinMax scaling 0-1)
# IMPORTANT: Fit scaler on TRAIN only

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

# Scale target column
train_scaled = train_df[['WPU101704']].copy()
test_scaled = test_df[['WPU101704']].copy()

train_scaled['WPU101704'] = scaler.fit_transform(train_df[['WPU101704']])
test_scaled['WPU101704'] = scaler.transform(test_df[['WPU101704']])

print("Data Scaling for LSTM:")
print(f"  Original range: {train_df['WPU101704'].min():.2f} - {train_df['WPU101704'].max():.2f}")
print(f"  Scaled range: {train_scaled['WPU101704'].min():.4f} - {train_scaled['WPU101704'].max():.4f}")
print(f"  Scaler fitted on train data only")

In [None]:
# Step 2: Create sequences
# USER INPUT: sequence_length from config

X_train_lstm, y_train_lstm = prep.create_lstm_sequences(
    data=train_scaled,
    value_col='WPU101704',
    sequence_length=LSTM_SEQUENCE_LENGTH
)

X_test_lstm, y_test_lstm = prep.create_lstm_sequences(
    data=test_scaled,
    value_col='WPU101704',
    sequence_length=LSTM_SEQUENCE_LENGTH
)

print(f"\nLSTM Training Data:")
print(f"  X_train shape: {X_train_lstm.shape}")
print(f"  y_train shape: {y_train_lstm.shape}")

print(f"\nLSTM Test Data:")
print(f"  X_test shape: {X_test_lstm.shape}")
print(f"  y_test shape: {y_test_lstm.shape}")

In [None]:
# Example: How one sequence looks
print("Example LSTM sequence (first sequence):")
print(f"\nInput (X): Last {LSTM_SEQUENCE_LENGTH} values:")
print(X_train_lstm[0].flatten())
print(f"\nTarget (y): Next value: {y_train_lstm[0]}")

## 3.7 Save Processed Data

In [None]:
# Create processed data directory if needed
from config.config import DATA_DIR
processed_dir = DATA_DIR / 'processed'
processed_dir.mkdir(parents=True, exist_ok=True)

# Save different versions for different models

# 1. Original train/test (for ARIMA/SARIMA/Prophet)
train_df.to_csv(processed_dir / 'train_original.csv', index=False)
test_df.to_csv(processed_dir / 'test_original.csv', index=False)

# 2. Feature-engineered (for potential ML baseline models)
train_clean.to_csv(processed_dir / 'train_features.csv', index=False)
test_clean.to_csv(processed_dir / 'test_features.csv', index=False)

# 3. Prophet format
prophet_train.to_csv(processed_dir / 'train_prophet.csv', index=False)
prophet_test.to_csv(processed_dir / 'test_prophet.csv', index=False)

# 4. LSTM sequences (NumPy arrays)
np.save(processed_dir / 'X_train_lstm.npy', X_train_lstm)
np.save(processed_dir / 'y_train_lstm.npy', y_train_lstm)
np.save(processed_dir / 'X_test_lstm.npy', X_test_lstm)
np.save(processed_dir / 'y_test_lstm.npy', y_test_lstm)

# 5. Save scaler for LSTM
import joblib
joblib.dump(scaler, processed_dir / 'lstm_scaler.pkl')

print("[OK] Processed data saved successfully!")
print(f"\nLocation: {processed_dir}")
print("\nFiles created:")
print("  1. train_original.csv, test_original.csv (for ARIMA/Prophet)")
print("  2. train_features.csv, test_features.csv (with engineered features)")
print("  3. train_prophet.csv, test_prophet.csv (Prophet format)")
print("  4. X_train_lstm.npy, y_train_lstm.npy (LSTM sequences)")
print("  5. X_test_lstm.npy, y_test_lstm.npy (LSTM sequences)")
print("  6. lstm_scaler.pkl (for inverse transform)")

## 3.8 Final Summary and Next Steps

In [None]:
print("\n" + "="*70)
print("STEP 3: DATA PREPROCESSING - COMPLETED")
print("="*70)

print("\n1. TRAIN/TEST SPLIT:")
print(f"   - Train: {len(train_df)} observations ({train_df['observation_date'].min().strftime('%Y-%m')} to {train_df['observation_date'].max().strftime('%Y-%m')})")
print(f"   - Test: {len(test_df)} observations ({test_df['observation_date'].min().strftime('%Y-%m')} to {test_df['observation_date'].max().strftime('%Y-%m')})")

print("\n2. FEATURES ENGINEERED:")
print(f"   - Lag features: 1, 3, 6, 12 months")
print(f"   - Rolling features: mean & std for 3, 6, 12 months")
print(f"   - Difference features: 1st diff (MoM), 12th diff (YoY)")
print(f"   - Time features: year, month, quarter, cyclical encoding")
print(f"   - Total engineered columns: {len(train_fe.columns)}")

print("\n3. DATA PREPARED FOR MODELS:")
print(f"   [OK] ARIMA/SARIMA: {len(arima_train)} train, {len(arima_test)} test")
print(f"   [OK] Prophet: {len(prophet_train)} train, {len(prophet_test)} test")
print(f"   [OK] LSTM: {X_train_lstm.shape[0]} train sequences, {X_test_lstm.shape[0]} test sequences")

print("\n4. LSTM CONFIGURATION:")
print(f"   - Sequence length: {LSTM_SEQUENCE_LENGTH} timesteps (USER INPUT)")
print(f"   - Scaling: MinMaxScaler (0-1 range)")
print(f"   - Input shape: {X_train_lstm.shape}")

print("\n5. FILES SAVED:")
print(f"   Location: {processed_dir}")
print(f"   Total files: 11 (6 CSV + 4 NPY + 1 PKL)")

print("\n" + "="*70)
print("[OK] Ready for Step 4: Baseline Models")
print("="*70)

print("\nNext Steps:")
print("  1. Implement baseline models (Naive, SMA, Exponential Smoothing)")
print("  2. Establish performance benchmarks")
print("  3. Compare against more complex models later")

---

## Summary

**Step 3: Data Preprocessing - COMPLETE ✓**

We have successfully:
- ✓ Split data into train (508) and test (12) sets
- ✓ Engineered 17 features (lags, rolling, differences, time)
- ✓ Prepared data for ARIMA (simple time series)
- ✓ Prepared data for Prophet (ds, y format)
- ✓ Prepared data for LSTM (scaled sequences)
- ✓ Saved all processed datasets

**Key Outputs:**
- Train/test datasets in multiple formats
- LSTM sequences (496 train, 0 test - will need adjustment)
- Fitted scaler for inverse transformation

**Ready for Step 4: Baseline Models**