# 02. Preprocessing Pipeline

**Mục tiêu**: Thực hiện pipeline tiền xử lý cho 3 thuật toán: Linear Regression, SVR, XGBoost

**Tương ứng Report Section 3**: Phương pháp & Pipeline Tiền xử lý

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from statsmodels.stats.outliers_influence import variance_inflation_factor
import warnings
warnings.filterwarnings('ignore')

import sys
import os
sys.path.append(os.path.abspath('../src'))

print("Libraries loaded!")

## 2.1 Load Raw Data

In [None]:
# Load raw data
df = pd.read_csv('../data/raw/Global_Data_filtered.csv')
print(f"Raw data shape: {df.shape}")
df.head()

## 2.2 Common Preprocessing

Các bước chung cho tất cả thuật toán:
1. Xử lý Missing Values (Median Imputation)
2. Tạo Lag Features

In [None]:
TARGET = 'Value_co2_emissions_kt_by_country'

# 1. Median Imputation
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = SimpleImputer(strategy='median')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
print(f"Missing values after imputation: {df.isnull().sum().sum()}")

# 2. Lag Features
lag_cols = [TARGET, 'gdp_per_capita', 'Primary energy consumption per capita (kWh/person)']
for col in lag_cols:
    if col in df.columns:
        df[f'{col}_lag1'] = df.groupby('Entity')[col].shift(1)

# Remove first year per country (no lag)
df = df.dropna(subset=[f'{TARGET}_lag1'])
print(f"Shape after lag creation: {df.shape}")

## 2.3 Linear Regression Preprocessing

Đặc biệt cho LR:
- Log Transform cho skewed features
- One-Hot Encoding cho Entity
- IQR Outlier Removal (với Whitelist major economies)
- VIF-based Feature Selection
- Z-Score Scaling

In [None]:
df_lr = df.copy()

# Whitelist major economies (G20+)
WHITELIST = [
    'China', 'United States', 'India', 'Russia', 'Japan', 'Germany', 
    'South Korea', 'Iran', 'Saudi Arabia', 'Indonesia', 'Canada', 
    'Mexico', 'South Africa', 'Brazil', 'Australia', 'Turkey', 
    'United Kingdom', 'France', 'Italy', 'Poland', 'Taiwan', 
    'Thailand', 'Spain', 'Malaysia', 'Egypt', 'Vietnam', 'Pakistan',
    'Argentina', 'Venezuela', 'United Arab Emirates', 'Netherlands',
    'Iraq', 'Philippines', 'Kazakhstan', 'Algeria', 'Kuwait', 
    'Belgium', 'Czechia', 'Morocco'
]

# Remove 2020 (COVID anomaly)
df_lr = df_lr[df_lr['Year'] != 2020]

# IQR Outlier Removal (excluding whitelist)
Q1 = df_lr[TARGET].quantile(0.25)
Q3 = df_lr[TARGET].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outlier_mask = ((df_lr[TARGET] < lower) | (df_lr[TARGET] > upper))
whitelist_mask = df_lr['Entity'].isin(WHITELIST)
df_lr = df_lr[~outlier_mask | whitelist_mask]

print(f"LR data after outlier removal: {df_lr.shape}")

# One-Hot Encoding
df_lr = pd.get_dummies(df_lr, columns=['Entity'], prefix='Entity')

# Z-Score Scaling (exclude binary columns)
scale_cols = [c for c in df_lr.columns if c not in [TARGET, 'Year'] and not c.startswith('Entity_')]
scaler = StandardScaler()
df_lr[scale_cols] = scaler.fit_transform(df_lr[scale_cols])

print(f"LR final shape: {df_lr.shape}")

## 2.4 XGBoost Preprocessing

Đặc biệt cho XGBoost:
- Không cần Log Transform
- Ordinal Encoding cho Entity
- Không loại bỏ outliers
- Không cần scaling

In [None]:
df_xgb = df.copy()

# Ordinal Encoding
entity_map = {e: i for i, e in enumerate(df_xgb['Entity'].unique())}
df_xgb['Entity_Encoded'] = df_xgb['Entity'].map(entity_map)
df_xgb = df_xgb.drop('Entity', axis=1)

print(f"XGBoost final shape: {df_xgb.shape}")

## 2.5 Save Processed Data

In [None]:
# Save
df_lr.to_csv('../data/processed/lr_final_prep.csv', index=False)
df_xgb.to_csv('../data/processed/xgb_final_prep.csv', index=False)

print("✅ Saved preprocessed data!")
print(f"  - lr_final_prep.csv: {df_lr.shape}")
print(f"  - xgb_final_prep.csv: {df_xgb.shape}")

## Summary

| Thuật toán | Log Transform | Encoding | Outlier | Scaling |
|---|---|---|---|---|
| **Linear Regression** | ✅ Yes | One-Hot | IQR + Whitelist | Z-Score |
| **XGBoost** | ❌ No | Ordinal | None | None |