# 02. Data Preprocessing Pipeline

**Mục tiêu**: Thực hiện pipeline tiền xử lý cho 3 thuật toán: Linear Regression, SVR, XGBoost

**Tương ứng Report Section 3**: Phương pháp & Pipeline Tiền xử lý

---

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, RobustScaler
from sklearn.impute import SimpleImputer
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

import sys
import os
sys.path.append(os.path.abspath('../src'))

print("Libraries loaded!")

## 2.1 Tải dữ liệu gốc

In [None]:
# Load raw data
df = pd.read_csv('../data/raw/global-data-on-sustainable-energy.csv')  
print(f"Raw data shape: {df.shape}")

# Convert all string numbers with commas to float
for col in df.columns:
    if df[col].dtype == 'object' and col not in ['Entity']:
        try:
            df[col] = df[col].str.replace(',', '').astype(float)
        except:
            pass

print(f"Columns: {df.columns.tolist()}")
df.head()

## 2.2 Common Preprocessing

Các bước chung cho tất cả thuật toán:
1. Xử lý Missing Values (Median Imputation)
2. Tạo Lag Features
3. Remove năm 2020 (COVID anomaly)

In [None]:
TARGET = 'Value_co2_emissions_kt_by_country'

# Whitelist major economies (G20+ và các nước phát thải lớn)
WHITELIST = [
    'China', 'United States', 'India', 'Russia', 'Japan', 'Germany', 
    'South Korea', 'Iran', 'Saudi Arabia', 'Indonesia', 'Canada', 
    'Mexico', 'South Africa', 'Brazil', 'Australia', 'Turkey', 
    'United Kingdom', 'France', 'Ítaly', 'Poland', 'Taiwan', 
    'Thailand', 'Spain', 'Malaysia', 'Egypt', 'Vietnăm', 'Pakistan',
    'Argentina', 'Venezuela', 'United Arab Emirates', 'Netherlands',
    'Iraq', 'Philippines', 'Kazakhstan', 'Algeria', 'Kuwait', 
    'Belgium', 'Czechia', 'Morocco'
]

# 1. Median Imputation
numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
imputer = SimpleImputer(strategy='median')
df[numeric_cols] = imputer.fit_transform(df[numeric_cols])
print(f"Missing values after imputation: {df.isnull().sum().sum()}")

# 2. Lag Features
lag_cols = [TARGET, 'gdp_per_capita', 'Primary energy consumption per capita (kWh/person)']
for col in lag_cols:
    if col in df.columns:
        df[f'{col}_lag1'] = df.groupby('Entity')[col].shift(1)

# Remove first year per country (no lag available)
df = df.dropna(subset=[f'{TARGET}_lag1'])
print(f"Shape after lag creation: {df.shape}")

# Save common preprocessed data
df.to_csv('../data/processed/common_preprocessed.csv', index=False)
print(f"Saved common_preprocessed.csv: {df.shape}")

## 2.3 Linear Regression Preprocessing

Đặc biệt cho LR:
- **Log Transform** cho skewed features (giảm ảnh hưởng của extreme values)
- **One-Hot Encoding** cho Entity (LR cần dummy variables)
- **IQR Outlier Removal** với Whitelist major economies
- **StandardScaler** (Z-Score Scaling)

In [None]:
df_lr = df.copy()

# Remove 2020 (COVID anomaly)
df_lr = df_lr[df_lr['Year'] != 2020]
print(f"After removing 2020: {df_lr.shape}")

# Log Transform cho skewed features
skewed_cols = ['Financial flows to developing countries (US $)', 
               'Electricity from fossil fuels (TWh)',
               'Electricity from nuclear (TWh)',
               'Electricity from renewables (TWh)',
               TARGET, f'{TARGET}_lag1']

for col in skewed_cols:
    if col in df_lr.columns:
        # log1p để handle giá trị 0
        df_lr[col] = np.log1p(df_lr[col].clip(lower=0))

print(f"Applied log transform to {len(skewed_cols)} columns")

# IQR Outlier Removal (excluding whitelist)
Q1 = df_lr[TARGET].quantile(0.25)
Q3 = df_lr[TARGET].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outlier_mask = ((df_lr[TARGET] < lower) | (df_lr[TARGET] > upper))
whitelist_mask = df_lr['Entity'].isin(WHITELIST)
df_lr = df_lr[~outlier_mask | whitelist_mask]
print(f"After outlier removal: {df_lr.shape}")

# One-Hot Encoding
df_lr = pd.get_dummies(df_lr, columns=['Entity'], prefix='Entity')

# StandardScaler (Z-Score) - exclude binary columns
scale_cols = [c for c in df_lr.columns if c not in [TARGET, 'Year'] and not c.startswith('Entity_')]
scaler_lr = StandardScaler()
df_lr[scale_cols] = scaler_lr.fit_transform(df_lr[scale_cols])

print(f"LR final shape: {df_lr.shape}")

## 2.4 SVR Preprocessing

Đặc biệt cho SVR:
- **Log Transform** cho skewed features (giống LR)
- **Ordinal Encoding** cho Entity (tiết kiệm memory, SVR không cần One-Hot)
- **IQR Outlier Removal** với Whitelist (SVR rất nhạy cảm với outliers)
- **RobustScaler** (tốt hơn StandardScaler vì dùng median/IQR thay vì mean/std)

In [None]:
df_svr = df.copy()

# Remove 2020 (COVID anomaly)
df_svr = df_svr[df_svr['Year'] != 2020]
print(f"After removing 2020: {df_svr.shape}")

# Log Transform cho skewed features (giống LR)
for col in skewed_cols:
    if col in df_svr.columns:
        df_svr[col] = np.log1p(df_svr[col].clip(lower=0))

print(f"Applied log transform to {len(skewed_cols)} columns")

# IQR Outlier Removal (excluding whitelist)
Q1 = df_svr[TARGET].quantile(0.25)
Q3 = df_svr[TARGET].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR

outlier_mask = ((df_svr[TARGET] < lower) | (df_svr[TARGET] > upper))
whitelist_mask = df_svr['Entity'].isin(WHITELIST)
df_svr = df_svr[~outlier_mask | whitelist_mask]
print(f"After outlier removal: {df_svr.shape}")

# Ordinal Encoding (giống XGBoost)
entity_map_svr = {e: i for i, e in enumerate(df_svr['Entity'].unique())}
df_svr['Entity_Encoded'] = df_svr['Entity'].map(entity_map_svr)
df_svr = df_svr.drop('Entity', axis=1)

# RobustScaler (tốt hơn StandardScaler cho SVR)
# RobustScaler dùng median và IQR, ít bị ảnh hưởng bởi outliers còn sót
scale_cols_svr = [c for c in df_svr.columns if c not in [TARGET, 'Year', 'Entity_Encoded']]
scaler_svr = RobustScaler()
df_svr[scale_cols_svr] = scaler_svr.fit_transform(df_svr[scale_cols_svr])

print(f"SVR final shape: {df_svr.shape}")

## 2.5 XGBoost Preprocessing

Đặc biệt cho XGBoost:
- **Không cần Log Transform** (tree-based tự handle skewness)
- **Ordinal Encoding** cho Entity
- **Không loại bỏ outliers** (tree-based robust với outliers)
- **Không cần scaling** (tree-based không phụ thuộc vào scale)

In [None]:
df_xgb = df.copy()

# Ordinal Encoding
entity_map_xgb = {e: i for i, e in enumerate(df_xgb['Entity'].unique())}
df_xgb['Entity_Encoded'] = df_xgb['Entity'].map(entity_map_xgb)
df_xgb = df_xgb.drop('Entity', axis=1)

# Không cần scaling, không cần loại outliers
print(f"XGBoost final shape: {df_xgb.shape}")

## 2.6 Save Processed Data

In [None]:
# Save all 3 datasets
df_lr.to_csv('../data/processed/lr_final_prep.csv', index=False)
df_svr.to_csv('../data/processed/svr_final_prep.csv', index=False)
df_xgb.to_csv('../data/processed/xgb_final_prep.csv', index=False)

print("✅ Saved preprocessed data!")
print(f"  - lr_final_prep.csv: {df_lr.shape}")
print(f"  - svr_final_prep.csv: {df_svr.shape}")
print(f"  - xgb_final_prep.csv: {df_xgb.shape}")

## Summary

**Preprocessing Pipeline Comparison:**

| Thuật toán | Log Transform | Encoding | Outlier Handling | Scaling |
|---|---|---|---|---|
| **Linear Regression** | ✅ Yes (skewed features) | One-Hot | IQR + Whitelist | StandardScaler |
| **SVR** | ✅ Yes (skewed features) | Ordinal | IQR + Whitelist | RobustScaler |
| **XGBoost** | ❌ No | Ordinal | None (keep all) | None |

**Common Steps (tất cả thuật toán):**
- Median Imputation cho missing values
- Lag Features: CO2_lag1, GDP_lag1, Energy_lag1

**Key Differences:**
- **LR**: One-Hot Encoding → nhiều cột, StandardScaler (mean=0, std=1)
- **SVR**: RobustScaler (dùng median/IQR) → ít nhạy cảm với outliers còn sót
- **XGBoost**: Tree-based → không cần transform, tự handle non-linear patterns
- **Outliers**: LR & SVR loại bỏ outliers nhưng giữ major economies (G20+)

**Output Files:**
- `common_preprocessed.csv`: Dữ liệu chung sau imputation + lag features
- `lr_final_prep.csv`: Dữ liệu cho Linear Regression
- `svr_final_prep.csv`: Dữ liệu cho SVR
- `xgb_final_prep.csv`: Dữ liệu cho XGBoost