# Notebook 2: Model Specific Preprocessing

**Goals:**
- **Linear Regression**: One-Hot Encoding, Outlier Removal (IQR), VIF check, Standard Scaling. *Fixed: Excludes One-Hot features from scaling/outliers. Log-Tx skewed features.*
- **SVR**: Robust Scaling (outlier sensitive), One-Hot Encoding. *Fixed: Excludes One-Hot features from scaling.*
- **XGBoost**: Ordinal Encoding (tree-friendly).
- Export specific datasets: `lr_final_prep.csv`, `svr_final_prep.csv`, etc.

In [1]:
import pandas as pd
import sys
import os
import numpy as np
from sklearn.preprocessing import RobustScaler, OrdinalEncoder, StandardScaler

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))
from preprocessing import load_data, encode_features, remove_outliers, remove_high_vif

# Load Common Data
df = load_data('../data/processed/common_preprocessed.csv')

# --- 1. Linear Regression ---
print("--- Processing Linear Regression Data ---")
target = 'Value_co2_emissions_kt_by_country'

# 1.0 Log Transformation for highly skewed/imputed features
skewed_cols = ['Financial flows to developing countries (US $)', 'Renewables (% equivalent primary energy)']
for col in skewed_cols:
    if col in df.columns:
        # Use log1p to handle zeros
        df[col] = np.log1p(df[col])
        print(f"Log-transformed {col}")

df_lr = encode_features(df, method='onehot')

# 1.1 Outliers Removal (IQR)
# Using threshold=3.0 and ensuring One-Hot columns are excluded from detection
# Function logic now also auto-skips columns with IQR=0 (e.g. constant/highly imputed)
df_lr = remove_outliers(df_lr, method='iqr', threshold=3.0)

# 1.2 VIF Removal (Multicollinearity)
# Function logic excludes Entity_ columns automatically
# Also exclude 'Financial flows...' to ensure it's kept per user interest
df_lr = remove_high_vif(df_lr, target, threshold=10, exclude_cols=['Financial flows to developing countries (US $)'])

# 1.3 Standard Scaling
scaler_lr = StandardScaler()
# Only scale continuous features, not One-Hot (which start with 'Entity_')
numeric_cols_lr = df_lr.select_dtypes(include=['float64', 'int64']).columns
feature_cols_lr = [c for c in numeric_cols_lr if c != target and not c.startswith('Entity_')]
df_lr[feature_cols_lr] = scaler_lr.fit_transform(df_lr[feature_cols_lr])

df_lr.to_csv('../data/processed/lr_final_prep.csv', index=False)
print(f"Saved LR data: {df_lr.shape}")

# --- 2. SVR ---
print("\n--- Processing SVR Data ---")
# Reload original to avoid double log-tx (or we can apply log-tx here too? SVR typically benefits from it)
df_svr_base = load_data('../data/processed/common_preprocessed.csv')
# Let's apply Log-Tx here as well for consistency, SVR sensitive to scale
for col in skewed_cols:
    if col in df_svr_base.columns:
        df_svr_base[col] = np.log1p(df_svr_base[col])

df_svr = encode_features(df_svr_base, method='onehot')
numeric_cols = df_svr.select_dtypes(include=['float64', 'int64']).columns
# Robust Scaling: Exclude target and One-Hot columns
svr_feats = [c for c in numeric_cols if c != target and not c.startswith('Entity_')]
scaler = RobustScaler()
df_svr[svr_feats] = scaler.fit_transform(df_svr[svr_feats])
df_svr.to_csv('../data/processed/svr_final_prep.csv', index=False)
print(f"Saved SVR data: {df_svr.shape}")

# --- 3. XGBoost ---
print("\n--- Processing XGBoost Data ---")
# Tree models handle outliers and collinearity well. We load ORIGINAL (no log-tx needed, but helpful)
df_xgb_base = load_data('../data/processed/common_preprocessed.csv')
df_xgb = encode_features(df_xgb_base, method='ordinal')
df_xgb.to_csv('../data/processed/xgb_final_prep.csv', index=False)
print(f"Saved XGBoost data: {df_xgb.shape}")

Loaded data from ../data/processed/common_preprocessed.csv: (3473, 25)
--- Processing Linear Regression Data ---
Log-transformed Financial flows to developing countries (US $)
Log-transformed Renewables (% equivalent primary energy)
Skipping outlier removal for 3 columns with IQR=0 (likely imputed): ['Electricity from nuclear (TWh)', 'Renewables (% equivalent primary energy)', 'Financial flows to developing countries (US $)']
Removed 1283 outlier rows (threshold=3.0).


Dropped features due to VIF > 10: ['Primary energy consumption per capita (kWh/person)', 'gdp_per_capita_lag1', 'Year', 'Access to electricity (% of population)', 'Renewables (% equivalent primary energy)']
Saved LR data: (2190, 193)

--- Processing SVR Data ---
Loaded data from ../data/processed/common_preprocessed.csv: (3473, 25)


Saved SVR data: (3473, 198)

--- Processing XGBoost Data ---
Loaded data from ../data/processed/common_preprocessed.csv: (3473, 25)
Saved XGBoost data: (3473, 25)
