# Notebook 11: Advanced Country-Specific Modeling

**Objective**: Implement advanced models that can learn country-specific insights (Slopes/Latent Features) automatically, moving beyond the fixed-slope limitation of Global Linear Regression.

**Approaches**:
1.  **Linear Mixed Effects Model (LMM)**: Allows each country to have its own Random Intercept and Random Slope (e.g., Effect of GDP on CO2 varies by country).
2.  **Interaction Model (Simulated Insight)**: Manually creating `GDP * Country` features for major economies to allow specific slope learning without full Mixed Effects complexity.

In [1]:
import pandas as pd
import numpy as np
import sys
import os
import statsmodels.formula.api as smf
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error

# Add src to path
sys.path.append(os.path.abspath(os.path.join('../src')))
from preprocessing import load_data

SPLIT_YEAR = 2015
TARGET = 'Value_co2_emissions_kt_by_country'

## 1. Data Preparation
We need a dataset with:
- **Aligned Sequence**: No missing years/gaps effectively.
- **Numeric Features**: Scaled.

In [2]:
# Load clean common data (No One-Hot yet)
df_raw = load_data('../data/processed/common_preprocessed.csv')

# Need to align with the 'Whitelist' logic (filtered noise)
# Best way: Load the lr_final_prep.csv (whitelisted rows) but getting original columns back
df_lr_template = load_data('../data/processed/lr_final_prep.csv')
map_df = pd.read_csv('../data/processed/recovered_index_map.csv')

# Filter Common Data to Keep only Whitelisted/Clean rows
valid_indices = map_df['Original_Index'].values
df = df_raw.loc[valid_indices].copy().reset_index(drop=True)

print(f"Data Shape (Whitelisted): {df.shape}")

# Define Features (Same as Phase 1)
# Use Log-Tx for consistency
skewed_cols = ['Financial flows to developing countries (US $)', 'Renewables (% equivalent primary energy)']
for col in skewed_cols:
    if col in df.columns:
        df[col] = np.log1p(df[col])

# Select Numeric Features
numeric_cols = df.select_dtypes(include=[np.number]).columns
feature_cols = [c for c in numeric_cols if c not in [TARGET, 'Year']]

# Standard Scale Numeric Features
scaler = StandardScaler()
df[feature_cols] = scaler.fit_transform(df[feature_cols])

# Split Train/Test
train_df = df[df['Year'] < SPLIT_YEAR].copy()
test_df = df[df['Year'] >= SPLIT_YEAR].copy()

print(f"Train: {train_df.shape}, Test: {test_df.shape}")

Loaded data from ../data/processed/common_preprocessed.csv: (3473, 25)
Loaded data from ../data/processed/lr_final_prep.csv: (2232, 192)
Data Shape (Whitelisted): (2232, 25)
Train: (1558, 25), Test: (674, 25)


## 2. Linear Mixed Effects Model (LMM)
We allow the slope of `gdp_per_capita` and `Primary energy` to vary by Country.
Formula: `CO2 ~ GDP + Energy (Fixed) + (1 + GDP + Energy | Entity) (Random)`

In [3]:
# Simplified Feature names for LMM formula strings
rename_map = {
    'gdp_per_capita': 'GDP',
    'Primary energy consumption per capita (kWh/person)': 'Energy',
    'Value_co2_emissions_kt_by_country': 'CO2'
}
lmm_train = train_df.rename(columns=rename_map)
lmm_test = test_df.rename(columns=rename_map)

# Fit LMM
# Using 'MixedLM' from statsmodels
formula = "CO2 ~ GDP + Energy"
try:
    # Random Intercept (groups=Entity) + Random Slopes (re_formula)
    # Note: Complex random structures can fail to converge with small data per group
    # We try Random Slope for Energy first, as it's the strongest driver
    print("Training Mixed Effects Model (this may take a minute)...")
    model_lmm = smf.mixedlm(formula, lmm_train, groups=lmm_train['Entity'], re_formula="~Energy + GDP")
    result_lmm = model_lmm.fit(method='powell')
    print(result_lmm.summary())
    
    lmm_preds = result_lmm.predict(lmm_test)
    r2_lmm = r2_score(lmm_test['CO2'], lmm_preds)
    print(f"LMM R2 Score: {r2_lmm:.4f}")
except Exception as e:
    print(f"LMM Failed: {e}")
    r2_lmm = 0

Training Mixed Effects Model (this may take a minute)...


                      Mixed Linear Model Regression Results
Model:                    MixedLM        Dependent Variable:        CO2           
No. Observations:         1558           Method:                    REML          
No. Groups:               131            Scale:                     223627373.3159
Min. group size:          1              Log-Likelihood:            -17898.4903   
Max. group size:          14             Converged:                 Yes           
Mean group size:          11.9                                                    
----------------------------------------------------------------------------------
                         Coef.         Std.Err.    z   P>|z|   [0.025     0.975]  
----------------------------------------------------------------------------------
Intercept                 170326.886   81002.078 2.103 0.035  11565.731 329088.041
GDP                        30412.798   15918.982 1.910 0.056   -787.834  61613.430
Energy                    2

## 3. Interaction Model (Global LR + Entity Interactions)
Instead of a black-box Neural Net, we manually create 'Entity Logic'.
We create interaction terms: `GDP * Entity_China`, `Energy * Entity_China`.
This allows the model to learn a specific slope for China without using Mixed Effects.

In [4]:
# Create Interaction Features
# We focus only on Top Entities to avoid exploding feature space (Curse of Dimensionality)
TOP_ENTITIES = ['China', 'United States', 'India', 'Japan', 'Russian Federation', 'Germany', 'Brazil', 'Canada']

def add_interactions(df_in):
    df_out = df_in.copy()
    # Dummy Encode Top Entities manually
    for entity in TOP_ENTITIES:
        mask = (df_out['Entity'] == entity).astype(int)
        # Interaction: Energy * Is_Entity
        if 'Primary energy consumption per capita (kWh/person)' in df_out.columns:
            df_out[f'Energy_x_{entity}'] = df_out['Primary energy consumption per capita (kWh/person)'] * mask
        # Interaction: GDP * Is_Entity
        if 'gdp_per_capita' in df_out.columns:
            df_out[f'GDP_x_{entity}'] = df_out['gdp_per_capita'] * mask
    return df_out

train_inter = add_interactions(train_df)
test_inter = add_interactions(test_df)

# Prepare X, y for Ridge
features_inter = [c for c in train_inter.columns if c not in [TARGET, 'Year', 'Entity', 'Entity_ID']]
X_tr = train_inter[features_inter].fillna(0)
y_tr = train_inter[TARGET]
X_te = test_inter[features_inter].fillna(0)
y_te = test_inter[TARGET]

print(f"Training Interaction Ridge Model with {len(features_inter)} features...")
model_inter = Ridge(alpha=1.0)
model_inter.fit(X_tr, y_tr)
inter_preds = model_inter.predict(X_te)

r2_inter = r2_score(y_te, inter_preds)
print(f"Interaction Model (Top 8 Slopes) R2: {r2_inter:.4f}")

Training Interaction Ridge Model with 38 features...
Interaction Model (Top 8 Slopes) R2: 0.7715


## 4. Comparison

In [5]:
comparison = pd.DataFrame({
    'Model': ['Global Linear Regression (Benchmark)', 'Linear Mixed Effects (LMM)', 'Interaction Ridge (Manual Slopes)'],
    'R2 Score': [0.7817, r2_lmm, r2_inter]
})
print(comparison)

                                  Model  R2 Score
0  Global Linear Regression (Benchmark)   0.78170
1            Linear Mixed Effects (LMM)   0.03757
2     Interaction Ridge (Manual Slopes)   0.77147
