# FLFP Models

## Setup

In [16]:
import pandas as pd
import numpy as np

# For data splitting and preprocessing
from sklearn.model_selection import train_test_split, GroupKFold, TimeSeriesSplit
from sklearn.preprocessing import StandardScaler

# For modeling and evaluation
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import ElasticNet
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostRegressor

## Data preparation

### Load the FLFP dataset

In [17]:
flfp_df = pd.read_parquet('data/flfp_dataset.parquet')
flfp_df['region'] = flfp_df['region'].str.strip()  # Clean trailing spaces
flfp_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5208 entries, 0 to 5207
Data columns (total 25 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   country_name             5208 non-null   object 
 1   flfp_15_64               4484 non-null   float64
 2   year                     5208 non-null   int64  
 3   fertility_rate           5208 non-null   float64
 4   fertility_adolescent     5208 non-null   float64
 5   urban_population         5160 non-null   float64
 6   dependency_ratio         5208 non-null   float64
 7   life_exp_female          5208 non-null   float64
 8   infant_mortality         4704 non-null   float64
 9   population_total         5208 non-null   float64
 10  secondary_enroll_fe      3427 non-null   float64
 11  tertiary_enroll_fe       2979 non-null   float64
 12  gender_parity_primary    2874 non-null   float64
 13  gender_parity_secondary  2934 non-null   float64
 14  gdp_per_capita_const    

### Perform train-test split
With this being panel data, we have to be careful not to leak future data for any country into the training set. 

In this notebook, we will do a time-based split, using early years for training and later years for validation and testing. This creates a forecasting-style problem where the model sees earlier years for each country and is evaluated on later years.

In [18]:
# Filter to FLFP observations
modeling_df = flfp_df[flfp_df['flfp_15_64'].notna()].copy()

print("Time-based (temporal) Train/Validation/Test Split")
print("=" * 50)

# Inspect years
years = np.sort(modeling_df['year'].unique())
n_years = len(years)
print(f"Years in data: {years}")
print(f"Number of distinct years: {n_years}")

# Define 80/10/10 split by year (earliest → latest)
train_end_idx = int(np.floor(0.8 * n_years))        # end (exclusive) of train years
val_end_idx   = int(np.floor(0.9 * n_years))        # end (exclusive) of train+val years

train_years = years[:train_end_idx]
val_years   = years[train_end_idx:val_end_idx]
test_years  = years[val_end_idx:]

print(f"\nTrain years ({len(train_years)}): {train_years}")
print(f"Validation years ({len(val_years)}): {val_years}")
print(f"Test years ({len(test_years)}): {test_years}")

# Create boolean masks based on year
train_mask = modeling_df['year'].isin(train_years)
val_mask   = modeling_df['year'].isin(val_years)
test_mask  = modeling_df['year'].isin(test_years)

# Subset the main dataframe
train_df = modeling_df[train_mask].copy()
val_df   = modeling_df[val_mask].copy()
test_df  = modeling_df[test_mask].copy()

print(f"\nTraining observations: {len(train_df):,} ({len(train_df)/len(modeling_df)*100:.1f}%)")
print(f"Validation observations: {len(val_df):,} ({len(val_df)/len(modeling_df)*100:.1f}%)")
print(f"Test observations: {len(test_df):,} ({len(test_df)/len(modeling_df)*100:.1f}%)")

# Optional: check that every year is assigned to exactly one set
assert set(train_years).isdisjoint(val_years)
assert set(train_years).isdisjoint(test_years)
assert set(val_years).isdisjoint(test_years)

time_split = {
    'train_years': train_years,
    'val_years': val_years,
    'test_years': test_years,
    'train_df': train_df,
    'val_df': val_df,
    'test_df': test_df,
}

print("\n✓ Time-based split created successfully (80/10/10 by year)")

Time-based (temporal) Train/Validation/Test Split
Years in data: [2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023]
Number of distinct years: 24

Train years (19): [2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013
 2014 2015 2016 2017 2018]
Validation years (2): [2019 2020]
Test years (3): [2021 2022 2023]

Training observations: 3,553 (79.2%)
Validation observations: 374 (8.3%)
Test observations: 557 (12.4%)

✓ Time-based split created successfully (80/10/10 by year)


### Add categorical features (region and income level)


In [19]:
# Filter out 'Not classified' income observations
print("Filtering out 'Not classified' income observations...")
initial_train = len(train_df)
initial_val = len(val_df)
initial_test = len(test_df)

train_df = train_df[train_df['income_level'] != 'Not classified'].copy()
val_df = val_df[val_df['income_level'] != 'Not classified'].copy()
test_df = test_df[test_df['income_level'] != 'Not classified'].copy()

print(f"  Train: {initial_train} → {len(train_df)} (-{initial_train - len(train_df)})")
print(f"  Val: {initial_val} → {len(val_df)} (-{initial_val - len(val_df)})")
print(f"  Test: {initial_test} → {len(test_df)} (-{initial_test - len(test_df)})")

# Create label-encoded income for tree-based models
print("\nCreating label-encoded income (for tree-based models)...")
income_mapping = {
    'Low income': 0,
    'Lower middle income': 1,
    'Upper middle income': 2,
    'High income': 3
}

train_df['income_level_encoded'] = train_df['income_level'].map(income_mapping)
val_df['income_level_encoded'] = val_df['income_level'].map(income_mapping)
test_df['income_level_encoded'] = test_df['income_level'].map(income_mapping)

print(f"  Income encoding: {income_mapping}")

# Create one-hot encoded region with clean names
print("\nCreating one-hot encoded region with clean names...")

# Region name mapping to clean snake_case abbreviations
region_name_mapping = {
    'East Asia & Pacific': 'region_eap',
    'Europe & Central Asia': 'region_eca',
    'Latin America & Caribbean': 'region_lac',
    'Middle East, North Africa, Afghanistan & Pakistan': 'region_mena_afpak',
    'North America': 'region_namerica',
    'South Asia': 'region_sasia',
    'Sub-Saharan Africa': 'region_ssa'
}

# Create dummies for ALL regions (drop_first=False to get all categories)
region_dummies_train = pd.get_dummies(train_df['region'], prefix='region', drop_first=False)
region_dummies_val = pd.get_dummies(val_df['region'], prefix='region', drop_first=False)
region_dummies_test = pd.get_dummies(test_df['region'], prefix='region', drop_first=False)

# Rename columns using mapping
for original, clean in region_name_mapping.items():
    old_col = f'region_{original}'
    if old_col in region_dummies_train.columns:
        region_dummies_train.rename(columns={old_col: clean}, inplace=True)
    if old_col in region_dummies_val.columns:
        region_dummies_val.rename(columns={old_col: clean}, inplace=True)
    if old_col in region_dummies_test.columns:
        region_dummies_test.rename(columns={old_col: clean}, inplace=True)

# Drop reference category: region_mena_afpak
reference_region = 'region_mena_afpak'
if reference_region in region_dummies_train.columns:
    region_dummies_train.drop(columns=[reference_region], inplace=True)
if reference_region in region_dummies_val.columns:
    region_dummies_val.drop(columns=[reference_region], inplace=True)
if reference_region in region_dummies_test.columns:
    region_dummies_test.drop(columns=[reference_region], inplace=True)

# Ensure all sets have the same columns (in case some regions are missing in val/test)
all_region_cols = region_dummies_train.columns.tolist()
for col in all_region_cols:
    if col not in region_dummies_val.columns:
        region_dummies_val[col] = 0
    if col not in region_dummies_test.columns:
        region_dummies_test[col] = 0

# Reorder columns to match
region_dummies_val = region_dummies_val[all_region_cols]
region_dummies_test = region_dummies_test[all_region_cols]

# Add to dataframes
train_df = pd.concat([train_df, region_dummies_train], axis=1)
val_df = pd.concat([val_df, region_dummies_val], axis=1)
test_df = pd.concat([test_df, region_dummies_test], axis=1)

print(f"  Created {len(all_region_cols)} region dummy variables")
print(f"  Reference category: {reference_region} (dropped)")
print(f"  Region columns: {all_region_cols}")

# Create one-hot encoded income with clean names (for linear models)
print("\nCreating one-hot encoded income with clean names (for linear models)...")

# Income level name mapping to clean snake_case abbreviations
income_name_mapping = {
    'High income': 'income_high',
    'Low income': 'income_low',
    'Lower middle income': 'income_lower_mid',
    'Upper middle income': 'income_upper_mid'
}

# Create dummies for ALL income levels (drop_first=False to get all categories)
income_dummies_train = pd.get_dummies(train_df['income_level'], prefix='income', drop_first=False)
income_dummies_val = pd.get_dummies(val_df['income_level'], prefix='income', drop_first=False)
income_dummies_test = pd.get_dummies(test_df['income_level'], prefix='income', drop_first=False)

# Rename columns using mapping
for original, clean in income_name_mapping.items():
    old_col = f'income_{original}'
    if old_col in income_dummies_train.columns:
        income_dummies_train.rename(columns={old_col: clean}, inplace=True)
    if old_col in income_dummies_val.columns:
        income_dummies_val.rename(columns={old_col: clean}, inplace=True)
    if old_col in income_dummies_test.columns:
        income_dummies_test.rename(columns={old_col: clean}, inplace=True)

# Drop reference category: income_low
reference_income = 'income_low'
if reference_income in income_dummies_train.columns:
    income_dummies_train.drop(columns=[reference_income], inplace=True)
if reference_income in income_dummies_val.columns:
    income_dummies_val.drop(columns=[reference_income], inplace=True)
if reference_income in income_dummies_test.columns:
    income_dummies_test.drop(columns=[reference_income], inplace=True)

# Ensure all sets have the same columns
all_income_cols = income_dummies_train.columns.tolist()
for col in all_income_cols:
    if col not in income_dummies_val.columns:
        income_dummies_val[col] = 0
    if col not in income_dummies_test.columns:
        income_dummies_test[col] = 0

# Reorder columns to match
income_dummies_val = income_dummies_val[all_income_cols]
income_dummies_test = income_dummies_test[all_income_cols]

# Add to dataframes
train_df = pd.concat([train_df, income_dummies_train], axis=1)
val_df = pd.concat([val_df, income_dummies_val], axis=1)
test_df = pd.concat([test_df, income_dummies_test], axis=1)

print(f"  Created {len(all_income_cols)} income dummy variables")
print(f"  Reference category: {reference_income} (dropped)")
print(f"  Income columns: {all_income_cols}")

print("\n✓ Categorical features added successfully!")
print(f"  Train shape: {train_df.shape}")
print(f"  Val shape: {val_df.shape}")
print(f"  Test shape: {test_df.shape}")


Filtering out 'Not classified' income observations...
  Train: 3553 → 3515 (-38)
  Val: 374 → 370 (-4)
  Test: 557 → 551 (-6)

Creating label-encoded income (for tree-based models)...
  Income encoding: {'Low income': 0, 'Lower middle income': 1, 'Upper middle income': 2, 'High income': 3}

Creating one-hot encoded region with clean names...
  Created 6 region dummy variables
  Reference category: region_mena_afpak (dropped)
  Region columns: ['region_eap', 'region_eca', 'region_lac', 'region_namerica', 'region_sasia', 'region_ssa']

Creating one-hot encoded income with clean names (for linear models)...
  Created 3 income dummy variables
  Reference category: income_low (dropped)
  Income columns: ['income_high', 'income_lower_mid', 'income_upper_mid']

✓ Categorical features added successfully!
  Train shape: (3515, 35)
  Val shape: (370, 35)
  Test shape: (551, 35)


### Preprocess the data
We need three datasets for different model types: raw data for tree-based models for XGBoost and LightGBM, imputed but unscaled for random forest, and imputed and scaled for linear models, SVM and KNN. 

For imputation, we will take the country median as the default strategy but if the country has all missing values for a feature, we will use the global mean instead.

For scaling, we will use standard scaling (mean=0, std=1) based on the training set statistics.

Note that we are conducting the imputation and scaling after splitting to avoid data leakage (in other words, using information from the test set to inform the training set transformations).

In [20]:
# Define base predictor columns (numeric features)
# Note: Removed unemployment_total, unemployment_female, and labor_force_total
# to avoid data leakage (these are mathematically related to FLFP)
base_predictor_cols = [
    'fertility_rate', 'fertility_adolescent', 'urban_population',
    'dependency_ratio', 'life_exp_female', 'infant_mortality',
    'population_total', 'secondary_enroll_fe', 'gdp_per_capita_const',
    'gdp_growth', 'services_gdp', 'industry_gdp', 'rule_of_law'
 ]

# Get the categorical feature column names that were created in the previous cell
region_cols = [col for col in train_df.columns if col.startswith('region_')]
# Get income dummy columns (exclude 'income_level' and 'income_level_encoded')
income_dummy_cols = [col for col in train_df.columns if col.startswith('income_') and col not in ['income_level', 'income_level_encoded']]

# Create separate predictor lists for different model types
# Tree-based models: use label-encoded income + one-hot region
predictor_cols_tree = base_predictor_cols + ['income_level_encoded'] + region_cols

# Linear models: use one-hot encoded income + one-hot region
predictor_cols_linear = base_predictor_cols + income_dummy_cols + region_cols

print("Predictor Column Setup:")
print(f"  Base numeric features: {len(base_predictor_cols)}")
print(f"  Region dummies: {len(region_cols)}")
print(f"  Income dummies: {len(income_dummy_cols)}")
print(f"  Total for tree models: {len(predictor_cols_tree)} (base + income_encoded + regions)")
print(f"  Total for linear models: {len(predictor_cols_linear)} (base + income_dummies + regions)")

target_col = 'flfp_15_64'

# Variables that need imputation (only numeric features need imputation)
variables_to_impute = [
    'secondary_enroll_fe', 'urban_population', 'infant_mortality',
    'gdp_per_capita_const', 'gdp_growth', 'services_gdp',
    'industry_gdp', 'rule_of_law'
 ]

def panel_imputation(train_df, val_df, test_df, variables_to_impute):
    """Apply country-specific median imputation without data leakage"""

    # Calculate imputation rules using ONLY training data
    train_country_medians = {}
    train_year_medians = {}
    train_global_medians = {}

    for var in variables_to_impute:
        # Country-specific medians from training data only
        train_country_medians[var] = train_df.groupby('country_name')[var].median()
        # Year-specific medians from training data only
        train_year_medians[var] = train_df.groupby('year')[var].median()
        # Global median from training data only
        train_global_medians[var] = train_df[var].median()

    # Apply imputation rules to all datasets
    def apply_imputation(df):
        df_imputed = df.copy()
        for var in variables_to_impute:
            if var in df_imputed.columns:
                # Use training-based country medians
                for country in df_imputed['country_name'].unique():
                    country_mask = df_imputed['country_name'] == country
                    country_median = train_country_medians[var].get(country, np.nan)

                    # Fill using country median where available
                    df_imputed.loc[country_mask, var] = df_imputed.loc[country_mask, var].fillna(country_median)

                # Fill remaining NaNs using year medians
                year_values = df_imputed.loc[:, 'year']
                missing_mask = df_imputed[var].isna()
                if missing_mask.any():
                    years_to_fill = year_values[missing_mask]
                    fill_values = years_to_fill.map(train_year_medians[var]).astype(float)
                    df_imputed.loc[missing_mask, var] = fill_values

                # Fall back to training-global median if still missing
                df_imputed[var] = df_imputed[var].fillna(train_global_medians[var])

        return df_imputed

    return apply_imputation(train_df), apply_imputation(val_df), apply_imputation(test_df)

# Log-transform population_total in the original dataframes (before imputation)
# This will affect both raw and imputed datasets
print("\nApplying log transformation to population_total (before extraction)...")
train_df['population_total'] = np.log(train_df['population_total'])
val_df['population_total'] = np.log(val_df['population_total'])
test_df['population_total'] = np.log(test_df['population_total'])
print("  ✓ Log transformation applied to raw data")

# Extract TRULY raw features (before imputation) for models that handle missing values natively
# Use tree predictor columns (includes income_level_encoded + region dummies)
X_train_raw = train_df[predictor_cols_tree].copy()
X_val_raw = val_df[predictor_cols_tree].copy()
X_test_raw = test_df[predictor_cols_tree].copy()

print("\nDataset 1 - Truly Raw/Unimputed (XGBoost, LightGBM):")
print(f"  Using predictor_cols_tree ({len(predictor_cols_tree)} features)")
print(f"  X_train_raw: {X_train_raw.shape}")
print(f"  X_val_raw: {X_val_raw.shape}")
print(f"  X_test_raw: {X_test_raw.shape}")
print(f"  Missing values - Train: {X_train_raw.isna().sum().sum()}")
print(f"  Missing values - Val: {X_val_raw.isna().sum().sum()}")
print(f"  Missing values - Test: {X_test_raw.isna().sum().sum()}")

# Apply imputation
train_clean, val_clean, test_clean = panel_imputation(
    train_df, val_df, test_df, variables_to_impute
)

# Extract imputed features and target
# Use tree predictor columns (includes income_level_encoded + region dummies)
X_train_imputed = train_clean[predictor_cols_tree].copy()
X_val_imputed = val_clean[predictor_cols_tree].copy()
X_test_imputed = test_clean[predictor_cols_tree].copy()

y_train = train_clean[target_col].copy()
y_val = val_clean[target_col].copy()
y_test = test_clean[target_col].copy()

print("\nImputation complete:")
print("Dataset 2 - Imputed (Random Forest):")
print(f"  Using predictor_cols_tree ({len(predictor_cols_tree)} features)")
print(f"  X_train_imputed: {X_train_imputed.shape}")
print(f"  X_val_imputed: {X_val_imputed.shape}")
print(f"  X_test_imputed: {X_test_imputed.shape}")
print(f"  Missing values - Train: {X_train_imputed.isna().sum().sum()}")
print(f"  Missing values - Val: {X_val_imputed.isna().sum().sum()}")
print(f"  Missing values - Test: {X_test_imputed.isna().sum().sum()}")

# Create scaled versions (fit scaler only on training data)
# Use linear predictor columns (includes income dummies + region dummies)
print("\nCreating scaled datasets...")
scaler = StandardScaler()

# Need to extract linear predictor columns from the clean dataframes
X_train_linear = train_clean[predictor_cols_linear].copy()
X_val_linear = val_clean[predictor_cols_linear].copy()
X_test_linear = test_clean[predictor_cols_linear].copy()

X_train_scaled = pd.DataFrame(
    scaler.fit_transform(X_train_linear),
    columns=predictor_cols_linear,
    index=X_train_linear.index
)
X_val_scaled = pd.DataFrame(
    scaler.transform(X_val_linear),
    columns=predictor_cols_linear,
    index=X_val_linear.index
)
X_test_scaled = pd.DataFrame(
    scaler.transform(X_test_linear),
    columns=predictor_cols_linear,
    index=X_test_linear.index
)

print("\nDataset 3 - Scaled + Imputed (Linear models, SVM, KNN):")
print(f"  Using predictor_cols_linear ({len(predictor_cols_linear)} features)")
print(f"  X_train_scaled: {X_train_scaled.shape}")
print(f"  X_val_scaled: {X_val_scaled.shape}")
print(f"  X_test_scaled: {X_test_scaled.shape}")

print("\nTarget variable (same for all models):")
print(f"  y_train: {y_train.shape}")
print(f"  y_val: {y_val.shape}")
print(f"  y_test: {y_test.shape}")

# Create time-based cross-validation within the training period
print("\nSetting up time-based cross-validation within the training period...")

# Sort training data by year so TimeSeriesSplit respects chronology
train_clean_sorted = train_clean.sort_values('year')

# Reorder feature matrices and target to match this sorted index
X_train_raw = X_train_raw.loc[train_clean_sorted.index]
X_train_imputed = X_train_imputed.loc[train_clean_sorted.index]
X_train_scaled = X_train_scaled.loc[train_clean_sorted.index]
y_train = y_train.loc[train_clean_sorted.index]

# Time-based CV: each split trains on earlier years and validates on later years
time_kfold = TimeSeriesSplit(n_splits=5)

print(f"  Using TimeSeriesSplit with {time_kfold.n_splits} splits ordered by year")
print("  This keeps validation folds later in time than their training folds")

print("\n✓ All datasets ready for modeling!")
print(f"Tree models: {len(predictor_cols_tree)} features (numeric + income_encoded + region_dummies)")
print(f"Linear models: {len(predictor_cols_linear)} features (numeric + income_dummies + region_dummies)")
print(f"Total observations: {len(X_train_raw) + len(X_val_raw) + len(X_test_raw):,}")

Predictor Column Setup:
  Base numeric features: 13
  Region dummies: 6
  Income dummies: 3
  Total for tree models: 20 (base + income_encoded + regions)
  Total for linear models: 22 (base + income_dummies + regions)

Applying log transformation to population_total (before extraction)...
  ✓ Log transformation applied to raw data

Dataset 1 - Truly Raw/Unimputed (XGBoost, LightGBM):
  Using predictor_cols_tree (20 features)
  X_train_raw: (3515, 20)
  X_val_raw: (370, 20)
  X_test_raw: (551, 20)
  Missing values - Train: 2103
  Missing values - Val: 189
  Missing values - Test: 308

Imputation complete:
Dataset 2 - Imputed (Random Forest):
  Using predictor_cols_tree (20 features)
  X_train_imputed: (3515, 20)
  X_val_imputed: (370, 20)
  X_test_imputed: (551, 20)
  Missing values - Train: 0
  Missing values - Val: 0
  Missing values - Test: 0

Creating scaled datasets...

Dataset 3 - Scaled + Imputed (Linear models, SVM, KNN):
  Using predictor_cols_linear (22 features)
  X_train_sca

## Linear Regression

### Simple OLS (no regularization)

In [21]:
# Initialize the model
ols_model = LinearRegression()

# Fit on training data (using scaled data for linear models)
ols_model.fit(X_train_scaled, y_train)

# Make predictions on train and validation sets
y_train_pred = ols_model.predict(X_train_scaled)
y_val_pred = ols_model.predict(X_val_scaled)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Feature importance for OLS (coefficient magnitudes)
feature_importance = pd.DataFrame({
    'feature': predictor_cols_linear,
    'importance': np.abs(ols_model.coef_)
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))


Training Performance:
  RMSE: 11.022
  MAE:  8.460
  R²:   0.509

Validation Performance:
  RMSE: 11.706
  MAE:  8.969
  R²:   0.397

Top 10 Most Important Features:
                 feature  importance
21            region_ssa   11.945144
17            region_eca   11.345783
16            region_eap   11.121724
18            region_lac    9.844976
3       dependency_ratio    7.796105
4        life_exp_female    6.142638
15      income_upper_mid    4.647287
13           income_high    4.312782
8   gdp_per_capita_const    3.681507
0         fertility_rate    3.102939


### Lasso Regression (L1 regularization)

In [22]:
# Define alpha values to test (regularization strength)
alpha_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

# Initialize Lasso with GridSearch for hyperparameter tuning
lasso_grid = GridSearchCV(
    Lasso(random_state=42, max_iter=2000),
    param_grid={'alpha': alpha_values},
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Fit on training data (using scaled data)
lasso_grid.fit(X_train_scaled, y_train)

# Get the best model
lasso_model = lasso_grid.best_estimator_
best_alpha = lasso_grid.best_params_['alpha']

print(f"Best alpha (regularization strength): {best_alpha}")
print(f"Cross-validation score: {-lasso_grid.best_score_:.3f}")

# Make predictions
y_train_pred = lasso_model.predict(X_train_scaled)
y_val_pred = lasso_model.predict(X_val_scaled)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results (focus on fit measures)
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Show feature selection results
non_zero_features = np.sum(lasso_model.coef_ != 0)
print(f"\nFeature Selection:")
print(f"  Features selected: {non_zero_features}/{len(predictor_cols_linear)}")
print(f"  Features eliminated: {len(predictor_cols_linear) - non_zero_features}")

# Coefficient magnitudes (features are scaled)
feature_importance = pd.DataFrame({
    'feature': predictor_cols_linear,
    'importance': np.abs(lasso_model.coef_)
})

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))


Best alpha (regularization strength): 0.01
Cross-validation score: 123.082

Training Performance:
  RMSE: 11.023
  MAE:  8.468
  R²:   0.509

Validation Performance:
  RMSE: 11.685
  MAE:  8.948
  R²:   0.399

Feature Selection:
  Features selected: 22/22
  Features eliminated: 0

Top 10 Most Important Features:
                feature  importance
0        fertility_rate    2.807001
1  fertility_adolescent    2.298461
2      urban_population    1.409252
3      dependency_ratio    7.509818
4       life_exp_female    5.919945
5      infant_mortality    2.822093
6      population_total    0.635869
7   secondary_enroll_fe    0.141356
8  gdp_per_capita_const    3.652596
9            gdp_growth    0.289871


### Ridge Regression (L2 regularization)

In [23]:
# Define alpha values to test (regularization strength)
alpha_values = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

# Initialize Ridge with GridSearch for hyperparameter tuning
ridge_grid = GridSearchCV(
    Ridge(random_state=42),
    param_grid={'alpha': alpha_values},
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

# Fit on training data (using scaled data)
ridge_grid.fit(X_train_scaled, y_train)

# Get the best model
ridge_model = ridge_grid.best_estimator_
best_alpha = ridge_grid.best_params_['alpha']

print(f"Best alpha (regularization strength): {best_alpha}")
print(f"Cross-validation score: {-ridge_grid.best_score_:.3f}")

# Make predictions
y_train_pred = ridge_model.predict(X_train_scaled)
y_val_pred = ridge_model.predict(X_val_scaled)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Feature importance for Ridge (coefficient magnitudes)
feature_importance = pd.DataFrame({
    'feature': predictor_cols_linear,
    'importance': np.abs(ridge_model.coef_)
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Best alpha (regularization strength): 1.0
Cross-validation score: 123.182

Training Performance:
  RMSE: 11.022
  MAE:  8.462
  R²:   0.509

Validation Performance:
  RMSE: 11.703
  MAE:  8.969
  R²:   0.397

Top 10 Most Important Features:
                 feature  importance
21            region_ssa   11.915953
17            region_eca   11.300181
16            region_eap   11.091472
18            region_lac    9.805426
3       dependency_ratio    7.749371
4        life_exp_female    6.120997
15      income_upper_mid    4.618732
13           income_high    4.276166
8   gdp_per_capita_const    3.683888
0         fertility_rate    3.045783


### Elastic Net Regression (L1+L2 regularization)

In [24]:
# Define grids for Elastic Net
alpha_values = [0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0]
l1_ratios = [0.1, 0.3, 0.5, 0.7, 0.9]  # 0 → pure Ridge, 1 → pure Lasso

elastic_grid = GridSearchCV(
    ElasticNet(max_iter=5000, random_state=42),
    param_grid={'alpha': alpha_values, 'l1_ratio': l1_ratios},
    cv=time_kfold,                 # time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

print("Fitting Elastic Net with hyperparameter tuning...")
elastic_grid.fit(X_train_scaled, y_train)

elastic_model = elastic_grid.best_estimator_
best_params = elastic_grid.best_params_

print(f"Best hyperparameters:")
print(f"  alpha: {best_params['alpha']}")
print(f"  l1_ratio: {best_params['l1_ratio']}")
print(f"Cross-validation score: {-elastic_grid.best_score_:.3f}")

# Predictions
y_train_pred = elastic_model.predict(X_train_scaled)
y_val_pred = elastic_model.predict(X_val_scaled)

# Metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Coefficient magnitudes
coef_importance = pd.DataFrame({
    'feature': predictor_cols_linear,
    'coef': elastic_model.coef_,
    'abs_coef': np.abs(elastic_model.coef_)
}).sort_values('abs_coef', ascending=False)

print("\nTop 10 features by |coefficient|:")
print(coef_importance.head(10))

Fitting Elastic Net with hyperparameter tuning...
Best hyperparameters:
  alpha: 0.01
  l1_ratio: 0.5
Cross-validation score: 122.971

Training Performance:
  RMSE: 11.034
  MAE:  8.508
  R²:   0.508

Validation Performance:
  RMSE: 11.662
  MAE:  8.958
  R²:   0.402

Top 10 features by |coefficient|:
                 feature       coef   abs_coef
21            region_ssa  11.441329  11.441329
16            region_eap  10.580470  10.580470
17            region_eca  10.534444  10.534444
18            region_lac   9.139997   9.139997
3       dependency_ratio  -6.937745   6.937745
4        life_exp_female  -5.694028   5.694028
15      income_upper_mid  -4.116431   4.116431
8   gdp_per_capita_const   3.695069   3.695069
13           income_high  -3.628104   3.628104
19       region_namerica   2.602425   2.602425


## Support Vector Regression (SVR)

In [25]:
# Set simpler, regularized grid
param_grid = {
    # slightly favor smaller C (stronger regularization)
    'C': [0.01, 0.1, 1, 10],
    # a bit wider epsilon range (flatter function, less overfit)
    'epsilon': [0.05, 0.1, 0.2, 0.5],
    # avoid very small fixed gamma that can lead to overfitting
    'gamma': ['scale', 'auto']
}

svr_grid = GridSearchCV(
    SVR(kernel='rbf'),
    param_grid=param_grid,
    cv=time_kfold,
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1
)

print("Fitting SVR (RBF) with more regularized hyperparameter grid...")
svr_grid.fit(X_train_scaled, y_train)
svr_model = svr_grid.best_estimator_
best_params = svr_grid.best_params_

print(f"Best hyperparameters:")
print(f"  C (regularization): {best_params['C']}")
print(f"  Gamma (kernel coef): {best_params['gamma']}")
print(f"  Epsilon (tolerance): {best_params['epsilon']}")
print(f"Cross-validation score: {-svr_grid.best_score_:.3f}")

# Make predictions
y_train_pred = svr_model.predict(X_train_scaled)
y_val_pred = svr_model.predict(X_val_scaled)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

print(f"\nModel Characteristics:")
print(f"  Kernel: RBF (Radial Basis Function)")
print(f"  Support vectors: {svr_model.n_support_}")
print(f"  Non-linear decision boundary")

Fitting SVR (RBF) with more regularized hyperparameter grid...
Fitting 5 folds for each of 32 candidates, totalling 160 fits
Best hyperparameters:
  C (regularization): 10
  Gamma (kernel coef): auto
  Epsilon (tolerance): 0.5
Cross-validation score: 56.455

Training Performance:
  RMSE: 5.958
  MAE:  3.489
  R²:   0.857

Validation Performance:
  RMSE: 7.780
  MAE:  5.393
  R²:   0.734

Model Characteristics:
  Kernel: RBF (Radial Basis Function)
  Support vectors: [2958]
  Non-linear decision boundary


## K-Nearest Neighbors Regression (KNN)

In [26]:
# Define hyperparameters to test
param_grid = {
    'n_neighbors': [3, 5, 7, 10, 15, 20],     # Number of neighbors
    'weights': ['uniform', 'distance'],        # Weighting scheme
    'metric': ['euclidean', 'manhattan']       # Distance metric
}

# Initialize KNN with GridSearch for hyperparameter tuning
knn_grid = GridSearchCV(
    KNeighborsRegressor(),
    param_grid=param_grid,
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1
)

print("Fitting KNN with hyperparameter tuning...")

# Fit on training data (using scaled data - KNN needs scaling!)
knn_grid.fit(X_train_scaled, y_train)

# Get the best model
knn_model = knn_grid.best_estimator_
best_params = knn_grid.best_params_

print(f"Best hyperparameters:")
print(f"  n_neighbors: {best_params['n_neighbors']}")
print(f"  weights: {best_params['weights']}")
print(f"  metric: {best_params['metric']}")
print(f"Cross-validation score: {-knn_grid.best_score_:.3f}")

# Make predictions
y_train_pred = knn_model.predict(X_train_scaled)
y_val_pred = knn_model.predict(X_val_scaled)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

print(f"\nModel Characteristics:")
print(f"  Non-parametric model")
print(f"  Memory-based learning")
print(f"  Local predictions based on nearest neighbors")

Fitting KNN with hyperparameter tuning...
Best hyperparameters:
  n_neighbors: 3
  weights: distance
  metric: manhattan
Cross-validation score: 8.704

Training Performance:
  RMSE: 0.000
  MAE:  0.000
  R²:   1.000

Validation Performance:
  RMSE: 2.415
  MAE:  1.544
  R²:   0.974

Model Characteristics:
  Non-parametric model
  Memory-based learning
  Local predictions based on nearest neighbors


## Tree-based models

### Random Forest

**Note**: Using reduced hyperparameter grids for initial model comparison. These smaller grids provide sufficient exploration to compare algorithm performance while keeping runtime manageable. Full hyperparameter optimization can be done later for the best-performing models.

In [31]:
# Note that we use the imputed but unscaled data for Random Forest

# Define hyperparameters to test (reduced grid for initial comparison)
param_grid = {
    'n_estimators': [100, 200],               # Number of trees
    'max_depth': [8, 12, None],               # Maximum tree depth
    'min_samples_split': [2, 5, 10],          # Min samples to split node
    'min_samples_leaf': [1, 2],               # Min samples in leaf
    'max_features': [0.2, 0.3, 'sqrt']        # Features per split
}

# Initialize Random Forest with GridSearch for hyperparameter tuning
rf_grid = GridSearchCV(
    RandomForestRegressor(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1  # Show progress
)

print("Fitting Random Forest with hyperparameter tuning...")
print("This may take several minutes...")

# Fit on training data (using imputed but unscaled data)
rf_grid.fit(X_train_imputed, y_train)

# Get the best model
rf_model = rf_grid.best_estimator_
best_params = rf_grid.best_params_

print(f"Best hyperparameters:")
print(f"  n_estimators: {best_params['n_estimators']}")
print(f"  max_depth: {best_params['max_depth']}")
print(f"  min_samples_split: {best_params['min_samples_split']}")
print(f"  min_samples_leaf: {best_params['min_samples_leaf']}")
print(f"  max_features: {best_params['max_features']}")
print(f"Cross-validation score: {-rf_grid.best_score_:.3f}")

# Make predictions
y_train_pred = rf_model.predict(X_train_imputed)
y_val_pred = rf_model.predict(X_val_imputed)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Save feature importance for model interpretation and comparison
feature_importance = pd.DataFrame({
    'feature': predictor_cols_tree,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Fitting Random Forest with hyperparameter tuning...
This may take several minutes...
Fitting 5 folds for each of 108 candidates, totalling 540 fits
Best hyperparameters:
  n_estimators: 200
  max_depth: None
  min_samples_split: 2
  min_samples_leaf: 1
  max_features: 0.3
Cross-validation score: 39.647

Training Performance:
  RMSE: 0.933
  MAE:  0.595
  R²:   0.996

Validation Performance:
  RMSE: 5.407
  MAE:  3.869
  R²:   0.871

Top 10 Most Important Features:
                 feature  importance
6       population_total    0.105949
7    secondary_enroll_fe    0.090172
0         fertility_rate    0.086890
1   fertility_adolescent    0.079364
2       urban_population    0.074516
4        life_exp_female    0.070776
8   gdp_per_capita_const    0.070467
12           rule_of_law    0.066866
14            region_eap    0.059568
19            region_ssa    0.051988


### XGBoost

In [28]:
# Note that we use the raw unimputed data for XGBoost because it can handle missing values natively

# Clean column names to snake_case for consistency
X_train_raw_xgb = X_train_raw.copy()
X_val_raw_xgb = X_val_raw.copy()
X_test_raw_xgb = X_test_raw.copy()

# Convert to snake_case: lowercase, replace special chars and spaces with underscores
X_train_raw_xgb.columns = (X_train_raw_xgb.columns
                            .str.lower()
                            .str.replace('[^a-z0-9]+', '_', regex=True)
                            .str.strip('_'))
X_val_raw_xgb.columns = (X_val_raw_xgb.columns
                          .str.lower()
                          .str.replace('[^a-z0-9]+', '_', regex=True)
                          .str.strip('_'))
X_test_raw_xgb.columns = (X_test_raw_xgb.columns
                           .str.lower()
                           .str.replace('[^a-z0-9]+', '_', regex=True)
                           .str.strip('_'))

# Define hyperparameters to test (reduced grid for initial comparison)
param_grid = {
    'n_estimators': [100, 200],               # Number of boosting rounds
    'max_depth': [3, 6],                      # Maximum tree depth
    'learning_rate': [0.01, 0.1],             # Step size shrinkage
    'subsample': [0.9, 1.0],                  # Fraction of samples per tree
    'colsample_bytree': [0.9, 1.0]            # Fraction of features per tree
}

# Initialize XGBoost with GridSearch for hyperparameter tuning
xgb_grid = GridSearchCV(
    xgb.XGBRegressor(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1  # Show progress
)

print("Fitting XGBoost with hyperparameter tuning...")
print("This may take several minutes...")

# Fit on cleaned training data
xgb_grid.fit(X_train_raw_xgb, y_train)

# Get the best model
xgb_model = xgb_grid.best_estimator_
best_params = xgb_grid.best_params_

print(f"Best hyperparameters:")
print(f"  n_estimators: {best_params['n_estimators']}")
print(f"  max_depth: {best_params['max_depth']}")
print(f"  learning_rate: {best_params['learning_rate']}")
print(f"  subsample: {best_params['subsample']}")
print(f"  colsample_bytree: {best_params['colsample_bytree']}")
print(f"Cross-validation score: {-xgb_grid.best_score_:.3f}")

# Make predictions using cleaned data
y_train_pred = xgb_model.predict(X_train_raw_xgb)
y_val_pred = xgb_model.predict(X_val_raw_xgb)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Feature importance for XGBoost
feature_importance = pd.DataFrame({
    'feature': predictor_cols_tree,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Fitting XGBoost with hyperparameter tuning...
This may take several minutes...
Fitting 5 folds for each of 32 candidates, totalling 160 fits
Best hyperparameters:
  n_estimators: 200
  max_depth: 6
  learning_rate: 0.1
  subsample: 1.0
  colsample_bytree: 0.9
Cross-validation score: 44.433

Training Performance:
  RMSE: 0.923
  MAE:  0.643
  R²:   0.997

Validation Performance:
  RMSE: 6.104
  MAE:  4.073
  R²:   0.836

Top 10 Most Important Features:
                 feature  importance
19            region_ssa    0.241206
14            region_eap    0.191705
16            region_lac    0.188869
15            region_eca    0.115791
13  income_level_encoded    0.059938
4        life_exp_female    0.032786
6       population_total    0.023580
8   gdp_per_capita_const    0.022789
2       urban_population    0.015393
0         fertility_rate    0.015321


### LightGBM

In [29]:
# Note that we use the raw unimputed data for LightGBM because it can handle missing values natively

# IMPORTANT: Clean column names for LightGBM (it doesn't support special characters)
# Convert to snake_case for consistency
X_train_raw_lgb = X_train_raw.copy()
X_val_raw_lgb = X_val_raw.copy()
X_test_raw_lgb = X_test_raw.copy()

# Convert to snake_case: lowercase, replace special chars and spaces with underscores
X_train_raw_lgb.columns = (X_train_raw_lgb.columns
                            .str.lower()
                            .str.replace('[^a-z0-9]+', '_', regex=True)
                            .str.strip('_'))
X_val_raw_lgb.columns = (X_val_raw_lgb.columns
                          .str.lower()
                          .str.replace('[^a-z0-9]+', '_', regex=True)
                          .str.strip('_'))
X_test_raw_lgb.columns = (X_test_raw_lgb.columns
                           .str.lower()
                           .str.replace('[^a-z0-9]+', '_', regex=True)
                           .str.strip('_'))

# Define hyperparameters to test (reduced grid for initial comparison)
param_grid = {
    'n_estimators': [100, 200],               # Number of boosting rounds
    'max_depth': [3, 6],                      # Maximum tree depth
    'learning_rate': [0.01, 0.1],             # Step size shrinkage
    'subsample': [0.9, 1.0],                  # Fraction of samples per tree
    'colsample_bytree': [0.9, 1.0],           # Fraction of features per tree
    'num_leaves': [31, 50]                    # Maximum number of leaves per tree
}

# Initialize LightGBM with GridSearch for hyperparameter tuning
lgb_grid = GridSearchCV(
    lgb.LGBMRegressor(random_state=42, n_jobs=-1, verbose=-1),
    param_grid=param_grid,
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1  # Show progress
)

print("Fitting LightGBM with hyperparameter tuning...")
print("This may take several minutes...")

# Fit on cleaned training data
lgb_grid.fit(X_train_raw_lgb, y_train)

# Get the best model
lgb_model = lgb_grid.best_estimator_
best_params = lgb_grid.best_params_

print(f"Best hyperparameters:")
print(f"  n_estimators: {best_params['n_estimators']}")
print(f"  max_depth: {best_params['max_depth']}")
print(f"  learning_rate: {best_params['learning_rate']}")
print(f"  subsample: {best_params['subsample']}")
print(f"  colsample_bytree: {best_params['colsample_bytree']}")
print(f"  num_leaves: {best_params['num_leaves']}")
print(f"Cross-validation score: {-lgb_grid.best_score_:.3f}")

# Make predictions using cleaned data
y_train_pred = lgb_model.predict(X_train_raw_lgb)
y_val_pred = lgb_model.predict(X_val_raw_lgb)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Feature importance for LightGBM
feature_importance = pd.DataFrame({
    'feature': predictor_cols_tree,
    'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))

Fitting LightGBM with hyperparameter tuning...
This may take several minutes...
Fitting 5 folds for each of 64 candidates, totalling 320 fits
Best hyperparameters:
  n_estimators: 200
  max_depth: 6
  learning_rate: 0.1
  subsample: 0.9
  colsample_bytree: 0.9
  num_leaves: 31
Cross-validation score: 40.887

Training Performance:
  RMSE: 1.382
  MAE:  0.970
  R²:   0.992

Validation Performance:
  RMSE: 6.481
  MAE:  4.147
  R²:   0.815

Top 10 Most Important Features:
                 feature  importance
6       population_total         581
2       urban_population         483
8   gdp_per_capita_const         478
1   fertility_adolescent         438
12           rule_of_law         378
10          services_gdp         374
7    secondary_enroll_fe         340
5       infant_mortality         332
0         fertility_rate         302
11          industry_gdp         290


### CatBoost


In [30]:
# Note that we use the raw unimputed data for CatBoost because it can handle missing values natively

# Clean column names to snake_case for consistency
X_train_raw_cat = X_train_raw.copy()
X_val_raw_cat = X_val_raw.copy()
X_test_raw_cat = X_test_raw.copy()

# Convert to snake_case: lowercase, replace special chars and spaces with underscores
X_train_raw_cat.columns = (X_train_raw_cat.columns
                            .str.lower()
                            .str.replace('[^a-z0-9]+', '_', regex=True)
                            .str.strip('_'))
X_val_raw_cat.columns = (X_val_raw_cat.columns
                          .str.lower()
                          .str.replace('[^a-z0-9]+', '_', regex=True)
                          .str.strip('_'))
X_test_raw_cat.columns = (X_test_raw_cat.columns
                           .str.lower()
                           .str.replace('[^a-z0-9]+', '_', regex=True)
                           .str.strip('_'))

# Define hyperparameters to test (reduced grid for initial comparison)
param_grid = {
    'iterations': [100, 200],                 # Number of boosting rounds
    'depth': [3, 6],                          # Maximum tree depth
    'learning_rate': [0.01, 0.1],             # Step size shrinkage
    'l2_leaf_reg': [1, 3, 5]                  # L2 regularization
}

# Initialize CatBoost with GridSearch for hyperparameter tuning
catboost_grid = GridSearchCV(
    CatBoostRegressor(random_state=42, verbose=0),
    param_grid=param_grid,
    cv=time_kfold,  # Time-based CV within training period
    scoring='neg_mean_squared_error',
    n_jobs=-1,
    verbose=1  # Show progress
)

print("Fitting CatBoost with hyperparameter tuning...")
print("This may take several minutes...")

# Fit on cleaned training data
catboost_grid.fit(X_train_raw_cat, y_train)

# Get the best model
catboost_model = catboost_grid.best_estimator_
best_params = catboost_grid.best_params_

print(f"Best hyperparameters:")
print(f"  iterations: {best_params['iterations']}")
print(f"  depth: {best_params['depth']}")
print(f"  learning_rate: {best_params['learning_rate']}")
print(f"  l2_leaf_reg: {best_params['l2_leaf_reg']}")
print(f"Cross-validation score: {-catboost_grid.best_score_:.3f}")

# Make predictions using cleaned data
y_train_pred = catboost_model.predict(X_train_raw_cat)
y_val_pred = catboost_model.predict(X_val_raw_cat)

# Calculate performance metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

train_mae = mean_absolute_error(y_train, y_train_pred)
val_mae = mean_absolute_error(y_val, y_val_pred)

train_r2 = r2_score(y_train, y_train_pred)
val_r2 = r2_score(y_val, y_val_pred)

# Display results
print(f"\nTraining Performance:")
print(f"  RMSE: {train_rmse:.3f}")
print(f"  MAE:  {train_mae:.3f}")
print(f"  R²:   {train_r2:.3f}")

print(f"\nValidation Performance:")
print(f"  RMSE: {val_rmse:.3f}")
print(f"  MAE:  {val_mae:.3f}")
print(f"  R²:   {val_r2:.3f}")

# Feature importance for CatBoost
feature_importance = pd.DataFrame({
    'feature': predictor_cols_tree,
    'importance': catboost_model.feature_importances_
}).sort_values('importance', ascending=False)

print(f"\nTop 10 Most Important Features:")
print(feature_importance.head(10))


Fitting CatBoost with hyperparameter tuning...
This may take several minutes...
Fitting 5 folds for each of 24 candidates, totalling 120 fits
Best hyperparameters:
  iterations: 200
  depth: 6
  learning_rate: 0.1
  l2_leaf_reg: 1
Cross-validation score: 35.459

Training Performance:
  RMSE: 2.551
  MAE:  1.879
  R²:   0.974

Validation Performance:
  RMSE: 6.377
  MAE:  4.554
  R²:   0.821

Top 10 Most Important Features:
                 feature  importance
6       population_total   13.031898
14            region_eap   10.052394
19            region_ssa    9.579160
2       urban_population    9.163742
0         fertility_rate    8.635526
1   fertility_adolescent    7.487290
8   gdp_per_capita_const    6.228070
13  income_level_encoded    6.204995
16            region_lac    4.751803
3       dependency_ratio    4.583082


## Final Evaluation on Test Set
Run pre-selected group of best performing models on test set.

In [32]:
results_test = []

# 1. Elastic Net (uses X_test_scaled)
y_test_pred_en = elastic_model.predict(X_test_scaled)
results_test.append({
    "model": "Elastic Net",
    "rmse": np.sqrt(mean_squared_error(y_test, y_test_pred_en)),
    "mae":  mean_absolute_error(y_test, y_test_pred_en),
    "r2":   r2_score(y_test, y_test_pred_en),
})

# 2. SVR (RBF) (uses X_test_scaled)
y_test_pred_svr = svr_model.predict(X_test_scaled)
results_test.append({
    "model": "SVR (RBF)",
    "rmse": np.sqrt(mean_squared_error(y_test, y_test_pred_svr)),
    "mae":  mean_absolute_error(y_test, y_test_pred_svr),
    "r2":   r2_score(y_test, y_test_pred_svr),
})

# 3. KNN (uses X_test_scaled)
y_test_pred_knn = knn_model.predict(X_test_scaled)
results_test.append({
    "model": "KNN",
    "rmse": np.sqrt(mean_squared_error(y_test, y_test_pred_knn)),
    "mae":  mean_absolute_error(y_test, y_test_pred_knn),
    "r2":   r2_score(y_test, y_test_pred_knn),
})

# 4. Random Forest (uses X_test_imputed)
y_test_pred_rf = rf_model.predict(X_test_imputed)
results_test.append({
    "model": "Random Forest",
    "rmse": np.sqrt(mean_squared_error(y_test, y_test_pred_rf)),
    "mae":  mean_absolute_error(y_test, y_test_pred_rf),
    "r2":   r2_score(y_test, y_test_pred_rf),
})

# 5. CatBoost (uses X_test_raw_cat)
y_test_pred_cat = catboost_model.predict(X_test_raw_cat)
results_test.append({
    "model": "CatBoost",
    "rmse": np.sqrt(mean_squared_error(y_test, y_test_pred_cat)),
    "mae":  mean_absolute_error(y_test, y_test_pred_cat),
    "r2":   r2_score(y_test, y_test_pred_cat),
})

# Nicely formatted table
test_results_df = pd.DataFrame(results_test).sort_values("r2", ascending=False)
print("Test-set performance (2021–2023):")
display(test_results_df)

Test-set performance (2021–2023):


Unnamed: 0,model,rmse,mae,r2
2,KNN,4.005239,2.28444,0.931236
3,Random Forest,7.481277,5.505966,0.760087
1,SVR (RBF),7.900602,5.464321,0.732439
4,CatBoost,8.072678,5.730122,0.720657
0,Elastic Net,11.874556,9.197685,0.395582


## Conclusion

Under the country‑based split (where we held out entire countries), all models faced a hard generalization problem and performance was modest, with regularized linear models (Lasso/Elastic Net) and SVR doing best, and tree/boosting models overfitting the training countries and underperforming on validation. When we moved to a temporal 80/10/10 split by year (training on 2000–2018, validating on 2019–2020, testing on 2021–2023), the forecasting task became much easier: all models improved sharply because they now predict later years for countries and regions already seen in the training data. In this setup, the linear models achieved validation R² around 0.40, SVR around 0.73, and the tree/boosting models (Random Forest, XGBoost, LightGBM, CatBoost) reached validation R² in the 0.82–0.87 range.

On the held‑out test years, we evaluated a shortlist of models chosen based on temporal CV and validation performance: Elastic Net, SVR (RBF), KNN, Random Forest, and CatBoost. All models showed some drop in performance relative to validation, as expected, but remained reasonably strong. Elastic Net offered a solid linear baseline (test R² ≈ 0.40). SVR, Random Forest, and CatBoost all retained good forecasting accuracy on the test period, with test R² in the low‑to‑mid 0.70s (around 0.72–0.76), and Random Forest emerging as the best of this group (test RMSE ≈ 7.48, R² ≈ 0.76). The standout performer on the test set was KNN, with very low RMSE (≈ 4.0) and very high R² (≈ 0.93), reflecting its strength as a local interpolation method when future observations closely resemble past ones in feature space.

Despite KNN’s superior test metrics, we ultimately chose Random Forest as the final model for the app. KNN’s strength here comes from its highly local nature: it effectively averages outcomes of the most similar historical cases, which works extremely well for short‑horizon forecasting of “more of the same” but makes it fragile when asked to predict for unusual or off‑manifold combinations of predictors. The slider‑based app will deliberately create hypothetical countries that may not closely match any real historical observations, and in such regions KNN can behave unpredictably because of the curse of dimensionality and its reliance on distance in a 20+ dimensional feature space. Random Forest, by contrast, learns a more global mapping from predictors to FLFP, extrapolates more smoothly as sliders move, and still delivers strong forecast performance on the temporal test set. For that reason, we treat KNN as a useful upper‑bound benchmark, but adopt Random Forest as the primary model for deployment.
