# MODELING

## PROBLEM DEFINITION AND GOALS

**We aim to model, predict, and simulate foreign aid dynamics for Kenya.**  

### CORE MODELING OBJECTIVES

1. **Aid Flow Stability And Volatility Analysis**  
   - Goal: Identify which sectors, agencies, and partners are most stable, volatile, or dominant in aid delivery.  
   - Approach:  
     - Compute volatility metrics (rolling std, coefficient of variation).  
     - Classify entities using unsupervised clustering or ranking.  
     - Highlight dominant vs. emerging actors.

2. **Predictive And Scenario Forecasting**  
   - Goal: Predict future funding flows or simulate policy shocks (e.g., “25% cut in USAID funding”).  
   - Approach: 
     - Use regression models (XGBoost, Random Forest, Prophet) on aid totals.  
     - Simulate counterfactual scenarios by adjusting model inputs.

3. **Temporal Forecasting And Shock Simulation**  
   - Goal: Model aid trends across fiscal years and assess resilience to fiscal shocks.  
   - Approach:  
     - Time series forecasting (Prophet, ARIMA, LSTM).  
     - Introduce synthetic shocks to test system response.

4. **Sectoral Dependency Clustering**  
   - Goal: Cluster sectors by their dependency on foreign aid.  
   - Approach:  
     - Use KMeans, DBSCAN, or hierarchical clustering.  
     - Base features on aid ratios and concentration metrics.

## DATA AUDIT

### STRONG COLUMNS
Provide robust quantitative and categorical anchors:
- Temporal:** fiscal_year, transaction_date, year, quarter
- Categorical Context: country_name, us_sector_name, us_category_name
- Management Context: managing_subagency_or_bureau_name, funded_and_managed_by, dominant_sector_per_agency
- Monetary & Variability Indicators:  
  current_dollar_amount, constant_dollar_amount, total_aid_fiscal,  
  aid_volatility, rolling_mean_3yr, rolling_std_3yr, aid_concentration_index,  
  top3_agency_share, sector_to_total_ratio, agency_to_total_ratio,  
  aid_per_partner, relative_aid_share, aid_std_fiscal,  
  mean_aid_per_transaction_fiscal

### COLUMNS THAT NEED ATTENTION
- objective, transaction_type_name -> may need grouping or encoding.  
- transaction_lag, transaction_lead -> align correctly with fiscal years.  
- is_end_of_fiscal_year, is_holiday_quarter -> binary encoding.  
- days_since_start_of_year -> useful for time-decay effects.

### COLUMNS TO DROP
- transaction_date -> replace with derived temporal features.  
- country_name -> constant for Kenya, can be dropped.  
- Columns ending in _interaction -> use dimensionality reduction (PCA) or selective inclusion.

## WORKFLOW

### DATA PREPARATION
1. Handle missingness with appropriate imputations.  
2. Normalize skewed monetary variables (we will use log1p).  
3. Encode categorical variables using **target encoding** or **frequency encoding**.  
4. Aggregate aid data by **year**, **sector**, or **agency** depending on the model goal.  
5. Split data chronologically (e.g train up to 2019, test from 2020 onward).

### FEATURE ENGINEERING CONSIDERATIONS
- Rolling averages: rolling_mean_3yr, rolling_std_3yr, sector_growth_rate, agency_growth_rate
- Concentration & diversity: aid_concentration_index, aid_diversity_index
- Ratios: sector_to_total_ratio, agency_to_total_ratio
- Interaction terms: agency_sector_interaction, partner_agency_interaction

### TARGET PICKING

| Objective | Problem Type | Model Candidates | Key Targets |
|------------|---------------|------------------|--------------|
| **Stability & Volatility** | Unsupervised / Ranking | KMeans, Isolation Forest, PCA, TS Clustering | aid_volatility, aid_concentration_index |
| **Predictive Forecasting** | Supervised Regression | XGBoost, LightGBM, Prophet | total_aid_fiscal, constant_dollar_amount |
| **Temporal Forecasting** | Time Series | Prophet, ARIMA, LSTM | total_aid_fiscal, sector_to_total_ratio |
| **Sector Dependency** | Clustering | KMeans, Spectral, DBSCAN | sector_to_total_ratio, aid_per_partner, aid_concentration_index |

### EVALUATION METRICS
- **Regression Models:** RMSE, MAE, MAPE, R²  
- **Forecasting Models:** MAPE, RMSE, Directional Accuracy  
- **Clustering Models:** Silhouette Score, Calinski-Harabasz Index  
- **Volatility Analysis:** Rolling std comparison, stability ranking

### SCENARIO SIMULATION
Run post-model scenario experiments:
- Introduce a **shock factor**, e.g., USAID_cut = -0.25  
- Recompute predicted totals and compare to baseline.  
- Measure cascading impacts by sector (sector_to_total_ratio shifts).  
- Visualize using waterfall charts or delta bar plots.

### VISUALS
- **Time Series:** Aid trends per sector/agency.  
- **Heatmaps:** Volatility or concentration over time.  
- **Cluster Maps:** Sector dependency visualization.  
- **Scenario Dashboards:** Policy or funding cut simulations.

In [None]:
# ------- [Import all relevant libraries] -------

# General Utilities
import warnings
warnings.filterwarnings('ignore')

import numpy as np                          # Numerical computing
import pandas as pd                         # Data manipulation and analysis
import datetime as dt                       # Date/time operations
import re                                   # String manipulation
from collections import Counter             # Frequency counting

# Visualization
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-whitegrid')
import seaborn as sns

# Feature Engineering & Preprocessing
from sklearn.preprocessing import (
    LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler
)
from sklearn.model_selection import (
    train_test_split, TimeSeriesSplit, GridSearchCV, StratifiedKFold
)
from sklearn.decomposition import PCA, TruncatedSVD

# Machine Learning Models
# ======================

## Supervised Learning (Regression/Prediction)
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor

## Unsupervised Learning (Volatility and Dependency Clustering)
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score

## Time Series & Forecasting
from statsmodels.tsa.arima.model import ARIMA
from prophet import Prophet

# Evaluation Metrics
from sklearn.metrics import (
    mean_absolute_error, mean_squared_error, r2_score,
    silhouette_score, calinski_harabasz_score
)

# # Imbalanced Data Handling
# from imblearn.over_sampling import SMOTE

# Model Interpretation & Explainability
from lime import lime_tabular
import shap

# Pipelines
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as ImbPipeline

# Display Settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

In [5]:
# Load modeling dataset
model_df = pd.read_csv("../Clean Data/modeling_data_final.csv")
model_df

Unnamed: 0,relative_aid_share,avg_sector_aid,top3_agency_share,current_dollar_amount,aid_std_fiscal,constant_dollar_amount,partner_transaction_count,fiscal_year,partner_count_fiscal,aid_per_transaction_ratio,...,us_sector_name,managing_subagency_or_bureau_name,dominant_sector_per_agency,funded_and_managed_by,country_name,us_category_name,fiscal_year.1,year.1,quarter,transaction_date
0,0.000055,1.038993e+06,0.933793,100000.0,1.441897e+06,97399.0,1,2024,351,0.319844,...,Development,not applicable,Other/Unspecified,African Development Foundation managed by African Development Foundation,Kenya,Economic Development,2024,2024,3,2024-09-02
1,0.000198,1.228714e+05,0.935470,267335.0,1.371819e+06,299116.0,1,2021,297,0.913227,...,Other/Unspecified,International Narcotics and Law Enforcement Affairs,Human Rights,Department of State managed by Department of State,Kenya,Program Support,2021,2021,2,2021-06-30
2,0.000011,9.415889e+04,0.973222,20484.0,2.656578e+06,21421.0,1,2022,313,0.059108,...,Other/Unspecified,International Narcotics and Law Enforcement Affairs,Human Rights,Department of State managed by Department of State,Kenya,Program Support,2022,2022,2,2022-06-30
3,0.000014,7.097307e+04,0.967888,24689.0,2.111852e+06,24689.0,2,2023,347,0.079962,...,Other/Unspecified,International Narcotics and Law Enforcement Affairs,Human Rights,Department of State managed by Department of State,Kenya,Program Support,2023,2022,4,2022-12-31
4,-0.000041,7.097307e+04,0.967888,-72650.0,2.111852e+06,-72650.0,2,2023,347,-0.235298,...,Other/Unspecified,International Narcotics and Law Enforcement Affairs,Human Rights,Department of State managed by Department of State,Kenya,Program Support,2023,2023,1,2023-03-31
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68259,0.000030,3.889628e+05,0.935470,41095.0,1.371819e+06,45981.0,12,2021,297,0.140384,...,Health,Centers for Disease Control and Prevention,Health,Department of State managed by Department of Health and Human Services,Kenya,Health,2021,2021,3,2021-07-31
68260,0.000017,3.889628e+05,0.935470,23631.0,1.371819e+06,26440.0,12,2021,297,0.080724,...,Health,Centers for Disease Control and Prevention,Health,Department of State managed by Department of Health and Human Services,Kenya,Health,2021,2021,3,2021-08-31
68261,0.000019,3.889628e+05,0.935470,25316.0,1.371819e+06,28326.0,12,2021,297,0.086482,...,Health,Centers for Disease Control and Prevention,Health,Department of State managed by Department of Health and Human Services,Kenya,Health,2021,2021,3,2021-09-30
68262,0.000332,8.555616e+04,0.995510,200000.0,1.048696e+06,190615.0,2,2025,176,0.857045,...,Other/Unspecified,"Bureau for Inclusive Growth, Partnerships, and Innovation",Health,U.S. Agency for International Development managed by U.S. Agency for International Development,Kenya,Economic Development,2025,2024,4,2024-11-01


In [6]:
model_df.columns

Index(['relative_aid_share', 'avg_sector_aid', 'top3_agency_share',
       'current_dollar_amount', 'aid_std_fiscal', 'constant_dollar_amount',
       'partner_transaction_count', 'fiscal_year', 'partner_count_fiscal',
       'aid_per_transaction_ratio', 'total_partner_aid', 'total_aid_fiscal',
       'year', 'rolling_std_3yr', 'avg_agency_aid',
       'mean_aid_per_transaction_fiscal', 'rolling_mean_3yr',
       'aid_per_partner', 'total_sector_aid', 'aid_volatility',
       'aid_concentration_index', 'transaction_year', 'sector_to_total_ratio',
       'us_sector_name', 'managing_subagency_or_bureau_name',
       'dominant_sector_per_agency', 'funded_and_managed_by', 'country_name',
       'us_category_name', 'fiscal_year.1', 'year.1', 'quarter',
       'transaction_date'],
      dtype='object')