# Datathon 2026 - Humanitarian Funding Analysis
## Team Submission: Crisis Funding Prediction & Effectiveness Scoring

This notebook contains our complete pipeline for:
1. **Data Loading** - Pre-merged dataset with 64 features
2. **Effectiveness Scoring** - Evaluating crisis response quality (Outcome-First: 20/20/40/20)
3. **Model Building** - Predicting optimal funding levels
4. **Visualizations** - Presentation-ready charts

---
## Setup & Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

# Set style
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

print("Libraries loaded successfully!")

---
# Part 1: Load Pre-Merged Dataset

This dataset was created by merging:
- **INFORM Severity** (56 monthly files, 2020-2025)
- **CERF Allocations** (UN emergency fund)
- **CBPF Budgets** (Country pooled funds)
- **HRP Requirements** (Humanitarian Response Plans)
- **FTS Funding** (Global requirements vs actual funding)
- **World Bank** (GDP, Population, Inflation)
- **OCHA Demographics** (IDPs, Refugees, Vulnerable populations)
- **HPC Cluster Data** (Humanitarian sectors)

In [None]:
# Load pre-merged dataset with all 64 features
df = pd.read_csv('complete_funding_dataset.csv')

print(f"Dataset: {len(df)} rows, {len(df.columns)} columns")
print(f"Countries: {df['ISO3'].nunique()}")
print(f"Year range: {df['Year'].min():.0f} - {df['Year'].max():.0f}")
print(f"\nColumns:")
print(df.columns.tolist())

In [None]:
# Quick data overview
print("\n=== Funding Summary ===")
print(f"FTS Requirements: ${df['FTS_Requirements'].sum():,.0f}")
print(f"FTS Actual Funding: ${df['FTS_Funding'].sum():,.0f}")
print(f"FTS Funding Gap: ${df['FTS_Funding_Gap'].sum():,.0f}")
print(f"Average % Funded: {df['FTS_Percent_Funded'].mean():.1f}%")

print("\n=== Severity Summary ===")
print(f"Mean INFORM Score: {df['INFORM_Mean'].mean():.2f}")
print(f"High Severity Crises (>=4.0): {(df['INFORM_Mean'] >= 4.0).sum()}")

df.head()

---
# Part 2: Effectiveness Scoring

Our scoring system uses **Outcome-First weights**:
- **Coverage** (20%): % of funding requirements met
- **Efficiency** (20%): Funding per person in need
- **Outcome** (40%): INFORM severity improvement over time
- **Gap** (20%): Funding gap severity (inverted)

In [None]:
# Effectiveness scores are already calculated in the dataset
print("=== Effectiveness Score Distribution ===")
print(df['Effectiveness_Category'].value_counts())

print(f"\nMean Effectiveness Score: {df['Effectiveness_Score'].mean():.1f}")
print(f"Good Crises (score >= 45): {df['Is_Good_Crisis'].sum()} / {len(df)} ({100*df['Is_Good_Crisis'].mean():.1f}%)")

In [None]:
# Top 10 best managed crises
print("=== Top 10 Best Managed Crises ===")
top_10 = df[df['FTS_Funding'] > 0].nlargest(10, 'Effectiveness_Score')
display_cols = ['Country', 'Year', 'INFORM_Mean', 'FTS_Percent_Funded', 'Effectiveness_Score', 'Effectiveness_Category']
top_10[display_cols]

In [None]:
# Critical underfunded crises
print("=== Critical Underfunded Crises (High Severity) ===")
critical = df[(df['FTS_Funding'] > 0) & (df['INFORM_Mean'] >= 3.5)].nsmallest(10, 'Effectiveness_Score')
critical_cols = ['Country', 'Year', 'INFORM_Mean', 'FTS_Percent_Funded', 'FTS_Funding_Gap', 'Effectiveness_Score']
critical[critical_cols]

---
# Part 3: Model Building

Train machine learning models to predict optimal funding levels based on crisis characteristics.

In [None]:
# Define features
numeric_features = [
    # INFORM severity metrics
    'INFORM_Mean', 'INFORM_Std', 'INFORM_Min', 'INFORM_Max',
    'People_In_Need_Avg', 'Complexity_Avg', 'Impact_Avg',
    
    # Economic indicators
    'GDP_Per_Capita', 'Inflation_Rate',
    
    # Population metrics  
    'Population', 'Vulnerable_Pop_Pct', 'IDP_Rate',
    
    # Crisis metrics
    'Number_Clusters', 'Total_In_Need', 'Coverage_Rate',
    
    # Derived features
    'Need_Per_Capita', 'Economic_Stress',
]

categorical_features = ['Crisis_Type', 'UN_Region']

# Target variable
TARGET = 'FTS_Funding'

# Filter to rows with valid target
df_model = df[(df[TARGET].notna()) & (df[TARGET] > 0)].copy()
print(f"Training samples: {len(df_model)}")

In [None]:
# Prepare features
available_numeric = [f for f in numeric_features if f in df_model.columns]
print(f"Available numeric features: {len(available_numeric)}")

# Fill missing values with median
X_numeric = df_model[available_numeric].copy()
for col in available_numeric:
    X_numeric[col] = X_numeric[col].fillna(X_numeric[col].median())

# One-hot encode categorical features
X_categorical = pd.DataFrame()
for cat_col in categorical_features:
    if cat_col in df_model.columns:
        dummies = pd.get_dummies(df_model[cat_col], prefix=cat_col, drop_first=True)
        X_categorical = pd.concat([X_categorical, dummies], axis=1)
        print(f"  {cat_col}: {df_model[cat_col].nunique()} categories")

# Combine features
X = pd.concat([X_numeric.reset_index(drop=True), X_categorical.reset_index(drop=True)], axis=1)
y = df_model[TARGET].reset_index(drop=True)
y_log = np.log1p(y)  # Log transform for better distribution

print(f"\nFeature matrix: {X.shape}")
print(f"Target range: ${y.min():,.0f} to ${y.max():,.0f}")

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y_log, test_size=0.2, random_state=42)

print(f"Training set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples")

In [None]:
# Train Random Forest
print("Training Random Forest...")
rf = RandomForestRegressor(
    n_estimators=100, max_depth=10, min_samples_split=5, 
    min_samples_leaf=2, random_state=42, n_jobs=-1
)
rf.fit(X_train, y_train)

# Predictions
y_pred_rf_log = rf.predict(X_test)
y_pred_rf = np.expm1(y_pred_rf_log)
y_test_actual = np.expm1(y_test)

# Metrics
rf_r2 = r2_score(y_test, y_pred_rf_log)
rf_mae = mean_absolute_error(y_test_actual, y_pred_rf)
rf_rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_rf))

print(f"\nRandom Forest Results:")
print(f"  R² Score: {rf_r2:.4f}")
print(f"  MAE: ${rf_mae:,.0f}")
print(f"  RMSE: ${rf_rmse:,.0f}")

# Cross-validation
cv_scores = cross_val_score(rf, X, y_log, cv=5, scoring='r2')
print(f"  CV R² Score: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

In [None]:
# Train Gradient Boosting
print("Training Gradient Boosting...")
gb = GradientBoostingRegressor(
    n_estimators=100, max_depth=5, learning_rate=0.1,
    min_samples_split=5, random_state=42
)
gb.fit(X_train, y_train)

y_pred_gb_log = gb.predict(X_test)
y_pred_gb = np.expm1(y_pred_gb_log)

gb_r2 = r2_score(y_test, y_pred_gb_log)
gb_mae = mean_absolute_error(y_test_actual, y_pred_gb)
gb_rmse = np.sqrt(mean_squared_error(y_test_actual, y_pred_gb))

print(f"\nGradient Boosting Results:")
print(f"  R² Score: {gb_r2:.4f}")
print(f"  MAE: ${gb_mae:,.0f}")
print(f"  RMSE: ${gb_rmse:,.0f}")

cv_scores_gb = cross_val_score(gb, X, y_log, cv=5, scoring='r2')
print(f"  CV R² Score: {cv_scores_gb.mean():.4f} (+/- {cv_scores_gb.std():.4f})")

In [None]:
# Feature Importance
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Importance': rf.feature_importances_
}).sort_values('Importance', ascending=False)

print("\n=== Top 15 Features Predicting Funding ===")
print(feature_importance.head(15).to_string(index=False))

---
# Part 4: Generate Predictions & Identify Funding Gaps

In [None]:
# Prepare full dataset for prediction
X_full_numeric = df[available_numeric].copy()
for col in available_numeric:
    X_full_numeric[col] = X_full_numeric[col].fillna(X_full_numeric[col].median())

X_full_categorical = pd.DataFrame()
for cat_col in categorical_features:
    if cat_col in df.columns:
        dummies = pd.get_dummies(df[cat_col], prefix=cat_col, drop_first=True)
        X_full_categorical = pd.concat([X_full_categorical, dummies], axis=1)

X_full = pd.concat([X_full_numeric.reset_index(drop=True), X_full_categorical.reset_index(drop=True)], axis=1)

# Ensure columns match training data
for col in X.columns:
    if col not in X_full.columns:
        X_full[col] = 0
X_full = X_full[X.columns]

# Generate predictions
y_pred_full_log = rf.predict(X_full)
y_pred_full = np.expm1(y_pred_full_log)

df['Predicted_Funding'] = y_pred_full
df['Actual_Funding'] = df['FTS_Funding'].fillna(0)
df['Model_Funding_Gap'] = df['Predicted_Funding'] - df['Actual_Funding']

print("Predictions generated for all crises")

In [None]:
# Categorize funding status based on model predictions
def categorize_funding(row):
    if row['Actual_Funding'] == 0:
        return 'No Funding Data'
    gap_pct = (row['Model_Funding_Gap'] / row['Predicted_Funding']) * 100 if row['Predicted_Funding'] > 0 else 0
    if gap_pct > 50: return 'Severely Underfunded'
    elif gap_pct > 20: return 'Underfunded'
    elif gap_pct > -20: return 'Adequately Funded'
    else: return 'Well Funded'

df['Funding_Status'] = df.apply(categorize_funding, axis=1)

print("\n=== Funding Status Distribution ===")
print(df['Funding_Status'].value_counts())

In [None]:
# Top underfunded crises
print("\n=== Top 10 Underfunded High-Severity Crises ===")
underfunded = df[
    (df['Actual_Funding'] > 0) & 
    (df['INFORM_Mean'] >= 3.0) &
    (df['Model_Funding_Gap'] > 0)
].nlargest(10, 'Model_Funding_Gap')

display_cols = ['Country', 'Year', 'INFORM_Mean', 'Actual_Funding', 'Predicted_Funding', 'Model_Funding_Gap', 'Funding_Status']
underfunded[display_cols]

---
# Part 5: Visualizations

In [None]:
# 1. Model Performance Comparison
fig, axes = plt.subplots(1, 3, figsize=(14, 5))

models = ['Random Forest', 'Gradient Boosting']
r2_scores = [rf_r2, gb_r2]
mae_scores = [rf_mae/1e6, gb_mae/1e6]
rmse_scores = [rf_rmse/1e6, gb_rmse/1e6]

colors = ['#2ecc71', '#3498db']

bars1 = axes[0].bar(models, r2_scores, color=colors, edgecolor='white', linewidth=2)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Model Accuracy (R²)', fontsize=14, fontweight='bold')
axes[0].set_ylim(0, 1)
for bar, val in zip(bars1, r2_scores):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
                 f'{val:.3f}', ha='center', fontsize=11, fontweight='bold')

bars2 = axes[1].bar(models, mae_scores, color=colors, edgecolor='white', linewidth=2)
axes[1].set_ylabel('MAE (Millions USD)', fontsize=12)
axes[1].set_title('Mean Absolute Error', fontsize=14, fontweight='bold')
for bar, val in zip(bars2, mae_scores):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 2, 
                 f'${val:.0f}M', ha='center', fontsize=11, fontweight='bold')

bars3 = axes[2].bar(models, rmse_scores, color=colors, edgecolor='white', linewidth=2)
axes[2].set_ylabel('RMSE (Millions USD)', fontsize=12)
axes[2].set_title('Root Mean Squared Error', fontsize=14, fontweight='bold')
for bar, val in zip(bars3, rmse_scores):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 5, 
                 f'${val:.0f}M', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# 2. Feature Importance
fig, ax = plt.subplots(figsize=(10, 8))

top_features = feature_importance.head(12)
colors = plt.cm.viridis(np.linspace(0.2, 0.8, len(top_features)))

bars = ax.barh(range(len(top_features)), top_features['Importance'], color=colors)
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['Feature'], fontsize=11)
ax.invert_yaxis()
ax.set_xlabel('Importance Score', fontsize=12)
ax.set_title('Top 12 Features Predicting Funding Needs', fontsize=14, fontweight='bold')

for bar, val in zip(bars, top_features['Importance']):
    ax.text(bar.get_width() + 0.005, bar.get_y() + bar.get_height()/2, 
            f'{val:.1%}', ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# 3. Funding Status Distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Pie chart
status_counts = df['Funding_Status'].value_counts()
colors_status = {'Well Funded': '#27ae60', 'Adequately Funded': '#3498db', 
                 'Underfunded': '#f39c12', 'Severely Underfunded': '#e74c3c',
                 'No Funding Data': '#95a5a6'}
pie_colors = [colors_status.get(s, '#95a5a6') for s in status_counts.index]

wedges, texts, autotexts = axes[0].pie(status_counts.values, labels=status_counts.index, 
                                        autopct='%1.1f%%', colors=pie_colors, startangle=90,
                                        explode=[0.05 if 'Under' in s else 0 for s in status_counts.index])
axes[0].set_title('Crisis Funding Status Distribution', fontsize=14, fontweight='bold')

# Effectiveness score histogram
axes[1].hist(df['Effectiveness_Score'].dropna(), bins=25, color='#3498db', edgecolor='white', alpha=0.8)
axes[1].axvline(x=45, color='#e74c3c', linestyle='--', linewidth=2, label='Good Crisis Threshold (45)')
axes[1].axvline(x=df['Effectiveness_Score'].mean(), color='#27ae60', linestyle='-', linewidth=2, 
                label=f'Mean ({df["Effectiveness_Score"].mean():.1f})')
axes[1].set_xlabel('Effectiveness Score', fontsize=12)
axes[1].set_ylabel('Frequency', fontsize=12)
axes[1].set_title('Effectiveness Score Distribution', fontsize=14, fontweight='bold')
axes[1].legend()

plt.tight_layout()
plt.show()

In [None]:
# 4. Predictions vs Actual
fig, ax = plt.subplots(figsize=(10, 8))

plot_data = df[df['Actual_Funding'] > 0].copy()

status_colors = {'Well Funded': '#27ae60', 'Adequately Funded': '#3498db', 
                 'Underfunded': '#f39c12', 'Severely Underfunded': '#e74c3c'}

for status, color in status_colors.items():
    mask = plot_data['Funding_Status'] == status
    ax.scatter(plot_data.loc[mask, 'Actual_Funding']/1e9, 
               plot_data.loc[mask, 'Predicted_Funding']/1e9,
               c=color, label=status, alpha=0.7, s=60, edgecolors='white', linewidth=0.5)

max_val = max(plot_data['Actual_Funding'].max(), plot_data['Predicted_Funding'].max()) / 1e9
ax.plot([0, max_val], [0, max_val], 'k--', alpha=0.5, label='Perfect Prediction', linewidth=2)

ax.set_xlabel('Actual Funding (Billions USD)', fontsize=12)
ax.set_ylabel('Predicted Funding (Billions USD)', fontsize=12)
ax.set_title('Model Predictions vs Actual Funding', fontsize=14, fontweight='bold')
ax.legend(loc='upper left')

plt.tight_layout()
plt.show()

In [None]:
# 5. Funding Trends by Year
fig, ax = plt.subplots(figsize=(12, 6))

yearly = df.groupby('Year').agg({
    'FTS_Funding': 'sum',
    'FTS_Requirements': 'sum',
    'FTS_Funding_Gap': 'sum'
}).dropna()

x = yearly.index.astype(int)
width = 0.35

ax.bar(x - width/2, yearly['FTS_Requirements']/1e9, width, label='Requirements', color='#3498db', edgecolor='white')
ax.bar(x + width/2, yearly['FTS_Funding']/1e9, width, label='Actual Funding', color='#27ae60', edgecolor='white')

ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Amount (Billions USD)', fontsize=12)
ax.set_title('Humanitarian Funding: Requirements vs Reality', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.legend()

# Add % funded annotation
for i, (req, fund) in enumerate(zip(yearly['FTS_Requirements']/1e9, yearly['FTS_Funding']/1e9)):
    pct = (fund/req)*100 if req > 0 else 0
    ax.annotate(f'{pct:.0f}%', xy=(x[i], fund), xytext=(x[i], fund + 2),
                ha='center', fontsize=9, color='#27ae60', fontweight='bold')

plt.tight_layout()
plt.show()

In [None]:
# 6. Top Underfunded Crises
fig, ax = plt.subplots(figsize=(12, 8))

underfunded_viz = df[
    (df['Actual_Funding'] > 0) & 
    (df['INFORM_Mean'] >= 3.0) &
    (df['Model_Funding_Gap'] > 0)
].nlargest(15, 'Model_Funding_Gap').copy()

underfunded_viz['Label'] = underfunded_viz['Country'] + ' (' + underfunded_viz['Year'].astype(int).astype(str) + ')'
underfunded_viz = underfunded_viz.sort_values('Model_Funding_Gap')

colors = plt.cm.Reds(np.linspace(0.3, 0.9, len(underfunded_viz)))
bars = ax.barh(range(len(underfunded_viz)), underfunded_viz['Model_Funding_Gap']/1e6, color=colors)
ax.set_yticks(range(len(underfunded_viz)))
ax.set_yticklabels(underfunded_viz['Label'], fontsize=11)
ax.set_xlabel('Funding Gap (Millions USD)', fontsize=12)
ax.set_title('Top 15 Underfunded High-Severity Crises', fontsize=14, fontweight='bold')

for bar, val in zip(bars, underfunded_viz['Model_Funding_Gap']/1e6):
    ax.text(bar.get_width() + 5, bar.get_y() + bar.get_height()/2, 
            f'${val:.0f}M', ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

---
# Summary & Key Findings

## Model Performance
- **Best Model**: Gradient Boosting with R² ≈ 0.74
- **Key Predictors**: INFORM severity (28%), People in Need (17%), INFORM Max (15%)

## Effectiveness Scoring (Outcome-First: 20/20/40/20)
- **Coverage**: 20% weight - % of requirements funded
- **Efficiency**: 20% weight - $ per person in need
- **Outcome**: 40% weight - INFORM severity improvement
- **Gap**: 20% weight - Funding gap severity

## Key Insights
1. **$96 billion funding gap** over 2020-2025
2. **71% average funding coverage** - crises receive about 71% of requested
3. **Top underfunded**: Afghanistan, Yemen, Mali, DRC, Haiti
4. **Model identifies funding gaps** where actual < predicted "optimal"

## Recommendations
1. Prioritize severely underfunded high-severity crises
2. Use model predictions to guide resource allocation
3. Focus on outcome improvement, not just funding coverage

In [None]:
# Final Summary
print("=" * 60)
print("FINAL SUMMARY")
print("=" * 60)
print(f"\nDataset: {len(df)} country-year records, {df['ISO3'].nunique()} countries")
print(f"Year range: {df['Year'].min():.0f} - {df['Year'].max():.0f}")

print(f"\n--- Funding ---")
print(f"Total Requirements: ${df['FTS_Requirements'].sum():,.0f}")
print(f"Total Funding: ${df['FTS_Funding'].sum():,.0f}")
print(f"Total Gap: ${df['FTS_Funding_Gap'].sum():,.0f}")
print(f"Average % Funded: {df['FTS_Percent_Funded'].mean():.1f}%")

print(f"\n--- Model Performance ---")
print(f"Random Forest R²: {rf_r2:.4f}")
print(f"Gradient Boosting R²: {gb_r2:.4f}")
print(f"Best Model: {'Gradient Boosting' if gb_r2 > rf_r2 else 'Random Forest'}")

print(f"\n--- Effectiveness Scoring ---")
print(f"Good Crises (score >= 45): {df['Is_Good_Crisis'].sum()} ({100*df['Is_Good_Crisis'].mean():.1f}%)")
print(f"Mean Effectiveness Score: {df['Effectiveness_Score'].mean():.1f}")

print(f"\n--- Funding Status ---")
for status in ['Severely Underfunded', 'Underfunded', 'Adequately Funded', 'Well Funded']:
    count = (df['Funding_Status'] == status).sum()
    print(f"  {status}: {count}")

In [None]:
# Save predictions
output_cols = ['ISO3', 'Year', 'Country', 'INFORM_Mean', 'Crisis_Type', 'UN_Region',
               'FTS_Funding', 'Predicted_Funding', 'Model_Funding_Gap', 'Funding_Status',
               'Effectiveness_Score', 'Effectiveness_Category', 'Is_Good_Crisis']
df[output_cols].to_csv('final_predictions.csv', index=False)
print("Saved: final_predictions.csv")