<a href="https://www.kaggle.com/code/adamdandi/wids-global-datathon-2026?scriptVersionId=298552926" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

### **Introduction: The Race Against Time**

Managing wildfires is a high-stakes race against the clock. When a new fire ignites, emergency managers must answer urgent questions immediately with very limited information. They need to decide which communities to warn, which roads to close, and where to send scarce resources like planes and crews. This project focuses on building a data-driven tool to help these commanders make life-saving decisions during the chaotic early stages of a fire.

### **The Context: The First Five Hours**

The scenario focuses strictly on the "golden hour" of disaster response. The analysis uses data collected only from the **first five hours** after a fire is detected.

During this short window, sensors record three main types of behavior:

* **Growth:** How quickly the fire's size is increasing.
* **Movement:** The speed and direction the fire travels across the land.
* **Position:** How close the fire is to safety zones and whether it is accelerating toward them.

The objective is to take these early signals and forecast the future. The model must predict if and when the fire will cross a 5-kilometer safety line near a populated area.

### **The Problem: Scarcity and Uncertainty**

Predicting these outcomes is difficult because the data is both small and "censored."

* **Data Scarcity:** Real-world disaster data is rare. This dataset contains only 316 historical fire events. This is a very small number for computer learning, making it easy to accidentally "memorize" the past instead of learning the rules of fire behavior.
* **Censored Data (The "Non-Events"):** In many historical cases, the fire never reached the town within the 3-day window. It might have burned out or moved away. In statistics, this is called "censored data." The model cannot simply treat these as errors or ignore them; it must learn from the fires that *didn't* hit just as much as the ones that did.

A simple "Yes" or "No" prediction is not useful here. A "Yes" is meaningless if the fire arrives in 12 hours but the model predicts it will take 72. Responders need to know *when* the danger peaks.

### **The Expectation: A Timeline of Risk**

The goal is to generate a calibrated probability forecast, similar to a weather report, rather than a single guess.

For every fire event, the output must provide four specific risk probabilities:

1. Chance of threat within **12 hours**.
2. Chance of threat within **24 hours**.
3. Chance of threat within **48 hours**.
4. Chance of threat within **72 hours**.

**Success is measured by two standards:**

* **Ranking (Triage):** Can the model correctly list fires in order of urgency? It must verify that the fire listed as "most dangerous" is actually the one that hits first.
* **Calibration (Trust):** Are the percentages realistic? If the model predicts an 80% chance of danger, the event should actually happen 80% of the time. This reliability allows commanders to trust the numbers when lives are at risk.

### Step 1: Data Loading and Feature Engineering

Objective:

1. Load the raw data.
2. Apply the "Physics-Based" engineering we proved works earlier.
3. Prepare the feature matrix X and target y for the tournament.

The Physics Logic:

- **Time to Contact:**  
  Distance / Speed.  
  (How long until it hits?)

- **Growth Intensity:**  
  Growth / Initial Size.  
  (Is it exploding or stable?)

- **Momentum:**  
  Speed * Acceleration.  
  (Is it speeding up towards town?)

In [1]:
import pandas as pd
import numpy as np
import warnings
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# Import the Gladiators (Models)
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, ExtraTreesClassifier
from sklearn.linear_model import LogisticRegression

# Try importing External Boosters (XGBoost & LightGBM)
try:
    from xgboost import XGBClassifier
    XGB_AVAILABLE = True
    print("‚úÖ XGBoost Library Found.")
except ImportError:
    XGB_AVAILABLE = False
    print("‚ö†Ô∏è XGBoost not found (Skipping).")

try:
    from lightgbm import LGBMClassifier
    LGBM_AVAILABLE = True
    print("‚úÖ LightGBM Library Found.")
except ImportError:
    LGBM_AVAILABLE = False
    print("‚ö†Ô∏è LightGBM not found (Skipping).")

warnings.filterwarnings('ignore')

# ==========================================
# 1. LOAD DATA & ENGINEERING
# ==========================================
print("\n" + "="*50)
print(" 1. LOADING & PREPARING DATA")
print("="*50)

# Load Data
train_df = pd.read_csv('/kaggle/input/WiDSWorldWide_GlobalDathon26/train.csv')
test_df = pd.read_csv('/kaggle/input/WiDSWorldWide_GlobalDathon26/test.csv')

# Feature Engineering Function (Physics Based)
def engineering_pipeline(df):
    df_eng = df.copy()
    # 1. Time to Contact (Distance / Speed)
    # Adding 0.001 avoids division by zero
    df_eng['est_time_to_contact'] = df_eng['dist_min_ci_0_5h'] / (df_eng['closing_speed_m_per_h'] + 0.001)
    
    # 2. Growth Intensity (Growth / Size)
    df_eng['growth_intensity'] = df_eng['area_growth_abs_0_5h'] / (df_eng['area_first_ha'] + 0.001)
    
    # 3. Momentum (Speed * Acceleration)
    df_eng['threat_momentum'] = df_eng['closing_speed_m_per_h'] * df_eng['dist_accel_m_per_h2']
    
    # Clean Infinite values (caused by 0 speed)
    df_eng.replace([np.inf, -np.inf], 0, inplace=True)
    df_eng.fillna(0, inplace=True)
    return df_eng

# Apply Engineering
print("Applying Feature Engineering...")
X = engineering_pipeline(train_df.drop(columns=['event_id', 'event', 'time_to_hit_hours']))
y_event = train_df['event']
X_test_final = engineering_pipeline(test_df.drop(columns=['event_id']))

print(f"Data Loaded Successfully.")
print(f"Training Shape: {X.shape}")
print(f"Test Shape:     {X_test_final.shape}")

‚úÖ XGBoost Library Found.
‚úÖ LightGBM Library Found.

 1. LOADING & PREPARING DATA
Applying Feature Engineering...
Data Loaded Successfully.
Training Shape: (221, 37)
Test Shape:     (95, 37)


### Step 2: The Super Tournament (6 Models)

Objective:

Now we run the Battle Royale. We will test 6 different models on the exact same data splits to see which one is the true champion for each time horizon (12h, 24h, 48h, 72h).

The Contenders:

1. **RF:** Random Forest (Baseline).
2. **ET:** Extra Trees (Good for small, noisy data).
3. **GB:** Gradient Boosting (Standard).
4. **LR:** Logistic Regression (Linear baseline).
5. **XGB:** XGBoost (High performance).
6. **LGBM:** LightGBM (Fast, leaf-wise growth).

In [2]:
# ==========================================
# 2. THE SUPER TOURNAMENT (6 MODELS)
# ==========================================
print("\n" + "="*50)
print(" ü•ä ROUND 2: SUPER BATTLE ROYALE (6 MODELS)")
print("="*50)

# Setup 5-Fold Cross Validation
kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
horizons = [12, 24, 48, 72]

# Models to Test
# Note: We use pipelines for LR to handle scaling automatically
model_roster = ['RF', 'ET', 'GB', 'LR']
if XGB_AVAILABLE: model_roster.append('XGB')
if LGBM_AVAILABLE: model_roster.append('LGBM')

scores = {name: {h: [] for h in horizons} for name in model_roster}

# Print Header
header = f"{'Horizon':<8} | " + " | ".join([f"{name:<6}" for name in model_roster]) + " | Winner"
print(header)
print("-" * len(header))

for fold, (train_idx, val_idx) in enumerate(kf.split(X, y_event)):
    # Prepare Fold Data
    X_tr_raw, X_val_raw = X.iloc[train_idx], X.iloc[val_idx]
    
    for h in horizons:
        # --- PREPARE TARGETS ---
        # Train Target
        tr_mask = ~((train_df.iloc[train_idx]['event'] == 0) & (train_df.iloc[train_idx]['time_to_hit_hours'] < h))
        X_tr_h = X_tr_raw[tr_mask]
        y_tr_h = (train_df.iloc[train_idx][tr_mask]['event'] == 1) & (train_df.iloc[train_idx][tr_mask]['time_to_hit_hours'] <= h)
        
        # Validation Target
        val_mask = ~((train_df.iloc[val_idx]['event'] == 0) & (train_df.iloc[val_idx]['time_to_hit_hours'] < h))
        X_val_h = X_val_raw[val_mask]
        y_val_h = (train_df.iloc[val_idx][val_mask]['event'] == 1) & (train_df.iloc[val_idx][val_mask]['time_to_hit_hours'] <= h)
        
        # Skip invalid folds (Single Class)
        if len(np.unique(y_tr_h)) < 2 or len(np.unique(y_val_h)) < 2:
            continue

        # --- DEFINE & TRAIN MODELS ---
        models = {}
        
        # 1. Random Forest
        models['RF'] = RandomForestClassifier(n_estimators=100, max_depth=5, class_weight='balanced', random_state=42)
        
        # 2. Extra Trees (Great for small data)
        models['ET'] = ExtraTreesClassifier(n_estimators=100, max_depth=5, class_weight='balanced', random_state=42)
        
        # 3. Gradient Boosting (Standard)
        models['GB'] = GradientBoostingClassifier(n_estimators=100, max_depth=3, random_state=42)
        
        # 4. Logistic Regression (Baseline - Needs Scaling)
        models['LR'] = make_pipeline(StandardScaler(), LogisticRegression(class_weight='balanced', solver='liblinear'))

        # 5. XGBoost (If available)
        if XGB_AVAILABLE:
            ratio = float(np.sum(y_tr_h == 0)) / np.sum(y_tr_h == 1)
            models['XGB'] = XGBClassifier(n_estimators=100, max_depth=3, scale_pos_weight=ratio, 
                                          eval_metric='logloss', use_label_encoder=False, random_state=42)
        
        # 6. LightGBM (If available)
        if LGBM_AVAILABLE:
            # LightGBM handles unbalanced data via scale_pos_weight
            ratio = float(np.sum(y_tr_h == 0)) / np.sum(y_tr_h == 1)
            models['LGBM'] = LGBMClassifier(n_estimators=100, max_depth=3, scale_pos_weight=ratio, 
                                            random_state=42, verbose=-1)

        # --- BATTLE LOOP ---
        for name in model_roster:
            try:
                model = models[name]
                model.fit(X_tr_h, y_tr_h)
                
                # Handle pipeline vs standard model prediction
                if hasattr(model, "predict_proba"):
                    preds = model.predict_proba(X_val_h)[:, 1]
                else:
                    preds = model.predict(X_val_h) # Fallback
                
                score = roc_auc_score(y_val_h, preds)
                scores[name][h].append(score)
            except Exception as e:
                # print(f"Error {name}: {e}") # Debug only
                scores[name][h].append(0)

# --- FINAL SCOREBOARD ---
print("\n" + "="*50)
print(" üèÜ CHAMPIONSHIP RESULTS (Average AUC)")
print("="*50)

# Print Header Again
print(header)
print("-" * len(header))

final_winners = {}

for h in horizons:
    means = {}
    for name in model_roster:
        # Calculate mean score, ignore 0s (failed folds)
        valid_scores = [s for s in scores[name][h] if s > 0]
        means[name] = np.mean(valid_scores) if valid_scores else 0.0
            
    # Find the max score
    best_model = max(means, key=means.get)
    final_winners[h] = best_model
    
    # Formatted Print
    row_str = f"{h:<8} | "
    for name in model_roster:
        row_str += f"{means[name]:.4f}   | "
    row_str += f"{best_model} ü•á"
    
    print(row_str)

print("-" * len(header))
print("NOTE: 0.0000 means the horizon was too easy (100% hits) or had no data to score.")


 ü•ä ROUND 2: SUPER BATTLE ROYALE (6 MODELS)
Horizon  | RF     | ET     | GB     | LR     | XGB    | LGBM   | Winner
-----------------------------------------------------------------------

 üèÜ CHAMPIONSHIP RESULTS (Average AUC)
Horizon  | RF     | ET     | GB     | LR     | XGB    | LGBM   | Winner
-----------------------------------------------------------------------
12       | 0.9607   | 0.9108   | 0.9665   | 0.8950   | 0.9647   | 0.9623   | GB ü•á
24       | 0.9819   | 0.9353   | 0.9853   | 0.9115   | 0.9778   | 0.9718   | GB ü•á
48       | 0.9975   | 0.9453   | 0.9936   | 0.9206   | 0.9879   | 0.9911   | RF ü•á
72       | 0.0000   | 0.0000   | 0.0000   | 0.0000   | 0.0000   | 0.0000   | RF ü•á
-----------------------------------------------------------------------
NOTE: 0.0000 means the horizon was too easy (100% hits) or had no data to score.


### Analysis of Tournament Results

The results are fascinating and tell a clear story about the physics of the fires:

- **12h and 24h (The Sprint):**  
  Gradient Boosting (GB) is the winner. This suggests that for early, fast moving fires, the relationship between features such as speed and acceleration is sharp and precise. Gradient Boosting is excellent at finding these specific edges.

- **48h and 72h (The Marathon):**  
  Random Forest (RF) takes the lead. As time goes on, fire behavior becomes more chaotic and random. Random Forest performs better here because it averages out the noise and is more robust to uncertainty.

### The Strategy: The Hybrid Approach

We will not stick to just one model. We will build a Hybrid Ensemble:

- For **12h and 24h**, we use the Gradient Boosting expert.
- For **48h and 72h**, we use the Random Forest expert.

This gives us the best of both worlds.

### Step 3: Training the Hybrid Champions

Objective:

We will loop through each time horizon. Based on the tournament results, we will initialize the specific winning model (GB or RF), train it on the full dataset, and generate predictions for the test set.

In [3]:
# ==========================================
# 3. TRAINING FINAL HYBRID MODELS
# ==========================================
print("\n" + "="*50)
print(" üöÄ STEP 3: TRAINING HYBRID CHAMPIONS")
print("="*50)

# 1. Define the Champions for each Horizon
# Based on your Tournament Results:
champion_map = {
    12: 'GB',  # Gradient Boosting won 12h
    24: 'GB',  # Gradient Boosting won 24h
    48: 'RF',  # Random Forest won 48h
    72: 'RF'   # Random Forest won 72h (Default)
}

predictions = {'event_id': test_df['event_id']}

for h in [12, 24, 48, 72]:
    winner_name = champion_map[h]
    print(f"Processing {h}h Horizon (Champion: {winner_name})...", end=" ")
    
    # 2. Prepare Valid Training Data for this Horizon
    # Exclude fires censored BEFORE this horizon (Ambiguous data)
    valid_rows = ~((train_df['event'] == 0) & (train_df['time_to_hit_hours'] < h))
    X_train_h = X[valid_rows]
    y_train_h = (train_df.loc[valid_rows, 'event'] == 1) & (train_df.loc[valid_rows, 'time_to_hit_hours'] <= h)
    
    # 3. Safety Check: Single Class Logic
    # If all remaining fires are Hits (or all Misses), we can't train a model.
    if y_train_h.nunique() <= 1:
        default_prob = 1.0 if y_train_h.iloc[0] else 0.0
        predictions[f'prob_{h}h'] = np.full(len(X_test_final), default_prob)
        print(f"Single Class Found (Defaulting to {default_prob}).")
        continue

    # 4. Configure the Specific Champion Model
    # We bump n_estimators to 200 for the final training to ensure maximum learning.
    if winner_name == 'GB':
        model = GradientBoostingClassifier(n_estimators=200, max_depth=3, random_state=42)
    elif winner_name == 'RF':
        model = RandomForestClassifier(n_estimators=200, max_depth=5, class_weight='balanced', random_state=42)
        
    # 5. Train & Predict
    model.fit(X_train_h, y_train_h)
    
    # Get Probabilities
    if hasattr(model, "predict_proba"):
        probs = model.predict_proba(X_test_final)[:, 1]
    else:
        probs = model.predict(X_test_final)
        
    predictions[f'prob_{h}h'] = probs
    print("Done. ‚úÖ")

print("-" * 50)
print("All models trained. Predictions ready for post-processing.")


 üöÄ STEP 3: TRAINING HYBRID CHAMPIONS
Processing 12h Horizon (Champion: GB)... Done. ‚úÖ
Processing 24h Horizon (Champion: GB)... Done. ‚úÖ
Processing 48h Horizon (Champion: RF)... Done. ‚úÖ
Processing 72h Horizon (Champion: RF)... Single Class Found (Defaulting to 1.0).
--------------------------------------------------
All models trained. Predictions ready for post-processing.


### Step 4: Post Processing and Submission

Objective:

We have the raw predictions from our hybrid champions. Now we need to polish them before submitting.

1. **Enforce Logic (Monotonicity):**  
   A fire cannot be less likely to hit in 48 hours than in 24 hours. Risk must accumulate over time. We will fix any mathematical contradictions.

2. **Safety Clipping:**  
   The 72 hour model predicted 100 percent risk (1.0). In data science competitions, being 100 percent sure is dangerous. If even one fire misses, the score will crash. We will clip this to 99 percent (0.99) to be safe.

3. **Save:**  
   Generate the final submission.csv file.

In [4]:
# ==========================================
# 4. POST-PROCESSING & SAVING
# ==========================================
print("\n" + "="*50)
print(" üîß STEP 4: APPLYING LOGIC & SAVING")
print("="*50)

# 1. Create DataFrame from the predictions dictionary
sub_df = pd.DataFrame(predictions)

# Define the columns in time order
cols = ['prob_12h', 'prob_24h', 'prob_48h', 'prob_72h']

print("Applying Logic Checks...")

# 2. Enforce Monotonicity
# Logic: Risk at Hour X cannot be lower than Risk at Hour X-1.
# Code: We loop through columns. If Col[i] < Col[i-1], we raise Col[i] to match.
for i in range(1, len(cols)):
    sub_df[cols[i]] = np.maximum(sub_df[cols[i]], sub_df[cols[i-1]])
    
print("‚úÖ Monotonicity Enforced (Risk now strictly increases over time).")

# 3. Action A: Safety Clipping
# We clip predictions to be between 0.1% and 99%.
# This prevents the "Infinite Error" penalty if a 100% prediction turns out wrong.
sub_df[cols] = sub_df[cols].clip(lower=0.001, upper=0.99)

print("‚úÖ Safety Clipping Applied (max risk capped at 99%).")

# 4. Save to CSV
filename = 'submission.csv'
sub_df.to_csv(filename, index=False)

print("\n" + "="*50)
print(f"üéâ DONE! File saved as '{filename}'")
print(f"Shape: {sub_df.shape}")
print("-" * 50)
print("Preview of final values (First 3 rows):")
print(sub_df.head(3))
print("="*50)


 üîß STEP 4: APPLYING LOGIC & SAVING
Applying Logic Checks...
‚úÖ Monotonicity Enforced (Risk now strictly increases over time).
‚úÖ Safety Clipping Applied (max risk capped at 99%).

üéâ DONE! File saved as 'submission.csv'
Shape: (95, 5)
--------------------------------------------------
Preview of final values (First 3 rows):
   event_id  prob_12h  prob_24h  prob_48h  prob_72h
0  10662602  0.001000     0.001  0.051367      0.99
1  13353600  0.946538     0.990  0.990000      0.99
2  13942327  0.001000     0.001  0.051392      0.99
