# Disaster Prediction & Early Warning System (ML + GIS) — Without IoT

**End-to-end Jupyter Notebook** that you can run directly. It focuses on:
- Multi-hazard risk modeling (Flood & Drought)
- Robust preprocessing & feature engineering
- High-accuracy models with cross-validation + hyperparameter search
- Explainability (Permutation Importance)
- GIS risk mapping with Folium
- Early warning rule engine + notifications (simulated)

**Tip:** Replace the synthetic data section with your real CSV files when ready.


## 1) Setup
Install/Import libraries. If a library is missing, run the pip cell below.

In [10]:

!pip install numpy pandas scikit-learn matplotlib folium joblib shapely geopandas pyproj


Collecting folium
  Downloading folium-0.20.0-py2.py3-none-any.whl.metadata (4.2 kB)
Collecting shapely
  Downloading shapely-2.1.1-cp312-cp312-win_amd64.whl.metadata (7.0 kB)
Collecting geopandas
  Downloading geopandas-1.1.1-py3-none-any.whl.metadata (2.3 kB)
Collecting pyproj
  Downloading pyproj-3.7.2-cp312-cp312-win_amd64.whl.metadata (31 kB)
Collecting branca>=0.6.0 (from folium)
  Downloading branca-0.8.1-py3-none-any.whl.metadata (1.5 kB)
Collecting pyogrio>=0.7.2 (from geopandas)
  Downloading pyogrio-0.11.1-cp312-cp312-win_amd64.whl.metadata (5.4 kB)
Downloading folium-0.20.0-py2.py3-none-any.whl (113 kB)
Downloading shapely-2.1.1-cp312-cp312-win_amd64.whl (1.7 MB)
   ---------------------------------------- 0.0/1.7 MB ? eta -:--:--
   ---------------------------------------- 1.7/1.7 MB 10.2 MB/s eta 0:00:00
Downloading geopandas-1.1.1-py3-none-any.whl (338 kB)
Downloading pyproj-3.7.2-cp312-cp312-win_amd64.whl (6.3 MB)
   ---------------------------------------- 0.0/6.3 MB ?

In [11]:
import numpy as np
import pandas as pd
from pathlib import Path
from datetime import datetime, timedelta
from sklearn.model_selection import train_test_split, TimeSeriesSplit, StratifiedKFold, RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score, classification_report, mean_absolute_error, r2_score
from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor, RandomForestClassifier
from sklearn.inspection import permutation_importance
from sklearn.impute import SimpleImputer
import joblib
import folium
import matplotlib.pyplot as plt

pd.set_option('display.max_columns', 200)
np.random.seed(42)


## 2) Data
We use **synthetic-yet-realistic** geospatial timeseries data to ensure the notebook runs anywhere. You can later switch to real datasets by replacing the synthetic data cells with `pd.read_csv(...)`.

In [14]:
# ---- Synthetic Data Generator (Flood classification + Drought regression) ----
# Geography box (e.g., a region in Maharashtra)
lat_min, lat_max = 18.5, 20.0
lon_min, lon_max = 73.0, 76.0

n_points = 2500  # locations x dates
start_date = datetime(2020, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(365)]

# Sample lat/lon and assign each a random date (simulate daily observations)
lats = np.random.uniform(lat_min, lat_max, n_points)
lons = np.random.uniform(lon_min, lon_max, n_points)
date_idx = np.random.randint(0, len(dates), n_points)
date_series = np.array([dates[i] for i in date_idx])

# Weather-like features
rain_mm = np.clip(np.random.gamma(2.0, 15.0, n_points) - 5 + 25*np.random.binomial(1, 0.2, n_points), 0, None)
temp_c = np.random.normal(28, 4, n_points) - 0.05*(rain_mm > 40)  # rainy days slightly cooler
wind_kph = np.random.normal(15, 6, n_points) + 0.2*(rain_mm > 60) * np.random.normal(8, 3, n_points)
humidity = np.clip(np.random.normal(60, 15, n_points) + 0.2*rain_mm, 10, 100)
river_level_m = np.clip(1.2 + 0.02*rain_mm + np.random.normal(0, 0.3, n_points), 0, None)
soil_moisture = np.clip(20 + 0.6*rain_mm - 0.3*temp_c + np.random.normal(0, 5, n_points), 5, 100)
evap_mm = np.clip(5 + 0.2*temp_c - 0.05*humidity + np.random.normal(0, 1.5, n_points), 0, None)

# Flood label (classification): high rainfall + river level + soil saturation
flood_risk_score = 0.03*rain_mm + 1.5*river_level_m + 0.02*soil_moisture + 0.01*wind_kph - 0.02*temp_c
flood_threshold = np.percentile(flood_risk_score, 80)  # top 20% risk = flood event
flood_event = (flood_risk_score >= flood_threshold).astype(int)

# Drought index (regression target): lower rainfall, high evap, low soil moisture -> higher drought index
drought_index = np.clip(80 - 0.35*rain_mm + 0.6*evap_mm - 0.3*soil_moisture + np.random.normal(0, 3, n_points), 0, 100)

df = pd.DataFrame({
    'date': date_series,
    'lat': lats,
    'lon': lons,
    'rain_mm': rain_mm,
    'temp_c': temp_c,
    'wind_kph': wind_kph,
    'humidity': humidity,
    'river_level_m': river_level_m,
    'soil_moisture': soil_moisture,
    'evap_mm': evap_mm,
    'flood_event': flood_event,
    'drought_index': drought_index
})

df.sort_values('date', inplace=True)
df.reset_index(drop=True, inplace=True)

print('Synthetic dataset shape:', df.shape)
df.head()

Synthetic dataset shape: (2500, 12)


Unnamed: 0,date,lat,lon,rain_mm,temp_c,wind_kph,humidity,river_level_m,soil_moisture,evap_mm,flood_event,drought_index
0,2020-01-01,18.708241,74.350763,27.172204,26.853746,17.415451,66.403607,1.496472,30.674063,6.343199,0,67.65048
1,2020-01-01,19.033959,74.44267,69.529848,29.004376,13.874468,76.906338,3.319376,50.550377,8.022703,1,49.862756
2,2020-01-01,18.550919,73.632379,34.844697,27.977871,17.720272,62.174314,1.714159,32.428313,8.671048,0,65.907568
3,2020-01-01,19.954868,75.271849,7.013067,27.257663,18.653639,65.679849,1.604903,10.968427,10.18876,0,79.710447
4,2020-01-02,19.769678,73.029936,31.734303,22.191026,11.732328,60.767464,1.442845,26.598245,5.338447,0,67.748704


### Optional: Save the synthetic dataset to CSV (for reuse)

In [35]:
data_path = Path(r"C:\Users\vedan\Downloads\synthetic_disaster_dataset.csv")
df.to_csv(data_path, index=False)
print('Saved to:', data_path)

Saved to: C:\Users\vedan\Downloads\synthetic_disaster_dataset.csv


## 3) Feature Engineering
Create rolling features and lags per approximate spatial buckets (lat/lon rounding) to emulate time-series locality.

In [38]:
# Create spatial buckets (approximate grid) to compute rolling stats by area
df['lat_bucket'] = df['lat'].round(2)
df['lon_bucket'] = df['lon'].round(2)

df = df.sort_values(['lat_bucket','lon_bucket','date']).copy()

def add_group_rolls(data, group_cols, feat, windows=(3,7,14)):
    for w in windows:
        col = f'{feat}_roll{w}'
        data[col] = data.groupby(group_cols)[feat].transform(lambda x: x.rolling(w, min_periods=1).mean())
    return data

for feat in ['rain_mm','temp_c','wind_kph','humidity','soil_moisture','evap_mm','river_level_m']:
    df = add_group_rolls(df, ['lat_bucket','lon_bucket'], feat)

# Example lags
for feat in ['rain_mm','river_level_m','soil_moisture']:
    df[f'{feat}_lag1'] = df.groupby(['lat_bucket','lon_bucket'])[feat].shift(1)

df.dropna(inplace=True)  # remove rows where lags may be NaN after shifting
df.reset_index(drop=True, inplace=True)
print('After FE:', df.shape)
df.head()

After FE: (0, 38)


Unnamed: 0,date,lat,lon,rain_mm,temp_c,wind_kph,humidity,river_level_m,soil_moisture,evap_mm,flood_event,drought_index,lat_bucket,lon_bucket,rain_mm_roll3,rain_mm_roll7,rain_mm_roll14,temp_c_roll3,temp_c_roll7,temp_c_roll14,wind_kph_roll3,wind_kph_roll7,wind_kph_roll14,humidity_roll3,humidity_roll7,humidity_roll14,soil_moisture_roll3,soil_moisture_roll7,soil_moisture_roll14,evap_mm_roll3,evap_mm_roll7,evap_mm_roll14,river_level_m_roll3,river_level_m_roll7,river_level_m_roll14,rain_mm_lag1,river_level_m_lag1,soil_moisture_lag1


## 4) Train/Test Split
We split by time to avoid leakage for flood classification and drought regression.

In [41]:
# Time-based split (last 20% dates for test)
split_time = df['date'].quantile(0.8)
train_df = df[df['date'] <= split_time].copy()
test_df  = df[df['date'] >  split_time].copy()

print('Train size:', train_df.shape, 'Test size:', test_df.shape)

# Common features
base_feats = [c for c in df.columns if c not in ['date','flood_event','drought_index','lat_bucket','lon_bucket']]
target_flood = 'flood_event'
target_drought = 'drought_index'

X_train_flood, y_train_flood = train_df[base_feats], train_df[target_flood]
X_test_flood,  y_test_flood  = test_df[base_feats],  test_df[target_flood]

X_train_drought, y_train_drought = train_df[base_feats], train_df[target_drought]
X_test_drought,  y_test_drought  = test_df[base_feats],  test_df[target_drought]

num_features = base_feats  # all are numeric here


Train size: (0, 38) Test size: (0, 38)


## 5) Modeling — Flood Classification (High Accuracy with CV + Randomized Search)
We use **GradientBoostingClassifier** (robust on tabular data) with strong cross-validation and randomized search for hyperparameters.

In [44]:
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler(with_mean=False))  # tree models don't require scaling, but harmless
])

flood_clf = Pipeline(steps=[
    ('prep', ColumnTransformer([('num', numeric_transformer, num_features)], remainder='drop')),
    ('model', GradientBoostingClassifier(random_state=42))
])

param_distributions = {
    'model__n_estimators': [150, 250, 350, 450],
    'model__learning_rate': np.linspace(0.02, 0.2, 10),
    'model__max_depth': [2, 3, 4],
    'model__min_samples_split': [2, 5, 10, 20],
    'model__min_samples_leaf': [1, 2, 4, 8],
    'model__subsample': [0.7, 0.85, 1.0]
}

tscv = TimeSeriesSplit(n_splits=5)
search_flood = RandomizedSearchCV(
    flood_clf, param_distributions=param_distributions, n_iter=40,
    scoring='roc_auc', cv=tscv, n_jobs=-1, random_state=42, verbose=1
)

search_flood.fit(X_train_flood, y_train_flood)
print('Best Flood params:', search_flood.best_params_)


Fitting 5 folds for each of 40 candidates, totalling 200 fits


ValueError: Cannot have number of folds=6 greater than the number of samples=0.

In [None]:
# Evaluate
best_flood = search_flood.best_estimator_
proba = best_flood.predict_proba(X_test_flood)[:,1]
preds = (proba >= 0.5).astype(int)

print('ROC-AUC:', roc_auc_score(y_test_flood, proba))
print('Accuracy:', accuracy_score(y_test_flood, preds))
print('F1:', f1_score(y_test_flood, preds))
print('\nClassification Report:\n', classification_report(y_test_flood, preds))

# Permutation importance (on a sampled subset for speed)
idx = np.random.choice(len(X_test_flood), size=min(300, len(X_test_flood)), replace=False)
r = permutation_importance(best_flood, X_test_flood.iloc[idx], y_test_flood.iloc[idx], n_repeats=5, random_state=42, scoring='roc_auc')
imp_df = pd.DataFrame({'feature': num_features, 'importance': r.importances_mean}).sort_values('importance', ascending=False).head(15)
imp_df.reset_index(drop=True, inplace=True)
imp_df


## 6) Modeling — Drought Index Regression (High Accuracy with CV + Randomized Search)
We use **GradientBoostingRegressor** with strong validation.

In [None]:
drought_reg = Pipeline(steps=[
    ('prep', ColumnTransformer([('num', numeric_transformer, num_features)], remainder='drop')),
    ('model', GradientBoostingRegressor(random_state=42))
])

param_dist_reg = {
    'model__n_estimators': [200, 300, 400, 600],
    'model__learning_rate': np.linspace(0.02, 0.2, 10),
    'model__max_depth': [2, 3, 4],
    'model__min_samples_split': [2, 5, 10, 20],
    'model__min_samples_leaf': [1, 2, 4, 8],
    'model__subsample': [0.7, 0.85, 1.0]
}

search_drought = RandomizedSearchCV(
    drought_reg, param_distributions=param_dist_reg, n_iter=40,
    scoring='neg_mean_absolute_error', cv=tscv, n_jobs=-1, random_state=42, verbose=1
)
search_drought.fit(X_train_drought, y_train_drought)
print('Best Drought params:', search_drought.best_params_)


In [None]:
# Evaluate
best_drought = search_drought.best_estimator_
drought_pred = best_drought.predict(X_test_drought)
mae = mean_absolute_error(y_test_drought, drought_pred)
r2 = r2_score(y_test_drought, drought_pred)
print('MAE:', mae)
print('R2:', r2)

# Importance (permutation)
idx2 = np.random.choice(len(X_test_drought), size=min(300, len(X_test_drought)), replace=False)
r2imp = permutation_importance(best_drought, X_test_drought.iloc[idx2], y_test_drought.iloc[idx2], n_repeats=5, random_state=42, scoring='neg_mean_absolute_error')
imp_reg_df = pd.DataFrame({'feature': num_features, 'importance': r2imp.importances_mean}).sort_values('importance', ascending=False).head(15)
imp_reg_df.reset_index(drop=True, inplace=True)
imp_reg_df


## 7) Explainability — Top Features
Bar charts of permutation importance.

In [None]:
# Flood importance plot
plt.figure(figsize=(8,5))
plt.barh(imp_df['feature'], imp_df['importance'])
plt.gca().invert_yaxis()
plt.title('Flood Model — Top Features (Permutation Importance)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()

# Drought importance plot
plt.figure(figsize=(8,5))
plt.barh(imp_reg_df['feature'], imp_reg_df['importance'])
plt.gca().invert_yaxis()
plt.title('Drought Model — Top Features (Permutation Importance)')
plt.xlabel('Importance')
plt.tight_layout()
plt.show()


## 8) Risk Scoring
Combine flood probability and drought index into a unified **risk score** (0–100) and classify severity bands.

In [None]:
# Normalize drought index to 0-1, combine with flood probability
flood_prob_test = best_flood.predict_proba(X_test_flood)[:,1]

# Align indices (they come from the same test_df order)
risk = pd.DataFrame({
    'date': test_df['date'].values,
    'lat': test_df['lat'].values,
    'lon': test_df['lon'].values,
    'flood_prob': flood_prob_test,
    'drought_index': y_test_drought.values,  # ground truth
    'drought_pred': drought_pred
})
# Composite risk: higher flood prob or higher drought_pred -> higher risk
risk['comp_risk_0_100'] = 50*risk['flood_prob'] + 0.5*np.clip(risk['drought_pred'], 0, 100)
# Severity bands
bins = [0, 25, 50, 75, 100]
labels = ['Low','Moderate','High','Severe']
risk['severity'] = pd.cut(risk['comp_risk_0_100'], bins=bins, labels=labels, include_lowest=True)

risk.head()


## 9) GIS — Risk Map (Folium)
Interactive map with markers sized by risk and colored by severity.

In [None]:
# Color map for severity
def sev_color(s):
    return {'Low':'green','Moderate':'orange','High':'red','Severe':'darkred'}.get(str(s),'blue')

center_lat, center_lon = float(risk['lat'].mean()), float(risk['lon'].mean())
m = folium.Map(location=[center_lat, center_lon], zoom_start=7)

sample = risk.sample(min(500, len(risk)), random_state=42)  # limit markers for performance

for _, row in sample.iterrows():
    folium.CircleMarker(
        location=[row['lat'], row['lon']],
        radius=3 + (row['comp_risk_0_100']/20),
        color=sev_color(row['severity']),
        fill=True,
        fill_opacity=0.6,
        popup=f"Date: {row['date'].date()}\nFlood prob: {row['flood_prob']:.2f}\nDrought pred: {row['drought_pred']:.1f}\nRisk: {row['comp_risk_0_100']:.1f} ({row['severity']})"
    ).add_to(m)

m


## 10) Early Warning — Rule Engine
Generates human-friendly alerts and suggested actions.

In [None]:
def make_alerts(r):
    alerts = []
    for _, row in r.iterrows():
        msg = None
        if row['flood_prob'] >= 0.7:
            msg = f"⚠️ Flood risk HIGH ({row['flood_prob']:.2f}). Move to higher ground, prepare evacuation."
        elif row['comp_risk_0_100'] >= 75:
            msg = f"⚠️ Severe composite risk ({row['comp_risk_0_100']:.1f}). Stay alert and follow local advisories."
        elif row['comp_risk_0_100'] >= 50:
            msg = f"🔶 High composite risk ({row['comp_risk_0_100']:.1f}). Review safety plans."
        if msg:
            alerts.append({
                'date': row['date'],
                'lat': row['lat'],
                'lon': row['lon'],
                'message': msg
            })
    return pd.DataFrame(alerts)

alerts_df = make_alerts(risk)
print('Total alerts:', len(alerts_df))
alerts_df.head(10)


## 11) Save Models & Artifacts

In [None]:
art_dir = Path('/mnt/data/models')
art_dir.mkdir(parents=True, exist_ok=True)
joblib.dump(best_flood, art_dir / 'flood_model_gb.pkl')
joblib.dump(best_drought, art_dir / 'drought_model_gb.pkl')
risk.to_csv(art_dir / 'latest_risk_scores.csv', index=False)
alerts_df.to_csv(art_dir / 'latest_alerts.csv', index=False)
print('Saved models and CSVs to:', art_dir)

## 12) How to Use With Real Data
1. Replace the **synthetic data generator** with your CSV(s).
2. Ensure the following columns (or map your columns accordingly):
   - `date` (YYYY-MM-DD), `lat`, `lon`
   - `rain_mm`, `temp_c`, `wind_kph`, `humidity`, `river_level_m`, `soil_moisture`, `evap_mm`
   - `flood_event` (0/1), `drought_index` (0–100)
3. Re-run from **Feature Engineering** onward.

### Tips for Higher Accuracy on Real Data
- Increase `n_points` / dataset size.
- Try `HistGradientBoostingClassifier/Regressor` or `CatBoost` (handles categorical/monotonic constraints well).
- Add domain features: SPI (Standardized Precipitation Index), soil saturation days, river basin IDs.
- Use **spatial cross-validation** (e.g., GroupKFold by basin) to avoid overfitting by location.
- Calibrate probabilities with `CalibratedClassifierCV` for better alert thresholds.
- Tune thresholds using precision-recall tradeoffs for imbalanced events.
- Add seasonal features: month, monsoon flag, ENSO indices if available.
