**Computer science 250 Project** 

**Name**: **Dayyan Fundi** 
    
**Partner Name**: **Gavela Maculuve**
    
**Topic**: **Predicting Disaster Response Time Using Machine Learning: A Random Forest Approach Using the Global Disaster Response Dataset (2018–2024)**

**Research Question**: **Can we predict how long it will take for responding agencies to arrive after a disaster, based on event characteristics such as type, severity, and location?**

# Section 1: Project outline

# Predicting Disaster Response Time Using Machine Learning  
### Global Disaster Response Dataset (2018–2024)

---

## 1. Abstract  
This project analyzes global disaster events from 2018–2024 and builds a machine learning model to **predict disaster response time** using Random Forest Regression. The model uses disaster type, severity, fatalities, affected population, and location to estimate how long it takes responders to arrive at a disaster site.  
We also build:  
- A **Linear Regression model** as a baseline  
- A **K-Means Clustering model** to reveal patterns in disasters  

Random Forest provides the best performance and identifies the most important predictors of response delay.

---

## 2. Introduction  
Disasters vary in severity, type, and location — all influencing how fast governments and organizations respond. Predicting response time can help emergency systems allocate resources more efficiently.  

This project uses machine learning to:  
- Predict disaster response times  
- Identify factors influencing delays  
- Discover natural groups (clusters) in global disaster events  

---

## 3. Research Question  
**Can we accurately predict how long it will take for responders to arrive after a disaster, based on event characteristics?**

---

## 4. Dataset Summary  
Dataset fields include:  
- *disaster_type*  
- *location_country*  
- *severity_level*  
- *fatalities*  
- *people_affected*  
- *economic_loss*  
- *date*  
- *response_time_hours* (target variable)  

---

## 5. Methodology Overview  
### **Models used:**  
- Linear Regression (baseline)  
- Random Forest Regressor (main model)  
- K-Means clustering (unsupervised pattern discovery)  

### **Steps:**  
1. Import dataset  
2. Clean missing data  
3. Feature engineering  
4. Build baseline model  
5. Build Random Forest model  
6. Evaluate models  
7. Build clustering model  
8. Visualize clusters  
9. Interpret results  
10. Conclusion  

---

## 6. Expected Outputs  
- MAE, RMSE, R² for both models  
- Feature importance graph  
- Cluster visualization  
- Final insights & summary  


# Section 2: python code

**2.1. Importing important librairies**

In [67]:
# ---------------------------------------------
# 1. IMPORT LIBRARIES
# ---------------------------------------------

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

plt.style.use("ggplot")


**2.2. Loading dataset**

In [68]:
df = pd.read_csv("global_disaster_response_2018_2024.csv")
df.head()


Unnamed: 0,date,country,disaster_type,severity_index,casualties,economic_loss_usd,response_time_hours,aid_amount_usd,response_efficiency_score,recovery_days,latitude,longitude
0,2021-01-31,Brazil,Earthquake,5.99,111,7934365.71,15.62,271603.79,83.21,67,-30.613,-122.557
1,2018-12-23,Brazil,Extreme Heat,6.53,100,8307648.99,5.03,265873.81,96.18,55,10.859,-159.194
2,2020-08-10,India,Hurricane,1.55,22,765136.99,32.54,49356.49,60.4,22,0.643,-160.978
3,2022-09-15,Indonesia,Extreme Heat,4.55,94,1308251.31,7.83,237512.88,86.41,47,-33.547,30.35
4,2022-09-28,United States,Wildfire,3.8,64,2655864.36,21.9,188910.69,72.81,42,-19.17,-117.137


In [69]:
# ---------------------------------------------
# 2.1 DATASET OVERVIEW
# ---------------------------------------------

print("Number of rows:", df.shape[0])
print("Number of columns:", df.shape[1])
print("\nDataset Shape:", df.shape)

print("\nColumn Names:")
print(list(df.columns))

print("\nBasic Info:")
df.info()


Number of rows: 50000
Number of columns: 12

Dataset Shape: (50000, 12)

Column Names:
['date', 'country', 'disaster_type', 'severity_index', 'casualties', 'economic_loss_usd', 'response_time_hours', 'aid_amount_usd', 'response_efficiency_score', 'recovery_days', 'latitude', 'longitude']

Basic Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 12 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   date                       50000 non-null  object 
 1   country                    50000 non-null  object 
 2   disaster_type              50000 non-null  object 
 3   severity_index             50000 non-null  float64
 4   casualties                 50000 non-null  int64  
 5   economic_loss_usd          50000 non-null  float64
 6   response_time_hours        50000 non-null  float64
 7   aid_amount_usd             50000 non-null  float64
 8   response_efficiency_scor

# Section 3: Data Cleaning and Preprocessing

**3.1. Cleaning missing Data**

In [70]:
# Drop rows missing the target variable
df = df.dropna(subset=["response_time_hours"])

# Fill numeric missing values
num_cols = df.select_dtypes(include=["int64","float64"]).columns
df[num_cols] = df[num_cols].fillna(df[num_cols].median())

# Fill categorical missing values
cat_cols = df.select_dtypes(include=["object"]).columns
df[cat_cols] = df[cat_cols].fillna(df[cat_cols].mode().iloc[0])


**3.2. Feature Engineering**

In [71]:
# Convert date into components
df["date"] = pd.to_datetime(df["date"])
df["year"] = df["date"].dt.year
df["month"] = df["date"].dt.month
df["day_of_week"] = df["date"].dt.weekday

# Choose features
features = [
    "disaster_type", "country", "severity_index",
    "casualties", "economic_loss_usd",
    "year", "month", "day_of_week"
]

X = df[features]
y = df["response_time_hours"]


**3.3. Train/ Test Split**

In [72]:
# ---------------------------------------------
# 3.3 TRAIN/TEST SPLIT
# ---------------------------------------------

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)


**3.4. Preprocessing Pipeline**

In [80]:
# ---------------------------------------------
# 3.4 PREPROCESSING PIPELINE (Corrected)
# ---------------------------------------------

# Updated feature lists based on your dataset
numeric_features = [
    "casualties",
    "economic_loss_usd",
    "year",
    "month",
    "day_of_week"
]

categorical_features = [
    "disaster_type",
    "country",
    "severity_index"
]

# Build the preprocessing transformer
preprocessor = ColumnTransformer(
    transformers=[
        ("num", StandardScaler(), numeric_features),
        ("cat", OneHotEncoder(handle_unknown="ignore"), categorical_features)
    ]
)


# Section 4: Modelling

**4.1 Linear Regression Model**

In [81]:
# ---------------------------------------------
# 4.1 LINEAR REGRESSION (Corrected)
# ---------------------------------------------

linreg_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", LinearRegression())
])

linreg_model.fit(X_train, y_train)
y_pred_lr = linreg_model.predict(X_test)

print("Linear Regression MAE:", mean_absolute_error(y_test, y_pred_lr))
print("Linear Regression RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_lr)))
print("Linear Regression R2:", r2_score(y_test, y_pred_lr))


Linear Regression MAE: 3.865416366913522
Linear Regression RMSE: 4.763779131483152
Linear Regression R2: 0.7261584928806271


**4.2 Random Forest Regression**

In [None]:
# ---------------------------------------------
# 4.2 RANDOM FOREST REGRESSOR (Corrected)
# ---------------------------------------------

rf_model = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("model", RandomForestRegressor(
        n_estimators=200,
        random_state=42
    ))
])

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest MAE:", mean_absolute_error(y_test, y_pred_rf))
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("Random Forest R2:", r2_score(y_test, y_pred_rf))


In [None]:
# ---------------------------------------------
# 4.2 RANDOM FOREST REGRESSOR (MAIN MODEL)
# ---------------------------------------------

rf_model = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RandomForestRegressor(
        n_estimators=200,
        random_state=42
    ))
])

rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

print("Random Forest MAE:", mean_absolute_error(y_test, y_pred_rf))
print("Random Forest RMSE:", np.sqrt(mean_squared_error(y_test, y_pred_rf)))
print("Random Forest R2:", r2_score(y_test, y_pred_rf))


**4.3 Feature importance**

In [None]:
# ---------------------------------------------
# 4.3 FEATURE IMPORTANCE (Corrected)
# ---------------------------------------------

# Get fitted OneHotEncoder
encoder = rf_model.named_steps["preprocessor"].named_transformers_["cat"]

# Get the encoded categorical names
onehot_names = encoder.get_feature_names_out(categorical_features)

# Combine numeric + categorical names
feature_names = numeric_features + list(onehot_names)

# Extract random forest importances
importances = rf_model.named_steps["model"].feature_importances_

# Create DataFrame
fi_df = pd.DataFrame({
    "feature": feature_names,
    "importance": importances
}).sort_values("importance", ascending=False).head(20)

# Plot
plt.figure(figsize=(10, 6))
sns.barplot(data=fi_df, x="importance", y="feature")
plt.title("Top Feature Importances — Random Forest")
plt.show()


# Section 5: Unsupervised Clustering