# Day 2: Earthquake & Tsunami Risk Assessment

## Overview
Machine learning model to predict tsunami occurrence based on earthquake characteristics. This notebook implements feature engineering, Random Forest classification, and model evaluation to assess tsunami risk from seismic data.

## Dataset
- **Source**: Global Earthquake & Tsunami Risk Assessment Dataset
- **Features**: Magnitude, CDI, MMI, Significance, NST, Dmin, Gap, Depth, Coordinates, Year, Month, Tsunami (target)
- **Target**: Binary classification (0 = No Tsunami, 1 = Tsunami)

## Import Required Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)

print("ML Notebook Loaded | Setup Complete")

## Load and Explore Dataset

In [None]:
et_data = pd.read_csv('../data/earthquake_data_tsunami.csv')

display(et_data.head())
print("\n")
print("-" * 40)
display(et_data.info())
print("\n")
print("-" * 40)
display(et_data.describe())
print("\n")
print("-" * 40)
display(et_data.isnull().sum())
print("\n")
print("-" * 40)

## Feature Engineering

Creating additional features to improve model performance:

1. **Energy**: Magnitude squared to represent seismic energy release
2. **Felt vs Measured**: Difference between MMI and CDI intensity scales
3. **Distance Metrics**: Convert dmin to kilometers and create proximity score
4. **Gap Normalization**: Standardize azimuthal gap to 0-1 scale
5. **Depth Categories**: Classify earthquakes by depth (shallow/intermediate/deep)
6. **Seasonal Patterns**: Encode month into seasonal categories

In [None]:
# Create engineered features
et_data['energy'] = et_data['magnitude'] ** 2
et_data['felt_vs_measured'] = et_data['mmi'] - et_data['cdi']
et_data['dmin_km'] = et_data['dmin'] * 111  # Convert degrees to km (approx)
et_data['proximity_score'] = 1 / (et_data['dmin_km'] + 1)
et_data['gap_norm'] = et_data['gap'] / 360

# Depth categorization
et_data['depth_category'] = pd.cut(et_data['depth'], 
                                     bins=[0, 70, 300, 700],
                                     labels=['shallow', 'intermediate', 'deep'])

# Seasonal patterns
et_data['season'] = et_data['Month'] % 12 // 3 + 1 

# One-hot encode categorical variables
et_data = pd.get_dummies(et_data, columns=['depth_category', 'season'], drop_first=True)

print("Feature Engineering Complete!")
print(f"\nNew shape: {et_data.shape}")
print(f"\nNew columns added: {set(et_data.columns) - set(pd.read_csv('../data/earthquake_data_tsunami.csv').columns)}")

## Correlation Analysis

Visualize relationships between features and tsunami occurrence

In [None]:
plt.figure(figsize=(14, 12))
sns.heatmap(et_data.corr(), cmap='coolwarm', center=0, annot=False, fmt='.2f')
plt.title('Feature Correlation Heatmap', fontsize=16)
plt.tight_layout()
plt.savefig('../viz/correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("Correlation with Tsunami:")
print(et_data.corr()['tsunami'].sort_values(ascending=False))

## Model Training - Random Forest Classifier

### Split Data and Train Initial Model

In [None]:
# Prepare features and target
X = et_data.drop(columns=['tsunami'])
y = et_data['tsunami']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
print(f"\nClass distribution in training set:")
print(y_train.value_counts())
print(f"\nClass distribution in test set:")
print(y_test.value_counts())

In [None]:
# Train Random Forest model
model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

# Make predictions
y_pred = model.predict(X_test)

# Evaluation
print("=" * 60)
print("RANDOM FOREST CLASSIFIER - INITIAL MODEL (200 estimators)")
print("=" * 60)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

accuracy = (y_pred == y_test).mean()
print(f"\nOverall Accuracy: {accuracy:.2%}")

## Feature Importance Analysis

In [None]:
# Get feature importances
importances = pd.Series(model.feature_importances_, index=X.columns)
top_10_features = importances.sort_values(ascending=False).head(10)

print("Top 10 Most Important Features:")
print(top_10_features)

# Visualize feature importance
plt.figure(figsize=(10, 6))
importances.sort_values(ascending=True).tail(10).plot(kind='barh', color='steelblue')
plt.title('Top 10 Feature Importances for Tsunami Prediction', fontsize=14)
plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.tight_layout()
plt.savefig('../viz/feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n⚠️ INSIGHT: Unfortunately, the fact that 'Year' of occurrence is among the most important features")
print("goes to show how global warming has affected the probability of tsunamis.")

## Enhanced Model with Feature Scaling

Apply StandardScaler to normalize features for improved performance

In [None]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train enhanced model
model_scaled = RandomForestClassifier(n_estimators=300, random_state=42)
model_scaled.fit(X_train_scaled, y_train)

# Make predictions
y_pred_scaled = model_scaled.predict(X_test_scaled)

# Evaluation
print("=" * 60)
print("RANDOM FOREST CLASSIFIER - SCALED MODEL (300 estimators)")
print("=" * 60)
print("\nClassification Report:")
print(classification_report(y_test, y_pred_scaled))

accuracy_scaled = (y_pred_scaled == y_test).mean()
print(f"\nOverall Accuracy: {accuracy_scaled:.2%}")

## Confusion Matrix Visualization

In [None]:
# Generate confusion matrix
cm = confusion_matrix(y_test, y_pred_scaled)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", cbar=True,
            xticklabels=['No Tsunami', 'Tsunami'],
            yticklabels=['No Tsunami', 'Tsunami'])
plt.xlabel("Predicted", fontsize=12)
plt.ylabel("Actual", fontsize=12)
plt.title('Confusion Matrix - Tsunami Prediction Model', fontsize=14)
plt.tight_layout()
plt.savefig('../viz/confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

# Calculate additional metrics
tn, fp, fn, tp = cm.ravel()
print(f"\nConfusion Matrix Breakdown:")
print(f"True Negatives (Correctly predicted No Tsunami): {tn}")
print(f"False Positives (Incorrectly predicted Tsunami): {fp}")
print(f"False Negatives (Missed Tsunami): {fn}")
print(f"True Positives (Correctly predicted Tsunami): {tp}")
print(f"\nFalse Positive Rate: {fp/(fp+tn):.2%}")
print(f"False Negative Rate: {fn/(fn+tp):.2%}")

## Model Testing with New Data

Test the model on a hypothetical earthquake scenario

In [None]:
# Create test scenario
new_data = pd.DataFrame([{
    'magnitude': 7.2,
    'cdi': 5,
    'mmi': 6,
    'sig': 700,
    'nst': 120,
    'dmin': 0.3,
    'gap': 100,
    'depth': 50,
    'latitude': 38.322,
    'longitude': 142.369,
    'Year': 2024,
    'Month': 3,
    'energy': 7.2**2,
    'felt_vs_measured': 6 - 5,
    'dmin_km': 0.3 * 111,
    'proximity_score': 1 / (0.3*111 + 1),
    'gap_norm': 100 / 360,
    'depth_category_intermediate': 0,
    'depth_category_deep': 0,
    'season_2': 0,
    'season_3': 1,
    'season_4': 0
}])

# Align columns with training data
new_data = new_data.reindex(columns=X.columns, fill_value=0)

# Scale the new data
new_data_scaled = scaler.transform(new_data)

# Make prediction
pred = model_scaled.predict(new_data_scaled)[0]
pred_proba = model_scaled.predict_proba(new_data_scaled)[0]

print("="*60)
print("TSUNAMI RISK PREDICTION FOR TEST SCENARIO")
print("="*60)
print("\nEarthquake Parameters:")
print(f"  Magnitude: 7.2")
print(f"  Depth: 50 km (Shallow)")
print(f"  Location: 38.322°N, 142.369°E (Near Japan)")
print(f"  Year: 2024, Month: March")
print(f"  Intensity (MMI): 6, Felt (CDI): 5")
print(f"\nPrediction: {'⚠️ TSUNAMI LIKELY!' if pred == 1 else '✓ No tsunami expected.'}")
print(f"\nPrediction Probabilities:")
print(f"  No Tsunami: {pred_proba[0]:.2%}")
print(f"  Tsunami: {pred_proba[1]:.2%}")

## Model Persistence

Save the trained model and scaler for future use

In [None]:
import joblib

# Save model and scaler
joblib.dump(model_scaled, '../models/tsunami_rf_model.joblib')
joblib.dump(scaler, '../models/scaler.joblib')

print("Model and scaler saved successfully!")
print("  - ../models/tsunami_rf_model.joblib")
print("  - ../models/scaler.joblib")

## Summary

### Model Performance
- **Algorithm**: Random Forest Classifier (300 estimators)
- **Accuracy**: 80-90%
- **Key Features**: Year, Magnitude, Energy, Depth, Proximity

### Key Findings
1. **Temporal Trends**: Year is a significant predictor, indicating climate change impact
2. **Magnitude Matters**: Higher magnitude earthquakes pose greater tsunami risk
3. **Depth Factor**: Shallow earthquakes more likely to trigger tsunamis
4. **Proximity**: Closer epicenters increase tsunami probability

### Model Applications
- Early warning systems
- Risk assessment for coastal communities
- Disaster preparedness planning
- Insurance risk modeling