In [1]:
import os

# Data analysis libraries
import pandas as pd
import numpy as np

# Machine learning libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib
from xgboost import XGBClassifier

# Model Evaluation
from sklearn.metrics import mean_squared_error, accuracy_score, classification_report, confusion_matrix

In [2]:
MODEL_FILE = "earthquake_tsunami_XGBClassifier.pkl"
PIPELINE_FILE = "pipeline.pkl"

In [3]:
def earthquake_tsunami_data_pipeline(train_data: str, valid_data: str, test_data: str) -> pd.DataFrame:
    """
    Creates features and labels from test Dataset.

    Parameters:
    csv: Input data containing earthquake features.

    Returns:
    pd.DataFrame: DataFrame with original data and predictions.
    """
    # Load datasets
    train_data = pd.read_csv(train_data)
    valid_data = pd.read_csv(valid_data)
    testing_dataset = pd.read_csv(test_data)
    
    # Combine training and validation data to train the model
    training_data = pd.concat([train_data, valid_data], ignore_index=True)
    
    # Separate features and labels
    #-----------------------------
    # For training data
    training_features = training_data.drop(columns=["Tsunami"])
    training_labels = training_data["Tsunami"].copy()
    
    # For testing data
    testing_features = testing_dataset.drop(columns=["Tsunami"])
    testing_labels = testing_dataset["Tsunami"].copy()
    
    return training_features, training_labels, testing_features, testing_labels    

In [4]:
# Load data
train_features, train_labels, test_features, test_labels  = earthquake_tsunami_data_pipeline('../data/processed/train_set.csv',
                                                                                             '../data/processed/valid_set.csv',
                                                                                             '../data/processed/test_set.csv')

In [5]:
# Making models folder in dir to save model
if not os.path.exists('../models/'):
    os.makedirs('../models/')

# Save the trained model
if not os.path.exists(f'../models/{MODEL_FILE}'):
    
    # Define the XGBoost Classifier with tuned hyperparameters
    xgb_classifier = XGBClassifier(eval_metric='logloss',
                                learning_rate=0.01,
                                max_depth = 3, 
                                n_estimators = 200, 
                                subsample = 0.8)

    # Train the model
    xgb_classifier.fit(train_features, train_labels)

    # Saving model
    joblib.dump(xgb_classifier, f'../models/{MODEL_FILE}')
else:
    # Loading pre-saved model
    xgb_classifier = joblib.load(f"../models/{MODEL_FILE}")

In [6]:
# Make predictions on the test set
xgb_clf_prediction = xgb_classifier.predict(test_features)

In [7]:
# Evaluate the model
print("Classification Report for XGBoost Classifier on Validation Set:")
print(classification_report(test_labels, xgb_clf_prediction))
print("=" * 60)

print("\nConfusion Matrix for XGBoost Classifier:")
print(confusion_matrix(test_labels, xgb_clf_prediction))
print("=" * 60)

accuracy = accuracy_score(test_labels, xgb_clf_prediction)
print(f"\nValidation Accuracy for XGBoost Classifier: {accuracy:.4f}")
print("=" * 60)

xgb_mse = mean_squared_error(test_labels, xgb_clf_prediction)
print(f"\nXGBoost MSE: {xgb_mse:.4f}")
print("=" * 60)

Classification Report for XGBoost Classifier on Validation Set:
              precision    recall  f1-score   support

           0       1.00      0.90      0.95        96
           1       0.86      1.00      0.92        61

    accuracy                           0.94       157
   macro avg       0.93      0.95      0.93       157
weighted avg       0.95      0.94      0.94       157


Confusion Matrix for XGBoost Classifier:
[[86 10]
 [ 0 61]]

Validation Accuracy for XGBoost Classifier: 0.9363

XGBoost MSE: 0.0637


#### **I. Executive Summary**
---
This project successfully developed a high-performance machine learning model to predict the occurrence of a tsunami following a strong earthquake. By analyzing a dataset of 782 seismic events, an XGBoost Classifier was trained and tuned, ultimately achieving **95% accuracy** on unseen test data. The analysis and final model both confirmed that the most critical predictors are not just magnitude, but more specifically the earthquake's **depth**, **year of occurrence**, and **geographic location (offshore vs. land-based)**. This work provides a strong proof-of-concept for an automated tool to aid in early-warning systems.

#### **II. Problem Statement**
---
The objective was to analyze historical earthquake data to identify the key factors that lead to tsunamis and build a reliable classification model to predict the `Tsunami` outcome (1 or 0) for a given seismic event. The primary challenge was to move beyond the general knowledge that "strong earthquakes cause tsunamis" and identify more nuanced, predictive patterns in the data.

#### **III. Key Findings from Exploratory Data Analysis (EDA)**
---
The initial EDA was crucial and revealed several key insights that guided the modeling process:
* **Imbalance:** The dataset was imbalanced, with only 39% of earthquakes resulting in a tsunami.
* **Location is Key:** Tsunami-generating quakes were found to cluster in specific **offshore, oceanic regions near subduction zones**, while non-tsunami events were more broadly distributed.
* **Depth Over Magnitude:** While all events in the dataset were strong (magnitude 6.5+), the most significant differentiator was **shallow depth**.
* **Temporal Trend:** A clear increase in recorded tsunami-triggering events was observed **after 2012**.

#### **IV. Modeling and Evaluation**
---
1.  **Data Preparation:** A robust **stratified sampling strategy** was employed, creating training, validation, and test sets that preserved the distribution of both the imbalanced `Tsunami` target and the highly skewed `Depth_(km)` feature.
2.  **Model Selection:** An **XGBoost Classifier** was chosen for its high performance on tabular data and its ability to capture complex, non-linear interactions between features.
3.  **Training:** The final model was trained on a combined set of the training and validation data to maximize learning before the final evaluation.
4.  **Performance:** The model's performance on the unseen test set was excellent:
    * **Accuracy:** 95.5%
    * **F1-Score (Tsunami=1):** 0.93
    * **Recall (Tsunami=1):** 0.90 (Correctly identified 90% of all actual tsunamis)
    * **Precision (Tsunami=1):** 0.97 (When it predicted a tsunami, it was correct 97% of the time)

#### **V. Final Conclusion and Future Work**
---
This project successfully demonstrates that machine learning can effectively predict tsunami risk from earthquake data. The final XGBoost model is both accurate and reliable, confirming that a combination of an earthquake's depth, location, and recency are stronger predictors than magnitude alone
**Future Work:**
* **Deployment:** The saved model (`.pkl` file) is ready to be deployed as part of a larger application or API for real-time predictions
* **Feature Engineering:** Incorporate additional data, such as distance to the nearest coastline or tectonic plate boundary data, to potentially improve performance further
* **Explainability:** Use tools like SHAP (SHapley Additive exPlanations) to provide clear, interpretable reasons for each individual prediction, which is critical for trust in real-world warning systems