# Machine Learning Earthquake Alert Prediction

## Introduction / Background

Earthquake early warning systems are critical for reducing the impact of destructive seismic events. These systems provide time-sensitive alerts that give governments and communities a short but important window to prepare and respond. Traditional alert systems often rely on simple magnitude thresholds or basic seismic network rules, which can miss important context about how strongly an earthquake will be felt.

Recent research has shown that using richer seismic features - such as shaking intensity, depth, and community-reported effects - can significantly improve the accuracy of early warning systems. Machine learning methods, including ensemble models and neural networks, have been successfully applied to earthquake records and have been shown to outperform simple threshold-based approaches by better capturing the complex relationships among seismic features.

## Problem Definition

Earthquakes often strike without warning and can cause severe damage with little time to react. Communities currently rely on early warning systems to issue alerts, but many of these systems depend on simple magnitude thresholds or limited seismic indicators. These approaches can lead to inaccurate alerts because they do not capture the full complexity of earthquake behavior.

In this project, we aim to build machine learning models that classify earthquake alert levels into four categories - Green, Yellow, Orange, Red - using a richer set of seismic features:

- Magnitude  
- Depth  
- CDI (community-reported intensity)  
- MMI (instrumental intensity)  
- Significance score (combined impact measure)

Our goal is to design models that improve the accuracy and reliability of earthquake alerts compared to simple magnitude-based rules. We define several quantitative performance targets for the models:

- Macro F1 ≥ 0.75  
- Balanced Accuracy ≥ 0.80  
- Macro ROC-AUC ≥ 0.85  
- Brier Score ≤ 0.12  
- Recall for the Red class ≥ 0.80 (most critical for life safety)

## Methods

### Dataset

We use the *Earthquake Alert Prediction* dataset from Kaggle, which contains 1,300 earthquake records. Each record includes:

- magnitude - event magnitude (Richter scale)  
- depth - hypocenter depth in kilometers  
- cdi - maximum reported community intensity  
- mmi - maximum estimated instrumental intensity  
- sig - “significance” score reflecting magnitude, intensity, reports, and impact  
- alert - target alert level: Green, Yellow, Orange, or Red

This dataset is already balanced across the four alert categories.

### Data Preprocessing

Our preprocessing pipeline standardizes and stabilizes the feature distributions before training:

1. Distribution stabilization (PowerTransformer)  
   - cdi and mmi are heavily skewed. We apply PowerTransformer to reduce skew and make these features more Gaussian-like, which helps linear models and distance-based methods learn better boundaries.

2. Scale normalization (StandardScaler)  
   - We standardize all features so that magnitude, depth, and significance share a common scale. This prevents any single feature from dominating the loss function.

3. Class balance (RandomOverSampler)  
   - Although our dataset is already balanced in counts, we demonstrate the use of RandomOverSampler and later include sampling inside model pipelines to make sure each alert level is treated fairly during training.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import StratifiedKFold, GridSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import PowerTransformer, StandardScaler, PolynomialFeatures, LabelEncoder, label_binarize
from sklearn.metrics import ConfusionMatrixDisplay, classification_report, confusion_matrix, f1_score, balanced_accuracy_score, roc_auc_score, brier_score_loss, roc_curve, auc, precision_recall_curve, average_precision_score
from xgboost import XGBClassifier
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import RandomOverSampler
from pathlib import Path

In [None]:
for p in ["../data/sample/dataset_sample.csv",
          "../data/raw/earthquake_alert_balanced_dataset.csv"]:
    if Path(p).exists():
        df = pd.read_csv(p)
        break
else:
    raise FileNotFoundError("Place sample in data/sample/ or full file in data/raw/")

print(df.shape)
df.head()

In [None]:
df.info()
df.describe()
df.isnull().sum()

In [None]:
X = df[['magnitude', 'depth', 'cdi', 'mmi', 'sig']]
y = df['alert']

In [None]:
pt = PowerTransformer()
df[['cdi', 'mmi']] = pt.fit_transform(df[['cdi', 'mmi']])

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(10, 4))
sns.histplot(df['cdi'], ax=axes[0], kde=True)
axes[0].set_title('CDI after PowerTransform')
sns.histplot(df['mmi'], ax=axes[1], kde=True)
axes[1].set_title('MMI after PowerTransform')
plt.show()

In [None]:
df_original = pd.read_csv("../data/earthquake_alert_balanced_dataset.csv")

fig, axes = plt.subplots(2, 2, figsize=(10, 8))

sns.histplot(df_original['cdi'], ax=axes[0, 0], kde=True)
axes[0, 0].set_title('CDI Before PowerTransform')

sns.histplot(df_original['mmi'], ax=axes[0, 1], kde=True)
axes[0, 1].set_title('MMI Before PowerTransform')

sns.histplot(df['cdi'], ax=axes[1, 0], kde=True)
axes[1, 0].set_title('CDI After PowerTransform')

sns.histplot(df['mmi'], ax=axes[1, 1], kde=True)
axes[1, 1].set_title('MMI After PowerTransform')

plt.tight_layout()
plt.show()

In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
pd.DataFrame(X_scaled, columns=X.columns).head()

In [None]:
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X_scaled, y)

print("Before oversampling:\n", y.value_counts(), "\n")
print("After oversampling:\n", pd.Series(y_resampled).value_counts())

### Models

#### Logistic Regression

Our baseline model is multinomial logistic regression, which assumes linear decision boundaries between alert levels. Logistic regression is:

- Interpretable - coefficients directly show how each feature influences the log-odds of each alert category.
- Efficient - fast to train and evaluate, making it a strong starting point before more complex methods.

We use a pipeline that includes:
- PowerTransformer on cdi and mmi
- Optional polynomial features (degree 2) to capture pairwise interactions
- Standardized features
- Optional RandomOverSampler for balancing
- GridSearchCV over regularization strength C and class_weight to optimize macro F1.

In [None]:
# Training/Testing split
df = pd.read_csv("../data/earthquake_alert_balanced_dataset.csv")
N = df[['magnitude', 'depth', 'cdi', 'mmi', 'sig']]
D = df['alert']

NTrain, NTest, DTrain, DTest = train_test_split(
    N, D, test_size=0.20, stratify=D, random_state=42
)

# Building the Pipeline
features = ['magnitude', 'depth', 'cdi', 'mmi', 'sig']

preprocess = ColumnTransformer(
    transformers=[
        ("power", PowerTransformer(), ['cdi', 'mmi']),
        ("pass", 'passthrough', features),
    ],
    remainder='drop',
)

pipeline = Pipeline(steps=[
    ("preprocess", preprocess),
    ("poly", "passthrough"),
    ("scale", StandardScaler()),
    ("sampler", "passthrough"),
    ("model", LogisticRegression(solver="lbfgs", max_iter=1000)),
])

# Training and Evaluating
parameters = [{
    "poly": ["passthrough", PolynomialFeatures(degree=2, include_bias=False)],
    "sampler": ["passthrough", RandomOverSampler(random_state=42)],
    "model__C": [0.25, 0.5, 1.0, 2.0, 4.0],
    "model__class_weight": [None, "balanced"],
}]

cross = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid = GridSearchCV(
    estimator=pipeline,
    param_grid=parameters,
    scoring="f1_macro",
    cv=cross,
    n_jobs=-1,
    verbose=1,
)

grid.fit(NTrain, DTrain)
best_model = grid.best_estimator_

y_pred = best_model.predict(NTest)

print("\nClassification report:")
print(classification_report(DTest, y_pred, digits=3))

#### Random Forest

Our second model is a Random Forest classifier, which aggregates many decision trees trained on bootstrapped samples. Random Forests:

- Capture non-linear relationships between features such as magnitude, depth, and significance.
- Are robust to noise and reduce overfitting through averaging.

We reuse the same preprocessing idea (PowerTransformer on cdi/mmi, StandardScaler on all features), then perform a grid search over:

- Number of trees (n_estimators)
- Maximum depth
- Split and leaf sizes
- Optional class_weight settings

using 5-fold StratifiedKFold and GridSearchCV, optimizing macro F1.

In [None]:
df = pd.read_csv("../data/earthquake_alert_balanced_dataset.csv")
X = df[['magnitude', 'depth', 'cdi', 'mmi', 'sig']].copy()
y = df['alert']

# Power transform CDI/MMI, then scale all features
pt = PowerTransformer()
X[['cdi', 'mmi']] = pt.fit_transform(X[['cdi', 'mmi']])

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, stratify=y, random_state=42
)

rf = RandomForestClassifier(random_state=42)

params = {
    "n_estimators": [100, 200, 300],
    "max_depth": [None, 5, 10, 20],
    "min_samples_split": [2, 5],
    "min_samples_leaf": [1, 2],
    "class_weight": [None, "balanced"],
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_rf = GridSearchCV(
    rf,
    params,
    scoring="f1_macro",
    cv=cv,
    n_jobs=-1,
    verbose=1,
)

grid_rf.fit(X_train, y_train)
best_rf = grid_rf.best_estimator_

y_pred_rf = best_rf.predict(X_test)

print("Best params:", grid_rf.best_params_)
print("\nClassification Report:\n", classification_report(y_test, y_pred_rf, digits=3))

#### Gradient Boosting (XGBoost)

Our third model is a gradient boosting ensemble implemented with XGBoost. It builds many shallow trees sequentially, where each tree focuses on correcting errors made by previous trees. This approach is well-suited to structured tabular data and can model complex interactions between seismic features.

We use the same preprocessing pipeline as logistic regression:
- PowerTransformer on cdi and mmi
- Polynomial features (degree 2) to model pairwise feature interactions
- StandardScaler on all features
- RandomOverSampler to keep alert classes balanced

We train an XGBoost classifier with:
- 200 trees
- Maximum depth 4
- Learning rate 0.1
- multi:softprob objective for 4 alert classes

and evaluate it on the same train-test split as the logistic regression model.

In [None]:
# Encode string labels into integers for XGBoost
le = LabelEncoder()
DTrain_enc = le.fit_transform(DTrain)
DTest_enc = le.transform(DTest)

features = ['magnitude', 'depth', 'cdi', 'mmi', 'sig']

preprocess_gb = ColumnTransformer(
    transformers=[
        ("power", PowerTransformer(), ['cdi', 'mmi']),
        ("pass", "passthrough", features),
    ],
    remainder='drop',
)

gb_model = Pipeline(steps=[
    ("preprocess", preprocess_gb),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler()),
    ("sampler", RandomOverSampler(random_state=42)),
    ("model", XGBClassifier(
        objective="multi:softprob",
        num_class=4,
        eval_metric="mlogloss",
        tree_method="hist",
        random_state=42,
        n_estimators=200,
        max_depth=4,
        learning_rate=0.1,
        subsample=0.9,
        colsample_bytree=0.9,
        n_jobs=1,
    )),
])

gb_model.fit(NTrain, DTrain_enc)

y_pred_enc = gb_model.predict(NTest)
y_prob_gb = gb_model.predict_proba(NTest)

classes_gb = le.classes_
y_pred_gb = le.inverse_transform(y_pred_enc)

print("Classification report:")
print(classification_report(DTest, y_pred_gb, target_names=classes_gb, digits=3))

## Results & Discussion

In this section, we evaluate our three models - Logistic Regression, Random Forest, and Gradient Boosting (XGBoost) - using quantitative metrics and visualizations. We then interpret each model’s performance in terms of our project goals.

### Logistic Regression

#### Quantitative Metrics

In [None]:
# Predictions & probabilities
y_pred = best_model.predict(NTest)
y_prob = best_model.predict_proba(NTest)
classes = best_model.classes_
print("Classes:", list(classes))

# Compute quantitative metrics
macro_f1 = f1_score(DTest, y_pred, average='macro')
balanced_acc = balanced_accuracy_score(DTest, y_pred)

Y_true_ohe = pd.get_dummies(DTest).reindex(columns=classes, fill_value=0)
macro_roc_auc = roc_auc_score(
    Y_true_ohe, y_prob, multi_class='ovr', average='macro'
)

brier_macro = np.mean([
    brier_score_loss((np.array(DTest) == c).astype(int), y_prob[:, i])
    for i, c in enumerate(classes)
])

print(f"Macro F1: {macro_f1:.3f}")
print(f"Balanced Accuracy: {balanced_acc:.3f}")
print(f"Macro ROC-AUC: {macro_roc_auc:.3f}")
print(f"Macro Brier (↓): {brier_macro:.3f}")

# Per-class performance table
report = classification_report(
    DTest, y_pred, target_names=classes, output_dict=True
)
pd.DataFrame(report).T

#### Visualizations

In [None]:
# Confusion Matrix (Counts)
cm = confusion_matrix(DTest, y_pred, labels=classes)
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm, display_labels=classes).plot(
    ax=ax, cmap="Blues", colorbar=False
)
ax.set_title("Confusion Matrix - Logistic Regression")
plt.tight_layout()
plt.show()

# Confusion Matrix (Normalized by row / recall)
cm_norm = confusion_matrix(DTest, y_pred, labels=classes, normalize="true")
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm_norm, display_labels=classes).plot(
    ax=ax, cmap="Blues", colorbar=True
)
ax.set_title("Normalized Confusion Matrix (per-class recall)")
plt.tight_layout()
plt.show()

In [None]:
# ROC curves (One-vs-Rest)
Y_true_bin = label_binarize(DTest, classes=classes)

plt.figure(figsize=(7, 6))
for i, c in enumerate(classes):
    fpr, tpr, _ = roc_curve(Y_true_bin[:, i], y_prob[:, i])
    auc_i = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{c} (AUC = {auc_i:.2f})")

plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curves - Logistic Regression (One-vs-Rest)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Precision-Recall curves
plt.figure(figsize=(7, 6))
for i, c in enumerate(classes):
    precision, recall, _ = precision_recall_curve(Y_true_bin[:, i], y_prob[:, i])
    ap = average_precision_score(Y_true_bin[:, i], y_prob[:, i])
    plt.plot(recall, precision, label=f"{c} (AP = {ap:.2f})")

plt.title("Precision-Recall Curves - Logistic Regression")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.tight_layout()
plt.show()

#### Discussion

The Logistic Regression model achieved strong and balanced performance across the four alert levels. Based on the quantitative metrics:

- Macro F1 Score (~0.80) shows that the model performs consistently across all classes, giving equal weight to each alert level (Green, Orange, Red).
- Balanced Accuracy (~0.80) shows the model predicts each class fairly well, even when the number of samples per class differs.
- Macro ROC-AUC (~0.93) shows the model can reliably distinguish between the different alert levels using probability scores.
- Brier Score (~0.08) shows the model’s probability estimates are well-calibrated and not overconfident.

The per-class metrics show that:
- “Green” and “Orange” alerts are predicted with high precision (0.96 and 0.78 respectively), meaning few false alarms.
- “Red” alerts have very high recall (0.92), meaning the model correctly catches most dangerous cases - this is ideal for an early-warning system where missing severe events is catastrophic.

Overall, these metrics indicate the model captures meaningful patterns between seismic features (magnitude, depth, CDI, MMI, significance) and alert level while maintaining balanced predictive performance.

The confusion matrix shows that most predictions fall along the diagonal, meaning the model correctly classifies the majority of earthquakes. A few misclassifications occur between Orange and Red alerts - this makes sense since these alert levels are adjacent in severity and share similar ranges of magnitude and intensity features. The normalized confusion matrix also shows that recall (the proportion of correctly identified alerts per class) remains high for all alert levels.

The ROC curves show the tradeoff between the true positive rate and false positive rate for each alert class. Each curve lies well above the diagonal “random guess” line, showing the model’s ability to separate each alert level effectively. The area under the curves (AUC) is around 0.85 on average, which supports the earlier quantitative metrics.

The precision-recall curves reinforce the same conclusion: the model maintains good precision for frequent classes like Green and Orange, while achieving strong recall for Red, which is the most safety-critical.

Our group aims to build a machine learning model that is able to classify earthquake alert levels (Green, Yellow, Orange, Red) using a range of seismic features (magnitude, depth, CDI, MMI, and significance) instead of relying on only magnitude. As a baseline, we implemented a logistic regression model, following our plan in the project proposal. In our project proposal we set several key performance goals: a macro F1 score ≥ .75, a balanced accuracy ≥ .80, a macro ROC-AUC ≥ .85, a Brier score ≤ .12, and Red recall ≥ .80 with our most critical goal being the red recall due to it being most important to life safety.

Our logistic regression model successfully achieved or exceeded most of these with values of:
* Macro F1: 0.804
* Macro ROC-AUC: 0.930
* Macro Brier: 0.084

Our model’s recall for the Red class was 0.923, which as seen in the confusion matrix means that this model correctly identifies 60 of 65 Red events in the test set which far surpasses our minimum safety target. However our balanced accuracy came in at 0.804, which is just at our .80 goal. Balanced accuracy is the average of all class’s recall with each individual class’s recall being:

* Green Recall: 0.800  
* Orange Recall: 0.800  
* Red Recall: 0.923  
* Yellow Recall: 0.692  

While this model performed well on Red alerts and solid on the Green and Orange alerts, Yellow’s lower performance pulled the average down. The confusion matrix confirms this as many Yellow events were often misclassified as Green or Orange, suggesting that there is a complex relationship between the Green, Yellow, and Orange feature that this model struggles to show.

Overall we believe that this model’s strong performance is due to our preprocessing pipeline. The PowerTransformer normalized and skewed cdi and mmi features, while the StandardScaler ensured that features like depth and sig which had large ranges didn’t dominate our model. The GridSearchCV also selected PolynomialFeatures(degree=2), which confirmed that interactions between features are critical for separating classes.

### Random Forest

#### Quantitative Metrics

In [None]:
# Probabilities for metrics/curves
y_prob_rf = best_rf.predict_proba(X_test)
classes_rf = best_rf.classes_
print("Classes:", list(classes_rf))

macro_f1_rf = f1_score(y_test, y_pred_rf, average='macro')
balanced_acc_rf = balanced_accuracy_score(y_test, y_pred_rf)

Y_true_ohe_rf = pd.get_dummies(y_test).reindex(columns=classes_rf, fill_value=0)
macro_roc_auc_rf = roc_auc_score(
    Y_true_ohe_rf, y_prob_rf, multi_class='ovr', average='macro'
)

brier_macro_rf = np.mean([
    brier_score_loss((np.array(y_test) == c).astype(int), y_prob_rf[:, i])
    for i, c in enumerate(classes_rf)
])

print(f"Macro F1: {macro_f1_rf:.3f}")
print(f"Balanced Accuracy: {balanced_acc_rf:.3f}")
print(f"Macro ROC-AUC: {macro_roc_auc_rf:.3f}")
print(f"Macro Brier (↓): {brier_macro_rf:.3f}")

report_rf = classification_report(
    y_test, y_pred_rf, target_names=classes_rf, output_dict=True
)
pd.DataFrame(report_rf).T

#### Visualizations

In [None]:
# Confusion Matrix (Counts)
labels_rf = classes_rf  # or sorted(y.unique())
cm_rf = confusion_matrix(y_test, y_pred_rf, labels=labels_rf)
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm_rf, display_labels=labels_rf).plot(
    ax=ax, cmap="Blues", colorbar=False
)
ax.set_title("Random Forest Confusion Matrix")
plt.tight_layout()
plt.show()

# Confusion Matrix (Normalized)
cm_norm_rf = confusion_matrix(
    y_test, y_pred_rf, labels=labels_rf, normalize="true"
)
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm_norm_rf, display_labels=labels_rf).plot(
    ax=ax, cmap="Blues", colorbar=True
)
ax.set_title("Normalized Confusion Matrix - Random Forest")
plt.tight_layout()
plt.show()

In [None]:
# ROC Curves (One-vs-Rest)
Y_bin_rf = label_binarize(y_test, classes=labels_rf)

plt.figure(figsize=(7, 6))
for i, c in enumerate(labels_rf):
    fpr, tpr, _ = roc_curve(Y_bin_rf[:, i], y_prob_rf[:, i])
    auc_i = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{c} (AUC = {auc_i:.2f})")

plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curves - Random Forest (One-vs-Rest)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Precision-Recall Curves
plt.figure(figsize=(7, 6))
for i, c in enumerate(labels_rf):
    precision, recall, _ = precision_recall_curve(Y_bin_rf[:, i], y_prob_rf[:, i])
    ap = average_precision_score(Y_bin_rf[:, i], y_prob_rf[:, i])
    plt.plot(recall, precision, label=f"{c} (AP = {ap:.2f})")

plt.title("Precision-Recall Curves - Random Forest")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.tight_layout()
plt.show()

#### Discussion

Our second model builds on our baseline by using a Random Forest classifier. Unlike logistic regression, which relies on linear decision boundaries, Random Forests can pick up more complex patterns between the seismic features (magnitude, depth, CDI, MMI, and significance). This made it a strong next step for our project since the relationships between alert levels are unlikely to be strictly linear.

We used the same preprocessing steps as before - the PowerTransformer for CDI and MMI, and StandardScaler for all features. We then ran a 5-fold GridSearchCV over several parameters, including the number of trees, maximum depth, and split rules. The best model from this search used 100 trees with no limit on depth.

The Random Forest achieved strong and balanced performance with the following overall results:
* Accuracy: .923  
* Macro F1: .924  
* Macro Recall: .923  

Per-class recall also showed strong performance across all alert levels:
* Green Recall: .846  
* Orange Recall: .954  
* Red Recall: .969  
* Yellow Recall: .923  

These results show that the model identifies nearly all Red alerts, which remains our most important safety goal. It also performed especially well on Yellow and Orange alerts, suggesting that the nonlinear structure of Random Forests allows the model to pick up on subtle feature interactions that the logistic regression model struggled to capture.

The confusion matrix supports this, with most predictions falling along the diagonal and relatively few alerts being confused with neighboring levels. Misclassifications that do occur tend to be between adjacent severity levels, which is expected given their shared feature ranges.

Overall, the Random Forest results indicate that a more flexible model can better capture the relationships in our dataset. Its strong recall across all four alert classes, especially for Red and Yellow alerts, suggests that this approach may be better suited to representing the underlying structure of the seismic features.

### Gradient Boosting (XGBoost)

#### Quantitative Metrics

In [None]:
# Macro metrics
macro_f1_gb = f1_score(DTest, y_pred_gb, average='macro')
balanced_acc_gb = balanced_accuracy_score(DTest, y_pred_gb)

Y_true_ohe_gb = pd.get_dummies(DTest).reindex(columns=classes_gb, fill_value=0)
macro_roc_auc_gb = roc_auc_score(
    Y_true_ohe_gb, y_prob_gb, multi_class='ovr', average='macro'
)

macro_brier_gb = np.mean([
    brier_score_loss((np.array(DTest) == c).astype(int), y_prob_gb[:, i])
    for i, c in enumerate(classes_gb)
])

print(f"Macro F1: {macro_f1_gb:.3f}")
print(f"Balanced Accuracy: {balanced_acc_gb:.3f}")
print(f"Macro ROC-AUC: {macro_roc_auc_gb:.3f}")
print(f"Macro Brier (↓): {macro_brier_gb:.3f}")

# Per-class performance table
report_gb = classification_report(
    DTest, y_pred_gb, target_names=classes_gb, output_dict=True
)
pd.DataFrame(report_gb).T

#### Visualizations

In [None]:
# Confusion Matrix (Counts)
cm_gb = confusion_matrix(DTest, y_pred_gb, labels=classes_gb)
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm_gb, display_labels=classes_gb).plot(
    ax=ax, cmap="Blues", colorbar=False, values_format='d'
)
ax.set_title("Confusion Matrix - Gradient Boosting (XGBoost)")
plt.tight_layout()
plt.show()

# Confusion Matrix (Normalized)
cm_norm_gb = confusion_matrix(
    DTest, y_pred_gb, labels=classes_gb, normalize="true"
)
fig, ax = plt.subplots(figsize=(6, 6))
ConfusionMatrixDisplay(cm_norm_gb, display_labels=classes_gb).plot(
    ax=ax, cmap="Blues", colorbar=True
)
ax.set_title("Normalized Confusion Matrix - Gradient Boosting")
plt.tight_layout()
plt.show()

In [None]:
# ROC Curves (One-vs-Rest)
Y_bin_gb = label_binarize(DTest, classes=classes_gb)

plt.figure(figsize=(7, 6))
for i, c in enumerate(classes_gb):
    fpr, tpr, _ = roc_curve(Y_bin_gb[:, i], y_prob_gb[:, i])
    auc_i = auc(fpr, tpr)
    plt.plot(fpr, tpr, label=f"{c} (AUC = {auc_i:.2f})")

plt.plot([0, 1], [0, 1], 'k--')
plt.title("ROC Curves - Gradient Boosting (One-vs-Rest)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.legend()
plt.tight_layout()
plt.show()

In [None]:
# Precision-Recall Curves
plt.figure(figsize=(7, 6))
for i, c in enumerate(classes_gb):
    precision, recall, _ = precision_recall_curve(Y_bin_gb[:, i], y_prob_gb[:, i])
    ap = average_precision_score(Y_bin_gb[:, i], y_prob_gb[:, i])
    plt.plot(recall, precision, label=f"{c} (AP = {ap:.2f})")

plt.title("Precision-Recall Curves - Gradient Boosting")
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.legend()
plt.tight_layout()
plt.show()

#### Discussion

With the XGBoost pipeline above:

Macro F1: 0.923  
Balanced accuracy: 0.923  
Macro ROC-AUC: 0.986  
Macro Brier: 0.031  
Per-class recall: Green: 0.892, Orange: 0.969, Red: 0.923, Yellow: 0.908

Gradient boosting:
- Meets or exceeds all project targets (macro F1 ≥ .75, balanced accuracy ≥ .80, macro ROC-AUC ≥ .85, Brier ≤ .12, Red recall ≥ .80).
- Maintains high recall for the critical Red alert level.
- Produces strong classification performance for the Yellow alert level.
- Achieves excellent probability calibration (Brier score ≈ 0.03).

Gradient Boosting (XGBoost) Results

To capture non-linear interactions between seismic features, we trained a gradient boosting model using XGBoost with the same preprocessing pipeline used throughout the project (PowerTransformer on CDI/MMI, StandardScaler, optional polynomial features, and RandomOverSampler for class balancing). The final model used 200 boosted trees with max depth 4 and learning rate 0.1.

On the held-out test set, XGBoost achieved a macro F1 of 0.923, a balanced accuracy of 0.923, a macro ROC–AUC of 0.986, and a macro Brier score of 0.031. These values satisfy all of our predefined project goals.

The per-class performance shows strong predictive ability across all alert levels. Recall for Green and Orange alerts reached 0.892 and 0.969, respectively, while Red recall remained high at 0.923. The Yellow class also showed strong performance, with recall of 0.908 and a corresponding F1 score near 0.89. These values indicate that the boosted trees are able to capture the complex boundaries between seismic patterns associated with different alert levels.

The confusion matrix for XGBoost shows that most examples lie on the diagonal, with only a small number of cross-class confusions. The normalized matrix highlights the consistently high recall across all four alert levels. ROC and precision–recall curves for each class lie well above the random baseline, with areas under the ROC curves close to 1.0, indicating strong separability of the seismic patterns that correspond to each alert level.

Overall, gradient boosting provides highly accurate and reliable earthquake alert classification across all severity levels. The model maintains high recall for the most severe Red alerts while also achieving strong performance for intermediate alert levels, making it a strong candidate for a practical early warning system.

### Model Comparison

We were able to build three different Machine Learning models: linear regression, random forest, and gradient boosting. Based on the results we achieve, the gradient boosting model stood out as the best choice out of the three. This is because it provides the best balance of classification performance and reliability. The logistic repregression model was able to meet all of the minimum project targets but, because of its simplicity, it wasn't able to capture non-linear relationships. This, in turn, caused a low Yellow Recall score. When we utilized Random Forest, we were able to see an improvement from the linear regression model. This is because random forest was able to capture these non-linear interactions, achieving strong Macro F1 and Red Recall scores. However, the gradient boosting model was the most fitting out of the three since it scored higher on the Macro ROC-AUC and Brier scores. Based  on these metrics, we can determine that the gradient boosting model ensures the highest accuracy and most trustworthy probability estimates.

## References

[1] T. T. Trugman, E. A. Casarotti, and P. M. Shearer, “Machine learning for earthquake early warning and ground motion prediction,” Seismological Research Letters, vol. 91, no. 5, pp. 2362–2376, 2020.

[2] S. Mousavi and G. C. Beroza, “A machine-learning approach for earthquake magnitude estimation,” Geophysical Research Letters, vol. 47, no. 1, 2020.

[3] X. Wang, Z. Wang, J. Wang, P. Miao, H. Dang, and Z. Li, “Machine learning based ground motion site amplification prediction,” Frontiers in Earth Science, vol. 11, 2023.