# 3 Model Training and Prediction

This notebook trains and evaluates a predictive model using the imputed feature matrix with depth-2 interaction terms. It includes model selection, training, prediction, evaluation, and export of results.

## Contents

- **3.1 Load Transformed Dataset**
- **3.2 Define Target and Features**
- **3.3 Train-Test Split**
- **3.4 Model Training**
- **3.5 Prediction and Evaluation**
- **3.6 Export Predictions**
- **3.7 Save Model Artifact**

Load essential packages for data access, manipulation, and file handling.

In [None]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

## 3.1 Load Transformed Dataset

Load the imputed feature matrix with depth-2 interactions from the export stage.

In [6]:
# Load Transformed Dataset
imputed_df = pd.read_csv('../data/interaction/earthquake_imputed_2way.csv')
features_imputed = ['dmin', 'Year', 'cdi', 'dmin:Year']
raw_df = pd.read_csv('../data/interaction/earthquake_raw_2way.csv')
features_raw = ['Year', 'nst', 'sig', 'magnitude', 'Year:magnitude', 'depth']

## 3.2 Define Target and Features

Specify the target variable for prediction and construct the feature matrix. This step isolates the outcome column (`tsunami`) from the rest of the dataset, preparing inputs for model training.

- Target variable: `tsunami` (binary classification)
- Feature matrix: all other columns from the transformed dataset
- No feature pruning or filtering is applied at this stage
- Class distribution is printed for diagnostic clarity

In [7]:
# Define target column
target = 'tsunami'  # Replace with actual target if different

## 3.3 Train-Test Split

Split the dataset into training and test sets using stratified sampling to preserve class balance. This ensures that the model is trained and evaluated on representative distributions of the target variable.

- Split ratio: 80% train / 20% test
- Stratification: enabled to preserve class proportions
- Random seed: 42 for reproducibility

In [9]:
# Stratified split to preserve class distribution
def split_data(df: pd.DataFrame, target_col: str = "target", test_size: float = 0.2, random_state: int = 42):
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

X_train_imputed, X_test_imputed, y_train_imputed, y_test_imputed = split_data(imputed_df, target)
X_train_raw, X_test_raw, y_train_raw, y_test_raw = split_data(raw_df, target)

# Confirm shapes
print("X_train:", X_train_imputed.shape)
print("X_test:", X_test_imputed.shape)
print("y_train:", y_train_imputed.shape)
print("y_test:", y_test_imputed.shape)

X_train: (625, 15)
X_test: (157, 15)
y_train: (625,)
y_test: (157,)


## 3.4 Model Training

Train a tree-based classifier using the training set. Random Forest is selected for its robustness and invariance to feature scaling. Class imbalance is addressed using `class_weight='balanced'`.

- Model: `RandomForestClassifier`
- Parameters: 100 trees, max depth 8, balanced class weights
- Input: imputed feature matrix (`X_train_imputed`, `y_train_imputed`)
- Optional: raw matrix training for comparison

In [16]:
# Initialize model
model_imputed = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_imputed.fit(X_train_imputed, y_train_imputed)

In [15]:
model_raw = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_raw.fit(X_train_raw, y_train_raw)

## 3.5 Prediction and Evaluation

Generate predictions on the test set and evaluate model performance using standard classification metrics.

- Predictions: binary labels and class probabilities
- Metrics: confusion matrix, classification report, ROC AUC
- Input: `X_test_imputed`, `y_test_imputed`
- Optional: evaluation on raw matrix for comparison

In [None]:
# Predict labels and probabilities
y_pred = model_imputed.predict(X_test_imputed)
y_proba = model_imputed.predict_proba(X_test_imputed)[:, 1]

# Evaluation metrics
print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba))

Confusion Matrix:
 [[86 10]
 [ 1 60]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.90      0.94        96
           1       0.86      0.98      0.92        61

    accuracy                           0.93       157
   macro avg       0.92      0.94      0.93       157
weighted avg       0.94      0.93      0.93       157


ROC AUC Score: 0.9629439890710382


In [18]:
# Predict labels and probabilities
y_pred_raw = model_raw.predict(X_test_raw)
y_proba_raw = model_raw.predict_proba(X_test_raw)[:, 1]
# Evaluation metrics
print("Raw ROC AUC:", roc_auc_score(y_test_raw, y_proba_raw))

Raw ROC AUC: 0.9588456284153005
