# 3 Model Training and Prediction

This notebook trains and evaluates a predictive model using the imputed feature matrix with depth-2 interaction terms. It includes model selection, training, prediction, evaluation, and export of results.

## Contents

- **3.1 Load Transformed Dataset**
- **3.2 Define Target and Features**
- **3.3 Train-Test Split**
- **3.4 Model Training**
- **3.5 Prediction and Evaluation**
- **3.6 Export Predictions**
- **3.7 Save Model Artifact**

Load essential packages for data access, manipulation, and file handling.

In [1]:
# Import required libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score

## 3.1 Load Transformed Dataset

Load the imputed feature matrix with depth-2 interactions from the export stage.

In [2]:
# Load Transformed Dataset
imputed_df = pd.read_csv('../data/interaction/earthquake_imputed_2way.csv')
features_imputed = ['dmin', 'Year', 'cdi', 'dmin:Year']
raw_df = pd.read_csv('../data/interaction/earthquake_raw_2way.csv')
features_raw = ['Year', 'nst', 'sig', 'magnitude', 'Year:magnitude', 'depth']

## 3.2 Define Target and Features

Specify the target variable for prediction and construct the feature matrix. This step isolates the outcome column (`tsunami`) from the rest of the dataset, preparing inputs for model training.

- Target variable: `tsunami` (binary classification)
- Feature matrix: all other columns from the transformed dataset
- No feature pruning or filtering is applied at this stage
- Class distribution is printed for diagnostic clarity

In [3]:
# Define target column
target = 'tsunami'  # Replace with actual target if different

## 3.3 Train-Test Split

Split the dataset into training and test sets using stratified sampling to preserve class balance. This ensures that the model is trained and evaluated on representative distributions of the target variable.

- Split ratio: 80% train / 20% test
- Stratification: enabled to preserve class proportions
- Random seed: 42 for reproducibility

In [4]:
# Stratified split to preserve class distribution
def split_data(df: pd.DataFrame, target_col: str = "target", test_size: float = 0.2, random_state: int = 42):
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

X_train_imputed, X_test_imputed, y_train_imputed, y_test_imputed = split_data(imputed_df, target)
X_train_raw, X_test_raw, y_train_raw, y_test_raw = split_data(raw_df, target)

# Confirm shapes
print("X_train:", X_train_imputed.shape)
print("X_test:", X_test_imputed.shape)
print("y_train:", y_train_imputed.shape)
print("y_test:", y_test_imputed.shape)

X_train: (625, 15)
X_test: (157, 15)
y_train: (625,)
y_test: (157,)


## 3.4 Model Training

Train a tree-based classifier using the training set. Random Forest is selected for its robustness and invariance to feature scaling. Class imbalance is addressed using `class_weight='balanced'`.

- Model: `RandomForestClassifier`
- Parameters: 100 trees, max depth 8, balanced class weights
- Input: imputed feature matrix (`X_train_imputed`, `y_train_imputed`)
- Optional: raw matrix training for comparison

In [5]:
# Initialize model
model_imputed = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_imputed.fit(X_train_imputed, y_train_imputed)

In [6]:
model_raw = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_raw.fit(X_train_raw, y_train_raw)

## 3.5 Prediction and Evaluation

Generate predictions on the test set and evaluate model performance using standard classification metrics.

- Predictions: binary labels and class probabilities
- Metrics: confusion matrix, classification report, ROC AUC
- Input: `X_test_imputed`, `y_test_imputed`
- Optional: evaluation on raw matrix for comparison

In [7]:
# Predict labels and probabilities
y_pred = model_imputed.predict(X_test_imputed)
y_proba = model_imputed.predict_proba(X_test_imputed)[:, 1]

# Evaluation metrics
print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba))

Confusion Matrix:
 [[87  9]
 [ 1 60]]

Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.91      0.95        96
           1       0.87      0.98      0.92        61

    accuracy                           0.94       157
   macro avg       0.93      0.94      0.93       157
weighted avg       0.94      0.94      0.94       157


ROC AUC Score: 0.9617486338797815


In [8]:
# Predict labels and probabilities
y_pred_raw = model_raw.predict(X_test_raw)
y_proba_raw = model_raw.predict_proba(X_test_raw)[:, 1]
# Evaluation metrics
print("Raw ROC AUC:", roc_auc_score(y_test_raw, y_proba_raw))

Raw ROC AUC: 0.9588456284153005


## 3.6 Apply Selected Features

Apply the previously selected feature subset to both imputed and raw matrices. This step ensures consistent ancestry and prepares the data for re-training and export.

- Source: `selected_features` list defined earlier
- Targets: `X_train`, `X_test` for both imputed and raw matrices
- Output: refined feature matrices for modeling and evaluation

In [9]:
# Apply selected features
X_train_imputed_selected = X_train_imputed[features_imputed]
X_test_imputed_selected = X_test_imputed[features_imputed]

# Apply to raw matrix for comparison
X_train_raw_selected = X_train_raw[features_raw]
X_test_raw_selected = X_test_raw[features_raw]

# Confirm shape
print("Selected feature matrix shape:", X_train_imputed_selected.shape)

Selected feature matrix shape: (625, 4)


In [10]:
# Initialize model
model_imputed_selected = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_imputed_selected.fit(X_train_imputed_selected, y_train_imputed)

In [11]:
model_raw_selected = RandomForestClassifier(
    n_estimators=100,
    max_depth=8,
    random_state=42,
    class_weight='balanced'  # handles class imbalance
)

# Fit model on imputed feature matrix
model_raw_selected.fit(X_train_raw_selected, y_train_raw)

In [12]:
# Predict labels and probabilities using selected features
y_pred = model_imputed_selected.predict(X_test_imputed_selected)
y_proba = model_imputed_selected.predict_proba(X_test_imputed_selected)[:, 1]

# Evaluation metrics
print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba))

Confusion Matrix:
 [[79 17]
 [ 6 55]]

Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.82      0.87        96
           1       0.76      0.90      0.83        61

    accuracy                           0.85       157
   macro avg       0.85      0.86      0.85       157
weighted avg       0.87      0.85      0.86       157


ROC AUC Score: 0.9140198087431695


In [13]:
# Predict labels and probabilities
y_pred_raw = model_raw_selected.predict(X_test_raw_selected)
y_proba_raw = model_raw_selected.predict_proba(X_test_raw_selected)[:, 1]
# Evaluation metrics
print("Raw ROC AUC:", roc_auc_score(y_test_raw, y_proba_raw))

Raw ROC AUC: 0.9465505464480874


## 3.7 Ensemble Modeling

Train AdaBoost and XGBoost classifiers, then combine them with the previously trained Random Forest using a soft-voting ensemble. Evaluate ensemble performance on the imputed-selected test set.

- Models: AdaBoost, XGBoost, Random Forest
- Ensemble: `VotingClassifier` with soft voting
- Evaluation: confusion matrix, classification report, ROC AUC

### 3.7.1 training on `X_train_imputed`

In [14]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier(
    n_estimators=100,
    random_state=42
)

ada_model.fit(X_train_imputed, y_train_imputed)

In [15]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

xgb_model.fit(X_train_imputed, y_train_imputed)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


In [16]:
from sklearn.ensemble import VotingClassifier

ensemble_model = VotingClassifier(
    estimators=[
        ('ada', ada_model),
        ('xgb', xgb_model),
        ('rf', model_imputed_selected)  # your previously trained Random Forest
    ],
    voting='soft'  # uses predicted probabilities
)

ensemble_model.fit(X_train_imputed, y_train_imputed)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


In [17]:
y_pred_ada = ada_model.predict(X_test_imputed)
y_proba_ada = ada_model.predict_proba(X_test_imputed)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_ada))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_ada))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_ada))

Confusion Matrix:
 [[89  7]
 [ 9 52]]

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.93      0.92        96
           1       0.88      0.85      0.87        61

    accuracy                           0.90       157
   macro avg       0.89      0.89      0.89       157
weighted avg       0.90      0.90      0.90       157


ROC AUC Score: 0.9593579234972678


In [18]:
y_pred_xgb = xgb_model.predict(X_test_imputed)
y_proba_xgb = xgb_model.predict_proba(X_test_imputed)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_xgb))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_xgb))

Confusion Matrix:
 [[89  7]
 [ 3 58]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.93      0.95        96
           1       0.89      0.95      0.92        61

    accuracy                           0.94       157
   macro avg       0.93      0.94      0.93       157
weighted avg       0.94      0.94      0.94       157


ROC AUC Score: 0.9663592896174864


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


In [19]:
y_pred_ensemble = ensemble_model.predict(X_test_imputed)
y_proba_ensemble = ensemble_model.predict_proba(X_test_imputed)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_ensemble))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_ensemble))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_ensemble))

Confusion Matrix:
 [[89  7]
 [ 3 58]]

Classification Report:
               precision    recall  f1-score   support

           0       0.97      0.93      0.95        96
           1       0.89      0.95      0.92        61

    accuracy                           0.94       157
   macro avg       0.93      0.94      0.93       157
weighted avg       0.94      0.94      0.94       157


ROC AUC Score: 0.966188524590164


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


### 3.7.2 training on `X_train_imputed_selected`

In [20]:
from sklearn.ensemble import AdaBoostClassifier

ada_model = AdaBoostClassifier(
    n_estimators=100,
    random_state=42
)

ada_model.fit(X_train_imputed_selected, y_train_imputed)

In [21]:
from xgboost import XGBClassifier

xgb_model = XGBClassifier(
    n_estimators=100,
    max_depth=4,
    learning_rate=0.1,
    use_label_encoder=False,
    eval_metric='logloss',
    random_state=42
)

xgb_model.fit(X_train_imputed_selected, y_train_imputed)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(data):


In [22]:
from sklearn.ensemble import VotingClassifier

ensemble_model = VotingClassifier(
    estimators=[
        ('ada', ada_model),
        ('xgb', xgb_model),
        ('rf', model_imputed_selected)  # your previously trained Random Forest
    ],
    voting='soft'  # uses predicted probabilities
)

ensemble_model.fit(X_train_imputed_selected, y_train_imputed)

  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


In [23]:
y_pred_ada = ada_model.predict(X_test_imputed_selected)
y_proba_ada = ada_model.predict_proba(X_test_imputed_selected)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_ada))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_ada))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_ada))

Confusion Matrix:
 [[77 19]
 [ 8 53]]

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.80      0.85        96
           1       0.74      0.87      0.80        61

    accuracy                           0.83       157
   macro avg       0.82      0.84      0.82       157
weighted avg       0.84      0.83      0.83       157


ROC AUC Score: 0.9082137978142076


In [24]:
y_pred_xgb = xgb_model.predict(X_test_imputed_selected)
y_proba_xgb = xgb_model.predict_proba(X_test_imputed_selected)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_xgb))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_xgb))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_xgb))

Confusion Matrix:
 [[79 17]
 [ 8 53]]

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.82      0.86        96
           1       0.76      0.87      0.81        61

    accuracy                           0.84       157
   macro avg       0.83      0.85      0.84       157
weighted avg       0.85      0.84      0.84       157


ROC AUC Score: 0.9162397540983607


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


In [25]:
y_pred_ensemble = ensemble_model.predict(X_test_imputed_selected)
y_proba_ensemble = ensemble_model.predict_proba(X_test_imputed_selected)[:, 1]

print("Confusion Matrix:\n", confusion_matrix(y_test_imputed, y_pred_ensemble))
print("\nClassification Report:\n", classification_report(y_test_imputed, y_pred_ensemble))
print("\nROC AUC Score:", roc_auc_score(y_test_imputed, y_proba_ensemble))

Confusion Matrix:
 [[79 17]
 [ 8 53]]

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.82      0.86        96
           1       0.76      0.87      0.81        61

    accuracy                           0.84       157
   macro avg       0.83      0.85      0.84       157
weighted avg       0.85      0.84      0.84       157


ROC AUC Score: 0.9158982240437159


  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)
  if is_sparse(dtype):
  elif is_categorical_dtype(dtype) and enable_categorical:
  if is_categorical_dtype(dtype)
  return is_int or is_bool or is_float or is_categorical_dtype(dtype)


### 3.7.3 Training summary

The best model found with XGD classifier on the imputed dataset on all 12 features and 2 way interactions between the 3 features `'dmin', 'Year', 'cdi'` which gave ROC AUC Score: 0.9687499999999999