# 2.2 ETL Feature Engineering Notebook

Load essential packages for data access, manipulation, and file handling.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np

## Contents

1. Load Datasets: Raw vs. Imputed  
1.1 Drop Redundant or Constant Features  
1.2 Check Feature Distributions  

2. Feature Engineering  
2.1 Interaction Terms (e.g., cdi Ã— nst)  
2.2 Binning and Scaling  
2.3 Correlation and VIF Analysis  

3. Export Transformed Dataset

## 1. Load Datasets: Raw vs. Imputed

We import two versions of the earthquake dataset to support dual-path benchmarking:

- **Raw Dataset** (`earthquake_data_tsunami.csv`):  
  Contains original features with missing values. Used for missingness analysis, imputation strategy comparison, and robustness testing.

- **Imputed Dataset** (`earthquake_imputed.csv`):  
  Contains preprocessed features with missing values filled. Used for feature engineering, model training, and performance benchmarking.

This dual import allows us to:
- Compare model performance with vs. without imputation
- Audit the impact of preprocessing on feature distributions and interactions
- Apply validator-grade logic to assess parsimony, drift, and attribution stability

In [None]:
# Load imputed dataset (used for feature engineering and ML benchmarking)
imputed_df = pd.read_csv("../data/processed/earthquake_imputed.csv")

# Load raw dataset (used for missingness analysis, imputation benchmarking, and model robustness testing)
raw_df = pd.read_csv("../data/raw/earthquake_data_tsunami.csv")