# 2.2 ETL Feature Engineering Notebook

Load essential packages for data access, manipulation, and file handling.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

## Contents

1. Load Datasets: Raw vs. Imputed  
1.1 Drop Redundant or Constant Features  
1.2 Check Feature Distributions  

2. Feature Engineering  
2.1 Interaction Terms (e.g., cdi √ó nst)  
2.2 Binning and Scaling  
2.3 Correlation and VIF Analysis  

3. Export Transformed Dataset

## 1. Load Datasets: Raw vs. Imputed

We import two versions of the earthquake dataset to support dual-path benchmarking:

- **Raw Dataset** (`earthquake_data_tsunami.csv`):  
  Contains original features with missing values. Used for missingness analysis, imputation strategy comparison, and robustness testing.

- **Imputed Dataset** (`earthquake_imputed.csv`):  
  Contains preprocessed features with missing values filled. Used for feature engineering, model training, and performance benchmarking.

This dual import allows us to:
- Compare model performance with vs. without imputation
- Audit the impact of preprocessing on feature distributions and interactions
- Apply validator-grade logic to assess parsimony, drift, and attribution stability

In [2]:
# Load imputed dataset (used for feature engineering and ML benchmarking)
imputed_df = pd.read_csv("../data/processed/earthquake_imputed.csv")

# Load raw dataset (used for missingness analysis, imputation benchmarking, and model robustness testing)
raw_df = pd.read_csv("../data/raw/earthquake_data_tsunami.csv")

## 1.1 Drop Redundant or Constant Features

We perform a diagnostic audit to identify features that may be redundant, constant, or quasi-constant. This step supports the principle of parsimony and prepares for downstream correlation and VIF analysis.

No features are dropped at this stage ‚Äî we log candidates for potential removal based on:

- **Zero variance**: Features with identical values across all rows
- **Quasi-constant**: Features dominated by a single value (e.g., >99% frequency)
- **Duplicate columns**: Identical feature vectors
- **Manual flags**: ID-like or metadata columns with no predictive value

This audit is applied to both the raw and imputed datasets to support dual-path benchmarking.

In [3]:
# Define audit function for constant, quasi-constant, and duplicate features
def audit_redundant_features(df, name="dataset", quasi_thresh=0.99):
    print(f"\nüîç Auditing {name} for redundant features...")

    # Constant features (zero variance)
    constant_cols = [col for col in df.columns if df[col].nunique() == 1]
    print(f"üß± Constant features: {constant_cols}")

    # Quasi-constant features
    quasi_cols = []
    for col in df.columns:
        top_freq = df[col].value_counts(normalize=True, dropna=False).max()
        if top_freq > quasi_thresh and col not in constant_cols:
            quasi_cols.append(col)
    print(f"üßä Quasi-constant features (>{quasi_thresh*100:.0f}% dominance): {quasi_cols}")

    # Duplicate columns
    dup_cols = df.T[df.T.duplicated()].index.tolist()
    print(f"üìé Duplicate columns: {dup_cols}")

    # Summary
    flagged = set(constant_cols + quasi_cols + dup_cols)
    print(f"‚ö†Ô∏è Total flagged features in {name}: {len(flagged)}")
    return flagged

# Run audits on both datasets
flagged_raw = audit_redundant_features(raw_df, name="Raw Dataset")
flagged_imputed = audit_redundant_features(imputed_df, name="Imputed Dataset")


üîç Auditing Raw Dataset for redundant features...
üß± Constant features: []
üßä Quasi-constant features (>99% dominance): []
üìé Duplicate columns: []
‚ö†Ô∏è Total flagged features in Raw Dataset: 0

üîç Auditing Imputed Dataset for redundant features...
üß± Constant features: []
üßä Quasi-constant features (>99% dominance): []
üìé Duplicate columns: []
‚ö†Ô∏è Total flagged features in Imputed Dataset: 0


### Audit Summary: No Redundant Features Found

The audit identified **no constant, quasi-constant, or duplicate features** in either the raw or imputed dataset.

This confirms:
- All features exhibit sufficient variance and uniqueness
- No immediate candidates for removal based on parsimony or redundancy
- Feature pruning will be deferred to downstream correlation and VIF analysis (Section 2.3)

This outcome supports the integrity of the current feature set and validates the preprocessing pipeline up to this stage.