# ETL Transform Notebook

## Contents
1. Import Required Libraries 
2. Load Raw Dataset    
3. Initial Diagnostics  
4. Semantic Zero Detection  
5. Imputation Strategy Space 
6. Reduced Strategy Benchmarking
7. Focused Imputation on 'gap' Feature
8. Metric Summary and Best Config Identification
9. Save Imputed Dataset

## 1. Import Required Libraries
Load essential packages for data access, manipulation, and file handling.

In [1]:
# Import required libraries for data manipulation, imputation, and modeling
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, f1_score, precision_score, recall_score, brier_score_loss
import itertools

## 2. Load Raw Dataset  
Read the raw earthquake-tsunami dataset into a DataFrame for transformation and imputation benchmarking.

In [2]:
# Load raw earthquake-tsunami dataset
df = pd.read_csv("../data/raw/earthquake_data_tsunami.csv")

In [3]:
# Display dataset overview
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
df.head()

Dataset shape: (782, 13)
Columns: ['magnitude', 'cdi', 'mmi', 'sig', 'nst', 'dmin', 'gap', 'depth', 'latitude', 'longitude', 'Year', 'Month', 'tsunami']


Unnamed: 0,magnitude,cdi,mmi,sig,nst,dmin,gap,depth,latitude,longitude,Year,Month,tsunami
0,7.0,8,7,768,117,0.509,17.0,14.0,-9.7963,159.596,2022,11,1
1,6.9,4,4,735,99,2.229,34.0,25.0,-4.9559,100.738,2022,11,0
2,7.0,3,3,755,147,3.125,18.0,579.0,-20.0508,-178.346,2022,11,1
3,7.3,5,5,833,149,1.865,21.0,37.0,-19.2918,-172.129,2022,11,1
4,6.6,0,2,670,131,4.998,27.0,624.464,-25.5948,178.278,2022,11,1


---
## 3. Initial Diagnostics  
Inspect feature types, missing values, value ranges, and potential semantic zeros to understand data quality issues.

In [4]:
# Display feature data types and basic information
print("Feature types:")
print(df.dtypes)
print(f"\nDataset info:")
df.info()

Feature types:
magnitude    float64
cdi            int64
mmi            int64
sig            int64
nst            int64
dmin         float64
gap          float64
depth        float64
latitude     float64
longitude    float64
Year           int64
Month          int64
tsunami        int64
dtype: object

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 782 entries, 0 to 781
Data columns (total 13 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   magnitude  782 non-null    float64
 1   cdi        782 non-null    int64  
 2   mmi        782 non-null    int64  
 3   sig        782 non-null    int64  
 4   nst        782 non-null    int64  
 5   dmin       782 non-null    float64
 6   gap        782 non-null    float64
 7   depth      782 non-null    float64
 8   latitude   782 non-null    float64
 9   longitude  782 non-null    float64
 10  Year       782 non-null    int64  
 11  Month      782 non-null    int64  
 12  tsunami    78

In [5]:
# Check for missing values across all features
print("Missing values per column:")
df.isna().sum()

Missing values per column:


magnitude    0
cdi          0
mmi          0
sig          0
nst          0
dmin         0
gap          0
depth        0
latitude     0
longitude    0
Year         0
Month        0
tsunami      0
dtype: int64

In [6]:
# Check value ranges (min, max) for each numeric column
print("Value ranges [Column, Min, Max]:")
for col in df.select_dtypes(include=[np.number]).columns:
    print(f"{col}: [{df[col].min()}, {df[col].max()}]")

Value ranges [Column, Min, Max]:
magnitude: [6.5, 9.1]
cdi: [0, 9]
mmi: [1, 9]
sig: [650, 2910]
nst: [0, 934]
dmin: [0.0, 17.654]
gap: [0.0, 239.0]
depth: [2.7, 670.81]
latitude: [-61.8484, 71.6312]
longitude: [-179.968, 179.662]
Year: [2001, 2022]
Month: [1, 12]
tsunami: [0, 1]


### 3.1 Coordinate Validation
Check for potentially invalid zero values in latitude/longitude coordinates.

In [7]:
# Check for zero values in coordinate columns (none found)
print(f"Zero latitude values: {(df['latitude'] == 0).sum()}")
print(f"Zero longitude values: {(df['longitude'] == 0).sum()}")

Zero latitude values: 0
Zero longitude values: 0


### 3.2 Descriptive Statistics
Analyze central tendency, spread, and distribution characteristics.

**Key Findings:**
- `cdi`, `nst`: More than 25% of values are 0 (potential semantic zeros)
- `dmin`: More than 50% of values are 0 (strong candidate for imputation)
- `gap`: Contains zeros that may represent missing data
Consider binning and/or imputation strategies for these features.

In [8]:
# Display summary statistics for all features
df.describe(include="all")

Unnamed: 0,magnitude,cdi,mmi,sig,nst,dmin,gap,depth,latitude,longitude,Year,Month,tsunami
count,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0
mean,6.941125,4.33376,5.964194,870.108696,230.250639,1.325757,25.03899,75.883199,3.5381,52.609199,2012.280051,6.563939,0.388747
std,0.445514,3.169939,1.462724,322.465367,250.188177,2.218805,24.225067,137.277078,27.303429,117.898886,6.099439,3.507866,0.487778
min,6.5,0.0,1.0,650.0,0.0,0.0,0.0,2.7,-61.8484,-179.968,2001.0,1.0,0.0
25%,6.6,0.0,5.0,691.0,0.0,0.0,14.625,14.0,-14.5956,-71.66805,2007.0,3.25,0.0
50%,6.8,5.0,6.0,754.0,140.0,0.0,20.0,26.295,-2.5725,109.426,2013.0,7.0,0.0
75%,7.1,7.0,7.0,909.75,445.0,1.863,30.0,49.75,24.6545,148.941,2017.0,10.0,1.0
max,9.1,9.0,9.0,2910.0,934.0,17.654,239.0,670.81,71.6312,179.662,2022.0,12.0,1.0


### 3.3 Distribution Analysis
Calculate skewness and kurtosis to understand distribution shapes.

In [9]:
# Calculate skewness (measure of asymmetry)
print("Skewness:")
df.skew(numeric_only=True)

Skewness:


magnitude    1.444440
cdi         -0.197310
mmi         -0.250403
sig          3.083629
nst          0.533307
dmin         2.604580
gap          4.668607
depth        3.024869
latitude     0.200853
longitude   -0.702982
Year        -0.192450
Month       -0.067928
tsunami      0.457333
dtype: float64

In [10]:
# Calculate kurtosis (measure of tail heaviness)
print("Kurtosis:")
df.kurtosis(numeric_only=True)

Kurtosis:


magnitude     2.226391
cdi          -1.357753
mmi          -0.224592
sig          12.000754
nst          -1.092793
dmin          9.283367
gap          32.027722
depth         8.384480
latitude     -0.476740
longitude    -1.088383
Year         -1.042840
Month        -1.299853
tsunami      -1.795445
dtype: float64

---
## 4. Semantic Zero Detection
Replace 0 values with `NaN` in features where zero likely represents missing data rather than true zero.  
**Target features:** `cdi`, `nst`, `dmin`, `gap`

In [11]:
# Create a copy and replace semantic zeros with NaN for imputation
df_nan = df.copy()
semantic_zero_features = ["cdi", "nst", "dmin", "gap"]

for col in semantic_zero_features:
    zero_count = (df_nan[col] == 0).sum()
    df_nan[col] = df_nan[col].replace(0, np.nan)
    print(f"Replaced {zero_count} zeros with NaN in '{col}'")

Replaced 212 zeros with NaN in 'cdi'
Replaced 365 zeros with NaN in 'nst'
Replaced 405 zeros with NaN in 'dmin'
Replaced 70 zeros with NaN in 'gap'


---
## 5. Imputation Strategy Space  
Define candidate strategies for each feature: **none**, **mean**, **KNN**, **KMeans**.  
Explore the full combinatorial space (too large for exhaustive search).

**Strategy Parameters:**
- KNN and KMeans: Test with 2-10 clusters/neighbors based on feature cardinality

In [12]:
# Count unique values in each feature (to inform parameter ranges)
feature_discrete_counts = {
    col: df_nan[col].dropna().nunique()
    for col in semantic_zero_features
}
print("Unique value counts per feature:")
feature_discrete_counts

Unique value counts per feature:


{'cdi': 9, 'nst': 311, 'dmin': 368, 'gap': 255}

### 5.1 Full Strategy Space
Generate all possible strategy combinations (impractical to evaluate all ~160K configs).

In [13]:
# Define full strategy space (for reference, not for exhaustive testing)
strategy_space = (
    [("none", None)] +
    [("mean", None)] +
    [("knn", i) for i in range(2, 11)] +
    [("kmeans", i) for i in range(2, 11)]
)
features = semantic_zero_features

# Calculate total configurations (one strategy per feature)
all_configs = list(itertools.product(strategy_space, repeat=len(features)))
print(f"Total possible configurations: {len(all_configs):,}")
print("⚠️ Too many configurations for exhaustive search - using reduced space")

Total possible configurations: 160,000
⚠️ Too many configurations for exhaustive search - using reduced space


---
## 6. Reduced Strategy Benchmarking  
Limit strategy space to make benchmarking tractable.  
Train logistic regression on each configuration and evaluate performance metrics.

In [14]:
# Define reduced strategy space (limited k-range for feasibility)
reduced_strategy_space = (
    [("none", None)] +
    [("mean", None)] +
    [("knn", i) for i in range(2, 4)] +
    [("kmeans", i) for i in range(2, 4)]
)

# Generate all combinations: one strategy-param pair per feature
reduced_configs = list(itertools.product(reduced_strategy_space, repeat=len(features)))
print(f"Reduced configurations to evaluate: {len(reduced_configs):,}")

Reduced configurations to evaluate: 1,296


In [15]:
def apply_imputation(df: pd.DataFrame, plan: dict) -> pd.DataFrame:
    """
    Apply imputation strategies to a DataFrame based on a configuration plan.
    
    Args:
        df: Input DataFrame with missing values
        plan: Dictionary mapping feature names to (strategy, parameter) tuples
    
    Returns:
        DataFrame with imputed values
    """
    df_copy = df.copy()

    for feature, (strategy, param) in plan.items():
        if strategy == "none":
            # Use sentinel value for missing data
            df_copy[feature].fillna(-999, inplace=True)

        elif strategy == "mean":
            # Simple mean imputation
            imputer = SimpleImputer(strategy="mean")
            df_copy[[feature]] = imputer.fit_transform(df_copy[[feature]])

        elif strategy == "knn":
            # K-Nearest Neighbors imputation
            imputer = KNNImputer(n_neighbors=param)
            df_copy[[feature]] = imputer.fit_transform(df_copy[[feature]])

        elif strategy == "kmeans":
            # KMeans-based imputation using cluster centers
            missing_mask = df_copy[feature].isna()
            observed = df_copy.loc[~missing_mask, feature].values.reshape(-1, 1)

            if len(observed) < param:
                # Fallback to mean if insufficient data for clustering
                fill_value = np.nanmean(observed)
            else:
                kmeans = KMeans(n_clusters=param, random_state=42, n_init=10)
                kmeans.fit(observed)
                centers = kmeans.cluster_centers_
                fill_value = np.random.choice(centers.flatten())

            df_copy.loc[missing_mask, feature] = fill_value

    return df_copy

In [16]:
def split_data(df: pd.DataFrame, target_col: str = "tsunami", test_size: float = 0.2, random_state: int = 42):
    """
    Split dataset into training and validation sets with stratification.
    
    Args:
        df: Input DataFrame
        target_col: Name of target column
        test_size: Proportion of data for validation
        random_state: Random seed for reproducibility
    
    Returns:
        Tuple of (X_train, X_valid, y_train, y_valid)
    """
    X = df.drop(columns=[target_col])
    y = df[target_col]
    
    return train_test_split(X, y, test_size=test_size, random_state=random_state, stratify=y)

In [17]:
# Evaluate all reduced configurations
results = []

print("Evaluating imputation strategies...")
for i, config in enumerate(reduced_configs, 1):
    if i % 100 == 0:
        print(f"  Progress: {i}/{len(reduced_configs)}")
    
    # Apply imputation with current configuration
    impute_plan = dict(zip(features, config))
    df_imputed = apply_imputation(df_nan, impute_plan)
    
    # Train-validation split
    X_train, X_valid, y_train, y_valid = split_data(df_imputed, target_col="tsunami")
    
    # Train logistic regression model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_valid)[:, 1]
    y_pred = (y_prob >= 0.5).astype(int)
    
    # Calculate evaluation metrics
    metrics = {
        "config": impute_plan,
        "auc": roc_auc_score(y_valid, y_prob),
        "f1": f1_score(y_valid, y_pred),
        "precision": precision_score(y_valid, y_pred),
        "recall": recall_score(y_valid, y_pred),
        "brier": brier_score_loss(y_valid, y_prob)
    }
    
    results.append(metrics)

print(f"✓ Evaluated {len(results)} configurations")

Evaluating imputation strategies...


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 100/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 200/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 300/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 400/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 500/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 600/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 700/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 800/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 900/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 1000/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 1100/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

  Progress: 1200/1296


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

✓ Evaluated 1296 configurations


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [18]:
# Summarize metric ranges across all configurations
print("Metric Performance Summary:")
print("=" * 60)
metrics_list = ["auc", "f1", "precision", "recall", "brier"]

for metric in metrics_list:
    values = [r[metric] for r in results]
    print(f"{metric.upper():<10} → min: {min(values):.4f}, max: {max(values):.4f}, range: {max(values) - min(values):.4f}")

Metric Performance Summary:
AUC        → min: 0.5902, max: 0.8970, range: 0.3069
F1         → min: 0.3158, max: 0.8308, range: 0.5150
PRECISION  → min: 0.4146, max: 0.7846, range: 0.3700
RECALL     → min: 0.2459, max: 0.8852, range: 0.6393
BRIER      → min: 0.1149, max: 0.2380, range: 0.1231


In [19]:
# Identify best configuration by recall metric
best_recall = max(results, key=lambda x: x['recall'])
print("\nBest Configuration by Recall:")
print("=" * 60)
for feature, strategy in best_recall['config'].items():
    print(f"  {feature}: {strategy}")
print(f"\nMetrics: Recall={best_recall['recall']:.4f}, AUC={best_recall['auc']:.4f}, F1={best_recall['f1']:.4f}")


Best Configuration by Recall:
  cdi: ('none', None)
  nst: ('mean', None)
  dmin: ('none', None)
  gap: ('none', None)

Metrics: Recall=0.8852, AUC=0.8928, F1=0.8244


---
## 7. Focused Imputation on 'gap' Feature  
Fix strategies for `cdi`, `nst`, `dmin` and systematically vary `gap` imputation to isolate its impact on model performance.

In [20]:
# Define fixed strategies for first three features
fixed_strategies = [("none", None), ("mean", None), ("none", None)]

# Generate configurations varying only 'gap' imputation
gap_configs = []
for method in ["knn", "kmeans"]:
    for k in range(2, 11):
        gap_strategy = (method, k)
        config = tuple(fixed_strategies + [gap_strategy])
        gap_configs.append(config)

print(f"Gap-focused configurations to evaluate: {len(gap_configs)}")

Gap-focused configurations to evaluate: 18


In [21]:
# Evaluate gap-focused configurations
gap_results = []

print("Evaluating gap imputation strategies...")
for config in gap_configs:
    impute_plan = dict(zip(features, config))
    df_imputed = apply_imputation(df_nan, impute_plan)
    
    X_train, X_valid, y_train, y_valid = split_data(df_imputed, target_col="tsunami")
    
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    y_prob = model.predict_proba(X_valid)[:, 1]
    y_pred = (y_prob >= 0.5).astype(int)
    
    metrics = {
        "config": impute_plan,
        "auc": roc_auc_score(y_valid, y_prob),
        "f1": f1_score(y_valid, y_pred),
        "precision": precision_score(y_valid, y_pred),
        "recall": recall_score(y_valid, y_pred),
        "brier": brier_score_loss(y_valid, y_prob)
    }
    
    gap_results.append(metrics)

print(f"✓ Evaluated {len(gap_results)} gap configurations")

Evaluating gap imputation strategies...
✓ Evaluated 18 gap configurations
✓ Evaluated 18 gap configurations


---
## 8. Metric Summary and Best Configuration Identification  
Analyze metric distributions and identify the optimal imputation strategy based on recall performance.

In [22]:
# Summarize gap-focused metrics
print("Gap-Focused Metric Performance:")
print("=" * 60)
for metric in metrics_list:
    values = [r[metric] for r in gap_results]
    print(f"{metric.upper():<10} → min: {min(values):.4f}, max: {max(values):.4f}, range: {max(values) - min(values):.4f}")

# Identify best gap configuration by recall
best_recall_gap = max(gap_results, key=lambda x: x['recall'])
print("\n\nBest Gap Configuration by Recall:")
print("=" * 60)
for feature, strategy in best_recall_gap['config'].items():
    print(f"  {feature}: {strategy}")
print(f"\nMetrics: Recall={best_recall_gap['recall']:.4f}, AUC={best_recall_gap['auc']:.4f}, F1={best_recall_gap['f1']:.4f}")

Gap-Focused Metric Performance:
AUC        → min: 0.8893, max: 0.8916, range: 0.0022
F1         → min: 0.8244, max: 0.8244, range: 0.0000
PRECISION  → min: 0.7714, max: 0.7714, range: 0.0000
RECALL     → min: 0.8852, max: 0.8852, range: 0.0000
BRIER      → min: 0.1158, max: 0.1161, range: 0.0004


Best Gap Configuration by Recall:
  cdi: ('none', None)
  nst: ('mean', None)
  dmin: ('none', None)
  gap: ('knn', 2)

Metrics: Recall=0.8852, AUC=0.8907, F1=0.8244


---
## 9. Save Imputed Dataset
Apply the best-performing imputation configuration and save the cleaned dataset to `data/processed/` for downstream feature engineering.

In [23]:
# Apply best imputation configuration
best_plan = {'cdi': ('none', None), 'nst': ('mean', None), 'dmin': ('none', None), 'gap': ('kmeans', 3)}
df_imputed = apply_imputation(df_nan, best_plan)

# Ensure output directory exists
import os
output_path = "../data/processed/earthquake_imputed.csv"
os.makedirs(os.path.dirname(output_path), exist_ok=True)

# Save imputed dataset
df_imputed.to_csv(output_path, index=False)
print(f"✓ Saved imputed dataset to: {output_path}")
print(f"✓ Total records: {len(df_imputed)}")
print(f"✓ Features: {list(df_imputed.columns)}")

✓ Saved imputed dataset to: ../data/processed/earthquake_imputed.csv
✓ Total records: 782
✓ Features: ['magnitude', 'cdi', 'mmi', 'sig', 'nst', 'dmin', 'gap', 'depth', 'latitude', 'longitude', 'Year', 'Month', 'tsunami']


---
## Summary

This notebook successfully completed the **Transform** phase of the ETL pipeline:

### Key Accomplishments:
- ✓ Loaded and inspected raw earthquake-tsunami dataset
- ✓ Identified semantic zeros in 4 features (`cdi`, `nst`, `dmin`, `gap`)
- ✓ Evaluated 1,296 imputation strategy configurations
- ✓ Performed focused analysis on `gap` feature with 18 additional configs
- ✓ Selected optimal strategy: `(none, mean, none, kmeans-3)`
- ✓ Achieved best recall: {:.4f} with AUC: {:.4f}
- ✓ Saved imputed dataset to `data/processed/` for feature engineering

### Best Configuration:
- **cdi**: None (sentinel value -999)
- **nst**: Mean imputation
- **dmin**: None (sentinel value -999)
- **gap**: KMeans imputation with 3 clusters

### Next Steps:
Proceed to Feature Engineering notebook (`02_02_etl_feature_engineering.ipynb`) to create interaction terms and derived features for modeling.

---