# 001 — Feature Engineering & Selection

Comprehensive, plot-heavy feature engineering and selection pipeline.
Given an input parquet dataset and a declared target column, this notebook
produces curated feature recommendations, quality metrics, engineered
features, and a full suite of diagnostic plots.

**Lifecycle stage:** seedling (model-garden)

All code is self-contained in this notebook — no external library imports
from a shared `src/` package.

## Feature selection philosophy

1. **Relevance** — keep features that carry signal about the target.
2. **Stability** — prefer features whose importance is consistent across
   CV folds and time periods.
3. **Leakage avoidance** — flag and remove features that would not be
   available at prediction time or that encode the target directly.
4. **Parsimony** — fewer, stronger features beat a large, noisy set.

## Outputs produced

| Artifact | Path |
|---|---|
| Metrics JSON | `outputs/metrics/feature_report.json` |
| Plots | `outputs/plots/*.png` |
| Transformed features parquet | `outputs/features/features.parquet` |
| Executed notebook | `outputs/runs/<timestamp>_executed.ipynb` |

## Running with papermill

```bash
uv run papermill notebooks/001_feature_engineering_selection.ipynb out.ipynb \
    -p target_col "target" \
    -p task_type "binary"
```

In [None]:
# ---------------------------------------------------------------------------
# Papermill parameters  (this cell is tagged "parameters")
# ---------------------------------------------------------------------------

# Data + schema
input_parquet_paths: list[str] = []       # local or gs:// URIs; empty -> synthetic
output_dir: str = "outputs"
plots_dir: str = "outputs/plots"
metrics_json_path: str = "outputs/metrics/feature_report.json"
output_features_parquet_path: str | None = "outputs/features/features.parquet"
executed_notebook_path: str | None = None
target_col: str = "target"
id_cols: list[str] = []                   # entity IDs / keys
time_col: str | None = None              # datetime column for time-aware features
group_cols: list[str] = []               # entity grouping keys for aggregation
categorical_cols: list[str] | None = None  # None -> infer via dtype / unique count
numeric_cols: list[str] | None = None      # None -> infer
text_cols: list[str] | None = None         # optional lightweight text features
drop_cols: list[str] = []                  # columns to always drop
max_rows_for_eda: int = 200_000            # sample for heavy plots
sample_seed: int = 42

# Missingness + outliers
missingness_drop_threshold: float = 0.98
high_cardinality_threshold: int = 500
rare_category_min_count: int = 20
winsorize_limits: list[float] = [0.01, 0.99]
enable_outlier_clipping: bool = True

# Feature generation toggles
enable_interactions: bool = True
enable_polynomial: bool = False
enable_group_aggregations: bool = True
enable_time_features: bool = True
enable_target_encoding: bool = False
enable_mutual_info: bool = True
enable_permutation_importance: bool = True
enable_shap: bool = False
enable_stability_checks: bool = True

# Selection + evaluation
task_type: str = "binary"                  # binary | multiclass | regression
test_size: float = 0.2
random_state: int = 42
stratify: bool = True
baseline_model: str = "logreg"             # logreg | catboost_if_available
cv_folds: int = 3
scoring_metric: str = "f1"                 # f1 | f1_macro | rmse

In [None]:
# ---------------------------------------------------------------------------
# Imports & setup
# ---------------------------------------------------------------------------
import json
import math
import os
import random
import warnings
from datetime import date, datetime, timedelta, timezone
from pathlib import Path

warnings.filterwarnings("ignore")
os.environ["PYTHONWARNINGS"] = "ignore"

import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import polars as pl

# sklearn / scipy are used ONLY at model / statistical-test boundaries.
# Every data-wrangling step uses Polars.
import numpy as np  # only for sklearn interop and matplotlib arrays
from scipy import stats as sp_stats
from sklearn.feature_selection import (
    chi2,
    f_classif,
    f_regression,
    mutual_info_classif,
    mutual_info_regression,
)
from sklearn.inspection import permutation_importance
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.metrics import (
    f1_score,
    roc_auc_score,
    mean_squared_error,
)
from sklearn.model_selection import (
    StratifiedKFold,
    KFold,
    train_test_split,
)
from sklearn.preprocessing import OrdinalEncoder, StandardScaler

# Ensure output dirs exist
for d in ["outputs/runs", "outputs/plots", "outputs/metrics", "outputs/features"]:
    Path(d).mkdir(parents=True, exist_ok=True)

RUN_TS = datetime.now(timezone.utc).isoformat()
print(f"Run started at {RUN_TS}")

# Matplotlib defaults
plt.rcParams.update({
    "figure.figsize": (10, 5),
    "figure.dpi": 120,
    "axes.titlesize": 13,
    "axes.labelsize": 11,
    "xtick.labelsize": 9,
    "ytick.labelsize": 9,
    "figure.facecolor": "white",
})

In [None]:
# ---------------------------------------------------------------------------
# Helper utilities (Polars-first)
# ---------------------------------------------------------------------------

def save_plot(fig, name: str) -> str:
    path = f"{plots_dir}/{name}"
    fig.savefig(path, bbox_inches="tight", facecolor="white")
    plt.show()
    plt.close(fig)
    print(f"  -> saved {path}")
    return path


def safe_sample(df: pl.DataFrame, n: int, seed: int = 42) -> pl.DataFrame:
    if len(df) <= n:
        return df
    return df.sample(n=n, seed=seed)


def polars_col_to_list(df: pl.DataFrame, col: str) -> list:
    """Extract a column as a Python list (for matplotlib)."""
    return df[col].to_list()


def polars_to_numpy_for_sklearn(
    df: pl.DataFrame,
    feature_cols: list[str],
    target: str,
    cat_cols: list[str] | None = None,
) -> tuple:
    """Convert Polars DataFrame to numpy arrays ONLY for sklearn.

    This is the single boundary where we leave Polars. All data prep
    (fill_null, casting, encoding) happens in Polars first.
    """
    # Fill nulls and cast in Polars before conversion
    exprs = []
    for c in feature_cols:
        dt = df[c].dtype
        if dt in (pl.Utf8, pl.String, pl.Categorical):
            exprs.append(pl.col(c).fill_null("__MISSING__"))
        elif dt.is_numeric():
            exprs.append(pl.col(c).fill_null(pl.col(c).median()))
        else:
            exprs.append(pl.col(c).fill_null(pl.lit(0)))
    sub = df.select(exprs + [pl.col(target)])

    # Ordinal-encode string columns in Polars
    actual_cats = [c for c in (cat_cols or []) if c in feature_cols and sub[c].dtype in (pl.Utf8, pl.String)]
    if actual_cats:
        for c in actual_cats:
            # Map each unique value to an integer
            uniques = sub[c].unique().sort()
            mapping = {v: i for i, v in enumerate(uniques.to_list())}
            sub = sub.with_columns(
                pl.col(c).replace_strict(mapping, default=-1).cast(pl.Float64).alias(c)
            )

    # All columns should now be numeric — cast to Float64 and convert
    num_exprs = [pl.col(c).cast(pl.Float64) for c in feature_cols]
    X = sub.select(num_exprs).to_numpy()
    y = sub[target].to_numpy()
    # Replace any remaining NaN/inf (safety net)
    X = np.nan_to_num(X, nan=0.0, posinf=0.0, neginf=0.0)
    return X, y


# Tracking containers
report: dict = {
    "run_metadata": {
        "timestamp": RUN_TS,
        "task_type": task_type,
        "target_col": target_col,
    },
    "column_audit": [],
    "dropped_features": [],
    "univariate_scores": {},
    "consensus_ranking": [],
    "model_based_importance": {},
    "selection_frequency": {},
    "top_k_performance": [],
    "recommended_features": [],
    "engineered_features_created": [],
    "high_risk_features": [],
}

---
## B — Load Data (Polars-first)

Load parquet file(s) from `input_parquet_paths`. Supports local paths and
`gs://` URIs via `gcsfs`. If no paths are provided, generate a synthetic
dataset for demonstration.

In [None]:
# ---------------------------------------------------------------------------
# B — Data Loading (pure Polars)
# ---------------------------------------------------------------------------

def load_parquet_polars(paths: list[str] | str) -> pl.DataFrame:
    if isinstance(paths, str):
        paths = [paths]
    frames = []
    for p in paths:
        if p.startswith("gs://"):
            import gcsfs
            fs = gcsfs.GCSFileSystem()
            with fs.open(p, "rb") as f:
                frames.append(pl.read_parquet(f))
        else:
            frames.append(pl.read_parquet(p))
    return pl.concat(frames) if len(frames) > 1 else frames[0]


def generate_synthetic(task: str, n: int = 5000, seed: int = 42) -> pl.DataFrame:
    """Create a synthetic dataset entirely in Polars."""
    random.seed(seed)

    # Build numeric features via sklearn (returns numpy), wrap once
    from sklearn.datasets import make_classification, make_regression
    if task == "regression":
        X, y = make_regression(n_samples=n, n_features=20, n_informative=10,
                               noise=10.0, random_state=seed)
    else:
        n_classes = 2 if task == "binary" else 4
        X, y = make_classification(n_samples=n, n_features=20, n_informative=10,
                                   n_redundant=3, n_classes=n_classes,
                                   weights=[0.7, 0.3] if task == "binary" else None,
                                   random_state=seed)

    # Wrap numeric features into a Polars DataFrame in one shot
    df = pl.DataFrame(
        {f"num_{i}": pl.Series(X[:, i]) for i in range(X.shape[1])}
    ).with_columns(
        pl.Series("target", y.astype(float) if task == "regression" else y),
    )

    # Categorical columns — built as plain Python lists
    df = df.with_columns([
        pl.Series("cat_region",  [random.choice(["US", "EU", "APAC", "LATAM", "MEA"]) for _ in range(n)]),
        pl.Series("cat_channel", [random.choice(["web", "mobile", "api", "partner"]) for _ in range(n)]),
        pl.Series("cat_tier",    [random.choice(["free", "basic", "premium"]) for _ in range(n)]),
    ])

    # Rare-category column (v0 is dominant)
    rare_choices = [f"v{i}" for i in range(50)]
    rare_weights = [0.5] + [0.5 / 49] * 49
    df = df.with_columns(
        pl.Series("cat_rare", random.choices(rare_choices, weights=rare_weights, k=n)),
    )

    # Inject nulls into some numeric columns using Polars
    df = df.with_columns([
        pl.when(pl.int_range(pl.len()).mod(7) == 0)
          .then(None)
          .otherwise(pl.col("num_0"))
          .alias("num_0"),
        pl.when(pl.int_range(pl.len()).mod(6) == 0)
          .then(None)
          .otherwise(pl.col("num_3"))
          .alias("num_3"),
        pl.when(pl.int_range(pl.len()).mod(8) == 0)
          .then(None)
          .otherwise(pl.col("num_7"))
          .alias("num_7"),
    ])

    # High-null column (99% null) — Polars native
    df = df.with_columns(
        pl.when(pl.int_range(pl.len()).mod(100) == 0)
          .then(pl.lit(0.5))
          .otherwise(None)
          .cast(pl.Float64)
          .alias("num_mostly_null"),
    )

    # Text column
    words = ["great", "bad", "ok", "excellent", "poor", "fine", "amazing", "terrible"]
    df = df.with_columns(
        pl.Series("text_feedback", [
            " ".join(random.choices(words, k=random.randint(3, 14)))
            for _ in range(n)
        ]),
    )

    # Time column — pure Python dates wrapped in Polars
    base_date = date(2023, 1, 1)
    df = df.with_columns(
        pl.Series("event_time", [
            base_date + timedelta(days=random.randint(0, 364))
            for _ in range(n)
        ]).cast(pl.Date),
    )

    # Entity ID
    df = df.with_columns(
        pl.Series("entity_id", [f"ent_{i % 500}" for i in range(n)]),
    )

    return df


# Load or generate data
if input_parquet_paths:
    df_raw = load_parquet_polars(input_parquet_paths)
    print(f"Loaded {len(df_raw):,} rows from {len(input_parquet_paths)} file(s)")
else:
    print("No input_parquet_paths provided — generating synthetic dataset")
    df_raw = generate_synthetic(task_type, n=5000, seed=sample_seed)
    if time_col is None and "event_time" in df_raw.columns:
        time_col = "event_time"
    if not id_cols and "entity_id" in df_raw.columns:
        id_cols = ["entity_id"]
    if text_cols is None and "text_feedback" in df_raw.columns:
        text_cols = ["text_feedback"]
    print(f"Synthetic dataset: {df_raw.shape}")

assert target_col in df_raw.columns, f"target_col='{target_col}' not found in data"
print(f"\nTarget column: {target_col}")
print(f"Shape: {df_raw.shape}")
df_raw.head(5)

In [None]:
# ---------------------------------------------------------------------------
# Drop specified columns and separate feature candidates
# ---------------------------------------------------------------------------
exclude_cols = set(drop_cols) | set(id_cols) | {target_col}
if time_col:
    exclude_cols.add(time_col)
text_set = set(text_cols or [])
exclude_cols |= text_set

feature_candidates = [c for c in df_raw.columns if c not in exclude_cols]
print(f"Feature candidates: {len(feature_candidates)}")
print(f"Excluded columns: {sorted(exclude_cols)}")

# Create EDA sample
if len(df_raw) > max_rows_for_eda:
    df_eda = df_raw.sample(n=max_rows_for_eda, seed=sample_seed)
    print(f"EDA sample: {len(df_eda):,} rows (sampled from {len(df_raw):,})")
else:
    df_eda = df_raw
    print(f"EDA sample: full dataset ({len(df_eda):,} rows)")

---
## C — Type Inference & Column Categorization

Infer which columns are numeric, categorical, or text based on dtype and
cardinality. Integer columns with fewer than 20 unique values are treated
as categorical.

In [None]:
# ---------------------------------------------------------------------------
# C — Type Inference (pure Polars)
# ---------------------------------------------------------------------------
MAX_INT_UNIQUE_FOR_CAT = 20

def infer_column_types(
    df: pl.DataFrame,
    candidates: list[str],
    explicit_num: list[str] | None,
    explicit_cat: list[str] | None,
) -> tuple[list[str], list[str]]:
    if explicit_num is not None and explicit_cat is not None:
        return (
            [c for c in explicit_num if c in candidates],
            [c for c in explicit_cat if c in candidates],
        )
    num, cat = [], []
    for c in candidates:
        dtype = df[c].dtype
        if dtype in (pl.Utf8, pl.Categorical, pl.Boolean, pl.String):
            cat.append(c)
        elif dtype.is_numeric():
            nunique = df[c].n_unique()
            if dtype.is_integer() and nunique <= MAX_INT_UNIQUE_FOR_CAT:
                cat.append(c)
            else:
                num.append(c)
    if explicit_num is not None:
        num = [c for c in explicit_num if c in candidates]
    if explicit_cat is not None:
        cat = [c for c in explicit_cat if c in candidates]
    return num, cat


inferred_num, inferred_cat = infer_column_types(
    df_raw, feature_candidates, numeric_cols, categorical_cols
)
print(f"Numeric features:     {len(inferred_num)}")
print(f"Categorical features: {len(inferred_cat)}")
print(f"Text features:        {len(text_cols or [])}")

# Build audit table entirely in Polars
audit_rows = []
for c in feature_candidates:
    col = df_raw[c]
    null_frac = col.null_count() / len(df_raw)
    nunique = col.n_unique()
    role = "numeric" if c in inferred_num else ("categorical" if c in inferred_cat else "other")
    sample_vals = str(col.drop_nulls().head(3).to_list())[:80]
    audit_rows.append({
        "column": c, "dtype": str(col.dtype),
        "null_pct": round(null_frac * 100, 2), "n_unique": nunique,
        "role": role, "sample": sample_vals,
    })

audit_df = pl.DataFrame(audit_rows)
report["column_audit"] = audit_rows
print("\nColumn audit:")
audit_df

---
## D — Data Quality Audit

A thorough, plot-heavy audit of data quality covering missingness,
cardinality, distributions, outliers, and the target variable.

### D.1 — Missingness

Columns with more than the `missingness_drop_threshold` (default 98%) null
fraction are dropped. We visualize the top columns by null rate and show a
missingness matrix on a sample.

In [None]:
# ---------------------------------------------------------------------------
# D.1 — Missingness analysis (Polars)
# ---------------------------------------------------------------------------
# Compute null fractions in one Polars expression
null_stats = (
    df_raw.select(feature_candidates)
    .null_count()
    .unpivot(variable_name="feature", value_name="null_count")
    .with_columns(
        (pl.col("null_count") / len(df_raw)).alias("null_frac")
    )
    .sort("null_frac", descending=True)
)

# Columns to drop for high missingness
high_null_cols = null_stats.filter(
    pl.col("null_frac") > missingness_drop_threshold
)["feature"].to_list()

for c in high_null_cols:
    frac = null_stats.filter(pl.col("feature") == c)["null_frac"].item()
    report["dropped_features"].append({
        "feature": c,
        "reason": f"null_frac={frac:.3f} > {missingness_drop_threshold}",
    })
print(f"Dropping {len(high_null_cols)} columns for high missingness: {high_null_cols}")

# --- Plot: top 50 columns by null fraction ---
top_null = null_stats.head(50)
feats = top_null["feature"].to_list()
fracs = top_null["null_frac"].to_list()
fig, ax = plt.subplots(figsize=(12, max(5, len(feats) * 0.25)))
ax.barh(range(len(feats)), fracs, color="#e74c3c", alpha=0.8)
ax.set_yticks(range(len(feats)))
ax.set_yticklabels(feats, fontsize=8)
ax.set_xlabel("Null Fraction")
ax.set_title("Top 50 Columns by Null Fraction")
ax.axvline(missingness_drop_threshold, color="black", ls="--", lw=1,
           label=f"Drop threshold ({missingness_drop_threshold})")
ax.legend(fontsize=9)
ax.invert_yaxis()
fig.tight_layout()
save_plot(fig, "d1_null_fraction_bar.png")

**Missingness matrix** — each row is an observation, each column is a feature.
White pixels indicate null values. Correlated missingness patterns suggest
structural reasons (e.g., entire form sections skipped).

In [None]:
# --- Plot: missingness matrix (Polars -> single to_numpy at the end) ---
miss_cols = null_stats.filter(pl.col("null_frac") > 0)["feature"].to_list()[:40]
if miss_cols:
    sample_for_matrix = safe_sample(df_eda, 500, sample_seed)
    # Build boolean null matrix entirely in Polars, then convert once
    miss_df = sample_for_matrix.select([
        pl.col(c).is_null().cast(pl.Float32).alias(c)
        for c in miss_cols
    ])
    # Single .to_numpy() call at the matplotlib boundary
    miss_matrix = miss_df.to_numpy()

    fig, ax = plt.subplots(figsize=(max(6, len(miss_cols) * 0.3), 6))
    ax.imshow(miss_matrix, aspect="auto", cmap="gray_r", interpolation="nearest")
    ax.set_xticks(range(len(miss_cols)))
    ax.set_xticklabels(miss_cols, rotation=90, fontsize=7)
    ax.set_ylabel("Row index (sample)")
    ax.set_title("Missingness Matrix (white = null)")
    fig.tight_layout()
    save_plot(fig, "d1_missingness_matrix.png")
else:
    print("No columns with nulls — skipping missingness matrix.")

# Remove high-null columns from feature lists
inferred_num = [c for c in inferred_num if c not in high_null_cols]
inferred_cat = [c for c in inferred_cat if c not in high_null_cols]
feature_candidates = [c for c in feature_candidates if c not in high_null_cols]
report["missingness_summary"] = {
    "total_features": len(feature_candidates) + len(high_null_cols),
    "dropped_high_null": high_null_cols,
    "remaining": len(feature_candidates),
}
print(f"Remaining features after missingness drop: {len(feature_candidates)}")

### D.2 — Cardinality & Category Health

High-cardinality categoricals (> `high_cardinality_threshold` unique values)
may cause memory issues with one-hot encoding and are flagged. We also show
frequency distributions for top categoricals.

In [None]:
# ---------------------------------------------------------------------------
# D.2 — Cardinality analysis (Polars)
# ---------------------------------------------------------------------------
if inferred_cat:
    card_data = pl.DataFrame({
        "feature": inferred_cat,
        "n_unique": [df_raw[c].n_unique() for c in inferred_cat],
    }).sort("n_unique", descending=True)

    high_card = card_data.filter(
        pl.col("n_unique") > high_cardinality_threshold
    )["feature"].to_list()
    if high_card:
        print(f"High-cardinality categoricals (>{high_cardinality_threshold} unique): {high_card}")
        for c in high_card:
            report["high_risk_features"].append({"feature": c, "reason": "high_cardinality"})
    else:
        print("No high-cardinality categoricals flagged.")

    # --- Plot: n_unique (log scale) ---
    feats_c = card_data["feature"].to_list()
    nuniques_c = card_data["n_unique"].to_list()
    fig, ax = plt.subplots(figsize=(10, max(4, len(feats_c) * 0.3)))
    ax.barh(range(len(feats_c)), nuniques_c, color="#3498db", alpha=0.8)
    ax.set_yticks(range(len(feats_c)))
    ax.set_yticklabels(feats_c, fontsize=8)
    ax.set_xscale("log")
    ax.set_xlabel("Number of Unique Values (log scale)")
    ax.set_title("Categorical Feature Cardinality")
    ax.axvline(high_cardinality_threshold, color="red", ls="--", lw=1,
               label=f"Threshold ({high_cardinality_threshold})")
    ax.legend(fontsize=9)
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "d2_cardinality.png")

    # --- Plot: frequency bar charts for top categoricals ---
    top_cats_to_show = inferred_cat[:6]
    if top_cats_to_show:
        n_show = len(top_cats_to_show)
        ncols = min(3, n_show)
        nrows = (n_show + ncols - 1) // ncols
        fig, axes = plt.subplots(nrows, ncols, figsize=(5 * ncols, 4 * nrows))
        axes_flat = [axes] if n_show == 1 else list(np.array(axes).flatten())
        for i, c in enumerate(top_cats_to_show):
            vc = (
                df_eda.select(pl.col(c).cast(pl.Utf8).fill_null("__NULL__"))
                .to_series()
                .value_counts()
                .sort("count", descending=True)
            )
            top_vals = vc.head(15)
            other_count = vc["count"].sum() - top_vals["count"].sum()
            labels = top_vals[c].to_list()
            counts = top_vals["count"].to_list()
            if other_count > 0:
                labels.append("__OTHER__")
                counts.append(other_count)
            axes_flat[i].barh(range(len(labels)), counts, color="#2ecc71", alpha=0.8)
            axes_flat[i].set_yticks(range(len(labels)))
            axes_flat[i].set_yticklabels(labels, fontsize=7)
            axes_flat[i].set_title(c, fontsize=10)
            axes_flat[i].invert_yaxis()
        for j in range(i + 1, len(axes_flat)):
            axes_flat[j].set_visible(False)
        fig.suptitle("Category Frequency Distributions (top 15 + OTHER)", fontsize=12, y=1.01)
        fig.tight_layout()
        save_plot(fig, "d2_category_frequencies.png")
else:
    print("No categorical features to analyze.")

### D.3 — Numeric Distributions & Outliers

Histograms and boxplots for numeric features. Skewness and kurtosis are
computed in Polars. If `enable_outlier_clipping` is True, we show
before/after histograms for winsorized features.

In [None]:
# ---------------------------------------------------------------------------
# D.3 — Numeric distributions (Polars + matplotlib)
# ---------------------------------------------------------------------------
if inferred_num:
    # --- Histograms ---
    num_to_plot = inferred_num[:12]
    ncols = min(4, len(num_to_plot))
    nrows = (len(num_to_plot) + ncols - 1) // ncols
    fig, axes = plt.subplots(nrows, ncols, figsize=(4 * ncols, 3.5 * nrows))
    axes_flat = [axes] if len(num_to_plot) == 1 else list(np.array(axes).flatten())
    for i, c in enumerate(num_to_plot):
        vals = df_eda[c].drop_nulls().to_list()
        if not vals:
            axes_flat[i].set_title(f"{c} (all null)")
            continue
        axes_flat[i].hist(vals, bins=50, color="#9b59b6", alpha=0.7, edgecolor="white", linewidth=0.3)
        axes_flat[i].set_title(c, fontsize=9)
        axes_flat[i].tick_params(labelsize=7)
    for j in range(i + 1, len(axes_flat)):
        axes_flat[j].set_visible(False)
    fig.suptitle("Numeric Feature Distributions", fontsize=12, y=1.01)
    fig.tight_layout()
    save_plot(fig, "d3_numeric_histograms.png")

    # --- Boxplots ---
    box_cols = inferred_num[:10]
    box_data = [df_eda[c].drop_nulls().to_list() for c in box_cols]
    fig, ax = plt.subplots(figsize=(max(6, len(box_cols) * 0.8), 5))
    ax.boxplot(box_data, vert=True, patch_artist=True,
               boxprops=dict(facecolor="#3498db", alpha=0.6))
    ax.set_xticks(range(1, len(box_cols) + 1))
    ax.set_xticklabels(box_cols, rotation=45, ha="right", fontsize=8)
    ax.set_title("Numeric Feature Boxplots")
    fig.tight_layout()
    save_plot(fig, "d3_numeric_boxplots.png")
else:
    print("No numeric features.")

In [None]:
# ---------------------------------------------------------------------------
# Skewness & kurtosis — computed in Polars
# ---------------------------------------------------------------------------
if inferred_num:
    skew_rows = []
    for c in inferred_num:
        col = df_eda[c].drop_nulls()
        n_vals = col.len()
        if n_vals < 4:
            continue
        sk = col.skew()
        ku = col.kurtosis()
        skew_rows.append({
            "feature": c,
            "skewness": round(sk, 3) if sk is not None else 0.0,
            "kurtosis": round(ku, 3) if ku is not None else 0.0,
        })

    skew_df = pl.DataFrame(skew_rows).sort("skewness", descending=True)
    print("Skewness > 1 or < -1 suggests heavy tails; consider log transform.")
    print("Kurtosis > 3 (excess) indicates leptokurtic / outlier-prone distribution.\n")
    skew_df

In [None]:
# ---------------------------------------------------------------------------
# Outlier clipping before/after — Polars quantile + clip
# ---------------------------------------------------------------------------
if enable_outlier_clipping and inferred_num:
    clip_demo_cols = inferred_num[:3]
    fig, axes = plt.subplots(len(clip_demo_cols), 2, figsize=(10, 3.5 * len(clip_demo_cols)))
    if len(clip_demo_cols) == 1:
        axes = [axes]  # keep as list of pairs
        axes = [axes]

    for i, c in enumerate(clip_demo_cols):
        col = df_eda[c].drop_nulls()
        if col.len() == 0:
            continue
        lo = col.quantile(winsorize_limits[0])
        hi = col.quantile(winsorize_limits[1])
        before = col.to_list()
        after = col.clip(lo, hi).to_list()

        ax_row = axes[i] if len(clip_demo_cols) > 1 else axes[0]
        ax_row[0].hist(before, bins=50, color="#e74c3c", alpha=0.7, edgecolor="white", linewidth=0.3)
        ax_row[0].set_title(f"{c} — before clipping", fontsize=9)
        ax_row[1].hist(after, bins=50, color="#2ecc71", alpha=0.7, edgecolor="white", linewidth=0.3)
        ax_row[1].set_title(f"{c} — after clipping [{lo:.2f}, {hi:.2f}]", fontsize=9)

    fig.suptitle("Outlier Clipping (Winsorization) Before / After", fontsize=12, y=1.01)
    fig.tight_layout()
    save_plot(fig, "d3_outlier_clipping.png")
    print(f"Winsorize limits: {winsorize_limits}")
elif not enable_outlier_clipping:
    print("Outlier clipping disabled.")

### D.4 — Target Exploration

Distribution of the target variable. For classification: class balance bar
chart. For regression: histogram. If a `time_col` is available, we plot
the target rate (or mean) over time. If `group_cols` are provided, we show
target rates by group (purely descriptive — no leakage into features).

In [None]:
# ---------------------------------------------------------------------------
# D.4 — Target exploration (Polars)
# ---------------------------------------------------------------------------
fig, ax = plt.subplots(figsize=(8, 4))
if task_type in ("binary", "multiclass"):
    vc = df_raw[target_col].value_counts().sort(target_col)
    labels = vc[target_col].cast(pl.Utf8).to_list()
    counts = vc["count"].to_list()
    total = sum(counts)
    colors = plt.cm.Set2([i / max(len(labels), 1) for i in range(len(labels))])
    ax.bar(labels, counts, color=colors, edgecolor="white")
    ax.set_xlabel("Class")
    ax.set_ylabel("Count")
    ax.set_title(f"Target Distribution ({target_col})")
    for j, (lbl, cnt) in enumerate(zip(labels, counts)):
        ax.text(j, cnt, f"{cnt:,}\n({cnt/total:.1%})", ha="center", va="bottom", fontsize=9)
else:
    vals = df_raw[target_col].drop_nulls().to_list()
    ax.hist(vals, bins=60, color="#e67e22", alpha=0.8, edgecolor="white", linewidth=0.3)
    ax.set_xlabel(target_col)
    ax.set_ylabel("Count")
    ax.set_title(f"Target Distribution ({target_col}) — Regression")
fig.tight_layout()
save_plot(fig, "d4_target_distribution.png")

In [None]:
# --- Target over time ---
if time_col and time_col in df_raw.columns:
    time_agg = (
        df_raw.select([time_col, target_col])
        .drop_nulls()
        .with_columns(pl.col(time_col).cast(pl.Date).dt.truncate("1mo").alias("month"))
        .group_by("month")
        .agg([
            pl.col(target_col).mean().alias("target_rate"),
            pl.col(target_col).count().alias("n"),
        ])
        .sort("month")
    )
    months = time_agg["month"].to_list()
    rates = time_agg["target_rate"].to_list()
    ylabel = "Target Rate (mean)" if task_type in ("binary", "multiclass") else "Target Mean"

    fig, ax = plt.subplots(figsize=(10, 4))
    ax.plot(months, rates, marker="o", color="#2980b9", linewidth=2, markersize=4)
    ax.set_xlabel("Month")
    ax.set_ylabel(ylabel)
    ax.set_title("Target Over Time (monthly)")
    ax.tick_params(axis="x", rotation=45)
    fig.tight_layout()
    save_plot(fig, "d4_target_over_time.png")
else:
    print("No time_col — skipping target-over-time plot.")

# --- Target by group ---
if group_cols:
    for gc in group_cols[:2]:
        if gc not in df_raw.columns:
            continue
        grp = (
            df_raw.group_by(gc)
            .agg([
                pl.col(target_col).mean().alias("target_rate"),
                pl.col(target_col).count().alias("n"),
            ])
            .sort("target_rate", descending=True)
            .head(20)
        )
        fig, ax = plt.subplots(figsize=(10, max(4, len(grp) * 0.3)))
        ax.barh(range(len(grp)), grp["target_rate"].to_list(), color="#1abc9c", alpha=0.8)
        ax.set_yticks(range(len(grp)))
        ax.set_yticklabels(grp[gc].cast(pl.Utf8).to_list(), fontsize=8)
        ax.set_xlabel("Target Rate / Mean")
        ax.set_title(f"Target by {gc} (top 20)")
        ax.invert_yaxis()
        fig.tight_layout()
        save_plot(fig, f"d4_target_by_{gc}.png")
else:
    print("No group_cols — skipping target-by-group plot.")

---
## E — Baseline Feature Scoring

Multiple independent univariate "feature quality" scores. We compute
rankings per method and then build a **consensus score** that combines
all available rankings.

> **Note:** sklearn functions require numpy arrays. We prepare data in
> Polars (fill nulls, cast dtypes) then convert once at the sklearn
> boundary.

| Method | Applies to | What it measures |
|---|---|---|
| Pearson/Spearman correlation | Numeric | Linear/monotonic association |
| ANOVA F-value | Numeric (classification) | Between-class variance |
| Mutual information | Both | Any dependency (nonlinear included) |
| Univariate ROC-AUC | Numeric (binary) | Discriminative power as a score |
| Chi-squared | Categorical | Association with target |
| Near-zero variance | Numeric | Whether feature is constant |

In [None]:
# ---------------------------------------------------------------------------
# E — Baseline feature scoring: numeric features
# ---------------------------------------------------------------------------
score_rankings: dict[str, dict[str, float]] = {}

# Prepare data in Polars: fill nulls with median, drop target nulls
df_score_sample = safe_sample(df_raw, 50_000, sample_seed).drop_nulls(subset=[target_col])

if inferred_num:
    # Fill nulls with median in Polars
    df_num_filled = df_score_sample.select(
        [pl.col(c).fill_null(pl.col(c).median()) for c in inferred_num]
        + [pl.col(target_col)]
    )

    # Single conversion to numpy at the sklearn boundary
    X_num = df_num_filled.select(inferred_num).to_numpy().astype(np.float64)
    y_score = df_num_filled[target_col].to_numpy()
    X_num = np.nan_to_num(X_num, nan=0.0, posinf=0.0, neginf=0.0)

    # 1) Correlation with target (via scipy — no Polars equivalent for point-biserial)
    corr_scores = {}
    for i, c in enumerate(inferred_num):
        if task_type == "regression":
            r, _ = sp_stats.spearmanr(X_num[:, i], y_score, nan_policy="omit")
        else:
            r, _ = sp_stats.pointbiserialr(y_score, X_num[:, i])
        corr_scores[c] = abs(float(r)) if not (r is None or math.isnan(r)) else 0.0
    score_rankings["correlation"] = corr_scores
    print(f"Correlation scores computed for {len(corr_scores)} numeric features.")

    # 2) ANOVA F-value
    if task_type in ("binary", "multiclass"):
        f_vals, _ = f_classif(X_num, y_score)
    else:
        f_vals, _ = f_regression(X_num, y_score)
    anova_scores = {}
    for i, c in enumerate(inferred_num):
        v = float(f_vals[i])
        anova_scores[c] = v if not math.isnan(v) else 0.0
    score_rankings["anova_f"] = anova_scores
    print("ANOVA F-scores computed.")

    # 3) Mutual information
    if enable_mutual_info:
        if task_type in ("binary", "multiclass"):
            mi = mutual_info_classif(X_num, y_score, random_state=random_state, n_neighbors=5)
        else:
            mi = mutual_info_regression(X_num, y_score, random_state=random_state, n_neighbors=5)
        score_rankings["mutual_info"] = {c: float(mi[i]) for i, c in enumerate(inferred_num)}
        print("Mutual information scores computed.")

    # 4) Univariate ROC-AUC (binary only)
    if task_type == "binary":
        auc_scores = {}
        for i, c in enumerate(inferred_num):
            try:
                auc = roc_auc_score(y_score, X_num[:, i])
                auc_scores[c] = max(auc, 1 - auc)
            except Exception:
                auc_scores[c] = 0.5
        score_rankings["univariate_auc"] = auc_scores
        print("Univariate AUC scores computed.")

    # 5) Variance (Polars-native)
    var_scores = {}
    for c in inferred_num:
        v = df_score_sample[c].var()
        var_scores[c] = float(v) if v is not None else 0.0
    near_zero_var = [c for c, v in var_scores.items() if v < 1e-10]
    if near_zero_var:
        print(f"Near-zero variance features: {near_zero_var}")
        for c in near_zero_var:
            report["dropped_features"].append({"feature": c, "reason": "near_zero_variance"})
    score_rankings["variance"] = var_scores

print(f"\nScoring methods computed: {list(score_rankings.keys())}")

In [None]:
# ---------------------------------------------------------------------------
# E — Baseline feature scoring: categorical features
# ---------------------------------------------------------------------------
if inferred_cat:
    df_cat_sample = df_score_sample.select(inferred_cat + [target_col])
    y_cat = df_cat_sample[target_col].to_numpy()

    # Ordinal-encode in Polars (no pandas needed)
    df_cat_encoded = df_cat_sample.select(inferred_cat)
    for c in inferred_cat:
        uniques = df_cat_encoded[c].cast(pl.Utf8).fill_null("__MISSING__").unique().sort()
        mapping = {v: float(i) for i, v in enumerate(uniques.to_list())}
        df_cat_encoded = df_cat_encoded.with_columns(
            pl.col(c).cast(pl.Utf8).fill_null("__MISSING__")
            .replace_strict(mapping, default=-1.0)
            .cast(pl.Float64)
            .alias(c)
        )
    X_cat_enc = df_cat_encoded.to_numpy()

    # Chi-squared (classification only)
    if task_type in ("binary", "multiclass"):
        X_cat_nonneg = X_cat_enc - X_cat_enc.min(axis=0)
        chi2_vals, _ = chi2(X_cat_nonneg, y_cat)
        chi2_scores = {}
        for i, c in enumerate(inferred_cat):
            v = float(chi2_vals[i])
            chi2_scores[c] = v if not math.isnan(v) else 0.0
        score_rankings["chi2"] = chi2_scores
        print(f"Chi-squared scores computed for {len(chi2_scores)} categorical features.")

    # Mutual information for categoricals
    if enable_mutual_info:
        if task_type in ("binary", "multiclass"):
            mi_cat = mutual_info_classif(X_cat_enc, y_cat, discrete_features=True, random_state=random_state)
        else:
            mi_cat = mutual_info_regression(X_cat_enc, y_cat, discrete_features=True, random_state=random_state)
        score_rankings["mutual_info_cat"] = {c: float(mi_cat[i]) for i, c in enumerate(inferred_cat)}
        print("Mutual info (categorical) computed.")

    # Target rate per category plots (classification)
    if task_type in ("binary", "multiclass"):
        cats_to_plot = inferred_cat[:4]
        if cats_to_plot:
            ncols_p = min(2, len(cats_to_plot))
            nrows_p = (len(cats_to_plot) + ncols_p - 1) // ncols_p
            fig, axes = plt.subplots(nrows_p, ncols_p, figsize=(6 * ncols_p, 4 * nrows_p))
            axes_flat = [axes] if len(cats_to_plot) == 1 else list(np.array(axes).flatten())
            for idx, c in enumerate(cats_to_plot):
                grp = (
                    df_score_sample.select([c, target_col])
                    .with_columns(pl.col(c).cast(pl.Utf8).fill_null("__NULL__"))
                    .group_by(c)
                    .agg([
                        pl.col(target_col).mean().alias("target_rate"),
                        pl.col(target_col).count().alias("n"),
                    ])
                    .filter(pl.col("n") >= rare_category_min_count)
                    .sort("target_rate", descending=True)
                    .head(15)
                )
                axes_flat[idx].barh(
                    range(len(grp)), grp["target_rate"].to_list(),
                    color="#e67e22", alpha=0.8,
                )
                axes_flat[idx].set_yticks(range(len(grp)))
                axes_flat[idx].set_yticklabels(grp[c].to_list(), fontsize=7)
                axes_flat[idx].set_xlabel("Target Rate")
                axes_flat[idx].set_title(f"Target Rate by {c}", fontsize=10)
                axes_flat[idx].invert_yaxis()
            for j in range(idx + 1, len(axes_flat)):
                axes_flat[j].set_visible(False)
            fig.suptitle("Target Rate per Category (min count filter applied)", fontsize=12, y=1.01)
            fig.tight_layout()
            save_plot(fig, "e_target_rate_by_category.png")
else:
    print("No categorical features to score.")

### E — Univariate Score Plots & Consensus Ranking

For each scoring method, we plot the top 30 features. Then we compute a
**consensus score** by normalizing ranks across methods and averaging.
Features that consistently rank high across methods are the most robust.

In [None]:
# ---------------------------------------------------------------------------
# E — Plot top features per scoring method
# ---------------------------------------------------------------------------
all_scored_methods = [m for m in score_rankings if m != "variance"]

if all_scored_methods:
    n_methods = len(all_scored_methods)
    fig, axes = plt.subplots(n_methods, 1, figsize=(10, 4 * n_methods))
    if n_methods == 1:
        axes = [axes]
    colors = ["#e74c3c", "#3498db", "#2ecc71", "#9b59b6", "#e67e22", "#1abc9c"]

    for i, method in enumerate(all_scored_methods):
        scores = score_rankings[method]
        sorted_feats = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:30]
        names = [f[0] for f in sorted_feats]
        vals = [f[1] for f in sorted_feats]
        axes[i].barh(range(len(names)), vals, color=colors[i % len(colors)], alpha=0.8)
        axes[i].set_yticks(range(len(names)))
        axes[i].set_yticklabels(names, fontsize=7)
        axes[i].set_title(f"Top 30 by {method}", fontsize=10)
        axes[i].invert_yaxis()
    fig.suptitle("Univariate Feature Scores by Method", fontsize=13, y=1.0)
    fig.tight_layout()
    save_plot(fig, "e_univariate_scores.png")

for method, scores in score_rankings.items():
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    report["univariate_scores"][method] = [{"feature": f, "score": round(s, 6)} for f, s in ranked]

In [None]:
# ---------------------------------------------------------------------------
# E — Consensus ranking (pure Python, no numpy)
# ---------------------------------------------------------------------------
all_features_scored = sorted({f for scores in score_rankings.values() for f in scores})
methods_for_consensus = [m for m in score_rankings if m != "variance"]

if methods_for_consensus and all_features_scored:
    n_f = len(all_features_scored)
    rank_matrix: dict[str, list[float]] = {f: [] for f in all_features_scored}

    for method in methods_for_consensus:
        scores = score_rankings[method]
        sorted_feats = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        feat_to_rank = {f: rank + 1 for rank, (f, _) in enumerate(sorted_feats)}
        for f in all_features_scored:
            r = feat_to_rank.get(f, n_f)
            rank_matrix[f].append(r / n_f)

    # Consensus = 1 - mean(normalized_rank)
    consensus = {f: 1.0 - sum(ranks) / len(ranks) for f, ranks in rank_matrix.items()}
    consensus_sorted = sorted(consensus.items(), key=lambda x: x[1], reverse=True)

    report["consensus_ranking"] = [
        {"feature": f, "consensus_score": round(s, 4)} for f, s in consensus_sorted
    ]

    # --- Plot: consensus top 30 ---
    top_consensus = consensus_sorted[:30]
    fig, ax = plt.subplots(figsize=(10, max(5, len(top_consensus) * 0.25)))
    ax.barh(range(len(top_consensus)), [s for _, s in top_consensus], color="#2c3e50", alpha=0.85)
    ax.set_yticks(range(len(top_consensus)))
    ax.set_yticklabels([f for f, _ in top_consensus], fontsize=8)
    ax.set_xlabel("Consensus Score (higher = more consistently important)")
    ax.set_title("Top 30 Features — Consensus Ranking Across Methods")
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "e_consensus_ranking.png")

    # --- Rank agreement: how many methods put feature in top-k ---
    top_k = 15
    method_top_sets = {}
    for method in methods_for_consensus:
        scores = score_rankings[method]
        top_feats = sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
        method_top_sets[method] = {f for f, _ in top_feats}

    agreement_count: dict[str, int] = {}
    for fset in method_top_sets.values():
        for f in fset:
            agreement_count[f] = agreement_count.get(f, 0) + 1
    agreement_sorted = sorted(agreement_count.items(), key=lambda x: x[1], reverse=True)

    fig, ax = plt.subplots(figsize=(10, max(4, len(agreement_sorted) * 0.22)))
    ax.barh(range(len(agreement_sorted)), [c for _, c in agreement_sorted], color="#16a085", alpha=0.8)
    ax.set_yticks(range(len(agreement_sorted)))
    ax.set_yticklabels([f for f, _ in agreement_sorted], fontsize=7)
    ax.set_xlabel(f"# Methods placing feature in top-{top_k}")
    ax.set_title(f"Rank Agreement: Features in Top-{top_k} Across {len(methods_for_consensus)} Methods")
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "e_rank_agreement.png")
    print(f"Consensus ranking computed across {len(methods_for_consensus)} methods.")
else:
    consensus_sorted = []
    print("Not enough scoring data for consensus ranking.")

---
## F — Feature Engineering

All transformations are done in Polars. Features are added to the working
dataframe and tracked in `engineered_features_created`.

### F.1 — Missingness Indicators

For columns with meaningful missingness (> 1%), we create binary `is_null`
flags. These can carry signal if missingness is informative.

In [None]:
# ---------------------------------------------------------------------------
# F.1 — Missingness indicators (Polars)
# ---------------------------------------------------------------------------
df_eng = df_raw.clone()
miss_indicator_cols = []

null_frac_lookup = dict(zip(
    null_stats["feature"].to_list(),
    null_stats["null_frac"].to_list(),
))

new_cols = []
for c in inferred_num + inferred_cat:
    nf = null_frac_lookup.get(c, 0.0)
    if 0.01 < nf < missingness_drop_threshold:
        new_col = f"{c}_is_null"
        new_cols.append(pl.col(c).is_null().cast(pl.Int8).alias(new_col))
        miss_indicator_cols.append(new_col)

if new_cols:
    df_eng = df_eng.with_columns(new_cols)
report["engineered_features_created"].extend(miss_indicator_cols)
print(f"Created {len(miss_indicator_cols)} missingness indicator features.")

# --- Plot: missingness indicator vs target ---
if miss_indicator_cols and task_type in ("binary", "multiclass"):
    n_to_plot = min(6, len(miss_indicator_cols))
    fig, axes = plt.subplots(1, n_to_plot, figsize=(4 * n_to_plot, 4))
    if n_to_plot == 1:
        axes = [axes]
    for i, mc in enumerate(miss_indicator_cols[:n_to_plot]):
        grp = (
            df_eng.group_by(mc)
            .agg(pl.col(target_col).mean().alias("target_rate"))
            .sort(mc)
        )
        labels = grp[mc].cast(pl.Utf8).to_list()
        rates = grp["target_rate"].to_list()
        axes[i].bar(labels, rates, color=["#3498db", "#e74c3c"][:len(labels)], alpha=0.8)
        axes[i].set_title(mc.replace("_is_null", ""), fontsize=9)
        axes[i].set_xlabel("Is Null")
        axes[i].set_ylabel("Target Rate")
    fig.suptitle("Missingness Indicators vs Target Rate", fontsize=12, y=1.02)
    fig.tight_layout()
    save_plot(fig, "f1_missingness_indicators.png")

### F.2 — Numeric Transforms

For heavily-skewed positive features (|skewness| > 1), we apply `log1p`.
We also show quantile-binned target rates for selected features.

In [None]:
# ---------------------------------------------------------------------------
# F.2 — Numeric transforms (Polars)
# ---------------------------------------------------------------------------
log_cols_created = []

if inferred_num:
    log_exprs = []
    for c in inferred_num:
        col = df_eng[c].drop_nulls()
        if col.len() < 10:
            continue
        sk = col.skew()
        mn = col.min()
        if sk is not None and abs(sk) > 1.0 and mn is not None and mn >= 0:
            new_col = f"{c}_log1p"
            log_exprs.append(pl.col(c).fill_null(0).log1p().alias(new_col))
            log_cols_created.append(new_col)

    if log_exprs:
        df_eng = df_eng.with_columns(log_exprs)
    report["engineered_features_created"].extend(log_cols_created)
    print(f"Created {len(log_cols_created)} log1p-transformed features.")

    # --- Quantile binning target rate plot (Polars — no .qcut) ---
    bin_demo_cols = inferred_num[:4]
    if bin_demo_cols and task_type in ("binary", "multiclass"):
        ncols_p = min(2, len(bin_demo_cols))
        nrows_p = (len(bin_demo_cols) + ncols_p - 1) // ncols_p
        fig, axes = plt.subplots(nrows_p, ncols_p, figsize=(6 * ncols_p, 4 * nrows_p))
        axes_flat = [axes] if len(bin_demo_cols) == 1 else list(np.array(axes).flatten())

        for idx, c in enumerate(bin_demo_cols):
            try:
                # Compute decile boundaries in Polars
                col_clean = df_eng.select([c, target_col]).drop_nulls()
                boundaries = [col_clean[c].quantile(q / 10) for q in range(1, 10)]
                boundaries = sorted(set(b for b in boundaries if b is not None))

                if len(boundaries) < 2:
                    axes_flat[idx].set_title(f"{c} — too few unique quantiles")
                    continue

                # Use .cut() with precomputed boundaries (no labels= arg)
                binned = (
                    col_clean.with_columns(
                        pl.col(c).cut(boundaries).alias("bin")
                    )
                    .group_by("bin")
                    .agg([
                        pl.col(target_col).mean().alias("target_rate"),
                        pl.col(target_col).count().alias("n"),
                    ])
                    .sort("bin")
                )
                bin_labels = binned["bin"].cast(pl.Utf8).to_list()
                bin_rates = binned["target_rate"].to_list()
                axes_flat[idx].bar(range(len(bin_labels)), bin_rates, color="#8e44ad", alpha=0.8)
                axes_flat[idx].set_xticks(range(len(bin_labels)))
                axes_flat[idx].set_xticklabels(bin_labels, fontsize=6, rotation=45, ha="right")
                axes_flat[idx].set_title(f"Target Rate by Bin: {c}", fontsize=9)
                axes_flat[idx].set_ylabel("Target Rate")
            except Exception as e:
                axes_flat[idx].set_title(f"{c} — error: {e}")
        for j in range(idx + 1, len(axes_flat)):
            axes_flat[j].set_visible(False)
        fig.suptitle("Target Rate by Quantile Bins (numeric features)", fontsize=12, y=1.01)
        fig.tight_layout()
        save_plot(fig, "f2_quantile_bin_target_rate.png")

### F.3 — Interaction Features

For the top numeric features (by consensus), we create ratio and product
interactions. Limited to prevent combinatorial explosion.

In [None]:
# ---------------------------------------------------------------------------
# F.3 — Interaction features (Polars)
# ---------------------------------------------------------------------------
interaction_cols_created = []

if enable_interactions and inferred_num and consensus_sorted:
    top_num_for_interact = [f for f, _ in consensus_sorted if f in inferred_num][:5]

    if len(top_num_for_interact) >= 2:
        interact_exprs = []
        pairs_done = 0
        max_pairs = 10
        for i in range(len(top_num_for_interact)):
            for j in range(i + 1, len(top_num_for_interact)):
                if pairs_done >= max_pairs:
                    break
                a, b = top_num_for_interact[i], top_num_for_interact[j]
                ratio_col = f"{a}_div_{b}"
                prod_col = f"{a}_x_{b}"
                interact_exprs.extend([
                    (pl.col(a).fill_null(0) / (pl.col(b).fill_null(0).abs() + 1e-8)).alias(ratio_col),
                    (pl.col(a).fill_null(0) * pl.col(b).fill_null(0)).alias(prod_col),
                ])
                interaction_cols_created.extend([ratio_col, prod_col])
                pairs_done += 1

        if interact_exprs:
            df_eng = df_eng.with_columns(interact_exprs)
        report["engineered_features_created"].extend(interaction_cols_created)
        print(f"Created {len(interaction_cols_created)} interaction features.")

        # --- Plot: 2D scatter for first 3 pairs (binary only) ---
        if task_type == "binary" and len(top_num_for_interact) >= 2:
            pairs_to_plot = min(3, len(top_num_for_interact) * (len(top_num_for_interact) - 1) // 2)
            fig, axes = plt.subplots(1, pairs_to_plot, figsize=(5 * pairs_to_plot, 4))
            if pairs_to_plot == 1:
                axes = [axes]
            df_scatter = safe_sample(df_eng, 2000, sample_seed)
            pair_idx = 0
            for i in range(len(top_num_for_interact)):
                for j in range(i + 1, len(top_num_for_interact)):
                    if pair_idx >= pairs_to_plot:
                        break
                    a, b = top_num_for_interact[i], top_num_for_interact[j]
                    scatter_df = df_scatter.select([a, b, target_col]).drop_nulls()
                    axes[pair_idx].scatter(
                        scatter_df[a].to_list(), scatter_df[b].to_list(),
                        c=scatter_df[target_col].to_list(), cmap="coolwarm", alpha=0.3, s=8,
                    )
                    axes[pair_idx].set_xlabel(a, fontsize=8)
                    axes[pair_idx].set_ylabel(b, fontsize=8)
                    axes[pair_idx].set_title(f"{a} vs {b}", fontsize=9)
                    pair_idx += 1
            fig.suptitle("Interaction Pairs — 2D Scatter Colored by Target", fontsize=12, y=1.02)
            fig.tight_layout()
            save_plot(fig, "f3_interaction_scatter.png")
    else:
        print("Not enough numeric features for interactions.")
elif enable_polynomial:
    print("WARNING: enable_polynomial=True can create feature explosion.")
if not enable_interactions:
    print("Interaction features disabled.")

### F.4 — Group Aggregation Features

When `group_cols` are provided, we compute per-group statistics (mean, std,
count) for numeric features in Polars.

In [None]:
# ---------------------------------------------------------------------------
# F.4 — Group aggregation features (Polars)
# ---------------------------------------------------------------------------
group_agg_cols_created = []

if enable_group_aggregations and group_cols:
    top_num_for_agg = [f for f, _ in consensus_sorted if f in inferred_num][:5]

    for gc in group_cols:
        if gc not in df_eng.columns:
            continue
        agg_exprs = []
        new_names = []
        for nc in top_num_for_agg:
            mean_col = f"{nc}_grp_{gc}_mean"
            std_col = f"{nc}_grp_{gc}_std"
            agg_exprs.extend([
                pl.col(nc).mean().alias(mean_col),
                pl.col(nc).std().alias(std_col),
            ])
            new_names.extend([mean_col, std_col])

        count_col = f"grp_{gc}_count"
        agg_exprs.append(pl.len().alias(count_col))
        new_names.append(count_col)

        group_stats = df_eng.group_by(gc).agg(agg_exprs)
        df_eng = df_eng.join(group_stats, on=gc, how="left")
        group_agg_cols_created.extend(new_names)

    # Deduplicate
    group_agg_cols_created = list(dict.fromkeys(group_agg_cols_created))
    report["engineered_features_created"].extend(group_agg_cols_created)
    print(f"Created {len(group_agg_cols_created)} group aggregation features.")

    # --- Plot ---
    agg_to_plot = [c for c in group_agg_cols_created if c in df_eng.columns][:4]
    if agg_to_plot:
        fig, axes = plt.subplots(1, len(agg_to_plot), figsize=(5 * len(agg_to_plot), 4))
        if len(agg_to_plot) == 1:
            axes = [axes]
        for idx, ac in enumerate(agg_to_plot):
            vals = df_eng[ac].drop_nulls().to_list()
            axes[idx].hist(vals, bins=40, color="#27ae60", alpha=0.7, edgecolor="white", linewidth=0.3)
            axes[idx].set_title(ac, fontsize=9)
        fig.suptitle("Group Aggregation Feature Distributions", fontsize=12, y=1.02)
        fig.tight_layout()
        save_plot(fig, "f4_group_agg_distributions.png")
else:
    if not group_cols:
        print("No group_cols — skipping group aggregations.")
    else:
        print("Group aggregations disabled.")

### F.5 — Time-Based Features

When a `time_col` is provided, we extract calendar components and their
cyclic (sin/cos) encodings in Polars.

In [None]:
# ---------------------------------------------------------------------------
# F.5 — Time features (Polars)
# ---------------------------------------------------------------------------
time_cols_created = []

if enable_time_features and time_col and time_col in df_eng.columns:
    if df_eng[time_col].dtype not in (pl.Date, pl.Datetime):
        df_eng = df_eng.with_columns(pl.col(time_col).cast(pl.Date))

    pi2 = 2 * math.pi
    df_eng = df_eng.with_columns([
        pl.col(time_col).dt.weekday().alias("dow"),
        pl.col(time_col).dt.month().alias("month_num"),
        (pi2 * pl.col(time_col).dt.weekday() / 7).sin().alias("dow_sin"),
        (pi2 * pl.col(time_col).dt.weekday() / 7).cos().alias("dow_cos"),
        (pi2 * pl.col(time_col).dt.month() / 12).sin().alias("month_sin"),
        (pi2 * pl.col(time_col).dt.month() / 12).cos().alias("month_cos"),
        (pl.col(time_col).dt.weekday() >= 6).cast(pl.Int8).alias("is_weekend"),
    ])

    time_cols_created = ["dow", "month_num", "dow_sin", "dow_cos",
                         "month_sin", "month_cos", "is_weekend"]
    report["engineered_features_created"].extend(time_cols_created)
    print(f"Created {len(time_cols_created)} time features from {time_col}.")

    # --- Plot ---
    if task_type in ("binary", "multiclass"):
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))
        dow_agg = (
            df_eng.group_by("dow")
            .agg(pl.col(target_col).mean().alias("target_rate"))
            .sort("dow")
        )
        dow_labels = ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
        ax1.bar(dow_labels[:len(dow_agg)], dow_agg["target_rate"].to_list(), color="#3498db", alpha=0.8)
        ax1.set_title("Target Rate by Day of Week")
        ax1.set_ylabel("Target Rate")

        month_agg = (
            df_eng.group_by("month_num")
            .agg(pl.col(target_col).mean().alias("target_rate"))
            .sort("month_num")
        )
        month_labels = ["Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec"]
        m_indices = month_agg["month_num"].to_list()
        m_labels = [month_labels[m-1] if 1 <= m <= 12 else str(m) for m in m_indices]
        ax2.bar(m_labels, month_agg["target_rate"].to_list(), color="#e67e22", alpha=0.8)
        ax2.set_title("Target Rate by Month")
        ax2.set_ylabel("Target Rate")
        ax2.tick_params(axis="x", rotation=45)
        fig.suptitle("Time-Based Target Patterns", fontsize=12, y=1.02)
        fig.tight_layout()
        save_plot(fig, "f5_time_target_patterns.png")
else:
    if not time_col:
        print("No time_col — skipping time features.")
    else:
        print("Time features disabled.")

### F.6 — Categorical Handling

1. **Rare category bucketing** — merge rare categories into `__RARE__`.
2. **Frequency encoding** — replace categories with their log-count.
3. **Target encoding** (optional) — cross-validated to prevent leakage.

In [None]:
# ---------------------------------------------------------------------------
# F.6 — Categorical handling (Polars)
# ---------------------------------------------------------------------------
freq_enc_cols = []

if inferred_cat:
    # Rare category bucketing in Polars
    for c in inferred_cat:
        vc = df_eng[c].cast(pl.Utf8).fill_null("__NULL__").value_counts()
        rare_vals = vc.filter(pl.col("count") < rare_category_min_count)[c].to_list()
        if rare_vals:
            df_eng = df_eng.with_columns(
                pl.when(pl.col(c).cast(pl.Utf8).is_in(rare_vals))
                .then(pl.lit("__RARE__"))
                .otherwise(pl.col(c).cast(pl.Utf8))
                .alias(c)
            )
    print(f"Rare category bucketing applied (threshold={rare_category_min_count}).")

    # Frequency encoding in Polars (log of count)
    for c in inferred_cat:
        freq_col = f"{c}_freq_enc"
        freq_map = df_eng.group_by(c).agg(pl.len().alias("_freq"))
        df_eng = df_eng.join(freq_map, on=c, how="left")
        df_eng = df_eng.with_columns(
            pl.col("_freq").cast(pl.Float64).log1p().alias(freq_col)
        ).drop("_freq")
        freq_enc_cols.append(freq_col)

    report["engineered_features_created"].extend(freq_enc_cols)
    print(f"Created {len(freq_enc_cols)} frequency-encoded features.")

In [None]:
# ---------------------------------------------------------------------------
# F.6b — Cross-validated target encoding (Polars-first)
# ---------------------------------------------------------------------------
target_enc_cols = []

if enable_target_encoding and inferred_cat:
    print("Target encoding enabled — using K-fold cross-validation to avoid leakage.")
    print("IMPORTANT: Without cross-validation, target encoding leaks the target")
    print("into features and inflates model performance.\n")

    global_mean = df_eng[target_col].mean()

    for c in inferred_cat[:10]:
        te_col = f"{c}_target_enc"
        # Add a fold column for CV
        n_rows = len(df_eng)
        fold_assignments = [(i % cv_folds) for i in range(n_rows)]
        df_eng = df_eng.with_columns(pl.Series("_fold", fold_assignments))

        # Initialize target encoding column with global mean
        df_eng = df_eng.with_columns(pl.lit(global_mean).alias(te_col))

        cat_col_str = df_eng[c].cast(pl.Utf8).fill_null("__NULL__")
        df_eng = df_eng.with_columns(cat_col_str.alias("_cat_str"))

        for fold in range(cv_folds):
            # Compute target means on training folds
            train_means = (
                df_eng.filter(pl.col("_fold") != fold)
                .group_by("_cat_str")
                .agg(pl.col(target_col).mean().alias("_te_mean"))
            )
            # Apply to validation fold via join
            df_eng = df_eng.join(train_means, on="_cat_str", how="left")
            df_eng = df_eng.with_columns(
                pl.when(pl.col("_fold") == fold)
                .then(pl.col("_te_mean").fill_null(global_mean))
                .otherwise(pl.col(te_col))
                .alias(te_col)
            ).drop("_te_mean")

        df_eng = df_eng.drop(["_fold", "_cat_str"])
        target_enc_cols.append(te_col)

    report["engineered_features_created"].extend(target_enc_cols)
    print(f"Created {len(target_enc_cols)} target-encoded features (CV'd).")

    # --- Plot ---
    if target_enc_cols and task_type in ("binary", "multiclass"):
        n_to_show = min(3, len(target_enc_cols))
        fig, axes = plt.subplots(1, n_to_show, figsize=(5 * n_to_show, 4))
        if n_to_show == 1:
            axes = [axes]
        df_te_sample = safe_sample(df_eng, 3000, sample_seed)
        for i, tc in enumerate(target_enc_cols[:n_to_show]):
            te_vals = df_te_sample[tc].to_list()
            t_vals = df_te_sample[target_col].to_list()
            # Add jitter via Python random
            t_jittered = [tv + random.gauss(0, 0.05) for tv in t_vals]
            axes[i].scatter(te_vals, t_jittered, alpha=0.2, s=5, color="#e74c3c")
            axes[i].set_xlabel(tc, fontsize=8)
            axes[i].set_ylabel("Target (jittered)")
            axes[i].set_title(tc.replace("_target_enc", ""), fontsize=9)
        fig.suptitle("Target Encoding vs Actual Target", fontsize=12, y=1.02)
        fig.tight_layout()
        save_plot(fig, "f6_target_encoding.png")
elif enable_target_encoding:
    print("Target encoding enabled but no categorical features found.")
else:
    print("Target encoding disabled.")

### F.7 — Text Features (lightweight)

If `text_cols` are provided, we extract simple statistics in Polars:
character length, word count, digit count.

In [None]:
# ---------------------------------------------------------------------------
# F.7 — Text features (Polars)
# ---------------------------------------------------------------------------
text_feat_cols = []

if text_cols:
    text_exprs = []
    for tc in text_cols:
        if tc not in df_eng.columns:
            continue
        len_col = f"{tc}_len"
        wc_col = f"{tc}_word_count"
        digit_col = f"{tc}_digit_count"
        text_exprs.extend([
            pl.col(tc).cast(pl.Utf8).fill_null("").str.len_chars().alias(len_col),
            pl.col(tc).cast(pl.Utf8).fill_null("").str.split(" ").list.len().alias(wc_col),
            pl.col(tc).cast(pl.Utf8).fill_null("").str.count_matches(r"\d").alias(digit_col),
        ])
        text_feat_cols.extend([len_col, wc_col, digit_col])

    if text_exprs:
        df_eng = df_eng.with_columns(text_exprs)
    report["engineered_features_created"].extend(text_feat_cols)
    print(f"Created {len(text_feat_cols)} text features from {len(text_cols)} text columns.")

    # --- Plot: text length vs target ---
    if text_feat_cols and task_type in ("binary", "multiclass"):
        len_cols_to_plot = [c for c in text_feat_cols if c.endswith("_len")][:3]
        if len_cols_to_plot:
            fig, axes = plt.subplots(1, len(len_cols_to_plot), figsize=(5 * len(len_cols_to_plot), 4))
            if len(len_cols_to_plot) == 1:
                axes = [axes]
            target_classes = sorted(df_eng[target_col].unique().to_list())
            for i, lc in enumerate(len_cols_to_plot):
                for cls_val in target_classes:
                    subset = df_eng.filter(pl.col(target_col) == cls_val)
                    vals = subset[lc].drop_nulls().to_list()
                    axes[i].hist(vals, bins=30, alpha=0.5, label=f"class={cls_val}",
                               edgecolor="white", linewidth=0.3)
                axes[i].set_title(lc, fontsize=9)
                axes[i].set_xlabel("Length")
                axes[i].legend(fontsize=7)
            fig.suptitle("Text Length Distribution by Target Class", fontsize=12, y=1.02)
            fig.tight_layout()
            save_plot(fig, "f7_text_length_by_target.png")
else:
    print("No text columns — skipping text features.")

---
## G — Model-Based Feature Selection

We use a simple baseline model purely for feature importance and selection.

> **sklearn boundary:** Data is prepared in Polars (fill nulls, cast types),
> then converted to numpy once via `polars_to_numpy_for_sklearn()`.

1. **L1-regularized selection** — drives unimportant coefficients to zero
2. **Permutation importance** — model-agnostic, measures actual predictive lift
3. **Stability selection** — consistency across CV folds
4. **Performance vs number of features** curve

In [None]:
# ---------------------------------------------------------------------------
# G — Prepare train/test split (Polars → numpy at sklearn boundary)
# ---------------------------------------------------------------------------

# Collect all numeric feature columns (original + engineered)
all_eng_features = list(dict.fromkeys(
    inferred_num
    + freq_enc_cols
    + log_cols_created
    + interaction_cols_created
    + group_agg_cols_created
    + time_cols_created
    + text_feat_cols
    + target_enc_cols
    + miss_indicator_cols
))
# Filter to columns that exist and are numeric
all_eng_features = [
    c for c in all_eng_features
    if c in df_eng.columns and df_eng[c].dtype.is_numeric()
]
print(f"Total numeric features for model-based selection: {len(all_eng_features)}")

# Convert to numpy at sklearn boundary
X_all, y_all = polars_to_numpy_for_sklearn(df_eng, all_eng_features, target_col)

# Train/test split
if task_type in ("binary", "multiclass") and stratify:
    X_train, X_test, y_train, y_test = train_test_split(
        X_all, y_all, test_size=test_size, random_state=random_state, stratify=y_all,
    )
else:
    X_train, X_test, y_train, y_test = train_test_split(
        X_all, y_all, test_size=test_size, random_state=random_state,
    )
print(f"Train shape: {X_train.shape}, Test shape: {X_test.shape}")

scaler = StandardScaler()
X_train_sc = scaler.fit_transform(X_train)
X_test_sc = scaler.transform(X_test)

### G.1 — L1-Regularized Feature Selection

L1 (Lasso) regularization drives unimportant coefficients to zero. Features
with non-zero coefficients are "selected".

In [None]:
# ---------------------------------------------------------------------------
# G.1 — L1-regularized selection
# ---------------------------------------------------------------------------
l1_selected = []

if task_type in ("binary", "multiclass"):
    l1_model = LogisticRegression(
        penalty="l1", solver="liblinear", C=1.0,
        max_iter=1000, random_state=random_state,
    )
    l1_model.fit(X_train_sc, y_train)
    coef = abs(l1_model.coef_).mean(axis=0) if l1_model.coef_.ndim > 1 else abs(l1_model.coef_[0])
else:
    from sklearn.linear_model import Lasso
    l1_model = Lasso(alpha=0.01, max_iter=2000, random_state=random_state)
    l1_model.fit(X_train_sc, y_train)
    coef = abs(l1_model.coef_)

l1_importance = {all_eng_features[i]: float(coef[i]) for i in range(len(all_eng_features))}
l1_selected = [f for f, v in l1_importance.items() if v > 1e-6]
print(f"L1 selection: {len(l1_selected)} / {len(all_eng_features)} features with non-zero coefficient.")

report["model_based_importance"]["l1_coefficients"] = sorted(
    l1_importance.items(), key=lambda x: x[1], reverse=True
)

### G.2 — Permutation Importance

Measures how much the model's score drops when a feature is shuffled.

In [None]:
# ---------------------------------------------------------------------------
# G.2 — Permutation importance
# ---------------------------------------------------------------------------
perm_importance_dict = {}

if enable_permutation_importance:
    if task_type in ("binary", "multiclass"):
        base_model = LogisticRegression(max_iter=1000, random_state=random_state, solver="lbfgs")
        scoring_str = "f1_macro" if task_type == "multiclass" else "f1"
    else:
        base_model = Ridge(alpha=1.0, random_state=random_state)
        scoring_str = "neg_root_mean_squared_error"

    base_model.fit(X_train_sc, y_train)
    perm_result = permutation_importance(
        base_model, X_test_sc, y_test,
        n_repeats=10, random_state=random_state,
        scoring=scoring_str, n_jobs=-1,
    )
    perm_importance_dict = {
        all_eng_features[i]: float(perm_result.importances_mean[i])
        for i in range(len(all_eng_features))
    }
    perm_sorted = sorted(perm_importance_dict.items(), key=lambda x: x[1], reverse=True)
    report["model_based_importance"]["permutation_importance"] = [
        {"feature": f, "importance": round(v, 6)} for f, v in perm_sorted
    ]

    # --- Plot ---
    top_perm = perm_sorted[:30]
    fig, ax = plt.subplots(figsize=(10, max(5, len(top_perm) * 0.25)))
    ax.barh(range(len(top_perm)), [v for _, v in top_perm], color="#c0392b", alpha=0.8)
    ax.set_yticks(range(len(top_perm)))
    ax.set_yticklabels([f for f, _ in top_perm], fontsize=8)
    ax.set_xlabel("Mean Permutation Importance")
    ax.set_title("Top 30 Features — Permutation Importance")
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "g2_permutation_importance.png")
    print(f"Permutation importance computed for {len(perm_importance_dict)} features.")
else:
    print("Permutation importance disabled.")

### G.3 — Stability Selection (Selection Frequency Across CV Folds)

How consistently is each feature selected across different folds?

In [None]:
# ---------------------------------------------------------------------------
# G.3 — Stability selection
# ---------------------------------------------------------------------------
selection_freq: dict[str, int] = {f: 0 for f in all_eng_features}

if task_type in ("binary", "multiclass"):
    kf = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)
    split_iter = list(kf.split(X_train_sc, y_train))
else:
    kf = KFold(n_splits=cv_folds, shuffle=True, random_state=random_state)
    split_iter = list(kf.split(X_train_sc))

for fold_idx, (tr_idx, va_idx) in enumerate(split_iter):
    X_fold_tr, y_fold_tr = X_train_sc[tr_idx], y_train[tr_idx]
    if task_type in ("binary", "multiclass"):
        fold_model = LogisticRegression(
            penalty="l1", solver="liblinear", C=1.0,
            max_iter=1000, random_state=random_state,
        )
    else:
        from sklearn.linear_model import Lasso
        fold_model = Lasso(alpha=0.01, max_iter=2000, random_state=random_state)
    fold_model.fit(X_fold_tr, y_fold_tr)

    if hasattr(fold_model, "coef_"):
        fc = abs(fold_model.coef_)
        if fc.ndim > 1:
            fc = fc.mean(axis=0)
        for i, f in enumerate(all_eng_features):
            if fc[i] > 1e-6:
                selection_freq[f] += 1

freq_sorted = sorted(selection_freq.items(), key=lambda x: x[1], reverse=True)
report["selection_frequency"] = {f: c for f, c in freq_sorted}

top_freq = [(f, c) for f, c in freq_sorted if c > 0][:40]
if top_freq:
    fig, ax = plt.subplots(figsize=(10, max(5, len(top_freq) * 0.25)))
    ax.barh(range(len(top_freq)), [c for _, c in top_freq], color="#2980b9", alpha=0.8)
    ax.set_yticks(range(len(top_freq)))
    ax.set_yticklabels([f for f, _ in top_freq], fontsize=7)
    ax.set_xlabel(f"Selection Count (out of {cv_folds} folds)")
    ax.set_title(f"Feature Selection Frequency Across {cv_folds} CV Folds")
    ax.set_xlim(0, cv_folds + 0.5)
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "g3_selection_frequency.png")
    print(f"Features in all {cv_folds} folds: {sum(1 for _, c in freq_sorted if c == cv_folds)}")
else:
    print("No features selected in any fold.")

### G.4 — Performance vs Number of Features

Evaluate the baseline model using top-k features for increasing k. The
"elbow plot" identifies the point of diminishing returns.

In [None]:
# ---------------------------------------------------------------------------
# G.4 — Performance vs number of features
# ---------------------------------------------------------------------------
if perm_importance_dict:
    ranked_feats = [f for f, _ in sorted(perm_importance_dict.items(), key=lambda x: x[1], reverse=True)]
elif consensus_sorted:
    ranked_feats = [f for f, _ in consensus_sorted if f in all_eng_features]
else:
    ranked_feats = all_eng_features

k_values = sorted({k for k in [5, 10, 15, 20, 30, 50, 75, 100, len(ranked_feats)]
                    if 0 < k <= len(ranked_feats)})

topk_results = []
for k in k_values:
    top_k_feats = ranked_feats[:k]
    feat_indices = [all_eng_features.index(f) for f in top_k_feats if f in all_eng_features]
    if not feat_indices:
        continue
    X_tr_k = X_train_sc[:, feat_indices]
    X_te_k = X_test_sc[:, feat_indices]

    if task_type in ("binary", "multiclass"):
        m = LogisticRegression(max_iter=1000, random_state=random_state, solver="lbfgs")
    else:
        m = Ridge(alpha=1.0, random_state=random_state)
    m.fit(X_tr_k, y_train)
    y_pred = m.predict(X_te_k)

    if task_type == "binary":
        score = f1_score(y_test, y_pred, zero_division=0)
        metric_name = "F1"
    elif task_type == "multiclass":
        score = f1_score(y_test, y_pred, average="macro", zero_division=0)
        metric_name = "F1 (macro)"
    else:
        score = -math.sqrt(mean_squared_error(y_test, y_pred))
        metric_name = "Neg RMSE"

    topk_results.append({"k": k, "metric": metric_name, "score": round(float(score), 4)})

report["top_k_performance"] = topk_results

if topk_results:
    ks = [r["k"] for r in topk_results]
    scores = [r["score"] for r in topk_results]
    fig, ax = plt.subplots(figsize=(10, 5))
    ax.plot(ks, scores, marker="o", color="#8e44ad", linewidth=2, markersize=6)
    ax.fill_between(ks, scores, alpha=0.1, color="#8e44ad")
    ax.set_xlabel("Number of Features (k)")
    ax.set_ylabel(topk_results[0]["metric"])
    ax.set_title(f"Performance vs Number of Features ({topk_results[0]['metric']})")
    ax.grid(True, alpha=0.3)

    best_idx = max(range(len(scores)), key=lambda i: scores[i])
    ax.annotate(
        f"k={ks[best_idx]}, score={scores[best_idx]:.4f}",
        xy=(ks[best_idx], scores[best_idx]),
        xytext=(ks[best_idx] + 2, scores[best_idx] - 0.01),
        fontsize=9, color="#c0392b",
        arrowprops=dict(arrowstyle="->", color="#c0392b"),
    )
    fig.tight_layout()
    save_plot(fig, "g4_performance_vs_features.png")
    print(f"Best performance at k={ks[best_idx]} features: {scores[best_idx]:.4f}")

---
## H — Leakage & Drift Checks

### Leakage heuristics
- Near-perfect univariate AUC (>0.95) — may encode the target directly.
- High-cardinality + high correlation — ID-like columns disguised as features.

### Drift checks (if `time_col` exists)
- KS-test across time halves for top numeric features.
- Compare chronological vs random split performance.

In [None]:
# ---------------------------------------------------------------------------
# H — Leakage heuristics
# ---------------------------------------------------------------------------
leakage_suspects = []

if task_type == "binary" and "univariate_auc" in score_rankings:
    for f, auc in score_rankings["univariate_auc"].items():
        if auc > 0.95:
            leakage_suspects.append({"feature": f, "reason": f"univariate_AUC={auc:.3f} (>0.95)"})

for f in feature_candidates:
    nunique = df_raw[f].n_unique()
    corr_val = score_rankings.get("correlation", {}).get(f, 0)
    mi_val = score_rankings.get("mutual_info", {}).get(f, 0)
    if nunique > high_cardinality_threshold and (corr_val > 0.5 or mi_val > 0.5):
        leakage_suspects.append({
            "feature": f,
            "reason": f"high_cardinality({nunique}) + high_signal(corr={corr_val:.2f}, MI={mi_val:.2f})"
        })

report["high_risk_features"].extend(leakage_suspects)
if leakage_suspects:
    print("Leakage suspects:")
    for s in leakage_suspects:
        print(f"  {s['feature']}: {s['reason']}")
else:
    print("No leakage suspects identified.")

# --- Plot: n_unique vs correlation/MI bubble ---
if inferred_num and "correlation" in score_rankings:
    plot_feats = inferred_num[:50]
    n_uniques = [df_raw[f].n_unique() for f in plot_feats]
    corrs = [score_rankings["correlation"].get(f, 0) for f in plot_feats]
    mi_vals = [score_rankings.get("mutual_info", {}).get(f, 0) for f in plot_feats]

    fig, ax = plt.subplots(figsize=(10, 6))
    scatter = ax.scatter(
        n_uniques, corrs,
        s=[max(10, v * 500) for v in mi_vals],
        c=mi_vals, cmap="YlOrRd", alpha=0.7, edgecolors="black", linewidth=0.5,
    )
    cbar = plt.colorbar(scatter, ax=ax)
    cbar.set_label("Mutual Information", fontsize=9)
    ax.set_xlabel("Number of Unique Values")
    ax.set_ylabel("|Correlation with Target|")
    ax.set_title("Leakage Detection: n_unique vs Correlation (bubble = MI)")
    ax.set_xscale("log")
    for s in leakage_suspects:
        f = s["feature"]
        if f in plot_feats:
            idx = plot_feats.index(f)
            ax.annotate(f, (n_uniques[idx], corrs[idx]), fontsize=7, color="red")
    fig.tight_layout()
    save_plot(fig, "h_leakage_bubble.png")

In [None]:
# ---------------------------------------------------------------------------
# H — Drift checks (Polars + scipy at KS-test boundary)
# ---------------------------------------------------------------------------
if time_col and time_col in df_raw.columns and enable_stability_checks:
    print("Running feature drift checks across time periods...\n")

    df_with_time = df_raw.filter(pl.col(time_col).is_not_null())
    median_time = df_with_time[time_col].cast(pl.Date).median()
    df_early = df_with_time.filter(pl.col(time_col).cast(pl.Date) <= median_time)
    df_late = df_with_time.filter(pl.col(time_col).cast(pl.Date) > median_time)
    print(f"Early period: {len(df_early):,} rows, Late period: {len(df_late):,} rows")

    # KS-test for top numeric features (scipy — no Polars equivalent)
    drift_results = []
    for f in inferred_num[:20]:
        early_vals = df_early[f].drop_nulls().to_list()
        late_vals = df_late[f].drop_nulls().to_list()
        if len(early_vals) < 10 or len(late_vals) < 10:
            continue
        ks_stat, ks_pval = sp_stats.ks_2samp(early_vals, late_vals)
        drift_results.append({
            "feature": f,
            "ks_statistic": round(float(ks_stat), 4),
            "ks_pvalue": round(float(ks_pval), 6),
            "drifted": ks_pval < 0.01,
        })

    drift_df = pl.DataFrame(drift_results).sort("ks_statistic", descending=True)
    print("Feature drift (KS-test, early vs late period):")
    print(drift_df)

    drifted_feats = [r["feature"] for r in drift_results if r["drifted"]]
    for f in drifted_feats:
        report["high_risk_features"].append({"feature": f, "reason": "temporal_drift (KS p<0.01)"})

    # --- Plot: KS statistic bar chart ---
    dr_sorted = sorted(drift_results, key=lambda x: x["ks_statistic"], reverse=True)
    feats_dr = [r["feature"] for r in dr_sorted]
    ks_vals = [r["ks_statistic"] for r in dr_sorted]
    colors_dr = ["#e74c3c" if r["drifted"] else "#3498db" for r in dr_sorted]
    fig, ax = plt.subplots(figsize=(10, max(4, len(feats_dr) * 0.25)))
    ax.barh(range(len(feats_dr)), ks_vals, color=colors_dr, alpha=0.8)
    ax.set_yticks(range(len(feats_dr)))
    ax.set_yticklabels(feats_dr, fontsize=8)
    ax.set_xlabel("KS Statistic")
    ax.set_title("Feature Drift: KS-Test (red = significant drift at p<0.01)")
    ax.invert_yaxis()
    fig.tight_layout()
    save_plot(fig, "h_feature_drift_ks.png")

    # --- Chronological split performance comparison ---
    if all_eng_features and len(df_early) > 100 and len(df_late) > 100:
        # Only use features that exist in df_early (original columns, not engineered)
        early_cols = set(df_early.columns)
        top_feats_chrono = [f for f in ranked_feats if f in early_cols][:min(30, len(ranked_feats))]
        if not top_feats_chrono:
            top_feats_chrono = [f for f in inferred_num if f in early_cols][:20]
        X_early, y_early = polars_to_numpy_for_sklearn(df_early, top_feats_chrono, target_col)
        X_late, y_late = polars_to_numpy_for_sklearn(df_late, top_feats_chrono, target_col)
        sc2 = StandardScaler()
        X_early_sc = sc2.fit_transform(X_early)
        X_late_sc = sc2.transform(X_late)

        if task_type in ("binary", "multiclass"):
            chrono_model = LogisticRegression(max_iter=1000, random_state=random_state)
        else:
            chrono_model = Ridge(alpha=1.0, random_state=random_state)
        chrono_model.fit(X_early_sc, y_early)
        y_pred_chrono = chrono_model.predict(X_late_sc)

        if task_type == "binary":
            chrono_score = f1_score(y_late, y_pred_chrono, zero_division=0)
            print(f"\nChronological split F1: {chrono_score:.4f}")
        elif task_type == "multiclass":
            chrono_score = f1_score(y_late, y_pred_chrono, average="macro", zero_division=0)
            print(f"\nChronological split F1 (macro): {chrono_score:.4f}")
        else:
            chrono_score = math.sqrt(mean_squared_error(y_late, y_pred_chrono))
            print(f"\nChronological split RMSE: {chrono_score:.4f}")

        if topk_results:
            random_score = max(r["score"] for r in topk_results)
            print(f"Random split best score: {random_score:.4f}")
            fig, ax = plt.subplots(figsize=(6, 4))
            ax.bar(["Random Split", "Chronological Split"],
                   [random_score, chrono_score], color=["#3498db", "#e67e22"], alpha=0.8)
            ax.set_ylabel("Score")
            ax.set_title("Random vs Chronological Split Performance")
            for i, v in enumerate([random_score, chrono_score]):
                ax.text(i, v, f"{v:.4f}", ha="center", va="bottom", fontsize=10)
            fig.tight_layout()
            save_plot(fig, "h_chrono_vs_random.png")
else:
    if not time_col:
        print("No time_col — skipping drift checks.")
    else:
        print("Stability checks disabled.")

---
## I — Final Recommendations & Exports

Combining all signals from univariate scoring, model-based importance,
stability selection, and leakage checks.

In [None]:
# ---------------------------------------------------------------------------
# I — Final recommended feature set
# ---------------------------------------------------------------------------
stable_features = [f for f, c in selection_freq.items() if c >= max(1, cv_folds - 1)]
if perm_importance_dict:
    perm_positive = [f for f, v in perm_importance_dict.items() if v > 0.001]
else:
    perm_positive = all_eng_features

recommended = sorted(set(stable_features) & set(perm_positive))

if len(recommended) < 5 and consensus_sorted:
    recommended = [f for f, _ in consensus_sorted if f in all_eng_features][:20]
    print(f"Fallback: using top {len(recommended)} features by consensus.")

leak_set = {s["feature"] for s in leakage_suspects}
recommended = [f for f in recommended if f not in leak_set]

report["recommended_features"] = recommended
print(f"\nRecommended features ({len(recommended)}):")
for f in recommended:
    print(f"  - {f}")

print(f"\nEngineered features created ({len(report['engineered_features_created'])}):")
for f in report["engineered_features_created"][:20]:
    print(f"  + {f}")
if len(report["engineered_features_created"]) > 20:
    print(f"  ... and {len(report['engineered_features_created']) - 20} more")

print(f"\nDropped features ({len(report['dropped_features'])}):")
for d in report["dropped_features"]:
    print(f"  x {d['feature']}: {d['reason']}")

print(f"\nHigh-risk features ({len(report['high_risk_features'])}):")
for h in report["high_risk_features"]:
    if isinstance(h, dict):
        print(f"  ! {h.get('feature', h)}: {h.get('reason', '')}")

In [None]:
# ---------------------------------------------------------------------------
# I — Export transformed features parquet (Polars)
# ---------------------------------------------------------------------------
if output_features_parquet_path:
    export_cols = [target_col] + recommended
    for c in id_cols:
        if c in df_eng.columns and c not in export_cols:
            export_cols.insert(0, c)
    if time_col and time_col in df_eng.columns and time_col not in export_cols:
        export_cols.insert(0, time_col)
    export_cols = [c for c in export_cols if c in df_eng.columns]
    df_export = df_eng.select(export_cols)
    Path(output_features_parquet_path).parent.mkdir(parents=True, exist_ok=True)
    df_export.write_parquet(output_features_parquet_path)
    print(f"\nExported {df_export.shape} to {output_features_parquet_path}")
else:
    print("\nNo output_features_parquet_path — skipping parquet export.")

In [None]:
# ---------------------------------------------------------------------------
# I — Save metrics JSON
# ---------------------------------------------------------------------------
report["run_metadata"]["data_shape"] = list(df_raw.shape)
report["run_metadata"]["n_features_original"] = len(feature_candidates) + len(high_null_cols)
report["run_metadata"]["n_features_engineered"] = len(report["engineered_features_created"])
report["run_metadata"]["n_features_recommended"] = len(recommended)
report["run_metadata"]["parameters"] = {
    "input_parquet_paths": input_parquet_paths,
    "target_col": target_col,
    "task_type": task_type,
    "test_size": test_size,
    "cv_folds": cv_folds,
    "missingness_drop_threshold": missingness_drop_threshold,
    "high_cardinality_threshold": high_cardinality_threshold,
    "rare_category_min_count": rare_category_min_count,
    "enable_interactions": enable_interactions,
    "enable_group_aggregations": enable_group_aggregations,
    "enable_time_features": enable_time_features,
    "enable_target_encoding": enable_target_encoding,
    "enable_mutual_info": enable_mutual_info,
    "enable_permutation_importance": enable_permutation_importance,
    "enable_stability_checks": enable_stability_checks,
}

def json_safe(obj):
    """Handle non-JSON-serializable types from sklearn results."""
    if hasattr(obj, "item"):  # numpy scalar
        return obj.item()
    if hasattr(obj, "tolist"):  # numpy array
        return obj.tolist()
    return str(obj)

Path(metrics_json_path).parent.mkdir(parents=True, exist_ok=True)
with open(metrics_json_path, "w") as f:
    json.dump(report, f, indent=2, default=json_safe)
print(f"Metrics JSON saved to {metrics_json_path}")

---
## Summary

This notebook completed a full feature engineering and selection pipeline:

1. **Data audit** — missingness, cardinality, distributions, outliers
2. **Univariate scoring** — correlation, ANOVA, mutual info, AUC, chi-squared
3. **Consensus ranking** — cross-method agreement
4. **Feature engineering** — missingness indicators, log transforms, interactions,
   group aggregations, time features, frequency encoding, optional target encoding, text features
5. **Model-based selection** — L1 regularization, permutation importance, stability selection
6. **Leakage & drift checks** — temporal drift, suspicious features
7. **Final recommendations** — curated feature list with explanations

In [None]:
# ---------------------------------------------------------------------------
# Final summary
# ---------------------------------------------------------------------------
print("=" * 60)
print("FEATURE ENGINEERING & SELECTION — COMPLETE")
print("=" * 60)
print(f"  Task type:              {task_type}")
print(f"  Original features:      {len(feature_candidates) + len(high_null_cols)}")
print(f"  Engineered features:    {len(report['engineered_features_created'])}")
print(f"  Recommended features:   {len(recommended)}")
print(f"  Dropped features:       {len(report['dropped_features'])}")
print(f"  High-risk features:     {len(report['high_risk_features'])}")
print(f"  Metrics JSON:           {metrics_json_path}")
if output_features_parquet_path:
    print(f"  Features parquet:       {output_features_parquet_path}")
print(f"  Plots directory:        {plots_dir}")
print("=" * 60)