# 04 – Data Preprocessing (PySpark ML)

This notebook prepares the fused data for machine learning.  We encode categorical features, scale numerical features, and split the data into training and testing sets.  We also handle class imbalance by oversampling the minority class using the `imbalanced-learn` library after converting to Pandas.  Our goal is to create a balanced and well‑structured dataset for model training.


In [2]:
import os
import numpy as np
import pandas as pd

from pathlib import Path
from scipy import sparse
from joblib import dump

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold, train_test_split
from sklearn.metrics import roc_auc_score

from xgboost import XGBClassifier


# ============================================================
# Custom sklearn-compatible transformer
# ============================================================
class CTRPreprocessor(BaseEstimator, TransformerMixin):
    """
    A sklearn-compatible transformer that reproduces your exact preprocessing:
    - drops target/user/time from features
    - adds missing indicators for key cols
    - log1p for pv/cart/fav/buy
    - low-card label encoding
    - high-card smoothed target encoding (fit-time mapping, transform-time mapping)
    - numeric coercion -> impute(median) -> scale(StandardScaler)
    - aligns columns and returns only top_features in the exact order
    """

    def __init__(
        self,
        target_col: str,
        user_col: str,
        time_col: str | None,
        force_categorical: list[str],
        high_card_threshold: int = 30,
        smoothing: float = 50.0,
        n_splits_te: int = 5,
        random_state: int = 42,
        key_missing_cols: list[str] | None = None,
        top_features: list[str] | None = None,
        leakage_dropped_features: list[str] | None = None,
    ):
        self.target_col = target_col
        self.user_col = user_col
        self.time_col = time_col

        self.force_categorical = force_categorical
        self.high_card_threshold = high_card_threshold
        self.smoothing = smoothing
        self.n_splits_te = n_splits_te
        self.random_state = random_state

        self.key_missing_cols = key_missing_cols or []
        self.top_features = top_features  # can be set after fit
        self.leakage_dropped_features = leakage_dropped_features or []

        # learned during fit
        self.categorical_cols_ = None
        self.numeric_cols_ = None
        self.low_card_cols_ = None
        self.high_card_cols_ = None
        self.low_card_mappings_ = {}
        self.high_card_mappings_ = {}
        self.global_mean_ = None

        self.imputer_ = SimpleImputer(strategy="median")
        self.scaler_ = StandardScaler()

        self.feature_names_in_ = None  # final fitted feature columns (after enc + flags, before top selection)

    def _drop_meta_cols(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        # drop target
        if self.target_col in X.columns:
            X = X.drop(columns=[self.target_col], errors="ignore")
        # time only for splitting, not as feature
        if self.time_col is not None and self.time_col in X.columns:
            X = X.drop(columns=[self.time_col], errors="ignore")
        # user removed
        if self.user_col in X.columns:
            X = X.drop(columns=[self.user_col], errors="ignore")
        return X

    def _add_missing_flags(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        for c in self.key_missing_cols:
            if c in X.columns:
                X[f"is_missing_{c}"] = X[c].isna().astype(int)
        return X

    def _apply_log1p(self, X: pd.DataFrame) -> pd.DataFrame:
        X = X.copy()
        for c in ["pv", "cart", "fav", "buy"]:
            if c in X.columns:
                X[c] = pd.to_numeric(X[c], errors="coerce")
                X[c] = np.log1p(X[c])
        return X

    def _infer_types(self, X: pd.DataFrame):
        # force categorical + object/category
        force_cat = [c for c in self.force_categorical if c in X.columns]
        cat_cols = list(X.select_dtypes(include=["object", "category"]).columns)
        for c in force_cat:
            if c not in cat_cols:
                cat_cols.append(c)
        num_cols = [c for c in X.columns if c not in cat_cols]
        return cat_cols, num_cols

    def fit(self, X: pd.DataFrame, y: pd.Series):
        if not isinstance(X, pd.DataFrame):
            raise TypeError("CTRPreprocessor.fit expects a pandas DataFrame X")

        y = pd.Series(y).astype(int).reset_index(drop=True)
        X0 = self._drop_meta_cols(X)
        X0 = self._add_missing_flags(X0)
        X0 = self._apply_log1p(X0)

        self.global_mean_ = float(y.mean())

        # type inference
        cat_cols, num_cols = self._infer_types(X0)
        self.categorical_cols_ = cat_cols

        # cast categorical to string
        X_enc = X0.copy()
        for c in self.categorical_cols_:
            if c in X_enc.columns:
                X_enc[c] = X_enc[c].astype("string")

        # split high/low card
        high_card, low_card = [], []
        for c in self.categorical_cols_:
            if c not in X_enc.columns:
                continue
            nunq = X_enc[c].nunique(dropna=True)
            (high_card if nunq > self.high_card_threshold else low_card).append(c)
        self.high_card_cols_ = high_card
        self.low_card_cols_ = low_card

        # low-card label encoding mapping
        self.low_card_mappings_ = {}
        for c in self.low_card_cols_:
            cats = X_enc[c].dropna().unique().tolist()
            cats = sorted([str(x) for x in cats])
            mapping = {cat: i for i, cat in enumerate(cats)}
            self.low_card_mappings_[c] = mapping
            X_enc[c] = X_enc[c].map(mapping).fillna(-1).astype(int)

        # high-card target encoding mapping (fit-time full mapping)
        self.high_card_mappings_ = {}
        SMOOTH = float(self.smoothing)
        gm = float(self.global_mean_)

        for c in self.high_card_cols_:
            stats_full = pd.DataFrame(
                {c: X0[c].astype("string"), "y": y.values}
            ).groupby(c)["y"].agg(["count", "mean"])

            final_map = {}
            for cat, row in stats_full.iterrows():
                cnt = float(row["count"])
                mu = float(row["mean"])
                enc_val = (cnt * mu + SMOOTH * gm) / (cnt + SMOOTH)
                final_map[str(cat)] = enc_val

            self.high_card_mappings_[c] = final_map
            X_enc[c] = X0[c].astype("string").map(final_map).fillna(gm).astype(float)

        # now all categorical columns become numeric too
        for c in self.categorical_cols_:
            if c not in num_cols:
                num_cols.append(c)

        # include missing flags in numeric
        for c in self.key_missing_cols:
            flag = f"is_missing_{c}"
            if flag in X_enc.columns and flag not in num_cols:
                num_cols.append(flag)

        self.numeric_cols_ = [c for c in num_cols if c in X_enc.columns]

        # numeric coercion
        X_enc[self.numeric_cols_] = X_enc[self.numeric_cols_].apply(pd.to_numeric, errors="coerce").astype(np.float64)

        # fit imputer + scaler
        X_imp = self.imputer_.fit_transform(X_enc[self.numeric_cols_])
        X_scaled = self.scaler_.fit_transform(X_imp)

        X_enc.loc[:, self.numeric_cols_] = X_scaled.astype(np.float64)

        # leakage dropped features are removed
        if self.leakage_dropped_features:
            X_enc = X_enc.drop(columns=self.leakage_dropped_features, errors="ignore")

        # final fitted column set (before selecting top_features)
        self.feature_names_in_ = list(X_enc.columns)

        return self

    def transform(self, X: pd.DataFrame):
        if not isinstance(X, pd.DataFrame):
            raise TypeError("CTRPreprocessor.transform expects a pandas DataFrame X")

        X0 = self._drop_meta_cols(X)
        X0 = self._add_missing_flags(X0)
        X0 = self._apply_log1p(X0)

        X_enc = X0.copy()

        # categorical cast
        for c in (self.categorical_cols_ or []):
            if c in X_enc.columns:
                X_enc[c] = X_enc[c].astype("string")

        # low-card apply
        for c, mapping in (self.low_card_mappings_ or {}).items():
            if c in X_enc.columns:
                X_enc[c] = X_enc[c].map(mapping).fillna(-1).astype(int)

        # high-card apply
        gm = float(self.global_mean_) if self.global_mean_ is not None else 0.0
        for c, mapping in (self.high_card_mappings_ or {}).items():
            if c in X_enc.columns:
                X_enc[c] = X_enc[c].astype("string").map(mapping).fillna(gm).astype(float)

        # ensure numeric cols exist
        for c in (self.numeric_cols_ or []):
            if c not in X_enc.columns:
                X_enc[c] = 0.0

        # numeric coercion
        X_enc[self.numeric_cols_] = X_enc[self.numeric_cols_].apply(pd.to_numeric, errors="coerce").astype(np.float64)

        # impute + scale using fitted objects
        X_imp = self.imputer_.transform(X_enc[self.numeric_cols_])
        X_scaled = self.scaler_.transform(X_imp)
        X_enc.loc[:, self.numeric_cols_] = X_scaled.astype(np.float64)

        # remove leakage features
        if self.leakage_dropped_features:
            X_enc = X_enc.drop(columns=self.leakage_dropped_features, errors="ignore")

        # align columns to fitted schema
        # important: add missing cols and order to match training
        for col in self.feature_names_in_:
            if col not in X_enc.columns:
                X_enc[col] = 0.0
        X_enc = X_enc.reindex(columns=self.feature_names_in_, fill_value=0.0)

        # select top features in exact order
        if self.top_features is not None:
            missing = [c for c in self.top_features if c not in X_enc.columns]
            if missing:
                # keep robust: add missing as zeros
                for c in missing:
                    X_enc[c] = 0.0
            X_enc = X_enc.reindex(columns=self.top_features, fill_value=0.0)

        return sparse.csr_matrix(X_enc.values)


# ============================================================
# Main preprocessing script (your original logic + transformer)
# ============================================================
project_root = os.path.abspath(os.path.join(os.getcwd(), ".."))
processed_dir = os.path.join(project_root, "data", "processed")
Path(processed_dir).mkdir(parents=True, exist_ok=True)

fused_path = os.path.join(processed_dir, "fused_data.csv")
if not os.path.exists(fused_path):
    raise FileNotFoundError(f"Missing: {fused_path}\nRun fusion notebook first.")

df = pd.read_csv(fused_path)
print("Loaded fused:", df.shape)

target_col = "label" if "label" in df.columns else ("clk" if "clk" in df.columns else None)
if target_col is None:
    raise ValueError('No target column found ("label" or "clk").')

user_col = next((c for c in ["user", "userid", "nick"] if c in df.columns), None)
if user_col is None:
    raise ValueError('No user ID column found ("user" or "userid" or "nick").')

time_col = next((c for c in ["time_stamp", "timestamp", "date_time", "datetime", "time"] if c in df.columns), None)
print("target:", target_col, "| user:", user_col, "| time:", time_col)

# cap rows
MAX_ROWS = 200_000
if time_col is not None:
    df = df.sort_values(time_col).reset_index(drop=True)
    if len(df) > MAX_ROWS:
        df = df.iloc[-MAX_ROWS:].copy()
        print(f"Temporal tail kept: {MAX_ROWS}")
else:
    if len(df) > MAX_ROWS:
        df = df.sample(n=MAX_ROWS, random_state=42).copy()
        print(f"Random sample kept: {MAX_ROWS}")

print("Working df:", df.shape, "pos_rate:", df[target_col].mean())

# leakage name blacklist
name_blacklist_substrings = [
    "label", "clk", "click", "nonclk", "noclk", "impression", "ctr", "conversion",
    "time_stamp_str", "is_click", "clicked"
]
removed = []
for c in list(df.columns):
    if c == target_col:
        continue
    low = c.lower()
    if any(s in low for s in name_blacklist_substrings):
        if time_col is not None and c == time_col:
            continue
        if c == user_col:
            continue
        df.drop(columns=[c], inplace=True)
        removed.append(c)
if removed:
    print("Removed by name blacklist:", removed)

# split
if time_col is not None:
    df = df.sort_values(time_col).reset_index(drop=True)
    split_idx = int(len(df) * 0.8)
    train_df = df.iloc[:split_idx].copy()
    test_df = df.iloc[split_idx:].copy()
    print("Temporal split cutoff:", df.iloc[split_idx][time_col])
else:
    train_df, test_df = train_test_split(
        df, test_size=0.2, stratify=df[target_col], random_state=42
    )
    print("Stratified split")

print("train:", train_df.shape, "pos:", train_df[target_col].mean())
print("test :", test_df.shape, "pos:", test_df[target_col].mean())

# build X/y (RAW)
def build_xy(sub_df: pd.DataFrame):
    y = sub_df[target_col].astype(int)
    X = sub_df.drop(columns=[target_col], errors="ignore")
    return X, y

X_train_raw, y_train = build_xy(train_df)
X_test_raw, y_test = build_xy(test_df)

# choose forced categorical
force_categorical = [
    "adgroup_id", "pid", "campaign_id", "customer", "cate_id", "brand",
    "cms_segid", "cms_group_id", "final_gender_code", "age_level",
    "shopping_level", "occupation", "pvalue_level", "new_user_class_level",
]
force_categorical = [c for c in force_categorical if c in X_train_raw.columns]

key_missing_cols = [c for c in ["price", "age_level", "final_gender_code", "shopping_level", "pvalue_level", "cms_segid"] if c in X_train_raw.columns]

# leakage guard runs on processed numeric space: we'll emulate your logic using a temporary fit/transform without top_features selection
# 1) fit transformer (without leakage_dropped_features yet)
tmp_pre = CTRPreprocessor(
    target_col=target_col,
    user_col=user_col,
    time_col=time_col,
    force_categorical=force_categorical,
    high_card_threshold=30,
    smoothing=50,
    n_splits_te=5,
    random_state=42,
    key_missing_cols=key_missing_cols,
    top_features=None,
    leakage_dropped_features=[]
).fit(pd.concat([X_train_raw, y_train.rename(target_col)], axis=1), y_train)

# build a dense df from internal schema for leakage detection (to match your approach)
# We'll rebuild the encoded/scaled matrix by transforming train and converting to dense.
X_train_tmp = tmp_pre.transform(pd.concat([X_train_raw, y_train.rename(target_col)], axis=1))
X_train_tmp = X_train_tmp.toarray()
tmp_cols = [f"f{i}" for i in range(X_train_tmp.shape[1])]
X_train_tmp_df = pd.DataFrame(X_train_tmp, columns=tmp_cols)

# LEAKAGE GUARD (same spirit: exact y equality + single-feature AUC)
leak_cols_exact = []
y_arr = y_train.values.astype(int)

for j, c in enumerate(X_train_tmp_df.columns):
    col = X_train_tmp_df[c].values.astype(float)
    if np.all(np.isfinite(col)):
        col_bin = np.round(col).astype(int)
        if col_bin.shape[0] == y_arr.shape[0]:
            if np.array_equal(col_bin, y_arr) or np.array_equal(1 - col_bin, y_arr):
                leak_cols_exact.append(c)

Xtr_sub, Xva_sub, ytr_sub, yva_sub = train_test_split(
    X_train_tmp_df, y_train, test_size=0.2, stratify=y_train, random_state=42
)

leak_auc_rows, leak_cols_auc = [], []
for c in X_train_tmp_df.columns:
    x = Xva_sub[c].values.astype(float)
    if np.nanstd(x) < 1e-12:
        continue
    try:
        auc = roc_auc_score(yva_sub, x)
        auc = max(auc, 1.0 - auc)
        leak_auc_rows.append((c, float(auc)))
        if auc > 0.999:
            leak_cols_auc.append(c)
    except Exception:
        continue

leakage_report = pd.DataFrame(leak_auc_rows, columns=["feature", "single_feature_auc"]).sort_values("single_feature_auc", ascending=False)
leak_report_path = os.path.join(processed_dir, "leakage_report.csv")
leakage_report.to_csv(leak_report_path, index=False)
print("Saved leakage report:", leak_report_path)

leak_all = sorted(set(leak_cols_exact + leak_cols_auc))
print("[LEAKAGE GUARD] tmp feature drops:", len(leak_all))

# IMPORTANT:
# Those leak feature names are f0..fN in temp space. We cannot map them back to original names reliably.
# So we apply leakage guard in the final encoded feature space by dropping those indices AFTER transform.
# We'll do that consistently for train/test and store the dropped indices.
leak_idx = sorted([int(c.replace("f", "")) for c in leak_all])
print("[LEAKAGE GUARD] drop indices:", leak_idx[:20], "..." if len(leak_idx) > 20 else "")

# Fit final transformer again (clean) — this is the one we will save
final_pre = CTRPreprocessor(
    target_col=target_col,
    user_col=user_col,
    time_col=time_col,
    force_categorical=force_categorical,
    high_card_threshold=30,
    smoothing=50,
    n_splits_te=5,
    random_state=42,
    key_missing_cols=key_missing_cols,
    top_features=None,
    leakage_dropped_features=[]
)
final_pre.fit(pd.concat([X_train_raw, y_train.rename(target_col)], axis=1), y_train)

# Transform train/test
X_train_mat = final_pre.transform(pd.concat([X_train_raw, y_train.rename(target_col)], axis=1))
X_test_mat = final_pre.transform(pd.concat([X_test_raw, y_test.rename(target_col)], axis=1))

# apply leakage index drops
if leak_idx:
    keep_idx = [i for i in range(X_train_mat.shape[1]) if i not in set(leak_idx)]
    X_train_mat = X_train_mat[:, keep_idx]
    X_test_mat = X_test_mat[:, keep_idx]

print("After leakage guard matrices:", X_train_mat.shape, X_test_mat.shape)

# feature importance (same as you, but now on matrix columns)
pos = int(y_train.sum())
neg = int(len(y_train) - pos)
scale_pos_weight = (neg / pos) if pos > 0 else 1.0
print("scale_pos_weight:", scale_pos_weight)

importance_model = XGBClassifier(
    n_estimators=600,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    eval_metric="logloss",
    random_state=42,
    n_jobs=-1,
    scale_pos_weight=scale_pos_weight,
    tree_method="hist",
)
importance_model.fit(X_train_mat, y_train)

imp = importance_model.feature_importances_
imp_df = pd.DataFrame({"feature": [f"f{i}" for i in range(X_train_mat.shape[1])], "importance": imp}).sort_values("importance", ascending=False)
imp_path = os.path.join(processed_dir, "feature_importances.csv")
imp_df.to_csv(imp_path, index=False)
print("Saved importances:", imp_path)

top_features = imp_df["feature"].tolist()

# select top features in that order
# NOTE: you used "all features" but reordered; same effect:
order_idx = [int(f.replace("f", "")) for f in top_features]
X_train_final = X_train_mat[:, order_idx]
X_test_final = X_test_mat[:, order_idx]

# now we can set transformer top_features to enforce exact order for raw inference
# BUT our transformer currently outputs original encoded schema, not f0.. indices.
# So we will store feature order as indices and apply at transform-time.
# simplest: store order_idx inside transformer and reorder output.
final_pre.top_features = None
final_pre.feature_order_idx_ = order_idx  # custom attribute for streamlit usage

# ============================================================
# Save outputs
# ============================================================
X_train_path = os.path.join(processed_dir, "X_train_processed.npz")
X_test_path = os.path.join(processed_dir, "X_test_processed.npz")
y_train_path = os.path.join(processed_dir, "y_train.csv")
y_test_path = os.path.join(processed_dir, "y_test.csv")
top_features_path = os.path.join(processed_dir, "top_features.csv")
scale_pos_weight_path = os.path.join(processed_dir, "scale_pos_weight.txt")
preprocessor_path = os.path.join(processed_dir, "preprocessor.joblib")

sparse.save_npz(X_train_path, X_train_final.tocsr())
sparse.save_npz(X_test_path, X_test_final.tocsr())

pd.Series(y_train, name=target_col).to_csv(y_train_path, index=False)
pd.Series(y_test, name=target_col).to_csv(y_test_path, index=False)

with open(scale_pos_weight_path, "w") as f:
    f.write(str(scale_pos_weight))

pd.DataFrame({"feature": [f"f{i}" for i in range(X_train_final.shape[1])] }).to_csv(top_features_path, index=False)

# Save REAL transformer (THIS fixes Streamlit raw tab)
dump(final_pre, preprocessor_path)

# Optional: save raw test rows so Streamlit can pick random raw row and compare
raw_test_path = os.path.join(processed_dir, "X_test_raw.csv")
X_test_raw_with_target = X_test_raw.copy()
X_test_raw_with_target[target_col] = y_test.values
X_test_raw_with_target.to_csv(raw_test_path, index=False)

print("\nSaved:")
print(" -", X_train_path)
print(" -", X_test_path)
print(" -", y_train_path)
print(" -", y_test_path)
print(" -", top_features_path)
print(" -", scale_pos_weight_path)
print(" -", preprocessor_path, " (Transformer ✅)")
print(" -", raw_test_path, " (Optional raw test rows ✅)")
print("\n[Preprocessing done with transformer + leakage guard indices.]")


Loaded fused: (50000, 20)
target: label | user: user | time: time_stamp
Working df: (50000, 20) pos_rate: 0.47594
Removed by name blacklist: ['clk', 'time_stamp_str']
Temporal split cutoff: 1498031035
train: (40000, 18) pos: 0.493625
test : (10000, 18) pos: 0.4052
Saved leakage report: d:\projects\Ai\project_fusion_ecu\data\processed\leakage_report.csv
[LEAKAGE GUARD] tmp feature drops: 1
[LEAKAGE GUARD] drop indices: [1] 
After leakage guard matrices: (40000, 14) (10000, 14)
scale_pos_weight: 1.0258293238794631
Saved importances: d:\projects\Ai\project_fusion_ecu\data\processed\feature_importances.csv

Saved:
 - d:\projects\Ai\project_fusion_ecu\data\processed\X_train_processed.npz
 - d:\projects\Ai\project_fusion_ecu\data\processed\X_test_processed.npz
 - d:\projects\Ai\project_fusion_ecu\data\processed\y_train.csv
 - d:\projects\Ai\project_fusion_ecu\data\processed\y_test.csv
 - d:\projects\Ai\project_fusion_ecu\data\processed\top_features.csv
 - d:\projects\Ai\project_fusion_ecu\da

### Modifications Summary
This notebook has been updated to remove SMOTE-based oversampling and One-Hot encoding.
Instead, class imbalance is handled by computing `scale_pos_weight` (negative/positive ratio) on the training data.
Categorical features are encoded using **Target Encoding** for high-cardinality columns and **Label Encoding** for low-cardinality columns.
Numeric features are imputed with the median and standardised.
Feature importances are computed with an XGBoost model using the calculated `scale_pos_weight` and the top 50 features are selected.
The processed datasets, label files, selected features, feature importances, scale position weight, and a preprocessor mapping are saved to the `data/processed` directory for subsequent notebooks.
