# 🔎 Model Selection (CatBoost + Comparators)

**Purpose:** Build a *stable*, **reproducible**, and **config-driven** model selection notebook that is safe to promote toward production.

> This notebook runs on SageMaker or locally. It loads prepared features (e.g., from **Redshift**, **Postgres**, or **S3 Parquet** via `data_io.py`),
applies deterministic splits, trains a small **model zoo** (CatBoost baseline + LightGBM + XGBoost + Logistic Regression), and
selects the **best model by Recall** (with a target recall threshold), using PR-AUC and ROC-AUC as tie-breakers.


## 📦 What You Get
- Config-first data loader (`data_io.load_data()` if present; fallback to **SQL** via SQLAlchemy or **synthetic demo**)  
- Cleaning & minimal feature handling (categoricals + datetime)  
- **Leakage-aware**, deterministic **time split** (or stratified split)  
- **Model zoo**: CatBoost (baseline), LightGBM, XGBoost, Logistic Regression  
- **Threshold tuning** for target recall (e.g., 80%) and comparison table  
- Export **artifacts**: `model.(cbm|json)`, `metrics.json`, `thresholds.json`  
- Optional **MLflow** run logging


## 🧰 Prerequisites
- Python 3.9+
- Packages: `pandas`, `numpy`, `pyarrow`, `scikit-learn`, `catboost`, `lightgbm`, `xgboost`, `sqlalchemy`, `redshift_connector`, `s3fs`, `mlflow` (optional)
- Optional: a `data_io.py` next to this notebook with a `load_data(...)` API.

```python
# If running on a fresh environment (SageMaker usually has most of these):
# %pip install pandas numpy pyarrow scikit-learn catboost lightgbm xgboost mlflow sqlalchemy redshift_connector s3fs
```


## ♻️ Reproducibility & Environment Capture
- **Fixed seeds** for determinism
- Save **package versions** and run metadata
- Unique **artifact run folder**


In [None]:
import os, sys, json, hashlib, platform, random
from datetime import datetime
import numpy as np
import pandas as pd

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"{RUN_TS}-{SEED}".encode()).hexdigest()[:10]

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/run_{RUN_TS}_{RUN_ID}")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
    "packages": {
        "pandas": pd.__version__,
        "numpy": np.__version__,
    },
}
with open(os.path.join(ARTIFACT_DIR, "env_info.json"), "w") as f:
    json.dump(env_info, f, indent=2)
env_info

## ⚙️ Configuration
Edit **only here** to switch sources and behavior.


In [None]:
from pathlib import Path

CONFIG = {
    "data": {
        # choose one of: 'redshift' | 'parquet' | 'postgres' | 'synthetic'
        "source": os.environ.get("SOURCE", "postgres"),
        "parquet_uri": os.environ.get("PARQUET_URI", "s3://your-bucket/path/to/data/*.parquet"),
        "sql": os.environ.get("SQL", "SELECT * FROM public.final_feature_snapshot"),
        # Redshift
        "redshift_kwargs": {
            "host": os.environ.get("REDSHIFT_HOST", "example.redshift.amazonaws.com"),
            "database": os.environ.get("REDSHIFT_DB", "dev"),
            "user": os.environ.get("REDSHIFT_USER", "username"),
            "password": os.environ.get("REDSHIFT_PASSWORD", "password"),
            "port": int(os.environ.get("REDSHIFT_PORT", "5439")),
        },
        # Postgres (psycopg3, SQLAlchemy URL)
        "pg": {
            "user": os.getenv("PGUSER", "postgres"),
            "password": os.getenv("PGPASSWORD", "postgres"),
            "host": os.getenv("PGHOST", "localhost"),
            "port": os.getenv("PGPORT", "5432"),
            "db": os.getenv("PGDATABASE", "testdb"),
        },
        "row_limit": int(os.environ.get("ROW_LIMIT", "0")) or None,
    },
    "columns": {
        "target": os.environ.get("TARGET", "churn"),
        "primary_key": os.environ.get("PRIMARY_KEY", "codigocontaservico"),
        # date column for time-based split (falls back to stratified)
        "time_col": os.environ.get("TIME_COL", "iddim_date_inicio"),
        # columns known to be leakage or pure identifiers
        "drop_cols": [
            "idconsumo","id_contaservico","codigocontaservico","idconta","iddim_cliente",
            "idcliente","codigocliente","iddim_conta","codigoconta",
            "idgrupo_dim_contadimensao","iddim_contaservico_dth","idcontaservico"
        ],
    },
    "split": {
        "test_size": 0.2,
        "val_size": 0.1,  # not used by the model zoo (we’ll use train/valid split only)
        "recall_target": float(os.environ.get("RECALL_TARGET", "0.80")),
        "stratify": True,
    },
    "models": {
        "catboost": {"iterations": 2000, "early_stopping_rounds": 200, "depth": 6, "learning_rate": 0.08, "l2_leaf_reg": 3.0, "task_type": os.environ.get("CAT_TASK_TYPE", "CPU"), "auto_class_weights": "Balanced", "verbose": 200},
        "lightgbm": {"n_estimators": 2000, "learning_rate": 0.05, "num_leaves": 64, "min_child_samples": 40, "subsample": 0.8, "colsample_bytree": 0.8, "reg_lambda": 1.0, "random_state": SEED},
        "xgboost":  {"n_estimators": 2000, "learning_rate": 0.05, "max_depth": 6, "subsample": 0.8, "colsample_bytree": 0.8, "reg_lambda": 1.0, "random_state": SEED, "eval_metric": "auc"},
        "logreg":   {"max_iter": 1000, "class_weight": "balanced", "solver": "liblinear", "random_state": SEED},
    },
    "output": {
        "artifact_dir": ARTIFACT_DIR,
        "processed_parquet_path": str(Path(ARTIFACT_DIR) / "processed" / "dataset.parquet"),
        "metrics_path": str(Path(ARTIFACT_DIR) / "metrics.json"),
        "thresholds_path": str(Path(ARTIFACT_DIR) / "thresholds.json"),
        "model_dir": str(Path(ARTIFACT_DIR) / "models"),
    },
    "mlflow": {
        "enabled": False,
        "tracking_uri": os.environ.get("MLFLOW_TRACKING_URI", ""),
        "experiment_name": os.environ.get("MLFLOW_EXPERIMENT", "model-selection"),
    },
}
CONFIG

## 📥 Load Data
Attempts `data_io.load_data(...)` first. If not present, uses **Postgres** (from `CONFIG['data']['pg']` and `CONFIG['data']['sql']`).
If neither works, falls back to a **synthetic** churn-like dataset so the rest is testable.


In [None]:
load_data = None
try:
    from data_io import load_data  # optional helper
except Exception as e:
    print("⚠️ data_io.load_data not available:", repr(e))

def _demo_dataset(n=20000, seed=SEED):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "codigocontaservico": np.arange(1, n+1),
        "iddim_date_inicio": pd.to_datetime("2023-01-01") + pd.to_timedelta(rng.integers(0, 650, size=n), unit="D"),
        "tenure_days": rng.integers(30, 800, size=n),
        "expiry_month": rng.integers(1, 13, size=n),
        "expiry_dow": rng.integers(0, 7, size=n),
        "tipo_produto_atual": rng.choice(["normal","premium"], size=n, p=[0.85,0.15]),
        "topup_total_value": rng.gamma(2.0, 30.0, size=n).round(2),
        "municipio": rng.choice(["luanda","lubango","viana"], size=n),
        "past_churns": rng.poisson(0.2, size=n),
        "n_prev_contracts": rng.integers(0, 5, size=n),
        "target_proxy": rng.normal(0,1,size=n)
    })
    logits = -1.2 + 0.003*(df["tenure_days"]) + 0.4*(df["tipo_produto_atual"]=="premium").astype(int) - 0.0008*df["topup_total_value"] + 0.35*df["past_churns"]
    p = 1/(1+np.exp(-logits))
    df["churn"] = (rng.random(size=n) < p).astype(int)
    return df

source = CONFIG["data"]["source"].lower()
df_raw = None

try:
    if load_data is not None:
        print(f"Loading via data_io.load_data(source={source}) …")
        df_raw = load_data(
            source=source,
            uri=CONFIG["data"]["parquet_uri"],
            sql=CONFIG["data"]["sql"],
            redshift_kwargs=CONFIG["data"]["redshift_kwargs"],
        )
    elif source == "postgres":
        from sqlalchemy import create_engine, text
        pg = CONFIG["data"]["pg"]
        url = f"postgresql+psycopg://{pg['user']}:{pg['password']}@{pg['host']}:{pg['port']}/{pg['db']}"
        engine = create_engine(url)
        with engine.begin() as conn:
            df_raw = pd.read_sql_query(text(CONFIG["data"]["sql"]), conn)
            
    if df_raw is None:
        print("Using synthetic demo dataset …")
        df_raw = _demo_dataset()
except Exception as e:
    print("⚠️ Failed to load from configured source:", repr(e))
    print("Using synthetic demo dataset …")
    df_raw = _demo_dataset()

row_limit = CONFIG["data"]["row_limit"]
if row_limit and len(df_raw) > row_limit:
    pk = CONFIG["columns"]["primary_key"]
    if pk in df_raw.columns:
        df_raw = df_raw.sort_values(pk).head(row_limit).reset_index(drop=True)
    else:
        df_raw = df_raw.sample(n=row_limit, random_state=SEED).reset_index(drop=True)

df_raw.head(3), df_raw.shape

## 🔎 Quick Profile

In [None]:
pd.DataFrame({
    "column": df_raw.columns,
    "dtype": df_raw.dtypes.astype(str).values,
    "nulls": [df_raw[c].isna().sum() for c in df_raw.columns],
    "non_nulls": [df_raw[c].notna().sum() for c in df_raw.columns],
}).head(30)

## 🧼 Minimal Cleaning & Feature Handling

In [None]:
target_col = CONFIG["columns"]["target"]
primary_key = CONFIG["columns"]["primary_key"]
time_col = CONFIG["columns"]["time_col"]
drop_cols = list(set(CONFIG["columns"]["drop_cols"] + [target_col]))

df = df_raw.copy()

# Convert pandas timedeltas (if any) to days
for c in df.columns:
    if str(df[c].dtype).startswith("timedelta"):
        df[c] = df[c].dt.total_seconds() / 86400

# Ensure time column is datetime if present
if time_col in df.columns:
    df[time_col] = pd.to_datetime(df[time_col], errors="coerce")

# Basic NaN handling for categoricals
cat_cols = [c for c in df.columns if df[c].dtype == "object" or str(df[c].dtype) == "category"]
for c in cat_cols:
    df[c] = df[c].astype(object).where(~pd.isna(df[c]), "<MISSING>")

df.head(3)

## ✂️ Deterministic Split (Time-based preferred, else stratified)

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit, train_test_split

y = df[target_col].astype(int).values if target_col in df.columns else None
X = df.drop(columns=[c for c in drop_cols if c in df.columns], errors="ignore")

def build_split_masks(df, y, time_col):
    if time_col in df.columns:
        dates = pd.to_datetime(df[time_col], errors="coerce")
        if dates.notna().mean() > 0.8:
            cutoff = dates.quantile(0.8)
            print("cutoff_date:", cutoff)
            train_mask = dates < cutoff
            valid_mask = ~train_mask
            return train_mask, valid_mask
    # fallback: stratified split
    splitter = StratifiedShuffleSplit(n_splits=1, test_size=CONFIG['split']['test_size'], random_state=SEED)
    train_idx, valid_idx = next(splitter.split(X, y))
    train_mask = pd.Series(False, index=X.index); train_mask.iloc[train_idx] = True
    valid_mask = ~train_mask
    return train_mask, valid_mask

train_mask, valid_mask = build_split_masks(df, y, time_col)
sum(train_mask), sum(valid_mask)

## 🤖 Model Zoo & Recall-First Evaluation

In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve, classification_report, confusion_matrix

def best_f1_threshold(y_true, y_scores):
    prec, rec, thr = precision_recall_curve(y_true, y_scores)
    f1 = (2 * prec[:-1] * rec[:-1]) / (prec[:-1] + rec[:-1] + 1e-12)
    i = int(np.argmax(f1))
    return float(thr[i]), float(f1[i]), float(prec[i]), float(rec[i])

def recall_target_threshold(y_true, y_scores, target=0.80):
    prec, rec, thr = precision_recall_curve(y_true, y_scores)
    idx = np.where(rec[:-1] >= target)[0]
    if len(idx):
        i = int(idx[-1])  # highest threshold with >= target recall
        return float(thr[i]), float(prec[i]), float(rec[i])
    return 0.0, float(prec[0]), float(rec[0])

def prep_data_for_tree_models(X):
    # Drop datetime columns; convert categoricals to string (CatBoost can handle strings directly)
    X2 = X.copy()
    dt_cols = [c for c in X2.columns if np.issubdtype(X2[c].dtype, np.datetime64)]
    if dt_cols:
        X2 = X2.drop(columns=dt_cols)
    for c in X2.columns:
        if X2[c].dtype == "object" or str(X2[c].dtype) == "category":
            X2[c] = X2[c].astype(str)
    return X2

X2 = prep_data_for_tree_models(X)
X_train, X_valid = X2.loc[train_mask], X2.loc[valid_mask]
y_train, y_valid = y[train_mask], y[valid_mask]

results = []
thresholds = {}
model_objects = {}

# --- CatBoost ---
try:
    from catboost import CatBoostClassifier, Pool
    cat_cols = [c for c in X_train.columns if X_train[c].dtype == 'object']
    cat_idx = X_train.columns.get_indexer(cat_cols).tolist()
    train_pool = Pool(X_train, y_train, cat_features=cat_idx)
    valid_pool = Pool(X_valid, y_valid, cat_features=cat_idx)
    params = CONFIG['models']['catboost']
    cb = CatBoostClassifier(
        iterations=params['iterations'],
        learning_rate=params['learning_rate'],
        depth=params['depth'],
        l2_leaf_reg=params['l2_leaf_reg'],
        loss_function='Logloss',
        eval_metric='AUC',
        random_seed=SEED,
        auto_class_weights=params.get('auto_class_weights','Balanced'),
        task_type=params.get('task_type','CPU'),
        verbose=params.get('verbose', 200)
    )
    cb.fit(train_pool, eval_set=valid_pool, use_best_model=True, early_stopping_rounds=params['early_stopping_rounds'])
    valid_proba = cb.predict_proba(valid_pool)[:,1]
    roc = roc_auc_score(y_valid, valid_proba)
    pr = average_precision_score(y_valid, valid_proba)
    thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, valid_proba)
    thr_rec, p80, r80 = recall_target_threshold(y_valid, valid_proba, CONFIG['split']['recall_target'])
    results.append({"model":"catboost","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
    thresholds['catboost'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
    model_objects['catboost'] = cb
except Exception as e:
    print("⚠️ CatBoost unavailable:", repr(e))

# --- LightGBM ---
try:
    import lightgbm as lgb
    lgbm = lgb.LGBMClassifier(**CONFIG['models']['lightgbm'])
    lgbm.fit(X_train, y_train, eval_set=[(X_valid,y_valid)], eval_metric='auc', verbose=False)
    valid_proba = lgbm.predict_proba(X_valid)[:,1]
    roc = roc_auc_score(y_valid, valid_proba)
    pr = average_precision_score(y_valid, valid_proba)
    thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, valid_proba)
    thr_rec, p80, r80 = recall_target_threshold(y_valid, valid_proba, CONFIG['split']['recall_target'])
    results.append({"model":"lightgbm","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
    thresholds['lightgbm'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
    model_objects['lightgbm'] = lgbm
except Exception as e:
    print("⚠️ LightGBM unavailable:", repr(e))

# --- XGBoost ---
try:
    from xgboost import XGBClassifier
    xgb = XGBClassifier(**CONFIG['models']['xgboost'])
    xgb.fit(X_train, y_train, eval_set=[(X_valid,y_valid)], verbose=False)
    valid_proba = xgb.predict_proba(X_valid)[:,1]
    roc = roc_auc_score(y_valid, valid_proba)
    pr = average_precision_score(y_valid, valid_proba)
    thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, valid_proba)
    thr_rec, p80, r80 = recall_target_threshold(y_valid, valid_proba, CONFIG['split']['recall_target'])
    results.append({"model":"xgboost","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
    thresholds['xgboost'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
    model_objects['xgboost'] = xgb
except Exception as e:
    print("⚠️ XGBoost unavailable:", repr(e))

# --- Logistic Regression (baseline linear) ---
try:
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.compose import ColumnTransformer
    from sklearn.pipeline import Pipeline
    from sklearn.linear_model import LogisticRegression

    num_cols = [c for c in X_train.columns if pd.api.types.is_numeric_dtype(X_train[c])]
    cat_cols = [c for c in X_train.columns if c not in num_cols]
    pre = ColumnTransformer([
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols)
    ], remainder="passthrough")
    logreg = Pipeline([
        ("pre", pre),
        ("clf", LogisticRegression(**CONFIG['models']['logreg']))
    ])
    logreg.fit(X_train, y_train)
    valid_proba = logreg.predict_proba(X_valid)[:,1]
    roc = roc_auc_score(y_valid, valid_proba)
    pr = average_precision_score(y_valid, valid_proba)
    thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, valid_proba)
    thr_rec, p80, r80 = recall_target_threshold(y_valid, valid_proba, CONFIG['split']['recall_target'])
    results.append({"model":"logreg","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
    thresholds['logreg'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
    model_objects['logreg'] = logreg
except Exception as e:
    print("⚠️ Logistic Regression unavailable:", repr(e))

pd.DataFrame(results).sort_values(["recall_at_recall_target","pr_auc","roc_auc"], ascending=[False, False, False]).reset_index(drop=True)

## 🏅 Select Best Model (Recall-first) & Export Artifacts

In [None]:
from pathlib import Path
Path(CONFIG['output']['model_dir']).mkdir(parents=True, exist_ok=True)

res_df = pd.DataFrame(results)
if len(res_df)==0:
    raise RuntimeError("No models trained successfully. Check dependencies.")

# Sort by recall@target desc, then PR-AUC, then ROC-AUC
res_df_sorted = res_df.sort_values(["recall_at_recall_target","pr_auc","roc_auc"], ascending=[False, False, False]).reset_index(drop=True)
best_row = res_df_sorted.iloc[0]
best_model_name = best_row["model"]
best_model = model_objects[best_model_name]
best_thresholds = thresholds[best_model_name]

print("Best model (Recall-first):", best_model_name)
display(res_df_sorted)

# Save metrics & thresholds
with open(CONFIG['output']['metrics_path'], 'w') as f:
    json.dump(res_df_sorted.to_dict(orient='records'), f, indent=2)
with open(CONFIG['output']['thresholds_path'], 'w') as f:
    json.dump(thresholds, f, indent=2)

# Persist model (CatBoost has native save; others via joblib)
model_path = None
try:
    if best_model_name == 'catboost':
        model_path = str(Path(CONFIG['output']['model_dir']) / 'catboost_best.cbm')
        best_model.save_model(model_path)
    else:
        import joblib
        model_path = str(Path(CONFIG['output']['model_dir']) / f"{best_model_name}_best.joblib")
        joblib.dump(best_model, model_path)
except Exception as e:
    print("⚠️ Failed to persist model:", repr(e))

{
    "best_model": best_model_name,
    "model_path": model_path,
    "metrics_json": CONFIG['output']['metrics_path'],
    "thresholds_json": CONFIG['output']['thresholds_path']
}

## 📈 (Optional) MLflow Trace

In [None]:
if CONFIG['mlflow']['enabled']:
    import mlflow
    mlflow.set_tracking_uri(CONFIG['mlflow']['tracking_uri'] or 'file://' + str(Path(ARTIFACT_DIR).absolute()))
    mlflow.set_experiment(CONFIG['mlflow']['experiment_name'])
    with mlflow.start_run(run_name=f"model-selection-{RUN_TS}") as run:
        mlflow.log_params({
            "seed": SEED,
            "source": CONFIG['data']['source'],
            "time_col": CONFIG['columns']['time_col'],
            "recall_target": CONFIG['split']['recall_target']
        })
        mlflow.log_artifact(CONFIG['output']['metrics_path'])
        mlflow.log_artifact(CONFIG['output']['thresholds_path'])
        if 'model_path' in locals() and model_path:
            mlflow.log_artifact(model_path)
        print("MLflow run:", run.info.run_id)

## 🔬 Sanity Check: Train vs Valid (Best Model)

In [None]:
from sklearn.metrics import recall_score
best = best_model
if best_model_name=='catboost':
    val_scores = best.predict_proba(valid_pool)[:,1]
else:
    val_scores = best.predict_proba(X_valid)[:,1]

thr = best_thresholds['recall_target_threshold']
y_pred_val = (val_scores >= thr).astype(int)
print("Validation recall at target threshold:", recall_score(y_valid, y_pred_val))
print("Confusion matrix:")
print(confusion_matrix(y_valid, y_pred_val))
