# 📊 Model Evaluation — **Exercise** (Add your changes where marked)

Welcome! This notebook mirrors our MLOps evaluation workflow but leaves key choices to **you**.

- Look for **`# <- TODO ✏️`** comments and edit those lines.
- Keep runs **reproducible** and **config-driven**.
- Save artifacts so they can be traced in CI/CD and in SageMaker Model Registry.

### Why evaluation matters
- **Ensures Production Readiness:** confirm performance meets SLAs before/after deployment.
- **Drives Model Improvement:** pinpoint failure modes → guide feature and retraining work.
- **Maintains Model Health:** watch for bias, data drift, and concept drift post-deployment.


## 🧰 Prereqs

If needed, install packages below (SageMaker kernels usually have most of these preinstalled).

In [None]:
# %pip install pandas numpy scikit-learn mlflow catboost boto3 sagemaker s3fs pyarrow sqlalchemy redshift_connector

In [None]:
# ♻️ Reproducibility & environment capture
import os, sys, json, hashlib, random, platform
from datetime import datetime
import numpy as np
import pandas as pd

SEED = 42   # <- TODO ✏️ choose a single seed for reproducibility across all splits & models
random.seed(SEED); np.random.seed(SEED)

RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"{RUN_TS}-{SEED}".encode()).hexdigest()[:10]

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/eval_{RUN_TS}_{RUN_ID}")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
}
with open(os.path.join(ARTIFACT_DIR, "env_eval_info.json"), "w") as f:
    json.dump(env_info, f, indent=2)

env_info

## ⚙️ Configuration (edit here only)

Choose **data source**, **target column**, **ID/leakage columns**, **model family**, and **CV strategy**. 

In [None]:
CONFIG = {
    "data": {
        # <- TODO ✏️ choose: "parquet" or "redshift"
        "source": os.environ.get("SOURCE", "parquet"),
        # <- TODO ✏️ if parquet, point to your dataset (S3 or local path with wildcards)
        "parquet_uri": os.environ.get("PARQUET_URI", "s3://YOUR-BUCKET/YOUR-PATH/*.parquet"),
        # <- TODO ✏️ if redshift, provide a deterministic SQL query (ORDER BY for stability)
        "redshift_sql": os.environ.get("SQL", "SELECT * FROM your_schema.your_table ORDER BY id"),
        # <- TODO ✏️ set via env vars or instance role; change only if testing locally
        "redshift_kwargs": {
            "host": os.environ.get("REDSHIFT_HOST", "example.redshift.amazonaws.com"),
            "database": os.environ.get("REDSHIFT_DB", "dev"),
            "user": os.environ.get("REDSHIFT_USER", "username"),
            "password": os.environ.get("REDSHIFT_PASSWORD", "password"),
            "port": int(os.environ.get("REDSHIFT_PORT", "5439")),
        },
    },
    "features": {
        # <- TODO ✏️ list identifiers/leakage columns to DROP before training/eval
        "id_features": ["customer_id","contract_id","account_id"]
    },
    "model": {
        # <- TODO ✏️ set your target column
        "target_col": os.environ.get("TARGET", "churned"),
        "random_seed": SEED,
        # Choose your model here (implementations below):
        # options: "catboost", "xgboost", "lightgbm", "sklearn_rf", "sklearn_logreg"
        "family": os.environ.get("MODEL_FAMILY","catboost"),  # <- TODO ✏️ pick a model family

        # Baseline hyperparameters (tune as you wish)
        "params": {
            # CatBoost-like defaults (ignored by other models unless mapped)
            "n_estimators": 1200,   # <- TODO ✏️
            "learning_rate": 0.08,  # <- TODO ✏️
            "depth": 6,             # <- TODO ✏️
            "l2_leaf_reg": 3.0,     # <- TODO ✏️
            "auto_class_weights": "Balanced",
        }
    },
    "evaluation": {
        "cv_folds": 5,                     # <- TODO ✏️
        "cv_strategy": "stratified",       # <- TODO ✏️ stratified | timeseries
        "opt_target_recall": 0.80,         # <- TODO ✏️ threshold target
        "use_mlflow": False,               # set True if you configured MLflow
        # Optional: temporal column to force chronological splits in holdout/backtest
        "time_col": "signup_ts",           # <- TODO ✏️ set to None if not available
    },
    "output": {
        "artifact_dir": ARTIFACT_DIR,
        "cv_summary_path": os.path.join(ARTIFACT_DIR, "cv_summary.json"),
        "holdout_report_path": os.path.join(ARTIFACT_DIR, "holdout_report.json"),
    }
}

CONFIG

## 📥 Load Data

We will try `data_io.load_data()` if available; otherwise we fall back to a synthetic dataset so you can run end‑to‑end now. 

In [None]:
# Try loading data via project helper
load_data = None
try:
    from data_io import load_data  # expects load_data(source, uri, sql, redshift_kwargs)
except Exception as e:
    print("ℹ️ data_io.load_data not found. Using synthetic demo. Error:", repr(e))

def _demo_dataset(n=12000, seed=SEED):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "customer_id": np.arange(1, n+1),
        "age": rng.integers(18, 85, size=n),
        "tenure_months": rng.integers(0, 120, size=n),
        "monthly_charges": rng.normal(45, 15, size=n).round(2),
        "contract_type": rng.choice(["month-to-month","one-year","two-year"], size=n, p=[0.6,0.25,0.15]),
        "country": rng.choice(["PT","ES","FR","DE"], size=n, p=[0.5,0.2,0.2,0.1]),
        "signup_ts": pd.to_datetime("2022-01-01") + pd.to_timedelta(rng.integers(0, 900, size=n), unit="D"),
        "churned": rng.choice([0,1], size=n, p=[0.78,0.22]).astype(int),
    })
    # anomalies to exercise cleaning
    df.loc[rng.choice(df.index, 40, replace=False), "monthly_charges"] = -5.0
    df.loc[rng.choice(df.index, 60, replace=False), "age"] = None
    return df

if load_data:
    if CONFIG["data"]["source"] == "parquet":
        # <- TODO ✏️ ensure parquet_uri points to your dataset
        df = load_data(source="parquet", uri=CONFIG["data"]["parquet_uri"], sql=None, redshift_kwargs=None)
    else:
        # <- TODO ✏️ ensure SQL/credentials resolve to your Redshift view/table
        df = load_data(source="redshift", uri=None, sql=CONFIG["data"]["redshift_sql"], redshift_kwargs=CONFIG["data"]["redshift_kwargs"])
else:
    df = _demo_dataset()

print("Shape:", df.shape)
df.head()

## 🧪 Utilities & Minimal Preprocessing

We’ll try to import your project’s `training_utils`. If unavailable, we define **simple fallbacks** you can customize. 

In [None]:
# Attempt to use your project helpers first
try:
    from training_utils import (
        setup_mlflow_tracking, preprocess_data, create_catboost_model, train_and_evaluate_model
    )
    HAVE_TRAINING_UTILS = True
    print("Using project training_utils.")
except Exception as e:
    print("training_utils not found. Using exercise fallbacks. Error:", repr(e))
    HAVE_TRAINING_UTILS = False

from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_recall_curve,
    precision_score, recall_score
)

# ---- TODO-friendly preprocessing ----
def preprocess_data_exercise(df, config):
    df = df.copy()
    target = config["model"]["target_col"]

    # <- TODO ✏️ add deterministic cleaning rules for your dataset
    # Example: fix invalid negatives
    if "monthly_charges" in df.columns:
        df.loc[df["monthly_charges"] < 0, "monthly_charges"] = np.nan

    # <- TODO ✏️ choose your missing value policy
    for c in df.columns:
        if c == target: 
            continue
        if df[c].dtype == object:
            df[c] = df[c].fillna("__MISSING__").astype(str)   # <- TODO ✏️ categorical policy
        elif pd.api.types.is_numeric_dtype(df[c]):
            df[c] = df[c].fillna(df[c].median())              # <- TODO ✏️ numerical imputer
        elif str(df[c].dtype).startswith("datetime"):
            df[c] = pd.to_datetime(df[c], errors="coerce")

    # <- TODO ✏️ add light feature engineering (must be deterministic)
    if {"tenure_months","monthly_charges"}.issubset(df.columns):
        df["est_ltv"] = (df["tenure_months"] * df["monthly_charges"]).round(2)

    return df

def build_model_from_family(config):
    fam = config["model"]["family"]
    params = config["model"]["params"]

    if fam == "catboost":
        try:
            from catboost import CatBoostClassifier
            return CatBoostClassifier(
                iterations=params.get("n_estimators", 800),
                learning_rate=params.get("learning_rate", 0.08),
                depth=params.get("depth", 6),
                l2_leaf_reg=params.get("l2_leaf_reg", 3.0),
                loss_function="Logloss",
                eval_metric="AUC",
                random_seed=config["model"]["random_seed"],
                auto_class_weights=params.get("auto_class_weights", "Balanced"),
                verbose=False
            )
        except Exception as e:
            print("CatBoost not available, fallback to RandomForest.", e)

    if fam == "xgboost":
        from xgboost import XGBClassifier
        return XGBClassifier(
            n_estimators=params.get("n_estimators", 600),
            learning_rate=params.get("learning_rate", 0.1),
            max_depth=params.get("depth", 6),
            subsample=0.8, colsample_bytree=0.8,
            reg_lambda=params.get("l2_leaf_reg", 1.0),
            random_state=config["model"]["random_seed"],
            tree_method="hist", eval_metric="auc"
        )

    if fam == "lightgbm":
        import lightgbm as lgb
        return lgb.LGBMClassifier(
            n_estimators=params.get("n_estimators", 800),
            learning_rate=params.get("learning_rate", 0.08),
            max_depth=params.get("depth", -1),
            reg_lambda=params.get("l2_leaf_reg", 0.0),
            class_weight="balanced",
            random_state=config["model"]["random_seed"]
        )

    if fam == "sklearn_logreg":
        from sklearn.linear_model import LogisticRegression
        return LogisticRegression(max_iter=1000, class_weight="balanced", random_state=config["model"]["random_seed"])

    # default fallback
    from sklearn.ensemble import RandomForestClassifier
    return RandomForestClassifier(n_estimators=400, class_weight="balanced", random_state=config["model"]["random_seed"])

def train_and_eval_generic(model, X_tr, y_tr, X_va, y_va, target_recall):
    # Simple encoding for object columns (joint mapping to avoid unseen category issues)
    X_tr_enc, X_va_enc = X_tr.copy(), X_va.copy()
    for c in X_tr_enc.columns:
        if X_tr_enc[c].dtype == object:
            vals = pd.concat([X_tr_enc[c], X_va_enc[c]], axis=0).astype(str)
            mapping = {v:i for i, v in enumerate(pd.Series(vals).unique())}
            X_tr_enc[c] = X_tr_enc[c].map(mapping).fillna(-1).astype(int)
            X_va_enc[c] = X_va_enc[c].map(mapping).fillna(-1).astype(int)

    model.fit(X_tr_enc, y_tr)
    proba = model.predict_proba(X_va_enc)[:,1] if hasattr(model, "predict_proba") else model.decision_function(X_va_enc)

    roc = roc_auc_score(y_va, proba)
    pr_auc = average_precision_score(y_va, proba)
    prec, rec, thr = precision_recall_curve(y_va, proba)

    # best F1
    f1s = (2*prec[:-1]*rec[:-1])/(prec[:-1]+rec[:-1]+1e-12)
    i_best = int(np.argmax(f1s))
    thr_best = float(thr[i_best])

    # target recall
    idx = np.where(rec[:-1] >= target_recall)[0]
    i_t = int(idx[-1]) if len(idx) else 0
    thr_rec = float(thr[i_t]) if len(idx) else 0.0

    y_hat_best = (proba >= thr_best).astype(int)
    y_hat_rec  = (proba >= thr_rec).astype(int)

    metrics = {
        "roc_auc": float(roc),
        "pr_auc": float(pr_auc),
        "best_f1": float((2*prec[i_best]*rec[i_best])/(prec[i_best]+rec[i_best]+1e-12)),
        "best_f1_threshold": thr_best,
        "precision_at_best_f1": float(precision_score(y_va, y_hat_best, zero_division=0)),
        "recall_at_best_f1": float(recall_score(y_va, y_hat_best, zero_division=0)),
        "threshold_at_target_recall": thr_rec,
        "precision_at_target_recall": float(prec[i_t]),
        "recall_at_target_recall": float(rec[i_t]),
    }
    return metrics, proba, {"thr_best": thr_best, "thr_rec": thr_rec}

# Bind final functions depending on availability
if HAVE_TRAINING_UTILS:
    _preprocess = preprocess_data                    # <- uses your project logic
    _make_model = lambda cfg: create_catboost_model(cfg)  # users can still change MODEL_FAMILY if they switch utils
    _train_eval  = lambda m, Xtr, ytr, Xva, yva, tr: train_and_evaluate_model(m, None, None, CONFIG, Xtr, Xva, ytr, yva)
else:
    _preprocess = preprocess_data_exercise           # <- TODO ✏️ edit this function above
    _make_model = build_model_from_family            # <- TODO ✏️ choose model family in CONFIG
    _train_eval  = lambda m, Xtr, ytr, Xva, yva, tr: train_and_eval_generic(m, Xtr, ytr, Xva, yva, tr)

## 🔁 Cross-Validation

Pick **StratifiedKFold** for classification or **TimeSeriesSplit** when chronology matters. Thresholds are set at your **target recall**. 

In [None]:
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit

def cross_validate_model(df, config):
    TARGET_COL = config['model']['target_col']
    id_features = config['features']['id_features']

    # Drop identifiers/leakage
    df_clean = df.drop(columns=[c for c in id_features if c in df.columns]).copy()

    # Drop datetime columns for generic models (CatBoost supports special handling; we keep simple here)
    dt_cols = [c for c in df_clean.columns if np.issubdtype(df_clean[c].dtype, np.datetime64)]
    if dt_cols:
        df_clean = df_clean.drop(columns=dt_cols)

    # Preprocess
    df_clean = _preprocess(df_clean, config)

    # Target/Features
    y = df_clean[TARGET_COL].astype(int).values
    X = df_clean.drop(columns=[TARGET_COL])

    # CV strategy
    if config['evaluation']['cv_strategy'] == 'timeseries':
        splitter = TimeSeriesSplit(n_splits=config['evaluation']['cv_folds'])
    else:
        splitter = StratifiedKFold(n_splits=config['evaluation']['cv_folds'], shuffle=True, random_state=config['model']['random_seed'])

    results = {"fold_results": []}
    print(f"Starting {config['evaluation']['cv_folds']}-fold CV using {config['evaluation']['cv_strategy']} strategy")

    for fold, (tr_idx, va_idx) in enumerate(splitter.split(X, y), 1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y[tr_idx], y[va_idx]

        model = _make_model(config)
        metrics, proba, thr = _train_eval(model, X_tr, y_tr, X_va, y_va, config['evaluation']['opt_target_recall'])

        results["fold_results"].append({
            "fold": fold,
            **metrics,
            "train_samples": int(len(X_tr)),
            "val_samples": int(len(X_va))
        })

        print(f"Fold {fold}: ROC-AUC={metrics['roc_auc']:.4f} | F1={metrics['best_f1']:.4f} | "
              f"Recall@target={metrics['recall_at_target_recall']:.4f} | "
              f"Precision@target={metrics['precision_at_target_recall']:.4f}")

    # Aggregate
    import numpy as np
    agg = {}
    for key in ["roc_auc","pr_auc","best_f1","precision_at_target_recall","recall_at_target_recall"]:
        vals = [r[key] for r in results["fold_results"]]
        agg[f"mean_{key}"] = float(np.mean(vals))
        agg[f"std_{key}"]  = float(np.std(vals))
    results.update(agg)

    return results, df_clean

cv_results, df_clean = cross_validate_model(df, CONFIG)
cv_results

In [None]:
# Save CV summary
import json, os, pprint
pprint.pprint(cv_results)
with open(CONFIG["output"]["cv_summary_path"], "w") as f:
    json.dump(cv_results, f, indent=2)
CONFIG["output"]["cv_summary_path"]

## 🧪 Holdout / Temporal Backtest (Optional)

Simulate *future* data by a strict time split (or random, if you lack a timestamp). 

In [None]:
from sklearn.model_selection import train_test_split

TARGET_COL = CONFIG["model"]["target_col"]
time_col = CONFIG["evaluation"]["time_col"]  # <- TODO ✏️ set this field in CONFIG if you have a timestamp

if time_col and time_col in df_clean.columns:
    ts = pd.to_datetime(df_clean[time_col], errors="coerce")
    cutoff = ts.quantile(0.8)  # <- TODO ✏️ adjust split policy if needed
    tr = df_clean[ts < cutoff].copy()
    te = df_clean[ts >= cutoff].copy()
else:
    tr, te = train_test_split(df_clean, test_size=0.2, random_state=CONFIG["model"]["random_seed"], stratify=df_clean[TARGET_COL])

X_tr, y_tr = tr.drop(columns=[TARGET_COL]), tr[TARGET_COL].values
X_te, y_te = te.drop(columns=[TARGET_COL]), te[TARGET_COL].values

mdl = _make_model(CONFIG)
metrics_te, _, thr = _train_eval(mdl, X_tr, y_tr, X_te, y_te, CONFIG['evaluation']['opt_target_recall'])

holdout_report = {
    "n_train": int(len(tr)),
    "n_test": int(len(te)),
    **metrics_te
}

with open(CONFIG["output"]["holdout_report_path"], "w") as f:
    json.dump(holdout_report, f, indent=2)

holdout_report

## 🧾 Browse SageMaker Experiments & Model Registry (Optional)

Use this section to find **previous experiments, training jobs, and model packages** to compare with your results.

> **NOTE:** Requires AWS credentials and permissions. Safe to run — cells will no-op without credentials. 

In [None]:
# <- TODO ✏️ optionally filter by names you use in your project
EXPERIMENT_NAME_CONTAINS = ""   # e.g., "churn"
MODEL_GROUP_CONTAINS = ""       # e.g., "churn-model-group"

try:
    import boto3, sagemaker
    sm = boto3.client("sagemaker")
    print("AWS Region:", boto3.Session().region_name)

    # Experiments
    print("\nRecent Experiments:")
    res = sm.list_experiments(SortBy="CreationTime", SortOrder="Descending", MaxResults=10)
    for e in res.get("ExperimentSummaries", []):
        if EXPERIMENT_NAME_CONTAINS.lower() in e["ExperimentName"].lower():
            print(" -", e["ExperimentName"], "| Created:", e["CreationTime"])

    # Model Registry
    print("\nRecent Model Packages:")
    mres = sm.list_model_packages(ModelPackageGroupNameContains=MODEL_GROUP_CONTAINS, SortBy="CreationTime", SortOrder="Descending", MaxResults=10)
    for mp in mres.get("ModelPackageSummaryList", []):
        print(" - Group:", mp.get("ModelPackageGroupName"), "| Status:", mp.get("ModelApprovalStatus"),
              "| Created:", mp.get("CreationTime"))

    # Training Jobs
    print("\nRecent Training Jobs:")
    tres = sm.list_training_jobs(SortBy="CreationTime", SortOrder="Descending", MaxResults=10)
    for tj in tres.get("TrainingJobSummaries", []):
        print(" -", tj["TrainingJobName"], "| Status:", tj["TrainingJobStatus"], "| Created:", tj["CreationTime"])

except Exception as e:
    print("Skipping browse (missing AWS creds or permissions):", e)

## ✅ Promotion Guidance

- Set **quality gates** (e.g., `mean recall@target ≥ 0.80` and stable across folds).
- Compare against **approved** model in Registry. Block promotion if worse.
- If drift/bias suspected: open an issue, review data pipeline, schedule retraining.
- Log and ship: `cv_summary.json`, `holdout_report.json`, model artifacts, and config to MLflow/Registry.
