# 📊 Model Evaluation — MLOps Pipeline (Example)

**Goal:** provide a *production-ready* evaluation workflow you can run locally or in SageMaker to validate model quality **before** promotion and **after** deployment.

### Why evaluation matters (theory, briefly)
- **Ensures Production Readiness:** confirms predictive performance meets SLAs before and after deployment.
- **Drives Model Improvement:** highlights where performance drops (e.g., low recall on minority cohorts) → informs retraining or feature work.
- **Maintains Model Health:** periodic checks catch **data drift**, **concept drift**, and **bias** that can emerge post-deployment.
- **Traceability & Governance:** metrics and artifacts are captured with **MLflow** and can be associated with **SageMaker Model Registry** entries.


## 🧰 Prereqs
- Python 3.9+
- Packages: `pandas`, `numpy`, `scikit-learn`, `mlflow` (optional), `catboost` (or fallback to `sklearn`), `sagemaker`, `boto3`
- A `data_io.py` (for loading from Redshift or S3 Parquet). If not present, a **synthetic dataset** will be used.

> If your environment is fresh, uncomment the cell below.


In [None]:
# %pip install pandas numpy scikit-learn mlflow catboost boto3 sagemaker s3fs pyarrow sqlalchemy redshift_connector

In [None]:
# ♻️ Reproducibility & environment capture
import os, sys, json, hashlib, random, platform
from datetime import datetime
import numpy as np
import pandas as pd

SEED = 42
random.seed(SEED); np.random.seed(SEED)

RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"{RUN_TS}-{SEED}".encode()).hexdigest()[:10]

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/eval_{RUN_TS}_{RUN_ID}")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
}
with open(os.path.join(ARTIFACT_DIR, "env_eval_info.json"), "w") as f:
    json.dump(env_info, f, indent=2)

env_info

## ⚙️ Configuration
Set all evaluation knobs here (data source, target column, model params, CV strategy, thresholds, etc.).

In [None]:
CONFIG = {
    "data": {
        # choose: "parquet" or "redshift"
        "source": os.environ.get("SOURCE", "parquet"),
        "parquet_uri": os.environ.get("PARQUET_URI", "s3://your-bucket/path/to/processed/*.parquet"),  # edit if using S3
        "redshift_sql": os.environ.get("SQL", "SELECT * FROM your_schema.your_table ORDER BY id"),
        "redshift_kwargs": {
            "host": os.environ.get("REDSHIFT_HOST", "example.redshift.amazonaws.com"),
            "database": os.environ.get("REDSHIFT_DB", "dev"),
            "user": os.environ.get("REDSHIFT_USER", "username"),
            "password": os.environ.get("REDSHIFT_PASSWORD", "password"),
            "port": int(os.environ.get("REDSHIFT_PORT", "5439")),
        },
    },
    "features": {
        "id_features": [
            # Put identifiers or known leakage columns here to drop before training/eval
            "customer_id", "contract_id", "account_id"
        ]
    },
    "model": {
        "target_col": os.environ.get("TARGET", "churned"),
        "random_seed": 42,
        # CatBoost-style params (used if CatBoost available; otherwise a sklearn fallback is used)
        "n_estimators": 1200,
        "learning_rate": 0.08,
        "depth": 6,
        "l2_leaf_reg": 3.0,
        "auto_class_weights": "Balanced",
    },
    "evaluation": {
        "cv_folds": 5,
        "cv_strategy": "stratified",  # stratified | timeseries
        "opt_target_recall": 0.80,    # threshold selection target
        "use_mlflow": False,          # toggle MLflow logging on/off
    },
    "output": {
        "artifact_dir": ARTIFACT_DIR,
        "cv_summary_path": os.path.join(ARTIFACT_DIR, "cv_summary.json"),
        "holdout_report_path": os.path.join(ARTIFACT_DIR, "holdout_report.json"),
    }
}

CONFIG

## 📥 Load Data
Uses `load_data` from `data_io.py` if available; otherwise a **synthetic churn** dataset is generated so this notebook remains runnable.

In [None]:
# Try loading utility
load_data = None
try:
    from data_io import load_data  # expects a function load_data(source, uri, sql, redshift_kwargs)
except Exception as e:
    print("⚠️ data_io.load_data not found. Falling back to synthetic demo. Error:", repr(e))

def _demo_dataset(n=12000, seed=SEED):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "customer_id": np.arange(1, n+1),
        "age": rng.integers(18, 85, size=n),
        "tenure_months": rng.integers(0, 120, size=n),
        "monthly_charges": rng.normal(45, 15, size=n).round(2),
        "contract_type": rng.choice(["month-to-month","one-year","two-year"], size=n, p=[0.6,0.25,0.15]),
        "country": rng.choice(["PT","ES","FR","DE"], size=n, p=[0.5,0.2,0.2,0.1]),
        "signup_ts": pd.to_datetime("2022-01-01") + pd.to_timedelta(rng.integers(0, 900, size=n), unit="D"),
        "churned": rng.choice([0,1], size=n, p=[0.78,0.22]).astype(int),
    })
    # add a few anomalies
    df.loc[rng.choice(df.index, 40, replace=False), "monthly_charges"] = -5.0
    df.loc[rng.choice(df.index, 60, replace=False), "age"] = None
    return df

if load_data:
    if CONFIG["data"]["source"] == "parquet":
        df = load_data(source="parquet", uri=CONFIG["data"]["parquet_uri"], sql=None, redshift_kwargs=None)
    else:
        df = load_data(source="redshift", uri=None, sql=CONFIG["data"]["redshift_sql"], redshift_kwargs=CONFIG["data"]["redshift_kwargs"])
else:
    df = _demo_dataset()

print(df.shape)
df.head()

## 🧪 Training Utils (fallbacks)
We try to import `training_utils` (your project helpers). If not present, we define minimal fallbacks here to keep the notebook runnable.

In [None]:
# Attempt to import project utilities; else define minimal versions
try:
    from training_utils import (
        setup_mlflow_tracking, preprocess_data, create_catboost_model, train_and_evaluate_model
    )
    HAVE_TRAINING_UTILS = True
except Exception as e:
    print("ℹ️ training_utils not found; using minimal fallbacks. Error:", repr(e))
    HAVE_TRAINING_UTILS = False

import warnings
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_recall_curve,
    classification_report, confusion_matrix, precision_score, recall_score, f1_score
)

def _simple_preprocess(df, config):
    # Clean simple issues; preserve determinism
    target = config["model"]["target_col"]
    df = df.copy()
    # replace invalid negatives in monthly_charges
    if "monthly_charges" in df.columns:
        df.loc[df["monthly_charges"] < 0, "monthly_charges"] = np.nan
        df["monthly_charges"] = df["monthly_charges"].fillna(df["monthly_charges"].median())
    # fill simple categoricals
    for c in df.columns:
        if c == target: continue
        if df[c].dtype == object:
            df[c] = df[c].fillna("__MISSING__").astype(str)
        elif pd.api.types.is_numeric_dtype(df[c]):
            df[c] = df[c].fillna(df[c].median())
        elif hasattr(df[c], "dtype") and str(df[c].dtype).startswith("datetime"):
            df[c] = pd.to_datetime(df[c], errors="coerce")
    return df

def _make_model(config):
    # Try CatBoost; else fallback to sklearn RandomForest
    try:
        from catboost import CatBoostClassifier
        model = CatBoostClassifier(
            iterations=config["model"]["n_estimators"],
            learning_rate=config["model"]["learning_rate"],
            depth=config["model"]["depth"],
            l2_leaf_reg=config["model"]["l2_leaf_reg"],
            loss_function="Logloss",
            eval_metric="AUC",
            random_seed=config["model"]["random_seed"],
            auto_class_weights=config["model"]["auto_class_weights"],
            verbose=False
        )
        return model, "catboost"
    except Exception:
        from sklearn.ensemble import RandomForestClassifier
        model = RandomForestClassifier(
            n_estimators=300, random_state=config["model"]["random_seed"], class_weight="balanced"
        )
        return model, "sklearn-rf"

def _train_eval(model, X_train, y_train, X_val, y_val, target_recall=0.80, cat_idx=None):
    # Fit and evaluate
    import numpy as np
    import pandas as pd
    try:
        # Encode object columns deterministically
        if any(getattr(X_train[c], "dtype", None) == object for c in X_train.columns):
            X_train_enc = X_train.copy()
            X_val_enc = X_val.copy()
            for c in X_train_enc.columns:
                if X_train_enc[c].dtype == object:
                    vals = pd.concat([X_train_enc[c], X_val_enc[c]], axis=0).astype(str)
                    mapping = {v:i for i, v in enumerate(pd.Series(vals).astype(str).unique())}
                    X_train_enc[c] = X_train_enc[c].map(mapping).fillna(-1).astype(int)
                    X_val_enc[c] = X_val_enc[c].map(mapping).fillna(-1).astype(int)
        else:
            X_train_enc, X_val_enc = X_train, X_val
        model.fit(X_train_enc, y_train)
        if hasattr(model, "predict_proba"):
            proba = model.predict_proba(X_val_enc)[:,1]
        else:
            scores = model.decision_function(X_val_enc)
            proba = (scores - scores.min()) / (scores.max() - scores.min() + 1e-12)
    except Exception as e:
        import warnings
        warnings.warn(f"Training failed: {e}")
        raise

    from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve, precision_score, recall_score
    roc = roc_auc_score(y_val, proba)
    pr_auc = average_precision_score(y_val, proba)
    prec, rec, thr = precision_recall_curve(y_val, proba)  # len(thr)=len(prec)-1

    # best F1
    f1s = (2*prec[:-1]*rec[:-1])/(prec[:-1]+rec[:-1]+1e-12)
    i_best = int(np.argmax(f1s))
    thr_best = float(thr[i_best])
    y_hat_best = (proba >= thr_best).astype(int)

    # threshold at target recall (choose highest threshold reaching recall >= target)
    idx = np.where(rec[:-1] >= target_recall)[0]
    if len(idx):
        i_t = int(idx[-1])
        thr_rec = float(thr[i_t])
    else:
        thr_rec = 0.0
        i_t = 0
    y_hat_rec = (proba >= thr_rec).astype(int)

    metrics = {
        "roc_auc": float(roc),
        "pr_auc": float(pr_auc),
        "best_f1": float(f1s[i_best]),
        "best_f1_threshold": thr_best,
        "precision_at_best_f1": float(precision_score(y_val, y_hat_best, zero_division=0)),
        "recall_at_best_f1": float(recall_score(y_val, y_hat_best, zero_division=0)),
        "threshold_at_target_recall": thr_rec,
        "precision_at_target_recall": float(prec[i_t]),
        "recall_at_target_recall": float(rec[i_t]),
    }
    return metrics, proba, {"thr_best": thr_best, "thr_rec": thr_rec}

# Bind fallbacks if needed
if not HAVE_TRAINING_UTILS:
    setup_mlflow_tracking = lambda cfg: None
    preprocess_data = _simple_preprocess
    create_catboost_model = lambda cfg: _make_model(cfg)[0]
    def train_and_evaluate_model(model, train_pool, val_pool, config, X_train, X_val, y_train, y_val):
        return _train_eval(model, X_train, y_train, X_val, y_val, target_recall=CONFIG["evaluation"]["opt_target_recall"])


## 🔁 Cross-Validation
Evaluate generalization with **StratifiedKFold** (classification) or **TimeSeriesSplit** (temporal). Thresholds are chosen at **target recall** to reflect production priorities (e.g., catching churners).

In [None]:
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
import numpy as np

def cross_validate_model(df, config, cv_folds=5, cv_strategy='stratified', use_mlflow=False):
    TARGET_COL = config['model']['target_col']
    results = {"fold_results": []}

    id_features = config['features']['id_features']
    cols_to_drop = [c for c in id_features if c in df.columns]
    df_clean = df.drop(columns=cols_to_drop).copy()

    # Remove datetime columns (CatBoost Pools support them only via special handling)
    dt_cols = [c for c in df_clean.columns if np.issubdtype(df_clean[c].dtype, np.datetime64)]
    if dt_cols:
        df_clean = df_clean.drop(columns=dt_cols)

    df_clean = preprocess_data(df_clean, config)

    y = df_clean[TARGET_COL].astype(int).values
    X = df_clean.drop(columns=[TARGET_COL])

    if cv_strategy == "timeseries":
        splitter = TimeSeriesSplit(n_splits=cv_folds)
    else:
        splitter = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=config['model']['random_seed'])

    print(f"Starting {cv_folds}-fold CV with strategy: {cv_strategy}")
    for fold, (tr_idx, va_idx) in enumerate(splitter.split(X, y), 1):
        X_tr, X_va = X.iloc[tr_idx], X.iloc[va_idx]
        y_tr, y_va = y[tr_idx], y[va_idx]

        model = create_catboost_model(config)
        fold_metrics, proba, thr = train_and_evaluate_model(model, None, None, config, X_tr, X_va, y_tr, y_va)
        results["fold_results"].append({
            "fold": fold,
            **fold_metrics,
            "train_samples": int(len(X_tr)),
            "val_samples": int(len(X_va)),
        })

        print(f"Fold {fold}: ROC-AUC={fold_metrics['roc_auc']:.4f}  F1={fold_metrics['best_f1']:.4f}  "
              f"Recall@target={fold_metrics['recall_at_target_recall']:.4f}  "
              f"Prec@target={fold_metrics['precision_at_target_recall']:.4f}")

    # Aggregate
    agg = {}
    for key in ["roc_auc","pr_auc","best_f1","precision_at_target_recall","recall_at_target_recall"]:
        vals = [r[key] for r in results["fold_results"]]
        agg[f"mean_{key}"] = float(np.mean(vals))
        agg[f"std_{key}"]  = float(np.std(vals))
    results.update(agg)
    return results, df_clean

cv_results, df_clean = cross_validate_model(
    df, CONFIG, cv_folds=CONFIG["evaluation"]["cv_folds"],
    cv_strategy=CONFIG["evaluation"]["cv_strategy"],
    use_mlflow=CONFIG["evaluation"]["use_mlflow"]
)
cv_results

### 📜 CV Summary

In [None]:
import json, pprint, os
pprint.pprint(cv_results)
with open(CONFIG["output"]["cv_summary_path"], "w") as f:
    json.dump(cv_results, f, indent=2)
CONFIG["output"]["cv_summary_path"]

## 🧪 Extra: Holdout / Temporal Backtest (Optional)
Simulate **pre-prod** evaluation with a strict temporal split to approximate future data.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve

TARGET_COL = CONFIG["model"]["target_col"]

# Use signup_ts if available; else random split (deterministic)
if "signup_ts" in df_clean.columns:
    ts = pd.to_datetime(df_clean["signup_ts"], errors="coerce")
    cutoff = ts.quantile(0.8)
    tr = df_clean[ts < cutoff].copy()
    te = df_clean[ts >= cutoff].copy()
else:
    tr, te = train_test_split(df_clean, test_size=0.2, random_state=CONFIG["model"]["random_seed"], stratify=df_clean[TARGET_COL])

X_tr, y_tr = tr.drop(columns=[TARGET_COL]), tr[TARGET_COL].values
X_te, y_te = te.drop(columns=[TARGET_COL]), te[TARGET_COL].values

mdl = create_catboost_model(CONFIG)
metrics_tr, _, thr = train_and_evaluate_model(mdl, None, None, CONFIG, X_tr, X_te, y_tr, y_te)

holdout_report = {
    "n_train": int(len(tr)),
    "n_test": int(len(te)),
    **metrics_tr
}

import json, os
with open(CONFIG["output"]["holdout_report_path"], "w") as f:
    json.dump(holdout_report, f, indent=2)

holdout_report

## 🧾 SageMaker Model Registry & Experiments (Browse)
Use this section to **discover past experiments, tuning jobs, and registered models** to compare against your current run.

> You may need AWS credentials & permissions. These cells are **safe** to run locally (they will no-op without creds).

In [None]:
import os, datetime as dt
try:
    import boto3, sagemaker
    from sagemaker.analytics import ExperimentAnalytics
    sm = boto3.client("sagemaker")
    sess = sagemaker.Session()
    print("AWS Region:", boto3.Session().region_name)

    # --- Experiments (adjust filters as needed) ---
    print("\nRecent Experiments (last 30 days):")
    res = sm.list_experiments(
        SortBy="CreationTime", SortOrder="Descending",
        MaxResults=10
    )
    for e in res.get("ExperimentSummaries", []):
        print(" -", e["ExperimentName"], "| Created:", e["CreationTime"])

    # --- Models in Registry ---
    print("\nModel Packages in Registry (latest 10):")
    mres = sm.list_model_packages(
        ModelPackageGroupNameContains="",
        SortBy="CreationTime", SortOrder="Descending", MaxResults=10
    )
    for mp in mres.get("ModelPackageSummaryList", []):
        print(" - Group:", mp.get("ModelPackageGroupName"), "| Status:", mp.get("ModelApprovalStatus"),
              "| Created:", mp.get("CreationTime"))

    # --- Training jobs (latest) ---
    print("\nRecent Training Jobs (latest 10):")
    tres = sm.list_training_jobs(SortBy="CreationTime", SortOrder="Descending", MaxResults=10)
    for tj in tres.get("TrainingJobSummaries", []):
        print(" -", tj["TrainingJobName"], "| Status:", tj["TrainingJobStatus"], "| Created:", tj["CreationTime"])

except Exception as e:
    print("ℹ️ Skipping registry/experiments browse (likely missing AWS creds):", e)

## ✅ What to do with these results
- **Gate for promotion:** enforce thresholds (e.g., mean Recall ≥ 0.80 at chosen threshold, CI bounds acceptable).
- **Compare to registry:** if current run underperforms the **approved** model, block promotion.
- **Open issues:** if drift/bias suspected, trigger data pipeline review and schedule retraining.
- **Log everything:** ship `cv_summary.json` and `holdout_report.json` to your artifact store/MLflow and link it to Model Registry entries.
