# 🛡️ Model Validation — **Example** (SageMaker‑ready)

Purpose: **decide if a candidate model is promotable**. This notebook validates a newly trained model against **quality gates**, compares it to the **approved champion** in the SageMaker **Model Registry**, and (optionally) **registers** the model if it passes and has all required artifacts.

### Why validation?
- **Ensures Production Readiness** — confirm predictive performance meets SLAs **before & after** deployment.
- **Drives Model Improvement** — expose failure modes and gaps that guide **feature & retraining** work.
- **Maintains Model Health** — detect **bias**, **data drift**, **concept drift**; block risky promotions.


## 🧰 Prerequisites

Uncomment to install any missing deps (Studio kernels usually have most of these):

In [None]:
# %pip install pandas numpy scikit-learn boto3 sagemaker mlflow s3fs pyarrow catboost sqlalchemy redshift_connector

## 🚪 Studio Bootstrap (safe to run locally too)

In [None]:
import os, boto3
try:
    import sagemaker
    sm_sess = sagemaker.Session()
    _region = boto3.Session().region_name
    try:
        _role = sagemaker.get_execution_role()
    except Exception:
        _role = "unknown-role"
    _bucket = sm_sess.default_bucket()
    print("✅ SageMaker context")
    print(" Region:", _region)
    print(" Role:  ", _role)
    print(" Bucket:", _bucket)
    os.environ.setdefault("AWS_REGION", _region or "")
    os.environ.setdefault("SM_DEFAULT_BUCKET", _bucket or "")
except Exception as e:
    print("ℹ️ Running without SageMaker context. Reason:", e)

## ♻️ Reproducibility & Environment Capture

In [None]:
import sys, json, hashlib, random, platform
from datetime import datetime
import numpy as np
import pandas as pd
from pathlib import Path

SEED = 42
random.seed(SEED); np.random.seed(SEED)
RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"val-{RUN_TS}-{SEED}".encode()).hexdigest()[:10]

ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/validation_{RUN_TS}_{RUN_ID}")
Path(ARTIFACT_DIR).mkdir(parents=True, exist_ok=True)

env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
}
with open(Path(ARTIFACT_DIR)/"env_validation_info.json", "w") as f:
    json.dump(env_info, f, indent=2)

env_info

## ⚙️ Configuration

Edit this cell to point to your data, thresholds, and Registry settings. This notebook can compute fresh CV metrics or read precomputed results.

In [None]:
CONFIG = {
    "data": {
        "source": os.environ.get("SOURCE", "parquet"),  # "parquet" | "redshift"
        "parquet_uri": os.environ.get("PARQUET_URI", "s3://YOUR-BUCKET/path/*.parquet"),
        "redshift_sql": os.environ.get("SQL", "SELECT * FROM your_schema.your_table ORDER BY id"),
        "redshift_kwargs": {
            "host": os.environ.get("REDSHIFT_HOST", "example.redshift.amazonaws.com"),
            "database": os.environ.get("REDSHIFT_DB", "dev"),
            "user": os.environ.get("REDSHIFT_USER", "username"),
            "password": os.environ.get("REDSHIFT_PASSWORD", "password"),
            "port": int(os.environ.get("REDSHIFT_PORT", "5439")),
        },
        "target_col": os.environ.get("TARGET", "churned"),
        "id_features": ["customer_id","contract_id","account_id"],
        "time_col": os.environ.get("TIME_COL", "signup_ts"),  # set to None if not available
    },
    "evaluation": {
        "cv_folds": int(os.environ.get("CV_FOLDS","5")),
        "cv_strategy": os.environ.get("CV_STRATEGY","stratified"),  # "stratified" | "timeseries"
        "target_recall": float(os.environ.get("TARGET_RECALL","0.80")),
        "stability_std_max": float(os.environ.get("STABILITY_STD_MAX","0.03")),  # std of recall@target
        "min_fold_recall": float(os.environ.get("MIN_FOLD_RECALL","0.75")),      # safety floor per fold
        "better_than_champion_margin": float(os.environ.get("BETTER_MARGIN","0.0")), # require strictly better by margin
        "read_cv_from": os.environ.get("CV_JSON",""),   # optional: path to existing cv_summary.json
    },
    "registry": {
        "package_group": os.environ.get("SM_MODEL_PACKAGE_GROUP","churn-model-group"),
        "candidate": {
            # local artifacts you plan to register
            "model_tar_path": os.environ.get("MODEL_TAR","model.tar.gz"),    # created by training pipeline
            "inference_script": os.environ.get("INFERENCE_SCRIPT","inference.py"),
            "requirements": os.environ.get("REQUIREMENTS","requirements.txt"),
            "schema_json": os.environ.get("SCHEMA_JSON", str(Path(ARTIFACT_DIR)/"feature_schema.json")),
            "validation_report": str(Path(ARTIFACT_DIR)/"validation_report.json"),
        },
        "s3_prefix": os.environ.get("ARTIFACTS_S3_PREFIX", f"s3://{os.environ.get('SM_DEFAULT_BUCKET','')}/model-validation/{RUN_TS}_{RUN_ID}"),
        "register_if_pass": os.environ.get("REGISTER_IF_PASS","false").lower() == "true",
        "model_approval_status": os.environ.get("APPROVAL_STATUS","PendingManualApproval"),
        "container_image_uri": os.environ.get("CONTAINER_IMAGE",""),  # optional: override
    },
    "paths": {
        "artifact_dir": ARTIFACT_DIR,
        "cv_summary_path": str(Path(ARTIFACT_DIR)/"cv_summary.json"),
        "validation_report_path": str(Path(ARTIFACT_DIR)/"validation_report.json"),
    }
}

CONFIG

## 📥 Load Data (Redshift or S3 Parquet)

Tries `data_io.load_data()` first; falls back to a synthetic dataset so you can run end‑to‑end.

In [None]:
load_data = None
try:
    from data_io import load_data  # expects load_data(source, uri, sql, redshift_kwargs)
except Exception as e:
    print("ℹ️ data_io.load_data not found. Using synthetic demo. Error:", repr(e))

def _demo_dataset(n=12000, seed=SEED):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "customer_id": np.arange(1, n+1),
        "age": rng.integers(18, 85, size=n),
        "tenure_months": rng.integers(0, 120, size=n),
        "monthly_charges": rng.normal(45, 15, size=n).round(2),
        "contract_type": rng.choice(["month-to-month","one-year","two-year"], size=n, p=[0.6,0.25,0.15]),
        "country": rng.choice(["PT","ES","FR","DE"], size=n, p=[0.5,0.2,0.2,0.1]),
        "signup_ts": pd.to_datetime("2022-01-01") + pd.to_timedelta(rng.integers(0, 900, size=n), unit="D"),
        "churned": rng.choice([0,1], size=n, p=[0.78,0.22]).astype(int),
    })
    # anomalies
    df.loc[rng.choice(df.index, 40, replace=False), "monthly_charges"] = -5.0
    df.loc[rng.choice(df.index, 60, replace=False), "age"] = None
    return df

if load_data:
    if CONFIG["data"]["source"] == "parquet":
        df = load_data(source="parquet", uri=CONFIG["data"]["parquet_uri"], sql=None, redshift_kwargs=None)
    else:
        df = load_data(source="redshift", uri=None, sql=CONFIG["data"]["redshift_sql"], redshift_kwargs=CONFIG["data"]["redshift_kwargs"])
else:
    df = _demo_dataset()

print("Shape:", df.shape)
df.head()

## 🧼 Minimal Deterministic Preprocessing (replace with your project helpers if you prefer)

In [None]:
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_recall_curve,
    precision_score, recall_score
)

def preprocess_minimal(df, target):
    df = df.copy()
    if "monthly_charges" in df.columns:
        df.loc[df["monthly_charges"] < 0, "monthly_charges"] = np.nan
    for c in df.columns:
        if c == target: 
            continue
        if df[c].dtype == object:
            df[c] = df[c].fillna("__MISSING__").astype(str)
        elif pd.api.types.is_numeric_dtype(df[c]):
            df[c] = df[c].fillna(df[c].median())
        elif str(df[c].dtype).startswith("datetime"):
            df[c] = pd.to_datetime(df[c], errors="coerce")
    if {"tenure_months","monthly_charges"}.issubset(df.columns):
        df["est_ltv"] = (df["tenure_months"] * df["monthly_charges"]).round(2)
    return df

## 🔁 Cross‑Validation (compute candidate metrics)

In [None]:
from sklearn.model_selection import StratifiedKFold, TimeSeriesSplit
from sklearn.ensemble import RandomForestClassifier

def encode_objects_joint(Xtr, Xva):
    Xt, Xv = Xtr.copy(), Xva.copy()
    for c in Xt.columns:
        if Xt[c].dtype == object:
            vals = pd.concat([Xt[c], Xv[c]], axis=0).astype(str)
            mapping = {v:i for i,v in enumerate(pd.Series(vals).unique())}
            Xt[c] = Xt[c].map(mapping).fillna(-1).astype(int)
            Xv[c] = Xv[c].map(mapping).fillna(-1).astype(int)
    return Xt, Xv

def evaluate_candidate_via_cv(df, cfg):
    target = cfg["data"]["target_col"]
    ids = cfg["data"]["id_features"]
    time_col = cfg["data"]["time_col"]
    target_recall = cfg["evaluation"]["target_recall"]
    folds = cfg["evaluation"]["cv_folds"]
    strategy = cfg["evaluation"]["cv_strategy"]

    df = df.drop(columns=[c for c in ids if c in df.columns]).copy()
    df = preprocess_minimal(df, target)

    # Remove pure datetime columns for the simple baseline model here
    dt_cols = [c for c in df.columns if np.issubdtype(df[c].dtype, np.datetime64)]
    if dt_cols:
        df = df.drop(columns=dt_cols)

    y = df[target].astype(int).values
    X = df.drop(columns=[target])

    if strategy == "timeseries" and time_col in df.columns:
        splitter = TimeSeriesSplit(n_splits=folds)
        split_iter = splitter.split(X, y)
    else:
        splitter = StratifiedKFold(n_splits=folds, shuffle=True, random_state=SEED)
        split_iter = splitter.split(X, y)

    fold_rows = []
    for k, (tr_idx, va_idx) in enumerate(split_iter, 1):
        Xtr, Xva = X.iloc[tr_idx], X.iloc[va_idx]
        ytr, yva = y[tr_idx], y[va_idx]

        Xtr_enc, Xva_enc = encode_objects_joint(Xtr, Xva)

        model = RandomForestClassifier(
            n_estimators=400, max_depth=None, random_state=SEED, class_weight="balanced"
        )
        model.fit(Xtr_enc, ytr)
        proba = model.predict_proba(Xva_enc)[:,1]

        roc = roc_auc_score(yva, proba)
        pr_auc = average_precision_score(yva, proba)
        prec, rec, thr = precision_recall_curve(yva, proba)
        # find highest threshold meeting recall target
        idx = np.where(rec[:-1] >= target_recall)[0]
        i = int(idx[-1]) if len(idx) else 0
        thr_rec = float(thr[i]) if len(idx) else 0.0
        yhat_rec = (proba >= thr_rec).astype(int)

        row = {
            "fold": k,
            "roc_auc": float(roc),
            "pr_auc": float(pr_auc),
            "recall_at_target": float(recall_score(yva, yhat_rec, zero_division=0)),
            "precision_at_target": float(precision_score(yva, yhat_rec, zero_division=0)),
            "threshold_at_target": thr_rec,
            "n_train": int(len(tr_idx)),
            "n_val": int(len(va_idx)),
        }
        fold_rows.append(row)

        print(f"Fold {k}: ROC-AUC={row['roc_auc']:.4f} | PR-AUC={row['pr_auc']:.4f} | "
              f"Recall@target={row['recall_at_target']:.4f} | Precision@target={row['precision_at_target']:.4f}")

    df_cv = pd.DataFrame(fold_rows)
    agg = {
        "mean_recall_at_target": float(df_cv["recall_at_target"].mean()),
        "std_recall_at_target": float(df_cv["recall_at_target"].std()),
        "mean_pr_auc": float(df_cv["pr_auc"].mean()),
        "mean_roc_auc": float(df_cv["roc_auc"].mean()),
    }
    results = {"folds": fold_rows, **agg}
    return results

# Compute or read CV summary
if CONFIG["evaluation"]["read_cv_from"] and Path(CONFIG["evaluation"]["read_cv_from"]).exists():
    with open(CONFIG["evaluation"]["read_cv_from"], "r") as f:
        cv_results = json.load(f)
    print("Loaded CV from:", CONFIG["evaluation"]["read_cv_from"])
else:
    cv_results = evaluate_candidate_via_cv(df, CONFIG)

with open(CONFIG["paths"]["cv_summary_path"], "w") as f:
    json.dump(cv_results, f, indent=2)

cv_results

## ✅ Quality Gates

We gate on **mean recall@target**, **stability** (std across folds), and **per‑fold safety floor**. 

In [None]:
def apply_quality_gates(cv_results, cfg):
    target_recall = cfg["evaluation"]["target_recall"]
    stability_std_max = cfg["evaluation"]["stability_std_max"]
    min_fold_recall = cfg["evaluation"]["min_fold_recall"]

    mean_rec = cv_results["mean_recall_at_target"]
    std_rec  = cv_results["std_recall_at_target"]
    fold_recalls = [f["recall_at_target"] for f in cv_results["folds"]]
    min_rec = float(min(fold_recalls)) if fold_recalls else 0.0

    gates = {
        "gate_mean_recall": mean_rec >= target_recall,
        "gate_stability": std_rec <= stability_std_max,
        "gate_min_fold": min_rec >= min_fold_recall,
    }
    decision = all(gates.values())

    report = {
        "target_recall_threshold": target_recall,
        "stability_std_max": stability_std_max,
        "min_fold_recall": min_fold_recall,
        "observed": {
            "mean_recall_at_target": mean_rec,
            "std_recall_at_target": std_rec,
            "min_fold_recall": min_rec
        },
        "gates": gates,
        "quality_gates_pass": decision
    }
    return report

quality_report = apply_quality_gates(cv_results, CONFIG)
quality_report

## 🏆 Compare to Champion in SageMaker Model Registry

We fetch the **latest Approved** model in the configured **Model Package Group** and read its stored metrics. If metrics are not present, we fall back to gates only.

In [None]:
def get_champion_metrics_from_registry(package_group, metric_key="mean_recall_at_target"):
    try:
        import boto3
        sm = boto3.client("sagemaker")

        # List model packages in the group (most recent first)
        res = sm.list_model_packages(
            ModelPackageGroupName=package_group,
            SortBy="CreationTime",
            SortOrder="Descending",
            MaxResults=20
        )
        for mp in res.get("ModelPackageSummaryList", []):
            if mp.get("ModelApprovalStatus") == "Approved":
                desc = sm.describe_model_package(ModelPackageName=mp["ModelPackageArn"])
                # Prefer CustomerMetadataProperties for simple scalar metrics
                meta = desc.get("CustomerMetadataProperties") or {}
                if metric_key in meta:
                    return {"metric_key": metric_key, "value": float(meta[metric_key]), "arn": mp["ModelPackageArn"]}
                # Otherwise, try ModelMetrics (URIs to S3 JSON). We skip fetching blobs here.
                mm = desc.get("ModelMetrics") or {}
                # If your pipeline writes a compact metrics JSON under ModelQuality -> Metrics, add S3 read here.
                return {"metric_key": metric_key, "value": None, "arn": mp["ModelPackageArn"]}
        return None
    except Exception as e:
        print("Registry lookup failed:", e)
        return None

champion = get_champion_metrics_from_registry(CONFIG["registry"]["package_group"])
champion

## 🚦 Promotion Decision (candidate vs champion)

In [None]:
def decide_promotion(quality_report, cv_results, champion, cfg):
    # Must pass quality gates
    if not quality_report["quality_gates_pass"]:
        return {"promote": False, "reason": "Failed quality gates", "compare": None}

    margin = cfg["evaluation"]["better_than_champion_margin"]
    candidate_rec = cv_results["mean_recall_at_target"]

    # If no champion or no metric found, promote based on gates alone
    if not champion or champion.get("value") is None:
        return {"promote": True, "reason": "Passed gates; no champion metric available", "compare": None}

    champion_rec = champion["value"]
    better = (candidate_rec >= champion_rec + margin)
    reason = f"candidate {candidate_rec:.4f} vs champion {champion_rec:.4f} (margin {margin:.4f})"

    return {"promote": bool(better), "reason": reason, "compare": {"candidate": candidate_rec, "champion": champion_rec, "margin": margin}}

promotion = decide_promotion(quality_report, cv_results, champion, CONFIG)
promotion

## 📦 Artifact Checklist

Before registering, ensure required files exist (and will be included in `model.tar.gz`).

In [None]:
from pathlib import Path

REQUIRED = {
    "model_tar_path": CONFIG["registry"]["candidate"]["model_tar_path"],
    "inference_script": CONFIG["registry"]["candidate"]["inference_script"],
    "requirements": CONFIG["registry"]["candidate"]["requirements"],
    "schema_json": CONFIG["registry"]["candidate"]["schema_json"],
}

missing = {k:v for k,v in REQUIRED.items() if not Path(v).exists()}
artifact_ok = len(missing) == 0

artifact_report = {
    "required": REQUIRED,
    "missing": missing,
    "artifact_ok": artifact_ok
}
artifact_report

## 🧾 Validation Report & (Optional) Registration

If **gates pass**, **artifacts are OK**, and **candidate beats champion**, we can register to the **Model Registry**. Registration is disabled by default; set `CONFIG['registry']['register_if_pass']=True` to enable.

In [None]:
# Assemble validation report
validation_report = {
    "run_id": RUN_ID,
    "timestamp_utc": RUN_TS,
    "quality_report": quality_report,
    "cv_results": cv_results,
    "champion": champion,
    "promotion_decision": promotion,
    "artifact_report": artifact_report
}

with open(CONFIG["paths"]["validation_report_path"], "w") as f:
    json.dump(validation_report, f, indent=2)

print("Saved report:", CONFIG["paths"]["validation_report_path"])

should_register = CONFIG["registry"]["register_if_pass"] and promotion["promote"] and artifact_report["artifact_ok"]

print("\nDecision:")
print(" Gates pass:      ", quality_report["quality_gates_pass"])
print(" Artifact ready:  ", artifact_report["artifact_ok"])
print(" Better than champ:", promotion["promote"])
print(" Will register:   ", should_register)

### 🧩 Register candidate to SageMaker Model Registry (only if allowed by config)

> This section uploads artifacts to S3 (under your prefix) and creates a **Model Package** entry in your **Model Package Group**, setting `CustomerMetadataProperties` for quick metric comparisons. Safe to run: guarded by `should_register` and AWS try/except.

In [None]:
if should_register:
    try:
        import boto3, sagemaker, tarfile, io, s3fs, json as _json
        sm = boto3.client("sagemaker")
        s3 = boto3.client("s3")

        # Parse S3 prefix
        s3_prefix = CONFIG["registry"]["s3_prefix"]
        if not s3_prefix.startswith("s3://"):
            raise ValueError("s3_prefix must be an s3:// URL")
        _, s3_rest = s3_prefix.split("s3://", 1)
        s3_bucket, s3_key_prefix = s3_rest.split("/", 1)

        # Upload artifacts
        uploads = {}
        for key, local in CONFIG["registry"]["candidate"].items():
            if key == "validation_report":
                # ensure latest
                pass
            if Path(local).exists():
                dest_key = f"{s3_key_prefix}/{Path(local).name}"
                s3.upload_file(str(local), s3_bucket, dest_key)
                uploads[key] = f"s3://{s3_bucket}/{dest_key}"
        # Also upload cv summary
        cv_key = f"{s3_key_prefix}/cv_summary.json"
        s3.upload_file(CONFIG["paths"]["cv_summary_path"], s3_bucket, cv_key)
        uploads["cv_summary"] = f"s3://{s3_bucket}/{cv_key}"
        # Upload validation report
        vr_key = f"{s3_key_prefix}/validation_report.json"
        s3.upload_file(CONFIG["paths"]["validation_report_path"], s3_bucket, vr_key)
        uploads["validation_report"] = f"s3://{s3_bucket}/{vr_key}"

        # Build ModelMetrics (optional pointers to S3 JSONs)
        model_metrics = {
            "ModelQuality": {
                "Statistics": {"S3Uri": uploads["cv_summary"], "ContentType": "application/json"},
                "Constraints": {"S3Uri": uploads["validation_report"], "ContentType": "application/json"}
            }
        }

        # Inference spec (script mode or pre-built container). If you use a framework container, provide image.
        image_uri = CONFIG["registry"]["container_image_uri"] or sagemaker.image_uris.retrieve("pytorch", boto3.Session().region_name, version="2.0", image_scope="inference")
        primary_container = {
            "Image": image_uri,
            "ModelDataUrl": uploads.get("model_tar_path"),
            "Environment": {
                "SAGEMAKER_PROGRAM": Path(CONFIG["registry"]["candidate"]["inference_script"]).name,
                "SAGEMAKER_SUBMIT_DIRECTORY": CONFIG["registry"]["candidate"]["model_tar_path"],
                "SAGEMAKER_REQUIREMENTS": Path(CONFIG["registry"]["candidate"]["requirements"]).name,
            }
        }

        # Ensure group exists (idempotent)
        try:
            sm.describe_model_package_group(ModelPackageGroupName=CONFIG["registry"]["package_group"])
        except sm.exceptions.ClientError:
            sm.create_model_package_group(
                ModelPackageGroupName=CONFIG["registry"]["package_group"],
                ModelPackageGroupDescription="Model group created by validation notebook"
            )

        # Create Model Package
        resp = sm.create_model_package(
            ModelPackageGroupName=CONFIG["registry"]["package_group"],
            ModelPackageDescription="Registered via validation notebook",
            InferenceSpecification={
                "Containers": [primary_container],
                "SupportedContentTypes": ["application/json", "text/csv"],
                "SupportedResponseMIMETypes": ["application/json"]
            },
            ModelMetrics=model_metrics,
            ModelApprovalStatus=CONFIG["registry"]["model_approval_status"],
            CustomerMetadataProperties={
                "mean_recall_at_target": str(cv_results["mean_recall_at_target"]),
                "std_recall_at_target": str(cv_results["std_recall_at_target"]),
                "mean_pr_auc": str(cv_results["mean_pr_auc"]),
                "mean_roc_auc": str(cv_results["mean_roc_auc"]),
                "validation_report": uploads["validation_report"],
            }
        )
        print("✅ Registered model package:", resp["ModelPackageArn"])
    except Exception as e:
        print("❌ Registration failed:", e)
else:
    print("Skipping registration (conditions not met or disabled).")

## 📌 Summary

- **Quality gates:** evaluated and recorded in `validation_report.json`  
- **Champion comparison:** used the latest **Approved** package in the **Model Package Group** (if any)  
- **Artifacts check:** ensures `model.tar.gz`, `inference.py`, `requirements.txt`, and `feature_schema.json` exist before registration  
- **Registration:** guarded by gates + champion comparison + artifacts presence

> Tip: wire this notebook into your CI/CD as a **pre‑promotion** job. Keep thresholds versioned with infra-as-code (e.g., in your pipeline repo).