# 🧪 Model Selection (Choose Your Data & Models)

**Goal:** Build a *reproducible*, **config‑driven** model selection pipeline. You will choose the **data source**, define the **target**, and select **models** to compare. The notebook will pick a winner by **Recall at target** (default 0.80), using PR‑AUC and ROC‑AUC as tie‑breakers.

> This template runs on **SageMaker** or locally. You can load data from **CSV/Parquet (local/S3)**, **Postgres**, **Redshift**, or a custom `data_io.load_data()` you provide. The code is defensive: if a library/model isn't installed, that model is skipped.


## ✅ What You’ll Do
1) Set up reproducibility and a run folder for artifacts  
2) Choose a data source (CSV/Parquet/S3, Postgres, Redshift, or `data_io.py`)  
3) Define target column and (optional) time column  
4) Perform **deterministic** train/valid split (time‑based preferred)  
5) Train a **model zoo** you choose (e.g., CatBoost, LightGBM, XGBoost, Logistic Regression)  
6) Tune thresholds for **metric target** and compare models  
7) Export artifacts: `metrics.json`, `thresholds.json`, and the winning model  
8) (Optional) Log into **MLflow**


## 📝 Student Checklist
- [ ] Pick a data source and update the **CONFIG** cell
- [ ] Set the **target column** (classification: 0/1)
- [ ] (Optional) Set a **time column** for a robust time‑based split
- [ ] Choose which **models** to enable in `CONFIG['models']['enabled']`
- [ ] (Optional) Add your own model to the registry block
- [ ] Run the notebook end‑to‑end and inspect metrics
- [ ] Explain which model won and why


## 🧰 Prerequisites
- Python 3.9+
- Packages: `pandas`, `numpy`, `pyarrow`, `scikit-learn`, `sqlalchemy`, `redshift_connector`, `s3fs`,
  and optionally `catboost`, `lightgbm`, `xgboost`, `mlflow`

```python
# If running on a fresh environment (uncomment as needed):
# %pip install pandas numpy pyarrow scikit-learn sqlalchemy redshift_connector s3fs
# %pip install catboost lightgbm xgboost mlflow
```


## ♻️ Reproducibility & Run Folder

In [None]:
import os, sys, json, hashlib, platform, random
from datetime import datetime
import numpy as np
import pandas as pd

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

RUN_TS = datetime.utcnow().strftime("%Y%m%dT%H%M%SZ")
RUN_ID = hashlib.sha1(f"{RUN_TS}-{SEED}".encode()).hexdigest()[:10]
ARTIFACT_DIR = os.environ.get("ARTIFACT_DIR", f"artifacts/run_{RUN_TS}_{RUN_ID}")
os.makedirs(ARTIFACT_DIR, exist_ok=True)

env_info = {
    "python": sys.version,
    "platform": platform.platform(),
    "timestamp_utc": RUN_TS,
    "seed": SEED,
    "packages": {"pandas": pd.__version__, "numpy": np.__version__},
}
with open(os.path.join(ARTIFACT_DIR, "env_info.json"), "w") as f:
    json.dump(env_info, f, indent=2)
env_info

## ⚙️ CONFIG — ✏️ Edit this cell

In [None]:
from pathlib import Path

CONFIG = {
    "data": {
        # Choose one: 'csv' | 'parquet' | 's3-parquet' | 'postgres' | 'redshift' | 'data_io' | 'synthetic'
        "source": os.environ.get("SOURCE", "synthetic"),
        # Local paths
        "csv_path": "./your_data.csv",     # used if source == 'csv'
        "parquet_path": "./your_data.parquet",  # used if source == 'parquet'
        # S3 (requires s3fs creds)
        "s3_parquet_uri": "s3://bucket/path/to/*.parquet",
        # SQL query and connection details
        "sql": "SELECT * FROM public.final_feature_snapshot",
        "pg": {"user": os.getenv("PGUSER","postgres"), "password": os.getenv("PGPASSWORD","postgres"), "host": os.getenv("PGHOST","localhost"), "port": os.getenv("PGPORT","5432"), "db": os.getenv("PGDATABASE","testdb")},
        "redshift": {"host": os.getenv("REDSHIFT_HOST","example.redshift.amazonaws.com"), "database": os.getenv("REDSHIFT_DB","dev"), "user": os.getenv("REDSHIFT_USER","user"), "password": os.getenv("REDSHIFT_PASSWORD","password"), "port": int(os.getenv("REDSHIFT_PORT","5439"))},
        "row_limit": int(os.environ.get("ROW_LIMIT","0")) or None,
    },
    "columns": {
        "target": "churn",                   # <- ✏️ set your binary target name (0/1)
        "primary_key": "codigocontaservico",  # optional but recommended
        "time_col": "iddim_date_inicio",      # optional, for time-based split
        # known id/leakage cols (edit/remove as needed)
        "drop_cols": [
            "idconsumo","id_contaservico","codigocontaservico","idconta","iddim_cliente",
            "idcliente","codigocliente","iddim_conta","codigoconta",
            "idgrupo_dim_contadimensao","iddim_contaservico_dth","idcontaservico"
        ],
    },
    "split": {
        "test_size": 0.2,
        "recall_target": 0.80,  # <- ✏️ recall goal for selection
        "prefer_time_split": True,
    },
    "models": {
        # Enable/disable models here by name
        "enabled": ["catboost", "lightgbm", "xgboost", "logreg"],
        "catboost": {"iterations": 2000, "early_stopping_rounds": 200, "depth": 6, "learning_rate": 0.08, "l2_leaf_reg": 3.0, "task_type": os.environ.get("CAT_TASK_TYPE","CPU"), "auto_class_weights": "Balanced", "verbose": 200},
        "lightgbm": {"n_estimators": 1500, "learning_rate": 0.05, "num_leaves": 64, "min_child_samples": 40, "subsample": 0.8, "colsample_bytree": 0.8, "reg_lambda": 1.0, "random_state": 42},
        "xgboost":  {"n_estimators": 1500, "learning_rate": 0.05, "max_depth": 6, "subsample": 0.8, "colsample_bytree": 0.8, "reg_lambda": 1.0, "random_state": 42, "eval_metric": "auc"},
        "logreg":   {"max_iter": 1000, "class_weight": "balanced", "solver": "liblinear", "random_state": 42},
    },
    "output": {
        "artifact_dir": ARTIFACT_DIR,
        "metrics_path": str(Path(ARTIFACT_DIR)/"metrics.json"),
        "thresholds_path": str(Path(ARTIFACT_DIR)/"thresholds.json"),
        "model_dir": str(Path(ARTIFACT_DIR)/"models"),
    },
    "mlflow": {"enabled": False, "tracking_uri": os.getenv("MLFLOW_TRACKING_URI",""), "experiment_name": "student-model-selection"}
}
CONFIG

## 📥 Load Your Data

In [None]:
load_data = None
try:
    from data_io import load_data  # optional, if you provide one
except Exception as e:
    print("ℹ️ No data_io.load_data found (that's fine):", repr(e))

def _demo_dataset(n=12000, seed=SEED):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "codigocontaservico": np.arange(1, n+1),
        "iddim_date_inicio": pd.to_datetime("2023-01-01") + pd.to_timedelta(rng.integers(0, 650, size=n), unit="D"),
        "tenure_days": rng.integers(30, 800, size=n),
        "expiry_month": rng.integers(1, 13, size=n),
        "expiry_dow": rng.integers(0, 7, size=n),
        "tipo_produto_atual": rng.choice(["normal","premium"], size=n, p=[0.85,0.15]),
        "topup_total_value": rng.gamma(2.0, 30.0, size=n).round(2),
        "municipio": rng.choice(["luanda","lubango","viana"], size=n),
        "past_churns": rng.poisson(0.2, size=n),
        "n_prev_contracts": rng.integers(0, 5, size=n),
    })
    logits = -1.1 + 0.0025*df["tenure_days"] + 0.5*(df["tipo_produto_atual"]=="premium").astype(int) - 0.0007*df["topup_total_value"] + 0.35*df["past_churns"]
    p = 1/(1+np.exp(-logits))
    df["churn"] = (rng.random(size=n) < p).astype(int)
    return df

source = CONFIG['data']['source']
df_raw = None

try:
    if source == 'data_io' and load_data is not None:
        df_raw = load_data(source='parquet', uri=CONFIG['data'].get('parquet_path') or CONFIG['data'].get('s3_parquet_uri'), sql=CONFIG['data']['sql'], redshift_kwargs=CONFIG['data']['redshift'])
    elif source == 'csv':
        df_raw = pd.read_csv(CONFIG['data']['csv_path'])
    elif source == 'parquet':
        df_raw = pd.read_parquet(CONFIG['data']['parquet_path'])
    elif source == 's3-parquet':
        df_raw = pd.read_parquet(CONFIG['data']['s3_parquet_uri'])
    elif source == 'postgres':
        from sqlalchemy import create_engine, text
        pg = CONFIG['data']['pg']
        url = f"postgresql+psycopg://{pg['user']}:{pg['password']}@{pg['host']}:{pg['port']}/{pg['db']}"
        engine = create_engine(url)
        with engine.begin() as conn:
            df_raw = pd.read_sql_query(text(CONFIG['data']['sql']), conn)
    elif source == 'redshift':
        import redshift_connector, sqlalchemy
        from sqlalchemy import create_engine
        rs = CONFIG['data']['redshift']
        url = f"redshift+redshift_connector://{rs['user']}:{rs['password']}@{rs['host']}:{rs['port']}/{rs['database']}"
        engine = create_engine(url)
        with engine.begin() as conn:
            from sqlalchemy import text
            df_raw = pd.read_sql_query(text(CONFIG['data']['sql']), conn)
    elif source == 'synthetic':
        df_raw = _demo_dataset()
    else:
        raise ValueError("Unsupported source. Edit CONFIG.")
except Exception as e:
    print("⚠️ Failed to load from configured source:", repr(e))
    print("Using synthetic demo dataset …")
    df_raw = _demo_dataset()

row_limit = CONFIG['data']['row_limit']
if row_limit and len(df_raw) > row_limit:
    pk = CONFIG['columns']['primary_key']
    if pk in df_raw.columns:
        df_raw = df_raw.sort_values(pk).head(row_limit).reset_index(drop=True)
    else:
        df_raw = df_raw.sample(n=row_limit, random_state=SEED).reset_index(drop=True)

df_raw.head(3), df_raw.shape

## 🔎 Quick Profile

In [None]:
pd.DataFrame({
    "column": df_raw.columns,
    "dtype": df_raw.dtypes.astype(str).values,
    "nulls": [df_raw[c].isna().sum() for c in df_raw.columns],
    "non_nulls": [df_raw[c].notna().sum() for c in df_raw.columns],
})

## 🧼 Minimal Cleaning (extend as needed)

In [None]:
target_col = CONFIG['columns']['target']
primary_key = CONFIG['columns']['primary_key']
time_col = CONFIG['columns']['time_col']
drop_cols = list(set(CONFIG['columns']['drop_cols'] + [target_col]))

df = df_raw.copy()

# Convert timedeltas to days if any
for c in df.columns:
    if str(df[c].dtype).startswith('timedelta'):
        df[c] = df[c].dt.total_seconds() / 86400

# Ensure datetime for time split
if time_col in df.columns:
    df[time_col] = pd.to_datetime(df[time_col], errors='coerce')

# Fill missing categoricals with a token
cat_cols = [c for c in df.columns if df[c].dtype=='object' or str(df[c].dtype)=='category']
for c in cat_cols:
    df[c] = df[c].astype(object).where(~pd.isna(df[c]), '<MISSING>')

df.head(3)

## ✂️ Deterministic Split (Time‑based preferred)

In [None]:
from sklearn.model_selection import StratifiedShuffleSplit

y = df[target_col].astype(int).values
X = df.drop(columns=[c for c in drop_cols if c in df.columns], errors='ignore')

def build_split_masks(df, y, time_col, prefer_time=True):
    if prefer_time and time_col in df.columns:
        dates = pd.to_datetime(df[time_col], errors='coerce')
        if dates.notna().mean() > 0.8:
            cutoff = dates.quantile(0.8)
            print('cutoff_date:', cutoff)
            train_mask = dates < cutoff
            valid_mask = ~train_mask
            return train_mask, valid_mask
    splitter = StratifiedShuffleSplit(n_splits=1, test_size=CONFIG['split']['test_size'], random_state=SEED)
    train_idx, valid_idx = next(splitter.split(X, y))
    train_mask = pd.Series(False, index=X.index); train_mask.iloc[train_idx] = True
    valid_mask = ~train_mask
    return train_mask, valid_mask

train_mask, valid_mask = build_split_masks(df, y, time_col, CONFIG['split']['prefer_time_split'])
sum(train_mask), sum(valid_mask)

## 🤖 Model Zoo (You Choose)

In [None]:
from sklearn.metrics import roc_auc_score, average_precision_score, precision_recall_curve, confusion_matrix, classification_report
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

def best_f1_threshold(y_true, y_scores):
    p, r, t = precision_recall_curve(y_true, y_scores)
    f1 = (2*p[:-1]*r[:-1])/(p[:-1]+r[:-1]+1e-12)
    i = int(np.argmax(f1))
    return float(t[i]), float(f1[i]), float(p[i]), float(r[i])

def recall_target_threshold(y_true, y_scores, target=0.80):
    p, r, t = precision_recall_curve(y_true, y_scores)
    idx = np.where(r[:-1] >= target)[0]
    if len(idx):
        i = int(idx[-1])
        return float(t[i]), float(p[i]), float(r[i])
    return 0.0, float(p[0]), float(r[0])

def prep_tree_inputs(X):
    X2 = X.copy()
    # drop datetime for tree libs
    dt_cols = [c for c in X2.columns if np.issubdtype(X2[c].dtype, np.datetime64)]
    if dt_cols: X2 = X2.drop(columns=dt_cols)
    for c in X2.columns:
        if X2[c].dtype=='object' or str(X2[c].dtype)=='category':
            X2[c] = X2[c].astype(str)
    return X2

X2 = prep_tree_inputs(X)
X_train, X_valid = X2.loc[train_mask], X2.loc[valid_mask]
y_train, y_valid = y[train_mask], y[valid_mask]

enabled = set(CONFIG['models']['enabled'])
results, thresholds, models = [], {}, {}

# --- CatBoost ---
if 'catboost' in enabled:
    try:
        from catboost import CatBoostClassifier, Pool
        cat_cols = [c for c in X_train.columns if X_train[c].dtype=='object']
        cat_idx = X_train.columns.get_indexer(cat_cols).tolist()
        tr_pool = Pool(X_train, y_train, cat_features=cat_idx)
        va_pool = Pool(X_valid, y_valid, cat_features=cat_idx)
        p = CONFIG['models']['catboost']
        m = CatBoostClassifier(iterations=p['iterations'], learning_rate=p['learning_rate'], depth=p['depth'], l2_leaf_reg=p['l2_leaf_reg'], loss_function='Logloss', eval_metric='AUC', random_seed=SEED, auto_class_weights=p.get('auto_class_weights','Balanced'), task_type=p.get('task_type','CPU'), verbose=p.get('verbose',200))
        m.fit(tr_pool, eval_set=va_pool, use_best_model=True, early_stopping_rounds=p['early_stopping_rounds'])
        proba = m.predict_proba(va_pool)[:,1]
        roc = roc_auc_score(y_valid, proba); pr = average_precision_score(y_valid, proba)
        thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, proba)
        thr_rec, p80, r80 = recall_target_threshold(y_valid, proba, CONFIG['split']['recall_target'])
        results.append({"model":"catboost","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
        thresholds['catboost'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
        models['catboost'] = m
    except Exception as e:
        print("⚠️ Skipping CatBoost:", repr(e))

# --- LightGBM ---
if 'lightgbm' in enabled:
    try:
        import lightgbm as lgb
        p = CONFIG['models']['lightgbm']
        m = lgb.LGBMClassifier(**p)
        m.fit(X_train, y_train, eval_set=[(X_valid,y_valid)], eval_metric='auc', verbose=False)
        proba = m.predict_proba(X_valid)[:,1]
        roc = roc_auc_score(y_valid, proba); pr = average_precision_score(y_valid, proba)
        thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, proba)
        thr_rec, p80, r80 = recall_target_threshold(y_valid, proba, CONFIG['split']['recall_target'])
        results.append({"model":"lightgbm","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
        thresholds['lightgbm'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
        models['lightgbm'] = m
    except Exception as e:
        print("⚠️ Skipping LightGBM:", repr(e))

# --- XGBoost ---
if 'xgboost' in enabled:
    try:
        from xgboost import XGBClassifier
        p = CONFIG['models']['xgboost']
        m = XGBClassifier(**p)
        m.fit(X_train, y_train, eval_set=[(X_valid,y_valid)], verbose=False)
        proba = m.predict_proba(X_valid)[:,1]
        roc = roc_auc_score(y_valid, proba); pr = average_precision_score(y_valid, proba)
        thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, proba)
        thr_rec, p80, r80 = recall_target_threshold(y_valid, proba, CONFIG['split']['recall_target'])
        results.append({"model":"xgboost","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
        thresholds['xgboost'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
        models['xgboost'] = m
    except Exception as e:
        print("⚠️ Skipping XGBoost:", repr(e))

# --- Logistic Regression ---
if 'logreg' in enabled:
    try:
        from sklearn.linear_model import LogisticRegression
        num_cols = [c for c in X_train.columns if pd.api.types.is_numeric_dtype(X_train[c])]
        cat_cols = [c for c in X_train.columns if c not in num_cols]
        pre = ColumnTransformer([("cat", OneHotEncoder(handle_unknown='ignore'), cat_cols)], remainder='passthrough')
        p = CONFIG['models']['logreg']
        from sklearn.pipeline import Pipeline
        m = Pipeline([('pre', pre), ('clf', LogisticRegression(**p))])
        m.fit(X_train, y_train)
        proba = m.predict_proba(X_valid)[:,1]
        roc = roc_auc_score(y_valid, proba); pr = average_precision_score(y_valid, proba)
        thr_f1, best_f1, p_f1, r_f1 = best_f1_threshold(y_valid, proba)
        thr_rec, p80, r80 = recall_target_threshold(y_valid, proba, CONFIG['split']['recall_target'])
        results.append({"model":"logreg","roc_auc":roc,"pr_auc":pr,"best_f1":best_f1,"precision_at_best_f1":p_f1,"recall_at_best_f1":r_f1,"precision_at_recall_target":p80,"recall_at_recall_target":r80})
        thresholds['logreg'] = {"best_f1_threshold":thr_f1, "recall_target_threshold":thr_rec}
        models['logreg'] = m
    except Exception as e:
        print("⚠️ Skipping Logistic Regression:", repr(e))

pd.DataFrame(results).sort_values(["recall_at_recall_target","pr_auc","roc_auc"], ascending=[False, False, False]).reset_index(drop=True)

## 🏅 Pick Winner by Recall (with tie‑breakers) & Save Artifacts

In [None]:
from pathlib import Path
Path(CONFIG['output']['model_dir']).mkdir(parents=True, exist_ok=True)

res_df = pd.DataFrame(results)
if res_df.empty:
    raise RuntimeError("No models trained. Ensure packages are installed and enabled in CONFIG.")

res_sorted = res_df.sort_values(["recall_at_recall_target","pr_auc","roc_auc"], ascending=[False, False, False]).reset_index(drop=True)
best_name = res_sorted.loc[0, 'model']
best_model = models[best_name]
best_thr = thresholds[best_name]

display(res_sorted)
print("Best model:", best_name)

with open(CONFIG['output']['metrics_path'], 'w') as f:
    json.dump(res_sorted.to_dict(orient='records'), f, indent=2)
with open(CONFIG['output']['thresholds_path'], 'w') as f:
    json.dump(thresholds, f, indent=2)

model_path = None
try:
    if best_name == 'catboost':
        model_path = str(Path(CONFIG['output']['model_dir'])/"catboost_best.cbm")
        best_model.save_model(model_path)
    else:
        import joblib
        model_path = str(Path(CONFIG['output']['model_dir'])/f"{best_name}_best.joblib")
        joblib.dump(best_model, model_path)
except Exception as e:
    print("⚠️ Could not persist model:", repr(e))

{
  'best_model': best_name,
  'model_path': model_path,
  'metrics_json': CONFIG['output']['metrics_path'],
  'thresholds_json': CONFIG['output']['thresholds_path']
}

## 📈 (Optional) MLflow Logging

In [None]:
if CONFIG['mlflow']['enabled']:
    import mlflow
    mlflow.set_tracking_uri(CONFIG['mlflow']['tracking_uri'] or 'file://' + str(Path(ARTIFACT_DIR).absolute()))
    mlflow.set_experiment(CONFIG['mlflow']['experiment_name'])
    with mlflow.start_run(run_name=f"student-model-selection-{RUN_TS}") as run:
        mlflow.log_params({
            'seed': SEED,
            'source': CONFIG['data']['source'],
            'time_col': CONFIG['columns']['time_col'],
            'recall_target': CONFIG['split']['recall_target']
        })
        mlflow.log_artifact(CONFIG['output']['metrics_path'])
        mlflow.log_artifact(CONFIG['output']['thresholds_path'])
        if 'model_path' in locals() and model_path:
            mlflow.log_artifact(model_path)
        print("MLflow run:", run.info.run_id)

## 🧩 Add Your Own Model
1. Install the library (if needed)
2. Create it inside the **Model Zoo** cell:
```python
if 'my_model' in enabled:
    try:
        from mylib import MyModel
        m = MyModel(**your_params)
        m.fit(X_train, y_train)
        proba = m.predict_proba(X_valid)[:,1]
        # compute metrics as above, append to results, thresholds, models
    except Exception as e:
        print('⚠️ Skipping my_model:', repr(e))
```
3. Add `'my_model'` to `CONFIG['models']['enabled']`


## 🧠 Reflection (Short Answer)
- Which model won by **Recall at target**? Provide the metrics table and explain any trade‑offs in **precision**.
- Did time‑based splitting change your results vs stratified? Why might that happen?
- What would you try next to improve recall (feature engineering, thresholds, class weights, cost‑sensitive training)?
