# SEMMA — Model

Train & compare; choose by AUC+profit


In [None]:
# Logistic / Tree / GBM comparison


## TODO (phase gates)

- [ ] Baselines
- [ ] Advanced models
- [ ] Hyperparameters logged


## Critic Review Prompt

```
You are Dr. A. Renati, a world-renowned authority on <CRISP-DM|SEMMA|KDD>, author of award-winning books, and a keynote speaker at KDD/Strata.
Goal: ruthlessly critique ONLY the phase just completed, assuming an industry deployment.
Constraints:
- Be specific, actionable, and testable. No platitudes.
- Enforce methodology rigor: required subtasks, artifacts, risks, and acceptance gates.
- Flag data leakage, evaluation pitfalls, and governance/compliance gaps.
- Propose 3–5 experiments that could falsify my current conclusions.
- Rewrite my acceptance criteria to be business-measurable.
Return:
1) Red flags (bulleted, severity-tagged)
2) Missing artifacts (with exact filenames to add)
3) Experiments (name, hypothesis, design, success metric)
4) What to cut/simplify (to ship this week)
5) Final Go/No-Go recommendation for this phase

```

In [None]:
from pathlib import Path
import sys
sys.path.append(str(Path.cwd().parents[0]/'src'))
from data_loader import load_bank_data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
df = load_bank_data()
y = (df['y']=='yes').astype(int)
X = df.drop(columns=['y'])
cat = [c for c in X.columns if X[c].dtype=='object']
num = [c for c in X.columns if c not in cat]
pre = ColumnTransformer([
  ('num', SimpleImputer(strategy='median'), num),
  ('cat', Pipeline([('impute', SimpleImputer(strategy='most_frequent')),
                    ('oh', OneHotEncoder(handle_unknown='ignore'))]), cat)
])
model = Pipeline([('pre', pre), ('clf', LogisticRegression(max_iter=300))])
Xtr, Xte, ytr, yte = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)
model.fit(Xtr, ytr)
auc = roc_auc_score(yte, model.predict_proba(Xte)[:,1])
print('Baseline logistic AUC:', round(auc,4))
import joblib, pathlib
pathlib.Path('../data/processed').mkdir(parents=True, exist_ok=True)
joblib.dump(model, '../data/processed/bank_logreg.pkl')
print('Saved model to ../data/processed/bank_logreg.pkl')
