# SI Opportunity Scoring — SFDR Pref Only (v12)
**Change requested:** remove SFDR opportunity and use **SFDR_PREF** (not SFDR_ACTUAL) as the SFDR confirmation signal.  
**Goal:** rank **IDs with `si_offering = 0`** by likelihood of SI interest.

**Business logic**
- If `MIFID = 0` → score uses **only** `SI_CONSIDERATION`
- If `MIFID = 1` → score = **α · SI + (1−α) · Confirmations**
  - Confirmations = **SFDR_PREF + PAI + Taxonomy**
  - Start with **α=0.80** as a stakeholder-friendly baseline
  - Learn a better α (bounded) for the data-driven weighted rule

**Methods compared on the same validation split**
1) Fixed rule (transparent baseline)  
2) Weighted rule (learn confirmation weights + learn α within bounds)  
3) ML (Calibrated Logistic Regression; challenger)

**Last updated:** 2026-02-27

---
## 0) Setup

In [None]:
import numpy as np
import pandas as pd
from dataclasses import dataclass
from pathlib import Path

from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

import matplotlib.pyplot as plt
pd.set_option("display.max_columns", 200)

---
## 1) Load raw data + shuffle rows

We shuffle to avoid ordering artifacts in operational extracts.

In [None]:
DATA_PATH = Path("data.csv")  # <-- change to your real file path

def make_synthetic_data(n=9000, seed=42):
    rng = np.random.default_rng(seed)
    return pd.DataFrame({
        "ID": rng.integers(1, n//2 + 1, size=n),
        "IO_TYPE": rng.choice(["normal", "zombie"], size=n, p=[0.97, 0.03]),
        "LIFE_CYCLE": rng.choice(["open", "closed"], size=n, p=[0.9, 0.1]),
        "OFFERING_NAME": rng.choice(
            ["Core", "Standard", "ESG Plus", "SI Focus", "Core SI", "Income", "SI Sustainable", None],
            size=n, p=[0.23,0.23,0.14,0.14,0.08,0.07,0.06,0.05]
        ),
        "SI_CONSIDERATION_CD": rng.choice(["S1","S2","S3", None], size=n, p=[0.35,0.35,0.2,0.1]),
        "SFDR_PREF": rng.choice(["F1","F2","F3", None], size=n, p=[0.4,0.35,0.2,0.05]),
        "SFDR_ACTUAL": rng.choice(["F1","F2","F3", None], size=n, p=[0.45,0.35,0.15,0.05]),  # kept as raw, not used in scoring
        "PAI_PREF": rng.choice(["PAI Selected", "Yes", "No", None], size=n, p=[0.25,0.1,0.05,0.6]),
        "MIFID": rng.choice(["Yes","No", None], size=n, p=[0.55,0.4,0.05]),
        "TAXONOMYPREF": rng.choice(["A1","A2","A3", None], size=n, p=[0.5,0.35,0.1,0.05]),
        "GHG": rng.choice(["Yes","No","--", None], size=n, p=[0.25,0.65,0.05,0.05]),
        "Biodiversity": rng.choice(["Yes","No","--", None], size=n, p=[0.2,0.7,0.05,0.05]),
        "Water": rng.choice(["Yes","No","--", None], size=n, p=[0.22,0.68,0.05,0.05]),
        "Waste": rng.choice(["Yes","No","--", None], size=n, p=[0.18,0.72,0.05,0.05]),
        "Social": rng.choice(["Yes","No","--", None], size=n, p=[0.28,0.62,0.05,0.05]),
    })

if DATA_PATH.exists():
    df_raw = pd.read_csv(DATA_PATH)
    print(f"Loaded: {DATA_PATH}  shape={df_raw.shape}")
else:
    df_raw = make_synthetic_data()
    print("DATA_PATH not found; using synthetic demo dataset.")
    print(f"shape={df_raw.shape}")

df_raw = df_raw.sample(frac=1, random_state=42).reset_index(drop=True)
df_raw.head()

---
## 2) Cleaning & filtering (ONLY)

Hygiene step only: filtering + missing normalization. No category-to-number mapping here.

In [None]:
REQUIRED_RAW = [
    "ID","IO_TYPE","LIFE_CYCLE","OFFERING_NAME",
    "SI_CONSIDERATION_CD","SFDR_PREF","PAI_PREF","MIFID","TAXONOMYPREF",
    "GHG","Biodiversity","Water","Waste","Social"
]
missing_cols = [c for c in REQUIRED_RAW if c not in df_raw.columns]
if missing_cols:
    raise ValueError(f"Missing required raw columns: {missing_cols}")

def clean_filter_only(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    for c in df.columns:
        if df[c].dtype == "object":
            df[c] = df[c].apply(lambda x: x.strip() if isinstance(x, str) else x)
    df = df.replace({"--": np.nan, "": np.nan})

    before = len(df)
    df = df[df["IO_TYPE"].fillna("").str.lower() != "zombie"]
    after_zombie = len(df)
    df = df[df["LIFE_CYCLE"].fillna("").str.lower() == "open"]
    after_open = len(df)

    df.attrs["cleaning_summary"] = {
        "before": before,
        "after_remove_zombie": after_zombie,
        "after_keep_open": after_open,
        "removed_zombie": before - after_zombie,
        "removed_closed": after_zombie - after_open
    }
    return df

df_clean = clean_filter_only(df_raw)
pd.DataFrame([df_clean.attrs["cleaning_summary"]])

In [None]:
s = df_clean.attrs["cleaning_summary"]
plt.figure(figsize=(8,4))
plt.bar(["Before","After zombie","After open"], [s["before"], s["after_remove_zombie"], s["after_keep_open"]])
plt.title("Cleaning impact: rows remaining after filters")
plt.ylabel("Rows")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

key_cols = ["SI_CONSIDERATION_CD","SFDR_PREF","PAI_PREF","MIFID","TAXONOMYPREF",
            "GHG","Biodiversity","Water","Waste","Social","OFFERING_NAME"]
miss = df_clean[key_cols].isna().mean().sort_values(ascending=False)
display(miss.to_frame("missing_rate").head(12))

---
## 3) Derive proxy label + aggregate to ID-level

We recommend **clients (IDs)**, so we aggregate to one row per ID.

In [None]:
df = df_clean.copy()
df["si_offering_row"] = df["OFFERING_NAME"].astype(str).str.contains(r"\bSI\b", case=False, na=False).astype(int)

def mode_or_first(s: pd.Series):
    s2 = s.dropna()
    if len(s2) == 0:
        return np.nan
    m = s2.mode()
    return m.iloc[0] if len(m) else s2.iloc[0]

agg = {
    "OFFERING_NAME": mode_or_first,
    "SI_CONSIDERATION_CD": mode_or_first,
    "SFDR_PREF": mode_or_first,
    "PAI_PREF": mode_or_first,
    "MIFID": mode_or_first,
    "TAXONOMYPREF": mode_or_first,
    "GHG": mode_or_first,
    "Biodiversity": mode_or_first,
    "Water": mode_or_first,
    "Waste": mode_or_first,
    "Social": mode_or_first,
    "si_offering_row": "max",
}
df_id = df.groupby("ID", as_index=False).agg(agg).rename(columns={"si_offering_row":"si_offering"})
df_id["si_offering"] = df_id["si_offering"].astype(int)

sizes = pd.DataFrame({
    "level":["row-level (cleaned)","ID-level"],
    "rows":[len(df), len(df_id)],
    "si_offering_rate":[df["si_offering_row"].mean(), df_id["si_offering"].mean()]
})
display(sizes)

---
## 4) Feature mappings (simple stakeholder table)

**SFDR change:** we use `SFDR_PREF` only (no SFDR opportunity; no SFDR_ACTUAL).

In [None]:
feature_map = pd.DataFrame([
    ["SI_CONSIDERATION_CD", "S1/S2/S3", "si_norm", "S1=0, S2=0.5, S3=1"],
    ["MIFID", "Yes/No", "MIFID", "Yes=1 else 0"],
    ["SFDR_PREF", "F1/F2/F3", "sfdr_pref_norm", "F1=0, F2=0.5, F3=1"],
    ["PAI_PREF + topics", "PAI selected + ESG topics", "pai_block", "0 if no PAI else 0.5 + 0.5*topics_norm"],
    ["TAXONOMYPREF", "A1/A2/A3", "tax_norm", "A1=0, A2=0.5, A3=1"],
], columns=["Raw field(s)","Raw values","Engineered feature","Definition"])
display(feature_map)

---
## 5) Encoding & engineered signals

We encode values into small, interpretable 0..1 signals.

**Note:** we keep missing flags for diagnostics and for ML, but the rule score uses the primary signals.

In [None]:
MAP_SI = {"S1":1, "S2":2, "S3":3}
MAP_SFDR = {"F1":1, "F2":2, "F3":3}
MAP_TAX = {"A1":1, "A2":2, "A3":3}

def parse_yes_no(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    if isinstance(x, (int, np.integer)):
        return 1 if x == 1 else 0
    if isinstance(x, (float, np.floating)):
        return 1 if x > 0.5 else 0
    if isinstance(x, str):
        t = x.strip().lower()
        if t in {"yes","y","true","1","selected"}:
            return 1
        if t in {"no","n","false","0"}:
            return 0
    return np.nan

def parse_pai_selected(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return 0
    if isinstance(x, (int, np.integer)):
        return 1 if x == 1 else 0
    if isinstance(x, str):
        t = x.strip().lower()
        # robust: catches "PAI Selected", "pai: yes", "selected", etc.
        if "pai" in t and ("select" in t or "yes" in t or "true" in t or t.endswith("1")):
            return 1
        if t in {"pai selected","selected","yes","true","1"}:
            return 1
    return 0

def parse_sfdr_pref(x):
    if x is None or (isinstance(x, float) and np.isnan(x)):
        return np.nan
    if isinstance(x, str):
        t = x.strip().upper()
        for k in ["F1","F2","F3"]:
            if k in t:
                return MAP_SFDR[k]
    return np.nan

def encode(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Topics
    for c in ["GHG","Biodiversity","Water","Waste","Social"]:
        yn = df[c].apply(parse_yes_no)
        df[c] = yn.fillna(0).astype(int)

    # MiFID
    df["MIFID"] = df["MIFID"].apply(parse_yes_no).fillna(0).astype(int)

    # SI
    df["SI_CONSIDERATION_num"] = df["SI_CONSIDERATION_CD"].map(MAP_SI)
    df["si_missing"] = df["SI_CONSIDERATION_num"].isna().astype(int)
    df["SI_CONSIDERATION_num"] = df["SI_CONSIDERATION_num"].fillna(1).astype(int)
    df["si_norm"] = np.clip((df["SI_CONSIDERATION_num"] - 1)/2, 0, 1)

    # SFDR pref
    df["SFDR_PREF_num"] = df["SFDR_PREF"].apply(parse_sfdr_pref)
    df["sfdr_pref_missing"] = df["SFDR_PREF_num"].isna().astype(int)
    pref_filled = df["SFDR_PREF_num"].fillna(1)
    df["sfdr_pref_norm"] = np.clip((pref_filled - 1)/2, 0, 1)  # F1=0, F2=0.5, F3=1

    # Taxonomy
    df["TAXONOMYPREF_num"] = df["TAXONOMYPREF"].map(MAP_TAX)
    df["tax_missing"] = df["TAXONOMYPREF_num"].isna().astype(int)
    df["TAXONOMYPREF_num"] = df["TAXONOMYPREF_num"].fillna(1).astype(int)
    df["tax_norm"] = np.clip((df["TAXONOMYPREF_num"] - 1)/2, 0, 1)

    # PAI block
    df["pai_selected"] = df["PAI_PREF"].apply(parse_pai_selected).astype(int)
    topic_cols = ["GHG","Biodiversity","Water","Waste","Social"]
    df["esg_topics_yes_cnt"] = df[topic_cols].sum(axis=1)
    df["topics_norm"] = df["esg_topics_yes_cnt"] / len(topic_cols)
    df["pai_block"] = np.where(df["pai_selected"]==1, 0.5 + 0.5*df["topics_norm"], 0.0)

    return df

df_feat = encode(df_id)

diag = pd.DataFrame({
    "feature": ["si_norm","sfdr_pref_norm","pai_block","tax_norm","sfdr_pref_missing","tax_missing"],
    "mean": [df_feat[c].mean() for c in ["si_norm","sfdr_pref_norm","pai_block","tax_norm","sfdr_pref_missing","tax_missing"]],
    "nonzero_rate": [(df_feat[c]!=0).mean() for c in ["si_norm","sfdr_pref_norm","pai_block","tax_norm","sfdr_pref_missing","tax_missing"]],
}).round(4)
display(diag)

plt.figure(figsize=(7,4))
plt.hist(df_feat["sfdr_pref_norm"], bins=10)
plt.title("sfdr_pref_norm distribution (F1=0, F2=0.5, F3=1)")
plt.xlabel("sfdr_pref_norm")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

---
## 6) Train/validation split
We evaluate all three approaches on the same held-out validation split.

In [None]:
y = df_feat["si_offering"].astype(int).copy()

BASE_RULE = ["MIFID","si_norm","sfdr_pref_norm","pai_block","tax_norm"]
X = df_feat[BASE_RULE].copy()

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
idx_train, idx_val = X_train.index, X_val.index

print("Train:", len(idx_train), "Val:", len(idx_val))
print("Train si rate:", y_train.mean().round(4), "Val si rate:", y_val.mean().round(4))

---
## 7) Evaluation helpers (stakeholder metrics)

We prioritize top-bucket performance (precision/lift at top 10% and 20%), plus lift-by-decile and calibration.

In [None]:
def eval_global(y_true, p, label):
    return {
        "model": label,
        "auc": roc_auc_score(y_true, p),
        "avg_precision": average_precision_score(y_true, p),
        "brier": brier_score_loss(y_true, p)
    }

def precision_lift_at_frac(y_true, p, frac=0.10):
    n = len(p)
    k = max(1, int(np.ceil(frac*n)))
    order = np.argsort(-p)
    top = y_true[order][:k]
    baseline = y_true.mean()
    prec = float(top.mean())
    lift = float(prec / baseline) if baseline > 0 else np.nan
    return {"frac": frac, "k": k, "precision": prec, "lift": lift, "baseline": float(baseline)}

def lift_by_decile(y_true, p, n_bins=10):
    tmp = pd.DataFrame({"y": y_true, "p": p})
    tmp["decile"] = pd.qcut(tmp["p"], n_bins, labels=False, duplicates="drop") + 1
    out = tmp.groupby("decile")["y"].agg(["mean","count"]).rename(columns={"mean":"si_rate"})
    return out

def plot_lift_curve(tab, title):
    plt.figure(figsize=(8,4))
    plt.plot(tab.index, tab["si_rate"].values, marker="o")
    plt.title(title)
    plt.xlabel("Decile (1=lowest, 10=highest)")
    plt.ylabel("si_offering rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

def plot_calibration(y_true, p, title):
    prob_true, prob_pred = calibration_curve(y_true, p, n_bins=10, strategy="quantile")
    plt.figure(figsize=(6,6))
    plt.plot(prob_pred, prob_true, marker="o")
    plt.plot([0,1],[0,1], linestyle="--")
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Observed rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 8) Method 1 — Fixed rule (baseline)
Parameters (stakeholder-friendly default):
- alpha = 0.80 (SI anchor)
- confirmation weights: SFDR_PREF 50%, PAI 30%, Taxonomy 20%

**Scoring**
- If MIFID=0 → p = si_norm
- If MIFID=1 → p = alpha*si_norm + (1-alpha)*confirm

In [None]:
@dataclass
class FixedCfg:
    alpha: float = 0.80
    w_sfdr: float = 0.50
    w_pai: float = 0.30
    w_tax: float = 0.20

cfg = FixedCfg()

def score_fixed(df: pd.DataFrame, cfg: FixedCfg) -> pd.DataFrame:
    df = df.copy()
    confirm = cfg.w_sfdr*df["sfdr_pref_norm"] + cfg.w_pai*df["pai_block"] + cfg.w_tax*df["tax_norm"]
    m1 = cfg.alpha*df["si_norm"] + (1-cfg.alpha)*confirm
    m0 = df["si_norm"]
    df["p_fixed"] = np.clip(np.where(df["MIFID"]==1, m1, m0), 0, 1)

    # Explainability contributions
    df["why_si"] = np.where(df["MIFID"]==1, cfg.alpha*df["si_norm"], df["si_norm"])
    df["why_sfdr_pref"] = np.where(df["MIFID"]==1, (1-cfg.alpha)*cfg.w_sfdr*df["sfdr_pref_norm"], 0.0)
    df["why_pai"] = np.where(df["MIFID"]==1, (1-cfg.alpha)*cfg.w_pai*df["pai_block"], 0.0)
    df["why_tax"] = np.where(df["MIFID"]==1, (1-cfg.alpha)*cfg.w_tax*df["tax_norm"], 0.0)
    return df

df_scored = score_fixed(df_feat, cfg)

p_fixed = df_scored.loc[idx_val, "p_fixed"].values
fixed_global = eval_global(y_val.values, p_fixed, "Fixed rule (SFDR_PREF only)")
fixed_top10 = precision_lift_at_frac(y_val.values, p_fixed, 0.10)
fixed_top20 = precision_lift_at_frac(y_val.values, p_fixed, 0.20)

display(pd.DataFrame([fixed_global]))
display(pd.DataFrame([fixed_top10, fixed_top20]))

tab = lift_by_decile(y_val.values, p_fixed)
display(tab)
plot_lift_curve(tab, "Lift by decile: Fixed rule (SFDR_PREF only)")
plot_calibration(y_val.values, p_fixed, "Calibration: Fixed rule (SFDR_PREF only)")

---
## 9) Method 2 — Weighted rule (data-driven scorecard; recommended champion)
We keep the same scorecard structure, but learn from training data:
1) confirmation weights (SFDR_PREF / PAI / Taxonomy)
2) alpha split within conservative bounds (0.60–0.90)

**Governance constraints**
- Non-negative confirmation weights (supportive evidence)
- Weights normalized to sum to 1

In [None]:
train_df = df_scored.loc[idx_train].copy()
train_m1 = train_df[train_df["MIFID"]==1].copy()

CONF = ["sfdr_pref_norm","pai_block","tax_norm"]

if train_m1["si_offering"].nunique() < 2:
    print("Warning: MIFID=1 train subset has one class; fallback to fixed weights.")
    w = pd.Series([cfg.w_sfdr, cfg.w_pai, cfg.w_tax], index=CONF)
else:
    lr_w = LogisticRegression(max_iter=8000, class_weight="balanced")
    lr_w.fit(train_m1[CONF], train_m1["si_offering"])
    raw_coef = pd.Series(lr_w.coef_[0], index=CONF)
    display(raw_coef.to_frame("raw_coef (train, MIFID=1)"))

    pos = np.maximum(raw_coef.values, 0)
    if pos.sum() == 0:
        pos = np.ones_like(pos)
    w = pd.Series(pos/pos.sum(), index=CONF)

display(w.to_frame("learned_weight (sum=1)"))
plt.figure(figsize=(7,4))
plt.bar(w.index, w.values)
plt.title("Learned confirmation weights (sum=1)")
plt.ylabel("weight")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

def score_prob(df: pd.DataFrame, alpha: float, w: pd.Series) -> np.ndarray:
    confirm = w["sfdr_pref_norm"]*df["sfdr_pref_norm"] + w["pai_block"]*df["pai_block"] + w["tax_norm"]*df["tax_norm"]
    m1 = alpha*df["si_norm"].values + (1-alpha)*confirm.values
    m0 = df["si_norm"].values
    return np.clip(np.where(df["MIFID"].values==1, m1, m0), 0, 1)

# Choose alpha by CV Average Precision on training data
alpha_grid = np.round(np.arange(0.60, 0.91, 0.05), 2)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

train_idx = idx_train.values
rows=[]
for a in alpha_grid:
    aps=[]
    for tr_i, te_i in skf.split(train_idx, y_train.values):
        te_ix = train_idx[te_i]
        df_te = df_scored.loc[te_ix]
        p = score_prob(df_te, a, w)
        aps.append(average_precision_score(df_te["si_offering"].values, p))
    rows.append({"alpha": a, "cv_ap_mean": float(np.mean(aps))})

alpha_perf = pd.DataFrame(rows).sort_values("cv_ap_mean", ascending=False)
display(alpha_perf)

alpha_best = float(alpha_perf.iloc[0]["alpha"])
print("Selected alpha:", alpha_best)

plt.figure(figsize=(8,4))
tmp = alpha_perf.sort_values("alpha")
plt.plot(tmp["alpha"], tmp["cv_ap_mean"], marker="o")
plt.title("Selecting alpha by CV Average Precision (train)")
plt.xlabel("alpha (weight on SI)")
plt.ylabel("CV AP")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

df_scored["p_weighted"] = score_prob(df_scored, alpha_best, w)

# Why columns for weighted rule
df_scored["whyW_si"] = np.where(df_scored["MIFID"]==1, alpha_best*df_scored["si_norm"], df_scored["si_norm"])
df_scored["whyW_sfdr_pref"] = np.where(df_scored["MIFID"]==1, (1-alpha_best)*w["sfdr_pref_norm"]*df_scored["sfdr_pref_norm"], 0.0)
df_scored["whyW_pai"] = np.where(df_scored["MIFID"]==1, (1-alpha_best)*w["pai_block"]*df_scored["pai_block"], 0.0)
df_scored["whyW_tax"] = np.where(df_scored["MIFID"]==1, (1-alpha_best)*w["tax_norm"]*df_scored["tax_norm"], 0.0)

p_w = df_scored.loc[idx_val, "p_weighted"].values
w_global = eval_global(y_val.values, p_w, f"Weighted rule (alpha={alpha_best})")
w_top10 = precision_lift_at_frac(y_val.values, p_w, 0.10)
w_top20 = precision_lift_at_frac(y_val.values, p_w, 0.20)

display(pd.DataFrame([w_global]))
display(pd.DataFrame([w_top10, w_top20]))

tab = lift_by_decile(y_val.values, p_w)
display(tab)
plot_lift_curve(tab, "Lift by decile: Weighted rule (SFDR_PREF only)")
plot_calibration(y_val.values, p_w, "Calibration: Weighted rule (SFDR_PREF only)")

---
## 10) Method 3 — ML challenger: Calibrated Logistic Regression
We keep ML controlled and explainable:
- Logistic Regression for interpretability
- Isotonic calibration for usable probabilities
- MiFID interactions: confirmations contribute primarily when `MIFID=1`

In [None]:
def ml_matrix(df: pd.DataFrame) -> pd.DataFrame:
    X = pd.DataFrame(index=df.index)
    X["si_norm"] = df["si_norm"]
    X["MIFID"] = df["MIFID"]
    # interactions (only matter when MIFID=1)
    X["m1_sfdr_pref"] = df["MIFID"] * df["sfdr_pref_norm"]
    X["m1_pai"] = df["MIFID"] * df["pai_block"]
    X["m1_tax"] = df["MIFID"] * df["tax_norm"]
    # optional missing flags (often help with real data)
    X["sfdr_pref_missing"] = df["sfdr_pref_missing"]
    X["tax_missing"] = df["tax_missing"]
    X["si_missing"] = df["si_missing"]
    return X

Xtr = ml_matrix(df_scored.loc[idx_train])
Xva = ml_matrix(df_scored.loc[idx_val])

base_lr = LogisticRegression(max_iter=12000, class_weight="balanced")
cal = CalibratedClassifierCV(base_lr, method="isotonic", cv=5)
cal.fit(Xtr, y_train)

p_ml = cal.predict_proba(Xva)[:,1]
ml_global = eval_global(y_val.values, p_ml, "ML: Calibrated LR (SFDR_PREF only)")
ml_top10 = precision_lift_at_frac(y_val.values, p_ml, 0.10)
ml_top20 = precision_lift_at_frac(y_val.values, p_ml, 0.20)

display(pd.DataFrame([ml_global]))
display(pd.DataFrame([ml_top10, ml_top20]))

tab = lift_by_decile(y_val.values, p_ml)
display(tab)
plot_lift_curve(tab, "Lift by decile: ML (SFDR_PREF only)")
plot_calibration(y_val.values, p_ml, "Calibration: ML (SFDR_PREF only)")

# Coefficient sanity-check (uncalibrated LR)
plain = LogisticRegression(max_iter=12000, class_weight="balanced")
plain.fit(Xtr, y_train)
coef = pd.Series(plain.coef_[0], index=Xtr.columns).sort_values(key=np.abs, ascending=False)
display(coef.to_frame("coef"))

plt.figure(figsize=(10,4))
plt.bar(coef.index[:12], coef.values[:12])
plt.title("Top coefficients (uncalibrated LR; direction sanity-check)")
plt.ylabel("coef")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

---
## 11) Compare validation performance (Fixed vs Weighted vs ML)

We select a champion based on top-bucket lift/precision and calibration stability.

In [None]:
comparison = pd.DataFrame([fixed_global, w_global, ml_global]).round(4)
display(comparison)

topk = pd.DataFrame([
    {"model":"Fixed", **fixed_top10},
    {"model":"Fixed", **fixed_top20},
    {"model":"Weighted", **w_top10},
    {"model":"Weighted", **w_top20},
    {"model":"ML", **ml_top10},
    {"model":"ML", **ml_top20},
]).round(4)
display(topk)

for metric in ["auc","avg_precision","brier"]:
    plt.figure(figsize=(8,4))
    plt.bar(comparison["model"], comparison[metric])
    plt.title(f"Validation comparison: {metric}")
    plt.ylabel(metric)
    plt.xticks(rotation=20, ha="right")
    plt.grid(True, axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 12) Operational output: Top 20 IDs with `si_offering=0` + buckets + why columns

Default ranking uses the **weighted rule**.

In [None]:
df_out = df_scored.copy()
df_out["rank_prob"] = df_out["p_weighted"]
df_out["score_percentile"] = (df_out["rank_prob"].rank(pct=True) * 100).round(2)
df_out["bucket_3"] = pd.cut(df_out["score_percentile"], bins=[-0.01,50,80,100], labels=["Low","Average","High"])

targets = df_out[df_out["si_offering"]==0].sort_values("rank_prob", ascending=False).head(20)

cols = [
    "ID","rank_prob","score_percentile","bucket_3",
    "MIFID","SI_CONSIDERATION_num","SFDR_PREF_num","pai_selected","TAXONOMYPREF_num","esg_topics_yes_cnt",
    "whyW_si","whyW_sfdr_pref","whyW_pai","whyW_tax"
]
targets[cols]

---
## 13) Risks & governance (what to tell stakeholders)
- **Proxy label bias:** `si_offering` (membership) is not true intent.
- **Leakage control:** do not use OFFERING_NAME as a feature (we only use it for the proxy label).
- **Missing data:** can under-score clients; monitor missingness + missing flags.
- **Drift:** questionnaire/process/product naming changes → monitor score distributions and lift monthly.

**Pilot plan:** run outreach A/B (top bucket vs control) to collect true outcomes and improve labels.