# SI Opportunity Scoring — Train/Validation Comparison (v3: +Taxonomy)
**Fixed rule → Data-driven rule → ML (Calibrated Logistic Regression)**  
**Last updated:** 2026-02-24

## Target (proxy) definition
We derive **`si_offering`** from `OFFERING_NAME`:
- `si_offering_row = 1` if OFFERING_NAME contains token **SI** (case-insensitive)
- `si_offering (per ID) = max(si_offering_row)` across rows for that client

## Objective
Rank **IDs with `si_offering = 0`** (not currently in an SI offering) by predicted SI alignment from preference fields.

## New requirement
For the branch with SFDR (MiFID=1), we also include **Taxonomy preference**:
- `TAXONOMYPREF`: A1, A2, A3 (mapped to 1,2,3)

## Model comparison on the same validation set
We compare three approaches:
1) **Fixed-weight rule** (transparent business logic)  
2) **Data-driven rule** (learn branch-B weights statistically on train, apply to validation)  
3) **ML**: **Calibrated Logistic Regression** (train → validate)

> Note: `si_offering` is a proxy label (membership ≠ true interest). Use the pilot plan to create true outcome labels.

**v4 update:** Adds stakeholder narrative + additional charts (distributions, bucket rates, calibration, coefficients, metric comparison).

---
## 0) Setup

### How to read this notebook (stakeholder-friendly)
Each section follows the same pattern:

1) **Question** we need to answer  
2) **Logic** (short, business terms)  
3) **Evidence** (a chart/table)  
4) **Decision / next step**

Code exists only to generate the artifact that supports the decision.

In [None]:
import numpy as np
import pandas as pd

from dataclasses import dataclass
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

import matplotlib.pyplot as plt
pd.set_option("display.max_columns", 200)

---
## 1) Load data (fallback demo) + shuffle raw rows

**Expected columns**:
- `ID`, `IO_TYPE`, `LIFE_CYCLE`, `OFFERING_NAME`
- `SI_CONSIDERATION_CD`, `SFDR_PREF`, `SFDR_ACTUAL`, `PAI_PREF`, `MIFID`, `TAXONOMYPREF`
- ESG topics: `GHG`, `Biodiversity`, `Water`, `Waste`, `Social`

### Why we shuffle raw rows
Many operational datasets arrive **sorted** (by ID, by date, by system order).  
Shuffling avoids accidental ordering artifacts and makes the pipeline more robust.

In [None]:
DATA_PATH = Path("data.csv")  # <-- change to your real file path

def make_synthetic_data(n=9000, seed=42):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "ID": rng.integers(1, n//2 + 1, size=n),  # multiple rows per ID
        "IO_TYPE": rng.choice(["normal", "zombie"], size=n, p=[0.97, 0.03]),
        "LIFE_CYCLE": rng.choice(["open", "closed"], size=n, p=[0.9, 0.1]),

        "SI_CONSIDERATION_CD": rng.choice(["S1","S2","S3", None], size=n, p=[0.35,0.35,0.2,0.1]),
        "SFDR_PREF": rng.choice(["F1","F2","F3", None], size=n, p=[0.4,0.35,0.2,0.05]),
        "SFDR_ACTUAL": rng.choice(["F1","F2","F3", None], size=n, p=[0.45,0.35,0.15,0.05]),
        "PAI_PREF": rng.choice(["PAI Selected", None], size=n, p=[0.3,0.7]),
        "MIFID": rng.choice(["Yes","No", None], size=n, p=[0.55,0.4,0.05]),
        "TAXONOMYPREF": rng.choice(["A1","A2","A3", None], size=n, p=[0.5,0.35,0.1,0.05]),

        "GHG": rng.choice(["Yes","No","--", None], size=n, p=[0.25,0.65,0.05,0.05]),
        "Biodiversity": rng.choice(["Yes","No","--", None], size=n, p=[0.2,0.7,0.05,0.05]),
        "Water": rng.choice(["Yes","No","--", None], size=n, p=[0.22,0.68,0.05,0.05]),
        "Waste": rng.choice(["Yes","No","--", None], size=n, p=[0.18,0.72,0.05,0.05]),
        "Social": rng.choice(["Yes","No","--", None], size=n, p=[0.28,0.62,0.05,0.05]),
    })

    df["OFFERING_NAME"] = rng.choice(
        ["Core", "Standard", "ESG Plus", "SI Focus", "Core SI", "Income", "SI Sustainable"],
        size=n, p=[0.25,0.25,0.15,0.15,0.08,0.07,0.05]
    )
    return df

if DATA_PATH.exists():
    df_raw = pd.read_csv(DATA_PATH)
    print(f"Loaded: {DATA_PATH}  shape={df_raw.shape}")
else:
    df_raw = make_synthetic_data()
    print("DATA_PATH not found; using synthetic demo dataset.")
    print(f"shape={df_raw.shape}")

# Reproducible shuffle of raw rows (avoids ordering artifacts)
df_raw = df_raw.sample(frac=1, random_state=42).reset_index(drop=True)

df_raw.head()

---
## 2) Derive `si_offering` from name and aggregate to ID-level

We operate at **ID-level** to avoid recommending the same client multiple times.

**Aggregation rule (transparent defaults):**
- `si_offering`: max across rows
- preference fields: mode (or first if tie)

### Why aggregate to ID-level
We recommend **clients/IDs**, not rows. If an ID appears multiple times (multiple records),
row-level scoring can inflate evidence and produce duplicates.

So we aggregate to a **single record per ID** before scoring and validation.

In [None]:
# Chart: row-level vs ID-level size + proxy label prevalence (before cleaning)
sizes = pd.DataFrame({
    "level": ["row-level", "ID-level"],
    "rows": [len(df), len(df_id)],
    "si_offering_rate": [df["si_offering_row"].mean(), df_id["si_offering"].mean()]
})
display(sizes)

plt.figure(figsize=(7,4))
plt.bar(sizes["level"], sizes["rows"])
plt.title("Dataset size: row-level vs ID-level")
plt.ylabel("Number of records")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,4))
plt.bar(sizes["level"], sizes["si_offering_rate"])
plt.title("Proxy label prevalence: si_offering rate")
plt.ylabel("Rate")
plt.ylim(0, 1)
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
df = df_raw.copy()
df["si_offering_row"] = df["OFFERING_NAME"].astype(str).str.contains(r"\bSI\b", case=False, na=False).astype(int)

def mode_or_first(s: pd.Series):
    s2 = s.dropna()
    if len(s2) == 0:
        return np.nan
    m = s2.mode()
    if len(m) > 0:
        return m.iloc[0]
    return s2.iloc[0]

agg_dict = {
    "IO_TYPE": mode_or_first,
    "LIFE_CYCLE": mode_or_first,
    "SI_CONSIDERATION_CD": mode_or_first,
    "SFDR_PREF": mode_or_first,
    "SFDR_ACTUAL": mode_or_first,
    "PAI_PREF": mode_or_first,
    "MIFID": mode_or_first,
    "TAXONOMYPREF": mode_or_first,
    "GHG": mode_or_first,
    "Biodiversity": mode_or_first,
    "Water": mode_or_first,
    "Waste": mode_or_first,
    "Social": mode_or_first,
    "si_offering_row": "max",
}

df_id = df.groupby("ID", as_index=False).agg(agg_dict).rename(columns={"si_offering_row":"si_offering"})

print("Row-level rows:", len(df))
print("ID-level rows :", len(df_id))
print("si_offering rate (ID-level):", df_id["si_offering"].mean().round(4))
df_id.head()

---
## 3) Cleaning rules

- Remove `IO_TYPE='zombie'`
- Keep `LIFE_CYCLE='open'`
- Convert topics/MiFID/PAI to binary
- Keep missing for ordinal/categorical as 'nan' then default conservatively

### Why cleaning is critical
- Removes **non-actionable** IDs (zombie / closed lifecycle)
- Standardizes messy values (`--`, nulls)
- Prevents unstable scoring due to inconsistent encodings

We also inspect missingness to make assumptions explicit.

In [None]:
# Missingness report (post-cleaning) for key fields
key_cols = ["SI_CONSIDERATION_CD","SFDR_PREF","SFDR_ACTUAL","TAXONOMYPREF","MIFID","PAI_PREF",
            "GHG","Biodiversity","Water","Waste","Social"]
miss = df_clean[key_cols].isna().mean().sort_values(ascending=False)
display(miss.to_frame("missing_rate").head(10))

plt.figure(figsize=(9,4))
plt.bar(miss.index[:10], miss.values[:10])
plt.title("Top missing rates (post-cleaning) — key fields")
plt.ylabel("Missing rate")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
def yes_to_1(x):
    if pd.isna(x): return 0
    x = str(x).strip()
    if x == "--": return 0
    return 1 if x.lower() == "yes" else 0

def clean_id_level(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df[df["IO_TYPE"].fillna("").str.lower() != "zombie"]
    df = df[df["LIFE_CYCLE"].fillna("").str.lower() == "open"]

    for c in ["SI_CONSIDERATION_CD","SFDR_PREF","SFDR_ACTUAL","TAXONOMYPREF"]:
        df[c] = df[c].astype("object").where(df[c].notna(), "nan")

    for c in ["GHG","Biodiversity","Water","Waste","Social"]:
        df[c] = df[c].apply(yes_to_1).astype(int)

    df["MIFID"] = df["MIFID"].apply(yes_to_1).astype(int)
    df["PAI_PREF"] = (df["PAI_PREF"].astype(str).str.lower() == "pai selected").astype(int)
    df["si_offering"] = df["si_offering"].astype(int)
    return df

df_clean = clean_id_level(df_id)

pd.DataFrame({
    "stage": ["before", "after"],
    "rows": [len(df_id), len(df_clean)],
    "si_offering_rate": [df_id["si_offering"].mean(), df_clean["si_offering"].mean()]
})

---
## 4) Feature engineering (including SFDR gap + Taxonomy mapping)

We avoid OFFERING_NAME-derived features (to prevent leakage).

### Sanity-check engineered signals
Before we score anything, we check that key engineered signals are populated and reasonable:
- SFDR gap distribution
- Taxonomy distribution and normalized scale

In [None]:
plt.figure(figsize=(7,4))
plt.hist(df_feat["sfdr_gap"], bins=5)
plt.title("Distribution of SFDR gap (clipped)")
plt.xlabel("sfdr_gap")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,4))
plt.hist(df_feat["TAXONOMYPREF_num"], bins=3)
plt.title("Distribution of Taxonomy preference (A1/A2/A3 mapped to 1/2/3)")
plt.xlabel("TAXONOMYPREF_num")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,4))
plt.hist(df_feat["tax_norm"], bins=10)
plt.title("Distribution of tax_norm (0..1)")
plt.xlabel("tax_norm")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
MAP_SI = {"S1":1, "S2":2, "S3":3, "nan": np.nan}
MAP_SFDR = {"F1":1, "F2":2, "F3":3, "nan": np.nan}
MAP_TAX = {"A1":1, "A2":2, "A3":3, "nan": np.nan}

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["SI_CONSIDERATION_num"] = df["SI_CONSIDERATION_CD"].map(MAP_SI).fillna(1).astype(int)
    df["SFDR_PREF_num"] = df["SFDR_PREF"].map(MAP_SFDR).fillna(1).astype(int)
    df["SFDR_ACTUAL_num"] = df["SFDR_ACTUAL"].map(MAP_SFDR).fillna(1).astype(int)
    df["TAXONOMYPREF_num"] = df["TAXONOMYPREF"].map(MAP_TAX).fillna(1).astype(int)

    df["sfdr_gap"] = np.clip(df["SFDR_PREF_num"] - df["SFDR_ACTUAL_num"], -2, 2)
    df["sfdr_opp"] = np.maximum(df["sfdr_gap"], 0)  # 0..2

    topic_cols = ["GHG","Biodiversity","Water","Waste","Social"]
    df["esg_topics_yes_cnt"] = df[topic_cols].sum(axis=1)
    df["esg_topics_yes_share"] = df["esg_topics_yes_cnt"] / len(topic_cols)

    df["si_norm"] = np.clip((df["SI_CONSIDERATION_num"] - 1)/2, 0, 1)
    df["sfdr_norm"] = np.clip(df["sfdr_opp"]/2, 0, 1)
    df["topics_norm"] = np.clip(df["esg_topics_yes_share"], 0, 1)
    df["topics_if_pai"] = df["topics_norm"] * df["PAI_PREF"]
    df["tax_norm"] = np.clip((df["TAXONOMYPREF_num"] - 1)/2, 0, 1)

    return df

df_feat = engineer_features(df_clean)
df_feat[["ID","si_offering","MIFID","sfdr_gap","PAI_PREF","TAXONOMYPREF_num","tax_norm"]].head()

---
## 5) Fixed-weight rule score (branching, includes Taxonomy in SFDR branch)

**Logic:**
- If `MIFID=0`: SI-only (capped)
- If `MIFID=1`: combine **SFDR opportunity + PAI block + Taxonomy**

**Default weights (Branch B):** 60% SFDR, 25% PAI block, 15% Taxonomy (sum=100%)

### Fixed-weight rule: what it represents
This is the most transparent method: it encodes stakeholder logic directly.

- If **MiFID=0**: SI-only (capped)
- If **MiFID=1**: SFDR opportunity + PAI/topics + Taxonomy

Next we validate it out-of-sample.

In [None]:
# Score distribution for fixed rule
plt.figure(figsize=(7,4))
plt.hist(df_feat["score_fixed"], bins=30)
plt.title("Score distribution: Fixed-weight rule")
plt.xlabel("score_fixed")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Proxy success by broad buckets (overall, before split)
tmp = df_feat.copy()
tmp["bucket"] = pd.qcut(tmp["score_fixed"], 3, labels=["Low","Average","High"], duplicates="drop")
bucket_rate = tmp.groupby("bucket")["si_offering"].agg(["mean","count"]).rename(columns={"mean":"si_rate"})
display(bucket_rate)

plt.figure(figsize=(7,4))
plt.bar(bucket_rate.index.astype(str), bucket_rate["si_rate"].values)
plt.title("Proxy evidence: si_offering rate by Fixed-rule bucket")
plt.ylabel("si_offering rate")
plt.ylim(0, 1)
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
@dataclass
class RuleConfig:
    si_score_s1: float = 20
    si_score_s2: float = 50
    si_score_s3: float = 80
    w_sfdr: float = 0.60
    w_pai_block: float = 0.25
    w_tax: float = 0.15

cfg = RuleConfig()

def score_fixed_rule(df: pd.DataFrame, cfg: RuleConfig) -> pd.Series:
    si_score = df["SI_CONSIDERATION_num"].map({1: cfg.si_score_s1, 2: cfg.si_score_s2, 3: cfg.si_score_s3}).astype(float)
    pai_block = np.where(df["PAI_PREF"] == 1, 0.5 + 0.5*df["topics_norm"], 0.0)  # 0..1

    score_B = 100 * (
        cfg.w_sfdr * df["sfdr_norm"] +
        cfg.w_pai_block * pai_block +
        cfg.w_tax * df["tax_norm"]
    )

    score = np.where(df["MIFID"]==1, score_B, si_score)
    return pd.Series(np.clip(score, 0, 100), index=df.index)

df_feat["score_fixed"] = score_fixed_rule(df_feat, cfg)
df_feat["score_fixed"].describe()

---
## 6) Train/Validation split (ID-level)

We use a stratified split on `si_offering` (proxy label).

### Why validation matters
All three methods are judged on the **same validation set**:

- Fixed-weight rule
- Data-driven rule (learn on train → apply on validation)
- ML (train → validation)

We report:
- **AUC** (ranking)
- **Average Precision** (ranking for imbalanced targets)
- **Brier** (probability quality)
and show **lift-by-decile** charts.

In [None]:
FEATURES_ALL = [
    "si_norm","sfdr_norm","PAI_PREF","topics_if_pai","tax_norm",
    "esg_topics_yes_cnt","sfdr_gap","MIFID","SI_CONSIDERATION_num","TAXONOMYPREF_num"
]

X = df_feat[FEATURES_ALL].copy()
y = df_feat["si_offering"].copy()

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
idx_train, idx_val = X_train.index, X_val.index

print("Train size:", len(idx_train), "Val size:", len(idx_val))
print("Train si_offering rate:", y_train.mean().round(4), "Val si_offering rate:", y_val.mean().round(4))

---
## 7) Evaluation helpers

In [None]:
def eval_scores(y_true, p, label):
    return {
        "model": label,
        "auc": roc_auc_score(y_true, p),
        "avg_precision": average_precision_score(y_true, p),
        "brier": brier_score_loss(y_true, p)
    }

def lift_table(y_true, p, n_bins=10):
    tmp = pd.DataFrame({"y": y_true, "p": p})
    tmp["bin"] = pd.qcut(tmp["p"], n_bins, labels=False, duplicates="drop") + 1
    return tmp.groupby("bin")["y"].agg(["mean","count"]).rename(columns={"mean":"si_rate"})

def plot_lift(tab, title):
    plt.figure(figsize=(8,4))
    plt.plot(tab.index, tab["si_rate"].values, marker="o")
    plt.title(title)
    plt.xlabel("Decile (1=lowest, 10=highest)")
    plt.ylabel("si_offering rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 8) Fixed-weight rule — validation performance

In [None]:
p_fixed_val = df_feat.loc[idx_val, "score_fixed"].values / 100.0
fixed_metrics = eval_scores(y_val.values, p_fixed_val, "Fixed-weight rule (+Taxonomy)")
fixed_metrics

### Calibration (proxy): Fixed-weight rule

In [None]:
from sklearn.calibration import calibration_curve
prob_true, prob_pred = calibration_curve(y_val.values, p_fixed_val, n_bins=10, strategy="quantile")
plt.figure(figsize=(6,6))
plt.plot(prob_pred, prob_true, marker="o")
plt.plot([0,1],[0,1], linestyle="--")
plt.title("Calibration (proxy): Fixed-weight rule")
plt.xlabel("Predicted (score_fixed / 100)")
plt.ylabel("Observed si_offering rate")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
lift_fixed = lift_table(y_val.values, p_fixed_val)
plot_lift(lift_fixed, "Lift (proxy): Fixed-weight rule (+Taxonomy)")
lift_fixed

---
## 9) Data-driven rule — learn weights on train (branch B) and apply on validation

We keep stakeholder interpretability by learning weights for the **SFDR branch only**.

- Branch A (MiFID=0): keep the capped SI mapping
- Branch B (MiFID=1): fit logistic regression on train subset and convert coefficients to weights summing to 100
  using features: `sfdr_norm`, `PAI_PREF`, `topics_if_pai`, `tax_norm`

In [None]:
train_B = df_feat.loc[idx_train]
train_B = train_B[train_B["MIFID"] == 1].copy()

DD_B_FEATURES = ["sfdr_norm","PAI_PREF","topics_if_pai","tax_norm"]

lr_dd_B = LogisticRegression(max_iter=4000, class_weight="balanced")
lr_dd_B.fit(train_B[DD_B_FEATURES], train_B["si_offering"])

coef_B = pd.Series(lr_dd_B.coef_[0], index=DD_B_FEATURES).sort_values(key=np.abs, ascending=False)
coef_B

In [None]:
def coef_to_100_weights(coef_series):
    pos = np.maximum(coef_series.values, 0)
    if pos.sum() == 0:
        pos = np.ones_like(pos)
    w = 100 * pos / pos.sum()
    return pd.Series(w, index=coef_series.index).sort_values(ascending=False)

w_dd_B = coef_to_100_weights(coef_B)
w_dd_B

In [None]:
plt.figure(figsize=(7,4))
plt.bar(w_dd_B.index, w_dd_B.values)
plt.title("Learned Branch-B weights (sum=100)")
plt.ylabel("weight")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
df_feat["score_datadriven"] = np.nan

# Branch A (MIFID=0): fixed SI mapping
si_score = df_feat["SI_CONSIDERATION_num"].map({1: cfg.si_score_s1, 2: cfg.si_score_s2, 3: cfg.si_score_s3}).astype(float)
df_feat.loc[df_feat["MIFID"]==0, "score_datadriven"] = si_score[df_feat["MIFID"]==0]

# Branch B (MIFID=1): learned weights
scoreB = (
    w_dd_B["sfdr_norm"] * df_feat["sfdr_norm"] +
    w_dd_B["PAI_PREF"] * df_feat["PAI_PREF"] +
    w_dd_B["topics_if_pai"] * df_feat["topics_if_pai"] +
    w_dd_B["tax_norm"] * df_feat["tax_norm"]
)

df_feat.loc[df_feat["MIFID"]==1, "score_datadriven"] = scoreB[df_feat["MIFID"]==1]
df_feat["score_datadriven"] = df_feat["score_datadriven"].clip(0,100)

p_dd_val = df_feat.loc[idx_val, "score_datadriven"].values / 100.0
dd_metrics = eval_scores(y_val.values, p_dd_val, "Data-driven rule (learned Branch-B weights)")
dd_metrics

### Calibration (proxy): Data-driven rule

In [None]:
prob_true, prob_pred = calibration_curve(y_val.values, p_dd_val, n_bins=10, strategy="quantile")
plt.figure(figsize=(6,6))
plt.plot(prob_pred, prob_true, marker="o")
plt.plot([0,1],[0,1], linestyle="--")
plt.title("Calibration (proxy): Data-driven rule")
plt.xlabel("Predicted (score_datadriven / 100)")
plt.ylabel("Observed si_offering rate")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
lift_dd = lift_table(y_val.values, p_dd_val)
plot_lift(lift_dd, "Lift (proxy): Data-driven rule (+Taxonomy)")
lift_dd

---
## 10) ML — Calibrated Logistic Regression (train → validate)

We train a single calibrated logistic regression on the full train split using engineered features (including Taxonomy).

In [None]:
ML_FEATURES = [
    "si_norm","sfdr_norm","PAI_PREF","topics_if_pai","tax_norm",
    "esg_topics_yes_cnt","sfdr_gap","MIFID","SI_CONSIDERATION_num","TAXONOMYPREF_num"
]

Xtr = X_train[ML_FEATURES].copy()
Xva = X_val[ML_FEATURES].copy()

lr = LogisticRegression(max_iter=5000, class_weight="balanced")
cal_lr = CalibratedClassifierCV(lr, method="isotonic", cv=5)
cal_lr.fit(Xtr, y_train)

p_lr_val = cal_lr.predict_proba(Xva)[:,1]
ml_metrics = eval_scores(y_val.values, p_lr_val, "ML: Calibrated Logistic Regression")
ml_metrics

### ML interpretability: coefficient overview (trained on train split)

In [None]:
# Fit uncalibrated LR on train for coefficient inspection (calibration wraps multiple CV estimators)
lr_plain = LogisticRegression(max_iter=5000, class_weight="balanced")
lr_plain.fit(Xtr, y_train)

coef = pd.Series(lr_plain.coef_[0], index=ML_FEATURES).sort_values(key=np.abs, ascending=False)
display(coef.to_frame("coef"))

plt.figure(figsize=(9,4))
plt.bar(coef.index[:10], coef.values[:10])
plt.title("Top coefficients (uncalibrated LR; sign indicates direction)")
plt.ylabel("coefficient")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

### Calibration (proxy): ML calibrated logistic regression

In [None]:
prob_true, prob_pred = calibration_curve(y_val.values, p_lr_val, n_bins=10, strategy="quantile")
plt.figure(figsize=(6,6))
plt.plot(prob_pred, prob_true, marker="o")
plt.plot([0,1],[0,1], linestyle="--")
plt.title("Calibration (proxy): ML Calibrated Logistic Regression")
plt.xlabel("Predicted probability")
plt.ylabel("Observed si_offering rate")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
lift_lr = lift_table(y_val.values, p_lr_val)
plot_lift(lift_lr, "Lift (proxy): ML Calibrated Logistic Regression (+Taxonomy)")
lift_lr

---
## 11) Validation comparison (Fixed vs Data-driven vs ML)

In [None]:
pd.DataFrame([fixed_metrics, dd_metrics, ml_metrics]).sort_values("auc", ascending=False).round(4)

### Visual comparison of validation metrics
A quick stakeholder view of which approach wins on:
- ranking (AUC / Average Precision)
- probability quality (Brier; lower is better)

In [None]:
comparison = pd.DataFrame([fixed_metrics, dd_metrics, ml_metrics])
for metric in ["auc", "avg_precision", "brier"]:
    plt.figure(figsize=(8,4))
    plt.bar(comparison["model"], comparison[metric])
    plt.title(f"Validation comparison: {metric}")
    plt.ylabel(metric)
    plt.xticks(rotation=30, ha="right")
    plt.grid(True, axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 12) Operational output: target IDs (`si_offering=0`) ranked by chosen method

Operationally, we recommend ranking by a score available for **all IDs**:
- `score_fixed` or `score_datadriven`

We also assign percentile buckets (Low/Average/High).

In [None]:
RANK_METHOD = "score_datadriven"  # options: "score_fixed" or "score_datadriven"

df_out = df_feat.copy()
df_out["rank_score"] = df_out[RANK_METHOD]
df_out["score_percentile"] = (df_out["rank_score"].rank(pct=True) * 100).round(2)
df_out["bucket_3"] = pd.cut(df_out["score_percentile"], bins=[-0.01, 50, 80, 100], labels=["Low","Average","High"])

targets = df_out[df_out["si_offering"]==0].sort_values("rank_score", ascending=False)
cols = ["ID","rank_score","score_percentile","bucket_3",
        "MIFID","SI_CONSIDERATION_num","sfdr_gap","PAI_PREF","TAXONOMYPREF_num","esg_topics_yes_cnt"]
targets[cols].head(20)

---
## 13) Pilot plan (create true labels)

Because `si_offering` is a proxy, validate impact with a pilot:
- Treatment: High bucket (top 20% of `si_offering=0`)
- Control: randomized sample from eligible pool (or next bucket)
- Outcomes: response, meeting booked, adoption, pipeline created

After pilot: retrain and compare Fixed vs Data-driven vs ML using true outcomes.