# SI Opportunity Scoring — Train/Validation Evaluation
**Fixed rule → Data-driven rule → ML**  
**Last updated:** 2026-02-24

## Target (proxy) definition
We derive **`si_offering`** from `OFFERING_NAME`:
- `si_offering_row = 1` if OFFERING_NAME contains the token **SI** (case-insensitive)
- `si_offering (per ID) = max(si_offering_row)` across rows for that client

### Objective
Rank **IDs with `si_offering = 0`** (not currently in an SI offering) by **likelihood of SI interest** using preference signals.

> Important: `si_offering` is a *proxy* label (membership ≠ true interest).  
> This notebook evaluates models **against the proxy** and ends with a pilot plan for true labels.

**Update:** raw rows are shuffled for robustness; ML benchmark is Calibrated Logistic Regression only.

---
## 0) Setup

In [None]:
import numpy as np
import pandas as pd

from dataclasses import dataclass
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

import matplotlib.pyplot as plt
pd.set_option("display.max_columns", 200)

---
## 1) Load data (fallback demo)

Expected columns:
- `ID`, `IO_TYPE`, `LIFE_CYCLE`, `OFFERING_NAME`
- `SI_CONSIDERATION_CD`, `SFDR_PREF`, `SFDR_ACTUAL`, `PAI_PREF`, `MIFID`
- ESG topics: `GHG`, `Biodiversity`, `Water`, `Waste`, `Social`

In [None]:
DATA_PATH = Path("data.csv")  # <-- change to your real file path

def make_synthetic_data(n=8000, seed=42):
    rng = np.random.default_rng(seed)
    df = pd.DataFrame({
        "ID": rng.integers(1, n//2 + 1, size=n),  # multiple rows per ID
        "IO_TYPE": rng.choice(["normal", "zombie"], size=n, p=[0.97, 0.03]),
        "LIFE_CYCLE": rng.choice(["open", "closed"], size=n, p=[0.9, 0.1]),

        "SI_CONSIDERATION_CD": rng.choice(["S1","S2","S3", None], size=n, p=[0.35,0.35,0.2,0.1]),
        "SFDR_PREF": rng.choice(["F1","F2","F3", None], size=n, p=[0.4,0.35,0.2,0.05]),
        "SFDR_ACTUAL": rng.choice(["F1","F2","F3", None], size=n, p=[0.45,0.35,0.15,0.05]),
        "PAI_PREF": rng.choice(["PAI Selected", None], size=n, p=[0.3,0.7]),
        "MIFID": rng.choice(["Yes","No", None], size=n, p=[0.55,0.4,0.05]),

        "GHG": rng.choice(["Yes","No","--", None], size=n, p=[0.25,0.65,0.05,0.05]),
        "Biodiversity": rng.choice(["Yes","No","--", None], size=n, p=[0.2,0.7,0.05,0.05]),
        "Water": rng.choice(["Yes","No","--", None], size=n, p=[0.22,0.68,0.05,0.05]),
        "Waste": rng.choice(["Yes","No","--", None], size=n, p=[0.18,0.72,0.05,0.05]),
        "Social": rng.choice(["Yes","No","--", None], size=n, p=[0.28,0.62,0.05,0.05]),
    })

    df["OFFERING_NAME"] = rng.choice(
        ["Core", "Standard", "ESG Plus", "SI Focus", "Core SI", "Income", "SI Sustainable"],
        size=n, p=[0.25,0.25,0.15,0.15,0.08,0.07,0.05]
    )
    return df

if DATA_PATH.exists():
    df_raw = pd.read_csv(DATA_PATH)
    print(f"Loaded: {DATA_PATH}  shape={df_raw.shape}")
else:
    df_raw = make_synthetic_data()
    print("DATA_PATH not found; using synthetic demo dataset.")
    print(f"shape={df_raw.shape}")

df_raw.head()

In [None]:
# Reproducible shuffle of raw rows (helps avoid ordering artifacts)
df_raw = df_raw.sample(frac=1, random_state=42).reset_index(drop=True)
df_raw.head()

---
## 2) Derive `si_offering` from name and aggregate to ID-level

We must operate at **ID-level** to avoid recommending the same client multiple times.

In [None]:
df = df_raw.copy()

# Row-level SI membership flag derived from OFFERING_NAME
df["si_offering_row"] = df["OFFERING_NAME"].astype(str).str.contains(r"\bSI\b", case=False, na=False).astype(int)

def mode_or_first(s: pd.Series):
    s2 = s.dropna()
    if len(s2) == 0:
        return np.nan
    m = s2.mode()
    if len(m) > 0:
        return m.iloc[0]
    return s2.iloc[0]

agg_dict = {
    "IO_TYPE": mode_or_first,
    "LIFE_CYCLE": mode_or_first,
    "SI_CONSIDERATION_CD": mode_or_first,
    "SFDR_PREF": mode_or_first,
    "SFDR_ACTUAL": mode_or_first,
    "PAI_PREF": mode_or_first,
    "MIFID": mode_or_first,
    "GHG": mode_or_first,
    "Biodiversity": mode_or_first,
    "Water": mode_or_first,
    "Waste": mode_or_first,
    "Social": mode_or_first,
    "si_offering_row": "max",
}

df_id = df.groupby("ID", as_index=False).agg(agg_dict).rename(columns={"si_offering_row":"si_offering"})

print("Row-level rows:", len(df))
print("ID-level rows :", len(df_id))
print("si_offering rate (ID-level):", df_id["si_offering"].mean().round(4))
df_id.head()

---
## 3) Cleaning rules & impact

- Remove `IO_TYPE='zombie'`
- Keep `LIFE_CYCLE='open'`
- Standardize missing
- Convert topics/MiFID/PAI to binary

In [None]:
def yes_to_1(x):
    if pd.isna(x): return 0
    x = str(x).strip()
    if x == "--": return 0
    return 1 if x.lower() == "yes" else 0

def clean_id_level(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df = df[df["IO_TYPE"].fillna("").str.lower() != "zombie"]
    df = df[df["LIFE_CYCLE"].fillna("").str.lower() == "open"]

    for c in ["SI_CONSIDERATION_CD","SFDR_PREF","SFDR_ACTUAL"]:
        df[c] = df[c].astype("object").where(df[c].notna(), "nan")

    for c in ["GHG","Biodiversity","Water","Waste","Social"]:
        df[c] = df[c].apply(yes_to_1).astype(int)

    df["MIFID"] = df["MIFID"].apply(yes_to_1).astype(int)
    df["PAI_PREF"] = (df["PAI_PREF"].astype(str).str.lower() == "pai selected").astype(int)

    df["si_offering"] = df["si_offering"].astype(int)
    return df

df_clean = clean_id_level(df_id)

impact = pd.DataFrame({
    "stage": ["before", "after"],
    "rows": [len(df_id), len(df_clean)],
    "si_offering_rate": [df_id["si_offering"].mean(), df_clean["si_offering"].mean()]
})
impact

---
## 4) Feature engineering (including sfdr_gap)

We explicitly avoid any feature derived from OFFERING_NAME beyond the label `si_offering`.

In [None]:
MAP_SI = {"S1":1, "S2":2, "S3":3, "nan": np.nan}
MAP_SFDR = {"F1":1, "F2":2, "F3":3, "nan": np.nan}

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["SI_CONSIDERATION_num"] = df["SI_CONSIDERATION_CD"].map(MAP_SI).fillna(1).astype(int)
    df["SFDR_PREF_num"] = df["SFDR_PREF"].map(MAP_SFDR).fillna(1).astype(int)
    df["SFDR_ACTUAL_num"] = df["SFDR_ACTUAL"].map(MAP_SFDR).fillna(1).astype(int)

    df["sfdr_gap"] = np.clip(df["SFDR_PREF_num"] - df["SFDR_ACTUAL_num"], -2, 2)
    df["sfdr_opp"] = np.maximum(df["sfdr_gap"], 0)  # 0..2

    topic_cols = ["GHG","Biodiversity","Water","Waste","Social"]
    df["esg_topics_yes_cnt"] = df[topic_cols].sum(axis=1)
    df["esg_topics_yes_share"] = df["esg_topics_yes_cnt"] / len(topic_cols)

    # normalized (0..1)
    df["si_norm"] = np.clip((df["SI_CONSIDERATION_num"] - 1)/2, 0, 1)
    df["sfdr_norm"] = np.clip(df["sfdr_opp"]/2, 0, 1)
    df["topics_norm"] = np.clip(df["esg_topics_yes_share"], 0, 1)
    df["topics_if_pai"] = df["topics_norm"] * df["PAI_PREF"]  # topics only matter if PAI=1

    return df

df_feat = engineer_features(df_clean)
df_feat[["ID","si_offering","MIFID","SI_CONSIDERATION_num","sfdr_gap","PAI_PREF","esg_topics_yes_cnt"]].head()

---
## 5) Fixed-weight rule score (branching)

**Logic:**
- If `MIFID=0`: SI-only (capped)
- If `MIFID=1`: 70% SFDR opportunity + 30% PAI block (topics only if PAI=1)

In [None]:
@dataclass
class RuleConfig:
    # Branch A (MIFID=0): capped SI mapping to avoid overconfidence
    si_score_s1: float = 20
    si_score_s2: float = 50
    si_score_s3: float = 80
    # Branch B (MIFID=1)
    w_sfdr: float = 0.70
    w_pai_block: float = 0.30

cfg = RuleConfig()

def score_fixed_rule(df: pd.DataFrame, cfg: RuleConfig) -> pd.Series:
    si_score = df["SI_CONSIDERATION_num"].map({1: cfg.si_score_s1, 2: cfg.si_score_s2, 3: cfg.si_score_s3}).astype(float)
    pai_block = np.where(df["PAI_PREF"] == 1, 0.5 + 0.5*df["topics_norm"], 0.0)  # 0..1
    score_B = 100 * (cfg.w_sfdr * df["sfdr_norm"] + cfg.w_pai_block * pai_block)
    score = np.where(df["MIFID"]==1, score_B, si_score)
    return pd.Series(np.clip(score, 0, 100), index=df.index)

df_feat["score_fixed"] = score_fixed_rule(df_feat, cfg)
df_feat[["score_fixed"]].describe()

---
## 6) Train/Validation split (ID-level)

We use a stratified split on the proxy label `si_offering`.
Rule scores do not train, but we still evaluate them on the held-out validation set.

In [None]:
FEATURES = ["si_norm","sfdr_norm","PAI_PREF","topics_if_pai","esg_topics_yes_cnt","sfdr_gap","MIFID","SI_CONSIDERATION_num"]

X = df_feat[FEATURES].copy()
y = df_feat["si_offering"].copy()

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

idx_train = X_train.index
idx_val = X_val.index

print("Train size:", len(idx_train), "Val size:", len(idx_val))
print("Train si_offering rate:", y_train.mean().round(4), "Val si_offering rate:", y_val.mean().round(4))

---
## 7) Evaluation helpers (AUC/AP/Brier + lift-by-decile)

In [None]:
def eval_scores(y_true, p, label):
    out = {
        "model": label,
        "auc": roc_auc_score(y_true, p),
        "avg_precision": average_precision_score(y_true, p),
        "brier": brier_score_loss(y_true, p)
    }
    return out

def lift_table(y_true, p, n_bins=10):
    tmp = pd.DataFrame({"y": y_true, "p": p})
    tmp["bin"] = pd.qcut(tmp["p"], n_bins, labels=False, duplicates="drop") + 1
    tab = tmp.groupby("bin")["y"].agg(["mean","count"]).rename(columns={"mean":"si_rate"})
    return tab

def plot_lift(tab, title):
    plt.figure(figsize=(8,4))
    plt.plot(tab.index, tab["si_rate"].values, marker="o")
    plt.title(title)
    plt.xlabel("Decile (1=lowest score, 10=highest)")
    plt.ylabel("si_offering rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

---
## 8) Fixed-weight rule — validation performance

In [None]:
p_fixed_val = (df_feat.loc[idx_val, "score_fixed"].values / 100.0)
fixed_metrics = eval_scores(y_val.values, p_fixed_val, "Fixed-weight rule")
fixed_metrics

In [None]:
lift_fixed = lift_table(y_val.values, p_fixed_val, n_bins=10)
lift_fixed

In [None]:
plot_lift(lift_fixed, "Lift (proxy): Fixed-weight rule")

---
## 9) Data-driven rule (learn weights on train, apply on validation)

We learn weights using a regularized logistic regression on train, then convert coefficients to weights summing to 100.
This keeps the scoring **rule-like** and stakeholder-friendly.

In [None]:
# Data-driven rule features (keep aligned to your branching logic)
DD_FEATURES = ["si_norm","sfdr_norm","PAI_PREF","topics_if_pai"]

lr_dd = LogisticRegression(max_iter=2000, class_weight="balanced")
lr_dd.fit(X_train[DD_FEATURES], y_train)

coef = pd.Series(lr_dd.coef_[0], index=DD_FEATURES).sort_values(key=np.abs, ascending=False)
coef

In [None]:
def coef_to_100_weights(coef_series):
    pos = np.maximum(coef_series.values, 0)
    if pos.sum() == 0:
        pos = np.ones_like(pos)
    w = 100 * pos / pos.sum()
    return pd.Series(w, index=coef_series.index).sort_values(ascending=False)

w_dd = coef_to_100_weights(coef)
w_dd

In [None]:
# Rule-like score: weights sum to 100, features are 0..1, so score is 0..100
df_feat["score_datadriven"] = (
    w_dd["si_norm"] * df_feat["si_norm"] +
    w_dd["sfdr_norm"] * df_feat["sfdr_norm"] +
    w_dd["PAI_PREF"] * df_feat["PAI_PREF"] +
    w_dd["topics_if_pai"] * df_feat["topics_if_pai"]
).clip(0,100)

p_dd_val = df_feat.loc[idx_val, "score_datadriven"].values / 100.0
dd_metrics = eval_scores(y_val.values, p_dd_val, "Data-driven rule (LR weights)")
dd_metrics

In [None]:
lift_dd = lift_table(y_val.values, p_dd_val, n_bins=10)
plot_lift(lift_dd, "Lift (proxy): Data-driven rule")
lift_dd

## 10) ML model (validation): Calibrated Logistic Regression

We use **Calibrated Logistic Regression** as the ML benchmark because it is:
- Strong and stable for tabular preference data
- Interpretable (coefficients can be inspected)
- Calibrated probabilities support thresholding and bucket definitions

We fit on the **train** split and evaluate on the **validation** split.

In [None]:
# Feature set for ML
ML_FEATURES = ["si_norm","sfdr_norm","PAI_PREF","topics_if_pai","esg_topics_yes_cnt","sfdr_gap","MIFID","SI_CONSIDERATION_num"]

Xtr = X_train[ML_FEATURES].copy()
Xva = X_val[ML_FEATURES].copy()

# Calibrated Logistic Regression (isotonic calibration)
lr = LogisticRegression(max_iter=3000, class_weight="balanced")
cal_lr = CalibratedClassifierCV(lr, method="isotonic", cv=5)
cal_lr.fit(Xtr, y_train)

p_lr = cal_lr.predict_proba(Xva)[:,1]

results = [
    fixed_metrics,
    dd_metrics,
    eval_scores(y_val.values, p_lr, "ML: Calibrated Logistic Regression"),
]

pd.DataFrame(results).sort_values("auc", ascending=False).round(4)

## 11) Recommendation

For this use case, the strongest and safest ML option is **Calibrated Logistic Regression**:
- It performs well on tabular preference data
- It stays explainable (coefficients can be reviewed)
- Calibration provides probability-like outputs suitable for percentile buckets

Operationally, you can deploy with **Fixed-weight rule** or **Data-driven rule** and treat ML as an enhancement if it improves validation lift.

---
## 12) Operational output: target IDs (si_offering=0) ranked by chosen score

Choose a score for ranking:
- `score_fixed` (fixed rule)
- `score_datadriven` (data-driven rule)
- `p_lr` (Calibrated Logistic Regression probability on validation)

In [None]:
# For production, you would refit the chosen model on the full dataset.
# Here, we demonstrate ranking using rule scores (available for all rows).

RANK_SCORE = "score_datadriven"  # change to "score_fixed" if you prefer

df_out = df_feat.copy()
df_out["score_percentile"] = (df_out[RANK_SCORE].rank(pct=True) * 100).round(2)
df_out["bucket_3"] = pd.cut(df_out["score_percentile"], bins=[-0.01, 50, 80, 100], labels=["Low","Average","High"])

targets = df_out[df_out["si_offering"]==0].sort_values(RANK_SCORE, ascending=False)

cols = ["ID", RANK_SCORE, "score_percentile", "bucket_3", "MIFID","SI_CONSIDERATION_num","sfdr_gap","PAI_PREF","esg_topics_yes_cnt"]
targets[cols].head(20)

---
## 13) Pilot plan (create true labels)

Because `si_offering` is a proxy, validate value with a pilot:
- Treatment: High bucket (top 20% of `si_offering=0`)
- Control: random sample from eligible pool (or next bucket)
- Outcomes: response, meeting booked, adoption, pipeline created

After pilot: retrain on true outcomes and re-evaluate Fixed vs Data-driven vs ML.