# SI Opportunity Scoring — Stakeholder Notebook (v7)
**Three methods, one evaluation framework:**  
1) **Fixed rule** (explicit business logic) →  
2) **Weighted rule** (same logic, weights learned from data) →  
3) **ML** (Calibrated Logistic Regression, controlled enhancement)

**Purpose:** rank clients with **`si_offering = 0`** by likelihood of SI interest.

---

## How to read this notebook
Each section follows the same decision-making template:

1) **Question** — what are we trying to decide?  
2) **Approach** — what logic do we apply?  
3) **Evidence** — charts/tables we use to verify it works.  
4) **Decision** — what we do next based on the evidence.

This is intentionally designed so the **results lead to the next step**, and code only exists to generate those results.

---

## Proxy target (benchmark only)
We create a proxy label `si_offering` from `OFFERING_NAME` containing the token **SI**.

> This is *not* the final truth label (membership ≠ interest).  
> We use it only to compare methods consistently until a pilot creates true outcomes (responses/adoption).

---

## Business logic (requested)
- **If `MIFID = 0`:** use **only** `SI_CONSIDERATION` (S1=no, S2=some, S3=high).  
- **If `MIFID = 1`:** use SFDR/PAI/Taxonomy **only when** `SI_CONSIDERATION` is **high (S3)**.  
  Otherwise keep a conservative baseline.

We implement that structure for Fixed + Weighted rules, and we respect the same structure in ML using interaction (“gate”) features.

**Last updated:** 2026-02-27

## 0) Setup

### What this section does
- Imports the libraries used throughout (pandas/numpy for data handling, sklearn for evaluation and ML).
- Sets display options for readable tables.

### Decision checkpoint
No decisions here—this is only to ensure the notebook runs reproducibly.

In [None]:
import numpy as np
import pandas as pd
from dataclasses import dataclass
from pathlib import Path

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.calibration import CalibratedClassifierCV, calibration_curve
from sklearn.metrics import roc_auc_score, average_precision_score, brier_score_loss

import matplotlib.pyplot as plt
pd.set_option("display.max_columns", 200)

## 1) Load raw data + shuffle rows

### Question
Do we have the raw fields we need, and is the dataset order influencing anything?

### Approach
- Load the raw dataset (or a synthetic demo if the file is not found).
- **Shuffle rows** (`sample(frac=1, random_state=42)`) to remove accidental ordering artifacts.
  - Many operational extracts are sorted by ID or date; shuffling makes splits and charts more robust.

### Evidence
We display the head of the dataframe to confirm columns and rough values.

### Decision
Proceed to cleaning/filtering once required raw columns are present.

In [None]:
DATA_PATH = Path("data.csv")  # <-- change to your real file path

def make_synthetic_data(n=9000, seed=42):
    rng = np.random.default_rng(seed)
    return pd.DataFrame({
        "ID": rng.integers(1, n//2 + 1, size=n),
        "IO_TYPE": rng.choice(["normal", "zombie"], size=n, p=[0.97, 0.03]),
        "LIFE_CYCLE": rng.choice(["open", "closed"], size=n, p=[0.9, 0.1]),
        "OFFERING_NAME": rng.choice(
            ["Core", "Standard", "ESG Plus", "SI Focus", "Core SI", "Income", "SI Sustainable", None],
            size=n, p=[0.23,0.23,0.14,0.14,0.08,0.07,0.06,0.05]
        ),
        "SI_CONSIDERATION_CD": rng.choice(["S1","S2","S3", None], size=n, p=[0.35,0.35,0.2,0.1]),
        "SFDR_PREF": rng.choice(["F1","F2","F3", None], size=n, p=[0.4,0.35,0.2,0.05]),
        "SFDR_ACTUAL": rng.choice(["F1","F2","F3", None], size=n, p=[0.45,0.35,0.15,0.05]),
        "PAI_PREF": rng.choice(["PAI Selected", None], size=n, p=[0.3,0.7]),
        "MIFID": rng.choice(["Yes","No", None], size=n, p=[0.55,0.4,0.05]),
        "TAXONOMYPREF": rng.choice(["A1","A2","A3", None], size=n, p=[0.5,0.35,0.1,0.05]),
        "GHG": rng.choice(["Yes","No","--", None], size=n, p=[0.25,0.65,0.05,0.05]),
        "Biodiversity": rng.choice(["Yes","No","--", None], size=n, p=[0.2,0.7,0.05,0.05]),
        "Water": rng.choice(["Yes","No","--", None], size=n, p=[0.22,0.68,0.05,0.05]),
        "Waste": rng.choice(["Yes","No","--", None], size=n, p=[0.18,0.72,0.05,0.05]),
        "Social": rng.choice(["Yes","No","--", None], size=n, p=[0.28,0.62,0.05,0.05]),
    })

if DATA_PATH.exists():
    df_raw = pd.read_csv(DATA_PATH)
    print(f"Loaded: {DATA_PATH}  shape={df_raw.shape}")
else:
    df_raw = make_synthetic_data()
    print("DATA_PATH not found; using synthetic demo dataset.")
    print(f"shape={df_raw.shape}")

df_raw = df_raw.sample(frac=1, random_state=42).reset_index(drop=True)
df_raw.head()

## 2) Cleaning & filtering (ONLY)

### Question
Which rows are actionable, and what is the effect of removing non-actionable records?

### Approach (data hygiene only)
We **only**:
- Trim whitespace in text fields
- Convert placeholder missing values (`"--"`, empty string) to nulls
- Filter out:
  - `IO_TYPE = zombie`
  - `LIFE_CYCLE != open`

**Important:**  
We do **not** map categories to numbers here (no Yes/No→1/0, no A1/A2/A3→1/2/3).  
That separation makes the process easier to review and approve.

### Evidence
- A small summary table: rows before/after each filter
- A bar chart showing the impact of filters
- Missingness snapshot (still in raw form)

### Decision checkpoint
If cleaning removes too many rows (or shifts the proxy target rate heavily), revisit filters or upstream data quality.

**How to interpret the outputs below**
- The row-count chart should show a reasonable drop from removing zombies/closed lifecycle.
- If the drop is extreme, we may be filtering too aggressively (or upstream data needs fixes).
- Missingness helps decide which fields can safely influence scoring; high missingness fields should have lower weight (or require a missing-safe design).

In [None]:
REQUIRED_RAW = [
    "ID","IO_TYPE","LIFE_CYCLE","OFFERING_NAME",
    "SI_CONSIDERATION_CD","SFDR_PREF","SFDR_ACTUAL","PAI_PREF","MIFID","TAXONOMYPREF",
    "GHG","Biodiversity","Water","Waste","Social"
]
missing_cols = [c for c in REQUIRED_RAW if c not in df_raw.columns]
if missing_cols:
    raise ValueError(f"Missing required raw columns: {missing_cols}")

def clean_filter_only(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # strip whitespace for object cols
    for c in df.columns:
        if df[c].dtype == "object":
            df[c] = df[c].apply(lambda x: x.strip() if isinstance(x, str) else x)

    # standardize placeholder missing
    df = df.replace({"--": np.nan, "": np.nan})

    before = len(df)
    df = df[df["IO_TYPE"].fillna("").str.lower() != "zombie"]
    after_zombie = len(df)
    df = df[df["LIFE_CYCLE"].fillna("").str.lower() == "open"]
    after_open = len(df)

    df.attrs["cleaning_summary"] = {
        "before": before,
        "after_remove_zombie": after_zombie,
        "after_keep_open": after_open,
        "removed_zombie": before - after_zombie,
        "removed_closed": after_zombie - after_open
    }
    return df

df_clean_rows = clean_filter_only(df_raw)
pd.DataFrame([df_clean_rows.attrs["cleaning_summary"]])

In [None]:
# Charts: cleaning impact + missingness snapshot
s = df_clean_rows.attrs["cleaning_summary"]
plt.figure(figsize=(8,4))
plt.bar(["Before", "After zombie", "After open"], [s["before"], s["after_remove_zombie"], s["after_keep_open"]])
plt.title("Cleaning impact: rows remaining after filters")
plt.ylabel("Rows")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

key_cols = ["SI_CONSIDERATION_CD","SFDR_PREF","SFDR_ACTUAL","TAXONOMYPREF","MIFID","PAI_PREF",
            "GHG","Biodiversity","Water","Waste","Social","OFFERING_NAME"]
miss = df_clean_rows[key_cols].isna().mean().sort_values(ascending=False)
display(miss.to_frame("missing_rate").head(12))

plt.figure(figsize=(10,4))
plt.bar(miss.index[:12], miss.values[:12])
plt.title("Top missing rates after cleaning (raw values, before encoding)")
plt.ylabel("Missing rate")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

## 3) Derive proxy label and aggregate to ID-level

### Question
What is the unit of action—rows or clients—and how do we create a benchmark label?

### Approach
1) Create `si_offering_row` from `OFFERING_NAME` containing token **SI**.
2) Aggregate to a **single record per ID** (because we recommend clients, not rows).
   - `si_offering = max(si_offering_row)` per ID.
   - For preference fields, we use **mode (or first if tie)** for transparency.

### Evidence
- Size comparison: cleaned row-level vs ID-level
- Proxy label prevalence (`si_offering` rate)

### Decision checkpoint
If the proxy label rate is extremely low/high, interpretation of lift metrics changes (AP becomes more important than AUC).

In [None]:
df = df_clean_rows.copy()

# Row-level SI membership derived from OFFERING_NAME
df["si_offering_row"] = df["OFFERING_NAME"].astype(str).str.contains(r"\bSI\b", case=False, na=False).astype(int)

def mode_or_first(s: pd.Series):
    s2 = s.dropna()
    if len(s2) == 0:
        return np.nan
    m = s2.mode()
    return m.iloc[0] if len(m) else s2.iloc[0]

agg_dict = {
    "IO_TYPE": mode_or_first,
    "LIFE_CYCLE": mode_or_first,
    "OFFERING_NAME": mode_or_first,  # transparency only (not used as feature)
    "SI_CONSIDERATION_CD": mode_or_first,
    "SFDR_PREF": mode_or_first,
    "SFDR_ACTUAL": mode_or_first,
    "PAI_PREF": mode_or_first,
    "MIFID": mode_or_first,
    "TAXONOMYPREF": mode_or_first,
    "GHG": mode_or_first,
    "Biodiversity": mode_or_first,
    "Water": mode_or_first,
    "Waste": mode_or_first,
    "Social": mode_or_first,
    "si_offering_row": "max",
}

df_id = df.groupby("ID", as_index=False).agg(agg_dict).rename(columns={"si_offering_row":"si_offering"})

sizes = pd.DataFrame({
    "level": ["row-level (cleaned)", "ID-level"],
    "rows": [len(df), len(df_id)],
    "si_offering_rate": [df["si_offering_row"].mean(), df_id["si_offering"].mean()]
})
display(sizes)

plt.figure(figsize=(7,4))
plt.bar(sizes["level"], sizes["rows"])
plt.title("Size: cleaned rows vs ID-level")
plt.ylabel("Count")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,4))
plt.bar(sizes["level"], sizes["si_offering_rate"])
plt.title("Proxy label prevalence: si_offering rate")
plt.ylabel("Rate")
plt.ylim(0, 1)
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

df_id.head()

## 4) Feature encoding & engineering (numbers happen here)

### Question
How do we convert preference fields into stable, explainable numeric signals?

### Approach
We transform raw fields into a small set of model-ready, interpretable signals:

**Encodings**
- `SI_CONSIDERATION_CD`: S1/S2/S3 → 1/2/3
- `SFDR_PREF`, `SFDR_ACTUAL`: F1/F2/F3 → 1/2/3
- `TAXONOMYPREF`: A1/A2/A3 → 1/2/3
- Topics: Yes/No → 1/0
- `MIFID`: Yes/No → 1/0
- `PAI_PREF`: “PAI Selected” → 1 else 0

**Engineered features**
- `sfdr_gap = clip(SFDR_PREF - SFDR_ACTUAL, -2, 2)`
- `sfdr_opp = max(sfdr_gap, 0)`  
  Only positive gap is treated as opportunity (preference exceeds current state).
- `pai_block` (0..1):  
  0 if no PAI, else `0.5 + 0.5*topics_norm`  
  (PAI selection plus intensity of ESG topic choices)
- `tax_norm` scales A1/A2/A3 → 0..1
- `si_is_s3` is a helper flag for the requested gate

### Evidence
We plot distributions (e.g., sfdr_gap and taxonomy) to confirm:
- values are in expected ranges
- signals are not mostly missing/zero

### Decision checkpoint
If a key signal is almost always missing/zero, it should not carry much weight (rule or ML).

In [None]:
MAP_SI = {"S1":1, "S2":2, "S3":3}
MAP_SFDR = {"F1":1, "F2":2, "F3":3}
MAP_TAX = {"A1":1, "A2":2, "A3":3}

def yes_to_1(x):
    return 1 if isinstance(x, str) and x.strip().lower() == "yes" else 0

def encode_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()

    # Binary fields
    for c in ["GHG","Biodiversity","Water","Waste","Social"]:
        df[c] = df[c].apply(yes_to_1).astype(int)

    df["MIFID"] = df["MIFID"].apply(yes_to_1).astype(int)
    df["PAI_PREF"] = (df["PAI_PREF"].astype(str).str.lower() == "pai selected").astype(int)

    # Ordinals (conservative default to lowest tier)
    df["SI_CONSIDERATION_num"] = df["SI_CONSIDERATION_CD"].map(MAP_SI).fillna(1).astype(int)
    df["SFDR_PREF_num"] = df["SFDR_PREF"].map(MAP_SFDR).fillna(1).astype(int)
    df["SFDR_ACTUAL_num"] = df["SFDR_ACTUAL"].map(MAP_SFDR).fillna(1).astype(int)
    df["TAXONOMYPREF_num"] = df["TAXONOMYPREF"].map(MAP_TAX).fillna(1).astype(int)

    # SFDR engineered
    df["sfdr_gap"] = np.clip(df["SFDR_PREF_num"] - df["SFDR_ACTUAL_num"], -2, 2)
    df["sfdr_opp"] = np.maximum(df["sfdr_gap"], 0)  # 0..2

    # Topics aggregate
    topic_cols = ["GHG","Biodiversity","Water","Waste","Social"]
    df["esg_topics_yes_cnt"] = df[topic_cols].sum(axis=1)
    df["topics_norm"] = df["esg_topics_yes_cnt"] / len(topic_cols)

    # Normalized signals (0..1)
    df["si_norm"] = np.clip((df["SI_CONSIDERATION_num"] - 1)/2, 0, 1)
    df["sfdr_norm"] = np.clip(df["sfdr_opp"]/2, 0, 1)
    df["tax_norm"] = np.clip((df["TAXONOMYPREF_num"] - 1)/2, 0, 1)

    # PAI block (0..1)
    df["pai_block"] = np.where(df["PAI_PREF"]==1, 0.5 + 0.5*df["topics_norm"], 0.0)

    # Business logic helper
    df["si_is_s3"] = (df["SI_CONSIDERATION_num"] == 3).astype(int)

    df["si_offering"] = df["si_offering"].astype(int)
    return df

df_feat = encode_features(df_id)

# Sanity-check distributions
plt.figure(figsize=(7,4))
plt.hist(df_feat["sfdr_gap"], bins=5)
plt.title("Distribution: sfdr_gap (clipped)")
plt.xlabel("sfdr_gap")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

plt.figure(figsize=(7,4))
plt.hist(df_feat["TAXONOMYPREF_num"], bins=3)
plt.title("Distribution: TAXONOMYPREF_num (A1/A2/A3 -> 1/2/3)")
plt.xlabel("TAXONOMYPREF_num")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

df_feat[["ID","si_offering","MIFID","SI_CONSIDERATION_num","si_is_s3","sfdr_gap","PAI_PREF","TAXONOMYPREF_num","esg_topics_yes_cnt"]].head()

## 5) Train/Validation split (ID-level)

### Question
How do we evaluate whether a method will generalize to new clients?

### Approach
We create a held-out **validation** split (stratified by `si_offering`) so each method is judged fairly:
- Train: used to learn data-driven weights and fit ML
- Validation: used only for evaluation

### Evidence
We print:
- split sizes
- proxy label rates in train vs validation (should be similar)

### Decision checkpoint
If label rates differ strongly, the split may not be stratified correctly or the dataset is too small.

In [None]:
FEATURES_BASE = ["MIFID","SI_CONSIDERATION_num","si_is_s3","sfdr_norm","pai_block","tax_norm"]

X = df_feat[FEATURES_BASE].copy()
y = df_feat["si_offering"].copy()

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)
idx_train, idx_val = X_train.index, X_val.index

print("Train:", len(idx_train), "Val:", len(idx_val))
print("Train si_offering rate:", y_train.mean().round(4))
print("Val   si_offering rate:", y_val.mean().round(4))

## 6) Evaluation helpers (metrics + lift + calibration)

### What we report (stakeholder meaning)
- **AUC:** ranking quality overall (higher is better)
- **Average Precision (AP):** ranking quality when positives are rare (often more relevant than AUC)
- **Brier score:** probability quality (lower is better)

### What we visualize
- **Lift by decile:** “Does the top decile have a meaningfully higher SI rate than the bottom?”
- **Calibration curve:** “If we treat a score as a probability, is it reliable?”

These artifacts support a key stakeholder question:
> “If we work the top bucket, do we see a higher hit rate than baseline?”

In [None]:
def eval_scores(y_true, p, label):
    return {
        "model": label,
        "auc": roc_auc_score(y_true, p),
        "avg_precision": average_precision_score(y_true, p),
        "brier": brier_score_loss(y_true, p)
    }

def lift_table(y_true, p, n_bins=10):
    tmp = pd.DataFrame({"y": y_true, "p": p})
    tmp["bin"] = pd.qcut(tmp["p"], n_bins, labels=False, duplicates="drop") + 1
    return tmp.groupby("bin")["y"].agg(["mean","count"]).rename(columns={"mean":"si_rate"})

def plot_lift(tab, title):
    plt.figure(figsize=(8,4))
    plt.plot(tab.index, tab["si_rate"].values, marker="o")
    plt.title(title)
    plt.xlabel("Decile (1=lowest score, 10=highest)")
    plt.ylabel("si_offering rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

def plot_calibration(y_true, p, title):
    prob_true, prob_pred = calibration_curve(y_true, p, n_bins=10, strategy="quantile")
    plt.figure(figsize=(6,6))
    plt.plot(prob_pred, prob_true, marker="o")
    plt.plot([0,1],[0,1], linestyle="--")
    plt.title(title)
    plt.xlabel("Predicted")
    plt.ylabel("Observed si_offering rate")
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.show()

## 7) Method 1 — Fixed rule (stakeholder weights)

### Question
Can a transparent rule-based score create meaningful lift without any learning?

### Approach (your requested logic)
- If `MIFID = 0`: score depends **only** on `SI_CONSIDERATION`
- If `MIFID = 1`: start from an SI baseline, and **only when `SI = S3`** add confirmation from:
  - SFDR opportunity (`sfdr_norm`)
  - PAI confirmation (`pai_block`)
  - Taxonomy preference (`tax_norm`)

We also create **“why columns”** showing component contributions (base, SFDR, PAI, tax).

### Evidence
On validation we show:
- AUC/AP/Brier
- Lift by decile
- Calibration curve

### Decision checkpoint
If lift is weak, we do not jump to ML immediately.
Next we try a **data-driven weighted rule** that keeps the same structure but learns weights.

**What to look for in the evaluation**
- **Lift:** the top decile should have a meaningfully higher SI rate than baseline.
- **Calibration:** points near the diagonal indicate “probability-like” behavior.
- If lift is weak: we improve the same logic by learning weights (Method 2) before considering ML.

In [None]:
@dataclass
class FixedRuleConfig:
    # MIFID=0 branch: only SI
    score_A_S1: float = 0
    score_A_S2: float = 45
    score_A_S3: float = 85

    # MIFID=1 branch:
    baseline_B_S1: float = 0
    baseline_B_S2: float = 30
    baseline_B_S3: float = 50

    confirm_max: float = 50  # max added if SI is high (S3)
    w_sfdr: float = 0.50
    w_pai: float = 0.30
    w_tax: float = 0.20

cfg_fixed = FixedRuleConfig()

def score_fixed_rule(df: pd.DataFrame, cfg: FixedRuleConfig) -> pd.DataFrame:
    df = df.copy()

    # Baselines
    base_A = df["SI_CONSIDERATION_num"].map({1: cfg.score_A_S1, 2: cfg.score_A_S2, 3: cfg.score_A_S3}).astype(float)
    base_B = df["SI_CONSIDERATION_num"].map({1: cfg.baseline_B_S1, 2: cfg.baseline_B_S2, 3: cfg.baseline_B_S3}).astype(float)

    confirm = cfg.confirm_max * (
        cfg.w_sfdr * df["sfdr_norm"] +
        cfg.w_pai * df["pai_block"] +
        cfg.w_tax * df["tax_norm"]
    )

    # Apply requested gate: only add confirm when MIFID=1 and SI=S3
    score = np.where(df["MIFID"]==0, base_A,
                     np.where(df["si_is_s3"]==1, base_B + confirm, base_B))

    df["score_fixed"] = np.clip(score, 0, 100)

    # Why columns
    df["why_base"] = np.where(df["MIFID"]==0, base_A, base_B)
    df["why_sfdr"] = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*cfg.w_sfdr*df["sfdr_norm"], 0.0)
    df["why_pai"]  = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*cfg.w_pai*df["pai_block"], 0.0)
    df["why_tax"]  = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*cfg.w_tax*df["tax_norm"], 0.0)

    return df

df_scored = score_fixed_rule(df_feat, cfg_fixed)

plt.figure(figsize=(7,4))
plt.hist(df_scored["score_fixed"], bins=30)
plt.title("Score distribution: Fixed rule")
plt.xlabel("score_fixed")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

df_scored[["score_fixed","why_base","why_sfdr","why_pai","why_tax"]].head()

In [None]:
# Validation performance (Fixed rule)
p_fixed = df_scored.loc[idx_val, "score_fixed"].values / 100.0
fixed_metrics = eval_scores(y_val.values, p_fixed, "Fixed rule")
display(fixed_metrics)

lift_fixed = lift_table(y_val.values, p_fixed)
plot_lift(lift_fixed, "Lift (proxy): Fixed rule")
display(lift_fixed)

plot_calibration(y_val.values, p_fixed, "Calibration (proxy): Fixed rule")

## 8) Method 2 — Weighted rule (data-driven, still rule-like)

### Question
Can we improve the rule by learning weights from data, while keeping the same structure?

### Approach
We keep the **same gate** and baselines, but learn the confirmation weights:
- Train on the subset where confirmation is allowed by design: `MIFID=1` and `SI=S3`
- Fit logistic regression using only:
  - `sfdr_norm`, `pai_block`, `tax_norm`
- Convert positive coefficients into weights that sum to 1 (easy to explain and audit)

### Evidence
- Bar chart of learned weights
- Validation metrics + lift + calibration

### Decision checkpoint
If weighted rule improves lift and stays stable, it is often the best operational default:
- explainable
- easy governance
- less drift-sensitive than complex models

**Why this step is persuasive**
This step preserves the stakeholder rule structure but uses data to answer:
> “Which confirmation signal (SFDR vs PAI vs Taxonomy) actually correlates more with SI membership in our data?”

If the learned weights are stable over time, this becomes a strong operational default.

In [None]:
train_df = df_scored.loc[idx_train].copy()
train_gate = train_df[(train_df["MIFID"]==1) & (train_df["si_is_s3"]==1)].copy()

WEIGHT_FEATURES = ["sfdr_norm","pai_block","tax_norm"]
if train_gate["si_offering"].nunique() < 2:
    print("Warning: gated training subset has only one class; falling back to fixed weights.")
    w_learned = pd.Series([cfg_fixed.w_sfdr, cfg_fixed.w_pai, cfg_fixed.w_tax], index=WEIGHT_FEATURES)
else:
    lr_w = LogisticRegression(max_iter=6000, class_weight="balanced")
    lr_w.fit(train_gate[WEIGHT_FEATURES], train_gate["si_offering"])
    coef = pd.Series(lr_w.coef_[0], index=WEIGHT_FEATURES)

    pos = np.maximum(coef.values, 0)  # stakeholder-friendly monotonic assumption
    if pos.sum() == 0:
        pos = np.ones_like(pos)
    w_learned = pd.Series(pos / pos.sum(), index=WEIGHT_FEATURES)

display(w_learned.to_frame("learned_weight"))

plt.figure(figsize=(7,4))
plt.bar(w_learned.index, w_learned.values)
plt.title("Weighted rule: learned confirmation weights (sum=1)")
plt.ylabel("weight")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
@dataclass
class WeightedRuleConfig:
    score_A_S1: float = cfg_fixed.score_A_S1
    score_A_S2: float = cfg_fixed.score_A_S2
    score_A_S3: float = cfg_fixed.score_A_S3

    baseline_B_S1: float = cfg_fixed.baseline_B_S1
    baseline_B_S2: float = cfg_fixed.baseline_B_S2
    baseline_B_S3: float = cfg_fixed.baseline_B_S3

    confirm_max: float = cfg_fixed.confirm_max

cfg_weighted = WeightedRuleConfig()

def score_weighted_rule(df: pd.DataFrame, cfg: WeightedRuleConfig, w: pd.Series) -> pd.DataFrame:
    df = df.copy()
    base_A = df["SI_CONSIDERATION_num"].map({1: cfg.score_A_S1, 2: cfg.score_A_S2, 3: cfg.score_A_S3}).astype(float)
    base_B = df["SI_CONSIDERATION_num"].map({1: cfg.baseline_B_S1, 2: cfg.baseline_B_S2, 3: cfg.baseline_B_S3}).astype(float)

    confirm = cfg.confirm_max * (
        w["sfdr_norm"] * df["sfdr_norm"] +
        w["pai_block"] * df["pai_block"] +
        w["tax_norm"] * df["tax_norm"]
    )

    score = np.where(df["MIFID"]==0, base_A,
                     np.where(df["si_is_s3"]==1, base_B + confirm, base_B))

    df["score_weighted"] = np.clip(score, 0, 100)

    df["whyW_base"] = np.where(df["MIFID"]==0, base_A, base_B)
    df["whyW_sfdr"] = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*w["sfdr_norm"]*df["sfdr_norm"], 0.0)
    df["whyW_pai"]  = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*w["pai_block"]*df["pai_block"], 0.0)
    df["whyW_tax"]  = np.where((df["MIFID"]==1) & (df["si_is_s3"]==1), cfg.confirm_max*w["tax_norm"]*df["tax_norm"], 0.0)
    return df

df_scored = score_weighted_rule(df_scored, cfg_weighted, w_learned)

plt.figure(figsize=(7,4))
plt.hist(df_scored["score_weighted"], bins=30)
plt.title("Score distribution: Weighted rule (data-driven)")
plt.xlabel("score_weighted")
plt.ylabel("count")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

df_scored[["score_weighted","whyW_base","whyW_sfdr","whyW_pai","whyW_tax"]].head()

In [None]:
# Validation performance (Weighted rule)
p_weighted = df_scored.loc[idx_val, "score_weighted"].values / 100.0
weighted_metrics = eval_scores(y_val.values, p_weighted, "Weighted rule (data-driven)")
display(weighted_metrics)

lift_weighted = lift_table(y_val.values, p_weighted)
plot_lift(lift_weighted, "Lift (proxy): Weighted rule")
display(lift_weighted)

plot_calibration(y_val.values, p_weighted, "Calibration (proxy): Weighted rule")

## 9) Method 3 — ML: Calibrated Logistic Regression (controlled enhancement)

### Question
Can ML outperform the rule approaches while remaining explainable and calibrated?

### Approach
We use **Calibrated Logistic Regression** because it is:
- strong for tabular data
- interpretable (coefficients)
- produces calibrated probabilities for bucket/threshold decisions

**Respecting the business structure**
We add interaction (“gate”) features:
- `gate = MIFID * si_is_s3`
- `gate_sfdr = gate * sfdr_norm`
- `gate_pai  = gate * pai_block`
- `gate_tax  = gate * tax_norm`

This allows ML to learn signal strength where it is logically relevant, aligned with the rule design.

### Evidence
- Validation metrics + lift + calibration
- Coefficients from an uncalibrated LR for interpretability

### Decision checkpoint
Use ML only if it improves lift materially **and** stays stable/calibrated. Otherwise prefer Weighted rule.

**How to interpret ML outputs**
- If ML improves AUC/AP but has worse calibration, it can still be useful for ranking—but bucket thresholds should be conservative.
- Coefficients are a sanity check: are directions plausible (e.g., higher SFDR opportunity increases score)?
- Adopt ML only if it materially improves lift in the highest bucket where ROI concentrates.

In [None]:
def make_ml_matrix(df: pd.DataFrame) -> pd.DataFrame:
    X = df[FEATURES_BASE].copy()
    X["gate"] = X["MIFID"] * X["si_is_s3"]
    X["gate_sfdr"] = X["gate"] * X["sfdr_norm"]
    X["gate_pai"]  = X["gate"] * X["pai_block"]
    X["gate_tax"]  = X["gate"] * X["tax_norm"]
    return X

Xtr = make_ml_matrix(df_scored.loc[idx_train])
Xva = make_ml_matrix(df_scored.loc[idx_val])

lr = LogisticRegression(max_iter=8000, class_weight="balanced")
cal_lr = CalibratedClassifierCV(lr, method="isotonic", cv=5)
cal_lr.fit(Xtr, y_train)

p_ml = cal_lr.predict_proba(Xva)[:,1]
ml_metrics = eval_scores(y_val.values, p_ml, "ML: Calibrated Logistic Regression")
display(ml_metrics)

lift_ml = lift_table(y_val.values, p_ml)
plot_lift(lift_ml, "Lift (proxy): ML Calibrated Logistic Regression")
display(lift_ml)

plot_calibration(y_val.values, p_ml, "Calibration (proxy): ML Calibrated Logistic Regression")

# Interpretability: coefficients from uncalibrated LR on same features
lr_plain = LogisticRegression(max_iter=8000, class_weight="balanced")
lr_plain.fit(Xtr, y_train)
coef = pd.Series(lr_plain.coef_[0], index=Xtr.columns).sort_values(key=np.abs, ascending=False)
display(coef.to_frame("coef"))

plt.figure(figsize=(10,4))
plt.bar(coef.index[:12], coef.values[:12])
plt.title("Top coefficients (uncalibrated LR; sign indicates direction)")
plt.ylabel("coef")
plt.xticks(rotation=45, ha="right")
plt.grid(True, axis="y", alpha=0.3)
plt.tight_layout()
plt.show()

## 10) Validation comparison (Fixed vs Weighted vs ML)

### Question
Which method is best for prioritization on held-out data?

### Approach
We compare all three methods on the **same validation set**:
- Fixed rule
- Weighted rule
- Calibrated LR (ML)

### Evidence
- Table: AUC / AP / Brier for all methods
- Bar charts for the three metrics

### Decision
Choose the method that offers the best trade-off between:
- uplift (business value)
- explainability (governance)
- stability (risk)

In [None]:
comparison = pd.DataFrame([fixed_metrics, weighted_metrics, ml_metrics]).round(4)
display(comparison.sort_values("auc", ascending=False))

for metric in ["auc","avg_precision","brier"]:
    plt.figure(figsize=(8,4))
    plt.bar(comparison["model"], comparison[metric])
    plt.title(f"Validation comparison: {metric}")
    plt.ylabel(metric)
    plt.xticks(rotation=30, ha="right")
    plt.grid(True, axis="y", alpha=0.3)
    plt.tight_layout()
    plt.show()

## 11) Operational output: Top 20 IDs with `si_offering=0` + “why” columns

### Question
How do we turn scores into an actionable list?

### Approach
- Filter to `si_offering = 0`
- Rank by selected method (Fixed or Weighted are available for all IDs)
- Add percentiles and 3 buckets (Low / Average / High)
- Include “why columns” so the front office understands *why* a client is prioritized

### Evidence
A top-20 table that can be exported to CRM or used directly for outreach planning.

In [None]:
RANK_METHOD = "score_weighted"  # "score_fixed" or "score_weighted"

df_out = df_scored.copy()
df_out["rank_score"] = df_out[RANK_METHOD]
df_out["score_percentile"] = (df_out["rank_score"].rank(pct=True) * 100).round(2)
df_out["bucket_3"] = pd.cut(df_out["score_percentile"], bins=[-0.01, 50, 80, 100], labels=["Low","Average","High"])

targets = df_out[df_out["si_offering"]==0].sort_values("rank_score", ascending=False)

if RANK_METHOD == "score_fixed":
    why_cols = ["why_base","why_sfdr","why_pai","why_tax"]
else:
    why_cols = ["whyW_base","whyW_sfdr","whyW_pai","whyW_tax"]

cols = ["ID","rank_score","score_percentile","bucket_3",
        "MIFID","SI_CONSIDERATION_num","sfdr_gap","PAI_PREF","TAXONOMYPREF_num","esg_topics_yes_cnt"] + why_cols

targets[cols].head(20)

## 12) Risks & governance

### Core risks
- **Proxy label bias:** `si_offering` is membership, not true interest.
- **Hard-gate coverage risk:** SFDR/PAI/Tax only influence the score when `SI=S3`. This can miss “emerging” interest.
- **Missing data:** defaulting to lowest tier can under-score clients with incomplete questionnaires.

### Governance controls
- Track missingness and score distributions monthly
- Monitor lift-by-decile on recent data
- Refresh learned weights on a fixed cadence (e.g., quarterly) or after major process changes
- Keep a human-review step for the highest-risk edge cases

## 13) Pilot plan (to create true labels)

### Why we need a pilot
The proxy label is useful for benchmarking methods, but stakeholders ultimately want:
> “Does targeting the high bucket produce more SI adoption or pipeline?”

### Recommended pilot design
- Population: `si_offering = 0`
- Treatment: top bucket (e.g., top 20% by Weighted rule)
- Control: randomized sample from the remaining eligible population (or next bucket)
- Outcomes to capture:
  - outreach response
  - meeting booked
  - SI adoption / mandate change
  - pipeline created

### Decision
After the pilot produces true outcomes, retrain and re-compare:
Fixed vs Weighted vs ML on **true labels**, not proxy membership.