# Feature Importance Methods for Scientific Inference — SOLUTIONS

This notebook contains the **full solutions** (code and interpretations) for all three exercises.

---

## Setup

Run the cells below to set everything up.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'font.size':        14,
    'axes.titlesize':   16,
    'axes.labelsize':   14,
    'xtick.labelsize':  13,
    'ytick.labelsize':  13,
    'legend.fontsize':  13,
})

In [None]:
def generate_data(n=1500, seed=83):
    """Generate the dataset. The DGP is hidden for now."""
    rng = np.random.RandomState(seed)
    x1 = rng.normal(0, 1, n)
    x2 = 0.999 * x1 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    x3 = rng.normal(0, 1, n)
    x4 = 0.999 * x3 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    y = 5 * x1 + rng.normal(0, 1, n)
    x5 = rng.normal(0, 1, n)
    X = np.column_stack([x1, x2, x3, x4, x5])
    return X, y

X, y = generate_data()
feature_names = ["X1", "X2", "X3", "X4", "X5"]

print(f"Dataset shape: {X.shape}")
print(f"Feature names: {feature_names}")
print(f"\nFirst 5 rows of X:")
print(np.round(X[:5], 2))
print(f"\nFirst 5 values of Y: {np.round(y[:5], 2)}")

In [None]:
# Split into train and test
n_train = 1000
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = y[:n_train], y[n_train:]

# Train a Linear Regression (unregularized)
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = model.score(X_test, y_test)
print(f"Test MSE: {mse:.3f}")
print(f"Test R\u00b2:  {r2:.3f}")
print(f"\nFitted coefficients: {np.round(model.coef_, 2)}")

In [None]:
def permutation_sampler(X, feature_idx, rng=None):
    """
    Permutation (marginal) sampler.
    Returns a copy of X where column `feature_idx` is randomly permuted,
    effectively sampling X_j from its marginal distribution independently
    of all other features and Y.
    """
    if rng is None:
        rng = np.random.RandomState(0)
    X_perm = X.copy()
    X_perm[:, feature_idx] = rng.permutation(X[:, feature_idx])
    return X_perm


def compute_pfi(model, X, y, feature_idx, sampler, n_repeats=50, seed=42):
    """
    Compute Permutation Feature Importance for feature `feature_idx`.

    PFI_j = E[L(Y, f(X_tilde_j, X_{-j}))] - E[L(Y, f(X))]

    Parameters
    ----------
    model : fitted sklearn model
    X : np.ndarray, shape (n, p) - test features
    y : np.ndarray, shape (n,) - test target
    feature_idx : int - index of the feature to permute
    sampler : callable - function(X, feature_idx, rng) -> X_perturbed
    n_repeats : int - number of repetitions to average over
    seed : int - random seed

    Returns
    -------
    pfi_mean : float - mean PFI score across repeats
    pfi_se   : float - standard error of the mean across repeats
    """
    rng = np.random.RandomState(seed)
    baseline_mse = mean_squared_error(y, model.predict(X))

    perturbed_mses = []
    for _ in range(n_repeats):
        X_perturbed = sampler(X, feature_idx, rng=rng)
        y_pred_perturbed = model.predict(X_perturbed)
        perturbed_mses.append(mean_squared_error(y, y_pred_perturbed))

    pfi_values = np.array(perturbed_mses) - baseline_mse
    pfi_mean = np.mean(pfi_values)
    pfi_se = np.std(pfi_values) / np.sqrt(n_repeats)
    return pfi_mean, pfi_se

### Pre-computed PFI Results

The plot below is what participants see before Exercise 1.

In [None]:
# Pre-computed PFI (same as shown to participants)
pfi_scores_setup = [compute_pfi(model, X_test, y_test, feature_idx=j,
                                sampler=permutation_sampler)
                    for j in range(X_test.shape[1])]

plt.figure(figsize=(6, 4))
plt.barh(feature_names[::-1], pfi_scores_setup[::-1],
         color='grey', edgecolor='black', linewidth=0.5)
plt.xlabel("PFI (increase in MSE)")
plt.title("Permutation Feature Importance")
plt.tight_layout()
plt.show()

---

# Solution 1: Interpret the PFI Plot

**Which features does the model rely on?**

PFI is non-zero for all five features. Permuting any of them increases the prediction error, so the model relies on all of them to some degree.

**Which features appear important in the data?**

Based on PFI alone, one would conclude that $X_1$ through $X_4$ are all associated with $Y$, with $X_1$ appearing most important and $X_5$ negligible.

**What does PFI measure exactly?**

PFI measures **model reliance**: how much the model's predictions degrade when the association between $X_j$ and all other variables is broken. A non-zero PFI means the model uses that feature — but this does **not** necessarily mean the feature is causally or even statistically associated with $Y$.

> This corresponds to *model reliance* in Ewald et al. (2024), Section 5.1.

---

# Solution 2: Why is PFI misleading?

**The true DGP:**

$$X_1 \sim \mathcal{N}(0,1), \quad X_3 \sim \mathcal{N}(0,1), \quad X_5 \sim \mathcal{N}(0,1) \quad \text{(mutually independent)}$$

$$X_2 = 0.999 \cdot X_1 + \sqrt{1 - 0.999^2} \cdot \varepsilon_2, \quad X_4 = 0.999 \cdot X_3 + \sqrt{1 - 0.999^2} \cdot \varepsilon_4$$

$$Y = 5 X_1 + \varepsilon_Y, \quad \varepsilon_Y \sim \mathcal{N}(0, 1)$$

$Y$ depends **only** on $X_1$. $X_3$, $X_4$, and $X_5$ are **completely independent** of $Y$.

In [None]:
# Task 1: Implement PFI from scratch
def my_pfi(model, X, y, feature_idx, n_repeats=50, seed=42):
    rng = np.random.RandomState(seed)
    baseline_mse = mean_squared_error(y, model.predict(X))
    perturbed_mses = []
    for _ in range(n_repeats):
        X_perm = X.copy()
        X_perm[:, feature_idx] = rng.permutation(X[:, feature_idx])
        perturbed_mses.append(mean_squared_error(y, model.predict(X_perm)))
    return np.mean(perturbed_mses) - baseline_mse

# Verify — should match the pre-computed pfi_scores_setup
for j in range(X_test.shape[1]):
    mine     = my_pfi(model, X_test, y_test, feature_idx=j)
    provided = pfi_scores_setup[j]
    print(f"X{j+1}: mine={mine:.4f}  provided={provided:.4f}")

In [None]:
# Create permuted versions
rng = np.random.RandomState(42)
X_test_perm_x2 = permutation_sampler(X_test, feature_idx=1, rng=rng)
rng = np.random.RandomState(42)
X_test_perm_x3 = permutation_sampler(X_test, feature_idx=2, rng=rng)

# Scatterplots: original vs permuted for both pairs
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Row 1: (X1, X2)
axes[0, 0].scatter(X_test[:, 0], X_test[:, 1], alpha=0.3, s=10, color='#2196F3')
axes[0, 0].set_xlabel("$X_1$"); axes[0, 0].set_ylabel("$X_2$")
axes[0, 0].set_title("Original: $(X_1, X_2)$")
axes[0, 0].set_xlim(-4, 4); axes[0, 0].set_ylim(-4, 4)

axes[0, 1].scatter(X_test_perm_x2[:, 0], X_test_perm_x2[:, 1], alpha=0.3, s=10, color='#FF9800')
axes[0, 1].set_xlabel("$X_1$"); axes[0, 1].set_ylabel(r"$\tilde{X}_2$ (permuted)")
axes[0, 1].set_title("After permuting $X_2$")
axes[0, 1].set_xlim(-4, 4); axes[0, 1].set_ylim(-4, 4)

# Row 2: (X3, X4)
axes[1, 0].scatter(X_test[:, 2], X_test[:, 3], alpha=0.3, s=10, color='#E91E63')
axes[1, 0].set_xlabel("$X_3$"); axes[1, 0].set_ylabel("$X_4$")
axes[1, 0].set_title("Original: $(X_3, X_4)$")
axes[1, 0].set_xlim(-4, 4); axes[1, 0].set_ylim(-4, 4)

axes[1, 1].scatter(X_test_perm_x3[:, 2], X_test_perm_x3[:, 3], alpha=0.3, s=10, color='#9C27B0')
axes[1, 1].set_xlabel(r"$\tilde{X}_3$ (permuted)"); axes[1, 1].set_ylabel("$X_4$")
axes[1, 1].set_title("After permuting $X_3$")
axes[1, 1].set_xlim(-4, 4); axes[1, 1].set_ylim(-4, 4)

plt.tight_layout()
plt.show()

print(f"\nFitted coefficients: {np.round(model.coef_, 2)}")

### Explanation (Solution 2)

**What do you notice in the scatterplots?**

- **Left column (original):** Both $(X_1, X_2)$ and $(X_3, X_4)$ show very tight ellipses along the diagonal — the features within each pair are nearly identical ($\rho \approx 0.999$).
- **Right column (permuted):** After permuting one feature in each pair, the correlation is destroyed. The data forms circular clouds with many **data points that never occurred in the training data**.

**How do the model coefficients explain the PFI scores?**

The fitted coefficients are approximately `[3.11, 1.88, -2.11, 2.17]`. Due to the extreme multicollinearity ($\rho = 0.999$), OLS assigns **large opposing coefficients** within each correlated pair:
- The $X_1$/$X_2$ pair: coefficients ~3.1 and ~1.9 (the true effect of 5 is split between them)
- The $X_3$/$X_4$ pair: coefficients ~-2.1 and ~2.2 (these are **pure overfitting** — they approximately cancel each other out, but each is large individually)

When you permute one feature in a pair, the near-perfect cancellation breaks, causing large prediction errors.

**Why does PFI assign high importance to $X_3$ and $X_4$?**

1. OLS assigns large but opposing coefficients to $X_3$ and $X_4$ (because they are nearly collinear and the coefficient estimates are unstable)
2. In the original data, $X_3 \approx X_4$, so the contributions of $\hat{\beta}_3 X_3$ and $\hat{\beta}_4 X_4$ nearly cancel
3. When the permutation sampler breaks the $X_3$-$X_4$ correlation, this cancellation is destroyed, and the model's predictions are wildly wrong
4. PFI detects this as "importance" — but it reflects the model's reliance on the $X_3$-$X_4$ correlation, **not** any association between these features and $Y$

This is an even more dramatic failure than the $X_2$ case: $X_3$ and $X_4$ are **completely independent** of $Y$, yet PFI assigns them substantial importance.

> This illustrates the problem described in *Negative Result 5.1.2* of Ewald et al. (2024): non-zero PFI does not necessarily imply any association with $Y$.

---

# Solution 3: Conditional Feature Importance (CFI)

In [None]:
def conditional_sampler(X, feature_idx, rng=None, mean=None, cov=None):
    """
    Conditional sampler for multivariate normal data.
    Samples X_j from X_j | X_{-j} using the closed-form Gaussian conditional.

    Parameters
    ----------
    X : np.ndarray, shape (n, p)
    feature_idx : int - the feature to resample
    rng : np.random.RandomState
    mean : np.ndarray, shape (p,) - mean of the joint distribution
    cov : np.ndarray, shape (p, p) - covariance of the joint distribution

    Returns
    -------
    X_cond : np.ndarray - copy of X with column `feature_idx` resampled conditionally
    """
    if rng is None:
        rng = np.random.RandomState(0)

    n, p = X.shape
    j = feature_idx

    # Indices of all other features
    others = [i for i in range(p) if i != j]

    # Extract sub-matrices from the covariance
    sigma_jj = cov[j, j]                              # scalar
    sigma_j_others = cov[j, others]                    # (p-1,)
    sigma_others_others = cov[np.ix_(others, others)]  # (p-1, p-1)

    # Conditional parameters
    sigma_others_inv = np.linalg.inv(sigma_others_others)
    beta = sigma_j_others @ sigma_others_inv            # regression coefficients
    cond_var = sigma_jj - sigma_j_others @ sigma_others_inv @ sigma_j_others  # conditional variance

    # Conditional mean for each observation: mu_j + beta @ (x_{-j} - mu_{-j})
    x_others = X[:, others]
    cond_means = mean[j] + (x_others - mean[others]) @ beta  # (n,)

    # Sample from conditional distribution
    X_cond = X.copy()
    X_cond[:, j] = cond_means + rng.normal(0, np.sqrt(max(cond_var, 0)), n)

    return X_cond

In [None]:
# Estimate mean and covariance from training data
estimated_mean = np.mean(X_train, axis=0)
estimated_cov = np.cov(X_train, rowvar=False)

print("Estimated covariance matrix:")
print(np.round(estimated_cov, 3))
print("\nNotice: X1-X2 are highly correlated, X3-X4 are highly correlated,")
print("but the two pairs are independent of each other.")


def conditional_sampler_wrapper(X, feature_idx, rng=None):
    """Wrapper so conditional_sampler has the same signature as permutation_sampler."""
    return conditional_sampler(X, feature_idx, rng=rng,
                               mean=estimated_mean, cov=estimated_cov)

In [None]:
# Compute CFI for all features (50 permutations each)
cfi_scores = []
cfi_ses = []
for j in range(X_test.shape[1]):
    cfi_mean, cfi_se = compute_pfi(model, X_test, y_test, feature_idx=j,
                                   sampler=conditional_sampler_wrapper)
    cfi_scores.append(cfi_mean)
    cfi_ses.append(cfi_se)
    print(f"CFI({feature_names[j]}): {cfi_mean:.4f} ± {cfi_se:.4f}")

# Side-by-side horizontal bar charts with SE (X1 on top → reverse order for barh)
fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharex=False)

err_kw = dict(ecolor='black', capsize=4, capthick=1.5, elinewidth=1.5)
names_r = feature_names[::-1]

axes[0].barh(names_r, pfi_scores[::-1], xerr=pfi_ses[::-1], color='grey',
             edgecolor='black', linewidth=0.5, error_kw=err_kw)
axes[0].set_xlabel("Importance (increase in MSE)")
axes[0].set_title("PFI (Permutation / Marginal Sampler)")

axes[1].barh(names_r, cfi_scores[::-1], xerr=cfi_ses[::-1], color='grey',
             edgecolor='black', linewidth=0.5, error_kw=err_kw)
axes[1].set_xlabel("Importance (increase in MSE)")
axes[1].set_title("CFI (Conditional Sampler)")

plt.tight_layout()
plt.show()

In [None]:
# Optional: Scatterplot comparison for both pairs
rng = np.random.RandomState(42)
X_test_cond_x2 = conditional_sampler_wrapper(X_test, feature_idx=1, rng=rng)
rng = np.random.RandomState(42)
X_test_cond_x3 = conditional_sampler_wrapper(X_test, feature_idx=2, rng=rng)

fig, axes = plt.subplots(2, 3, figsize=(16, 9))

# Row 1: (X1, X2) — original, permuted, conditional
for ax in axes[0]: ax.set_xlim(-4, 4); ax.set_ylim(-4, 4)
axes[0, 0].scatter(X_test[:, 0], X_test[:, 1], alpha=0.3, s=10, color='#2196F3')
axes[0, 0].set_xlabel("$X_1$"); axes[0, 0].set_ylabel("$X_2$")
axes[0, 0].set_title("Original data")

axes[0, 1].scatter(X_test_perm_x2[:, 0], X_test_perm_x2[:, 1], alpha=0.3, s=10, color='#FF9800')
axes[0, 1].set_xlabel("$X_1$"); axes[0, 1].set_ylabel(r"$\tilde{X}_2$")
axes[0, 1].set_title("Permutation sampler")

axes[0, 2].scatter(X_test_cond_x2[:, 0], X_test_cond_x2[:, 1], alpha=0.3, s=10, color='#4CAF50')
axes[0, 2].set_xlabel("$X_1$"); axes[0, 2].set_ylabel(r"$\tilde{X}_2$")
axes[0, 2].set_title("Conditional sampler")

# Row 2: (X3, X4) — original, permuted, conditional
for ax in axes[1]: ax.set_xlim(-4, 4); ax.set_ylim(-4, 4)
axes[1, 0].scatter(X_test[:, 2], X_test[:, 3], alpha=0.3, s=10, color='#E91E63')
axes[1, 0].set_xlabel("$X_3$"); axes[1, 0].set_ylabel("$X_4$")
axes[1, 0].set_title("Original data")

axes[1, 1].scatter(X_test_perm_x3[:, 2], X_test_perm_x3[:, 3], alpha=0.3, s=10, color='#9C27B0')
axes[1, 1].set_xlabel(r"$\tilde{X}_3$"); axes[1, 1].set_ylabel("$X_4$")
axes[1, 1].set_title("Permutation sampler")

axes[1, 2].scatter(X_test_cond_x3[:, 2], X_test_cond_x3[:, 3], alpha=0.3, s=10, color='#4CAF50')
axes[1, 2].set_xlabel(r"$\tilde{X}_3$"); axes[1, 2].set_ylabel("$X_4$")
axes[1, 2].set_title("Conditional sampler")

plt.tight_layout()
plt.show()

### Interpretation (Solution 3)

**How do PFI and CFI differ?**

- **PFI** assigns large importance to **all five features**, including $X_3$, $X_4$, and $X_5$ which are completely independent of $Y$.
- **CFI** assigns a non-zero importance only to $X_1$. All other features have CFI $\approx 0$.

The conditional sampler preserves the correlation within each pair. Since $X_2$ is still realistic given $X_1$ (and $X_4$ given $X_3$), the large opposing coefficients still cancel out, and the model's predictions remain accurate. For $X_5$, which is uncorrelated with everything, both samplers draw from the same marginal — yet any non-zero coefficient the model assigns will still inflate PFI.

**Three types of features and how CFI treats them:**

Our DGP contains three distinct types of features. CFI correctly identifies $X_1$ as the only important one, but it is worth understanding *why* each of the other features receives a CFI of zero — the reasons are different.

| Feature | Role in DGP | Unconditionally associated with $Y$? | Conditionally associated with $Y$ (given all others)? | PFI | CFI |
|---|---|---|---|---|---|
| $X_1$ | **Directly relevant** — appears in the equation for $Y$ | Yes | Yes | High | **Non-zero** |
| $X_2$ | **Indirectly relevant** — correlated with $X_1$, which causes $Y$ | Yes (through $X_1$) | **No** — once $X_1$ is known, $X_2$ adds nothing | High | $\approx 0$ |
| $X_3, X_4$ | **Collinear irrelevant** — no connection to $Y$, but mutually correlated | **No** | **No** | High | $\approx 0$ |
| $X_5$ | **Purely irrelevant** — independent of $Y$ and of all other features | **No** | **No** | Low | $\approx 0$ |

In more detail:

1. **$X_1$ — directly relevant (conditionally associated with $Y$):** $X_1$ appears in the equation $Y = 5 X_1 + \varepsilon$. Even after conditioning on all other features, $X_1$ still provides unique information about $Y$ that cannot be recovered from the others. CFI is non-zero.

2. **$X_2$ — indirectly relevant (unconditionally associated, but conditionally independent):** $X_2$ is correlated with $Y$ — but *only* because it is a noisy copy of $X_1$. In statistical terms, $X_2$ is *unconditionally associated* with $Y$ (i.e., $X_2 \not\perp\!\!\!\perp Y$), but *conditionally independent* of $Y$ given $X_1$ (i.e., $X_2 \perp\!\!\!\perp Y \mid X_1$). Once you know $X_1$, knowing $X_2$ tells you nothing new about $Y$. The conditional sampler preserves the $X_1$-$X_2$ relationship, so the model's predictions are unaffected. CFI $\approx 0$.

3. **$X_3, X_4$ — collinear irrelevant (unconditionally independent of $Y$):** $X_3$ and $X_4$ have no association with $Y$ at all — neither unconditionally nor conditionally. The conditional sampler preserves the $X_3$-$X_4$ correlation, so the opposing coefficients continue to cancel, and predictions are unaffected. CFI $\approx 0$.

4. **$X_5$ — purely irrelevant (independent of everything):** $X_5$ is independent of $Y$ and of all other features. OLS assigns a near-zero coefficient to it, so even the permutation sampler produces little error increase. Both PFI and CFI are close to zero — but $X_5$ serves as a useful **baseline**: any method that assigns large importance to $X_5$ is clearly misbehaving.

Note that CFI gives $\approx 0$ for $X_2$, $X_3$/$X_4$, and $X_5$, but for fundamentally different reasons:
- $X_2$ is zero because it is **redundant** — its information about $Y$ is already captured by $X_1$
- $X_3$/$X_4$ are zero because they are **collinearly irrelevant** — no information about $Y$, but the model's large canceling coefficients would fool PFI
- $X_5$ is zero for both methods because it is **purely irrelevant** with no collinear structure to exploit

> This distinction corresponds to the difference between *unconditional association* (A1) and *conditional association* (A2) in Ewald et al. (2024), Section 4. CFI specifically tests for conditional association: does $X_j$ provide information about $Y$ *beyond* what the other features already provide? Only $X_1$ passes this test.

**What does this mean for scientific inference?**

PFI and CFI answer fundamentally different questions:

- **PFI** answers: *"Which features does my model rely on?"* This is a question about the model, not about the data. It can be misleading for scientific conclusions because it conflates true feature-target associations with broken feature-feature correlations.

- **CFI** answers: *"Which features have a unique, direct association with $Y$ that is not already captured by the other features?"* This is more appropriate for scientific inference because it filters out both redundant features (like $X_2$) and completely irrelevant features (like $X_3$, $X_4$, $X_5$).

Neither method is universally "better" — the right choice depends on what you want to learn.

> See Ewald et al. (2024), Section 9 for practical recommendations on choosing between these approaches.

---

# Solution 4: Leave-One-Covariate-Out (LOCO)

LOCO asks: *how much does the model's performance drop if we remove feature $j$ entirely, marginalising it out with the conditional distribution?*

## The conditional SAGE value function

For a coalition $S \subseteq \{1,\ldots,p\}$, define:

$$v(S) = \underbrace{\mathbb{E}\!\left[(Y - \mathbb{E}[f(X)])^2\right]}_{\text{baseline MSE}} - \underbrace{\mathbb{E}\!\left[(Y - \mathbb{E}[f(X)\mid X_S])^2\right]}_{\text{MSE using only } X_S}$$

For a **linear model** $f(X) = \beta_0 + \beta^\top X$ with **Gaussian features**:

$$\mathbb{E}[f(X) \mid X_S] = \beta_0 + \beta_S^\top X_S + \beta_{S^c}^\top\!\left(\mu_{S^c} + \Sigma_{S^c,S}\,\Sigma_{S,S}^{-1}(X_S - \mu_S)\right)$$

## LOCO

$$\text{LOCO}_j = v(\{1,\ldots,p\}) - v(\{1,\ldots,p\} \setminus \{j\})$$

In [None]:
from itertools import chain, combinations
from math import factorial

def powerset(iterable):
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s) + 1))


def conditional_sage_value(S, model, X, y, mean, cov):
    """
    Conditional SAGE value function v(S).
    v(S) = MSE_base - E[(Y - E[f(X)|X_S])^2]
    Uses closed-form Gaussian conditional for linear models.
    """
    p = X.shape[1]
    S = list(S)
    Sc = [j for j in range(p) if j not in S]
    beta  = model.coef_
    beta0 = model.intercept_

    # Baseline: constant prediction E[f(X)] = beta0 + beta @ mean
    f_base   = beta0 + beta @ mean
    mse_base = np.mean((y - f_base) ** 2)

    if len(S) == 0:
        return 0.0

    if len(Sc) == 0:
        mse_full = np.mean((y - model.predict(X)) ** 2)
        return mse_base - mse_full

    # E[X_Sc | X_S] = mean_Sc + Sigma_{Sc,S} Sigma_{S,S}^{-1} (X_S - mean_S)
    Sigma_Sc_S = cov[np.ix_(Sc, S)]
    Sigma_S_S  = cov[np.ix_(S,  S)]
    A          = Sigma_Sc_S @ np.linalg.inv(Sigma_S_S)
    X_S        = X[:, S]
    cond_mean_Sc = mean[Sc] + (X_S - mean[S]) @ A.T

    f_S   = beta0 + X_S @ beta[S] + cond_mean_Sc @ beta[Sc]
    mse_S = np.mean((y - f_S) ** 2)
    return mse_base - mse_S

In [None]:
p = X_test.shape[1]
N = tuple(range(p))
v_total = conditional_sage_value(N, model, X_test, y_test, estimated_mean, estimated_cov)

# ── LOCO: v(N) - v(N \ {j}) ───────────────────────────────────────────────
loco_scores = []
for j in range(p):
    N_minus_j = tuple(i for i in range(p) if i != j)
    loco_j = v_total - conditional_sage_value(N_minus_j, model, X_test, y_test,
                                               estimated_mean, estimated_cov)
    loco_scores.append(loco_j)
    print(f"LOCO({feature_names[j]}): {loco_j:.4f}  ({100*loco_j/v_total:.1f}% of explained var.)")

In [None]:
# Plot LOCO scores as share of explained variance
loco_pct = [100 * s / v_total for s in loco_scores]

plt.figure(figsize=(6, 4))
plt.barh(feature_names[::-1], loco_pct[::-1],
         color='grey', edgecolor='black', linewidth=0.5)
plt.xlabel("Share of explained variance (%)")
plt.title("LOCO (conditional marginalization)")
plt.axvline(0, color='black', linewidth=0.8)
plt.tight_layout()
plt.show()

### Interpretation (Solution 4)

**What does LOCO measure?**

$\text{LOCO}_j = v(N) - v(N \setminus \{j\})$ is the **drop in explained variance** when feature $j$ is removed from the full model, with the missing feature marginalised out using the conditional distribution.

For the L2-loss and an optimal model it equals $\text{Var}(\mathbb{E}[Y \mid X_D]) - \text{Var}(\mathbb{E}[Y \mid X_{D \setminus j}])$, i.e. the reduction in explained variance.

**Results for our DGP:**

- **$X_1$**: LOCO $\approx 100\%$ of explained variance. Removing $X_1$ (and marginalising via the conditional distribution) collapses the model's predictive power almost entirely — as expected, since $X_1$ is the only true cause of $Y$.
- **$X_2$**: LOCO $\approx 0$. Once $X_1$ is available, $X_2$ adds nothing: $\mathbb{E}[f(X) \mid X_{-2}]$ can reconstruct the full prediction because $X_2 \approx X_1$.
- **$X_3, X_4$**: LOCO $\approx 0$. Their large opposing coefficients still cancel when the conditional distribution is used.
- **$X_5$**: LOCO $\approx 0$. Purely irrelevant; removing it changes nothing.

**Comparison with CFI:**

| | CFI | LOCO |
|---|---|---|
| Question | Does $X_j$ improve predictions beyond the other features? | How much does $X_j$ contribute to total explained variance? |
| $X_1$ | Non-zero | $\approx 100\%$ |
| $X_2$ | $\approx 0$ | $\approx 0$ |
| $X_3, X_4, X_5$ | $\approx 0$ | $\approx 0$ |
| Magnitude | Increase in MSE | Fraction of $R^2$ |

Both CFI and LOCO correctly identify $X_1$ as the only relevant feature. LOCO additionally gives a **interpretable magnitude**: the fraction of the model's $R^2$ attributable to each feature.

> LOCO $\neq 0$ implies conditional dependence ($X_j \not\perp\!\!\!\perp Y \mid X_{-j}$) and provides an interpretable measure of explained variance. See Ewald et al. (2024), Section 9.