# Feature Importance Methods for Scientific Inference — SOLUTIONS

---

## Setup

Run all cells in this section first.

In [None]:
!pip install fippy -q

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

plt.rcParams.update({
    'font.size':        14,
    'axes.titlesize':   16,
    'axes.labelsize':   14,
    'xtick.labelsize':  13,
    'ytick.labelsize':  13,
})

### Data Generating Process

$$X_1 \sim \mathcal{N}(0,1), \quad X_3 \sim \mathcal{N}(0,1), \quad X_5 \sim \mathcal{N}(0,1) \quad \text{(mutually independent)}$$
$$X_2 = 0.999\,X_1 + \sqrt{1-0.999^2}\,\varepsilon_2, \qquad X_4 = 0.999\,X_3 + \sqrt{1-0.999^2}\,\varepsilon_4$$
$$Y = 5\,X_1 + \varepsilon_Y, \quad \varepsilon_Y \sim \mathcal{N}(0,1)$$

Only $X_1$ causes $Y$. $X_2$ is a noisy copy of $X_1$. $X_3$/$X_4$ are a correlated but irrelevant pair. $X_5$ is purely irrelevant.

In [None]:
def generate_data(n=1500, seed=83):
    rng = np.random.RandomState(seed)
    x1 = rng.normal(0, 1, n)
    x2 = 0.999 * x1 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    x3 = rng.normal(0, 1, n)
    x4 = 0.999 * x3 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    y  = 5 * x1 + rng.normal(0, 1, n)
    x5 = rng.normal(0, 1, n)
    X  = np.column_stack([x1, x2, x3, x4, x5])
    return X, y

X, y = generate_data()
feature_names = ["X1", "X2", "X3", "X4", "X5"]

### Model

OLS on 1000 training observations, evaluated on 500 test observations.

In [None]:
n_train = 1000
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = y[:n_train], y[n_train:]

model = LinearRegression().fit(X_train, y_train)
print(f"Test R\u00b2: {model.score(X_test, y_test):.3f}")
print(f"Coefficients: {np.round(model.coef_, 2)}")

### fippy setup

The **Gaussian sampler** estimates $P(X_j \mid X_{-j})$ in closed form under the multivariate normal assumption.

In [None]:
from fippy.explainers import Explainer
from fippy.samplers import GaussianSampler

X_train_df = pd.DataFrame(X_train, columns=feature_names)
X_test_df  = pd.DataFrame(X_test,  columns=feature_names)
y_train_s  = pd.Series(y_train, name='y')
y_test_s   = pd.Series(y_test,  name='y')

sampler   = GaussianSampler(X_train_df)
explainer = Explainer(model.predict, X_train_df,
                      loss=mean_squared_error, sampler=sampler)

---

# Solution 2: PFI — Implementation and Why It Misleads

**PFI** permutes feature $j$, breaking its association with all other variables, and measures the
resulting increase in loss:

$$\text{PFI}_j = \mathbb{E}[L(Y, \hat{f}(\tilde{X}_j, X_{-j}))] - \mathbb{E}[L(Y, \hat{f}(X))], \quad \tilde{X}_j \perp X_{-j}$$

In [None]:
def my_pfi(model, X, y, feature_idx, n_repeats=50, seed=42):
    """Permutation Feature Importance for a single feature."""
    rng = np.random.RandomState(seed)
    baseline_mse = mean_squared_error(y, model.predict(X))

    perturbed_mses = []
    for _ in range(n_repeats):
        X_perm = X.copy()
        X_perm[:, feature_idx] = rng.permutation(X[:, feature_idx])
        perturbed_mses.append(mean_squared_error(y, model.predict(X_perm)))

    return np.mean(perturbed_mses) - baseline_mse


pfi_scores = [my_pfi(model, X_test, y_test, j) for j in range(len(feature_names))]
for name, score in zip(feature_names, pfi_scores):
    print(f"PFI({name}): {score:.4f}")

plt.figure(figsize=(6, 4))
plt.barh(feature_names[::-1], pfi_scores[::-1], color='grey', edgecolor='black', linewidth=0.5)
plt.xlabel("PFI (increase in MSE)")
plt.title("Permutation Feature Importance")
plt.tight_layout()
plt.show()

In [None]:
# Scatterplot: (X3, X4) before and after permuting X3
rng = np.random.RandomState(42)
X_perm = X_test.copy()
X_perm[:, 2] = rng.permutation(X_test[:, 2])

fig, axes = plt.subplots(1, 2, figsize=(10, 4))
axes[0].scatter(X_test[:, 2], X_test[:, 3], alpha=0.3, s=10, color='grey')
axes[0].set(xlabel="$X_3$", ylabel="$X_4$", title="Original: $(X_3, X_4)$",
            xlim=(-4,4), ylim=(-4,4))
axes[1].scatter(X_perm[:, 2], X_perm[:, 3], alpha=0.3, s=10, color='grey')
axes[1].set(xlabel=r"$\tilde{X}_3$ (permuted)", ylabel="$X_4$",
            title="After permuting $X_3$", xlim=(-4,4), ylim=(-4,4))
plt.tight_layout()
plt.show()

### Interpretation

**PFI values are high for all features except $X_5$** — including $X_3$ and $X_4$, which are
completely independent of $Y$.

**Why?** The fitted coefficients are $\hat{\beta} \approx [3.11, 1.88, -2.11, 2.17, 0.02]$.
Because $X_3 \approx X_4$ ($\rho = 0.999$), OLS assigns them large opposing coefficients that
nearly cancel in the original data. Permuting $X_3$ destroys this cancellation — the model's
predictions blow up, and PFI records a large error increase.

The scatterplot makes this visible: the original $(X_3, X_4)$ data lies on a tight diagonal.
After permuting $X_3$, the cloud becomes circular — these out-of-distribution combinations
never appeared during training.

**Key takeaway:** PFI measures *model reliance*, not association with $Y$.

| Conclusion | PFI $\neq$ 0 | PFI $= 0$ |
|---|---|---|
| Model uses feature | ✅ | ❓ |
| Feature is predictive of $Y$ | ❓ | ❓ |
| Feature is causally relevant | ❓ | ❓ |

---

# Solution 3: Conditional Feature Importance (CFI)

CFI replaces the permutation sampler with sampling from $P(X_j \mid X_{-j})$, preserving
the correlations between features while still breaking the direct $X_j$–$Y$ link:

$$\text{CFI}_j = \mathbb{E}[L(Y, \hat{f}(\tilde{X}_j, X_{-j}))] - \mathbb{E}[L(Y, \hat{f}(X))], \quad \tilde{X}_j \sim P(X_j \mid X_{-j})$$

The Gaussian sampler estimates this conditional distribution in closed form.

In [None]:
ex_cfi = explainer.cfi(X_test_df, y_test_s, nr_runs=10)
ex_cfi.hbarplot()
plt.show()

means, stds = ex_cfi.fi_means_stds()
print("CFI scores:")
for feat, m, s in zip(feature_names, means, stds):
    print(f"  {feat}: {m:.4f} ± {s:.4f}")

### Interpretation

**Only $X_1$ receives a non-zero CFI score.** All others are $\approx 0$.

Why the conditional sampler fixes the problem:
- **$X_2$**: $\tilde{X}_2 \sim P(X_2 \mid X_1, X_3, X_4, X_5) \approx X_1$, so the model's
  prediction barely changes. CFI $\approx 0$ — $X_2$ carries no information about $Y$ *beyond
  what $X_1$ already provides*.
- **$X_3, X_4$**: $\tilde{X}_3 \sim P(X_3 \mid X_4, \ldots)$ preserves $\rho(X_3, X_4) = 0.999$,
  so the large opposing coefficients still cancel. CFI $\approx 0$ — they are irrelevant to $Y$.
- **$X_5$**: independent of everything; resampling changes nothing. CFI $\approx 0$.
- **$X_1$**: once $X_1$ is conditionally resampled, its strong direct effect on $Y$ is broken.
  CFI is large and positive.

**CFI $\neq 0$ implies conditional dependence:** $X_j \not\perp\!\!\!\perp Y \mid X_{-j}$.

| Conclusion | CFI $\neq$ 0 | CFI $= 0$ |
|---|---|---|
| Model uses feature | ✅ | ❓ |
| Pairwise association with $Y$ | ❓ | ❓ |
| Conditional association with $Y$ | ✅ | ❓ |

---

# Solution 4: Leave-One-Covariate-Out (LOCO)

LOCO measures the **drop in explained variance** when a feature is removed from the full model,
with the missing feature marginalised out via its conditional distribution:

$$\text{LOCO}_j = v(\{1,\ldots,p\}) - v(\{1,\ldots,p\} \setminus \{j\})$$

where the conditional SAGE value function is:

$$v(S) = \mathbb{E}[(Y - \mathbb{E}[f(X)])^2] - \mathbb{E}[(Y - \mathbb{E}[f(X)\mid X_S])^2]$$

$v(N) \approx \text{Var}(Y) \cdot R^2$, so expressing LOCO as a fraction of $v(N)$ gives
each feature's **share of the model's $R^2$**.

In [None]:
N = feature_names

# v(N) – explained variance of the full model
ex_vN = explainer.csagevf(S=list(N), X_eval=X_test_df, y_eval=y_test_s)
means_N, _ = ex_vN.fi_means_stds()
v_N = float(np.array(means_N).flatten()[0])
print(f"v(all features) = {v_N:.4f}  [≈ Var(Y)·R²]\n")

# LOCO = v(N) - v(N\{j}) for every j, computed in one call
ex_loco = explainer.csagevfs(X_test_df, y_test_s, C='remainder')
loco_means, loco_stds = ex_loco.fi_means_stds()
loco_scores = dict(zip(feature_names, loco_means))
for feat, m in loco_scores.items():
    print(f"LOCO({feat}): {m:.4f}  ({100 * m / v_N:.1f}% of explained variance)")

pct = [100 * loco_scores[f] / v_N for f in N]
plt.figure(figsize=(6, 4))
plt.barh(N[::-1], pct[::-1], color='grey', edgecolor='black', linewidth=0.5)
plt.xlabel("Share of explained variance (%)")
plt.title("LOCO (conditional marginalization)")
plt.axvline(0, color='black', linewidth=0.8)
plt.tight_layout()
plt.show()

### Interpretation

**$X_1$ accounts for essentially all of the model's explained variance.**
All other features contribute $\approx 0\%$.

The logic for each feature:
- **$X_1$**: removing it (and marginalising via $P(X_1 \mid X_{-1})$) collapses predictions
  to a near-constant, since no other feature carries information about $Y$. LOCO $\approx R^2 \cdot \text{Var}(Y)$.
- **$X_2$**: $P(X_2 \mid X_1, X_3, X_4, X_5) \approx X_1$, so the model's restricted prediction
  $\mathbb{E}[f(X) \mid X_{-2}]$ is virtually identical to the full prediction. LOCO $\approx 0$.
- **$X_3, X_4$**: same argument — conditional distribution preserves the correlation, so the
  opposing coefficients still cancel in the restricted model. LOCO $\approx 0$.
- **$X_5$**: near-zero coefficient, purely irrelevant. LOCO $\approx 0$.

**Comparison of the three methods:**

| | PFI | CFI | LOCO |
|---|---|---|---|
| Model uses feature | ✅ iff $\neq 0$ | ✅ iff $\neq 0$ | ✅ iff $\neq 0$ |
| Pairwise assoc. $X_j \not\perp Y$ | ❓ | ❓ | ❓ |
| Conditional assoc. $X_j \not\perp Y \mid X_{-j}$ | ❓ | ✅ iff $\neq 0$ | ✅ iff $\neq 0$ |
| Magnitude = share of explained variance | ❌ | ❌ | ✅ |

LOCO is the only method here that gives an **interpretable magnitude**: its value directly equals
the reduction in $\text{Var}(\mathbb{E}[Y \mid X_D])$ when feature $j$ is removed — the
standard definition of explained variance contribution.