# Feature Importance Methods for Scientific Inference

In these exercises you will:

1. Compute and interpret **Permutation Feature Importance (PFI)** scores
2. Discover **why PFI can be misleading** when features are correlated
3. Use a **conditional sampler** to compute Conditional Feature Importance (CFI) and compare

---

## Setup

Run the cells below to set everything up. **You do not need to modify any code in this section.**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import warnings
warnings.filterwarnings('ignore')

### Data Generation

We generate a dataset with 4 features and a continuous target variable $Y$.

**You do not know the data generating process (DGP) yet.** You only observe the features $X_1, X_2, X_3, X_4$ and the target $Y$.

In [None]:
def generate_data(n=1500, seed=83):
    """Generate the dataset. The DGP is hidden for now."""
    rng = np.random.RandomState(seed)
    x1 = rng.normal(0, 1, n)
    x2 = 0.999 * x1 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    x3 = rng.normal(0, 1, n)
    x4 = 0.999 * x3 + np.sqrt(1 - 0.999**2) * rng.normal(0, 1, n)
    y = 5 * x1 + rng.normal(0, 1, n)
    X = np.column_stack([x1, x2, x3, x4])
    return X, y

X, y = generate_data()
feature_names = ["X1", "X2", "X3", "X4"]

print(f"Dataset shape: {X.shape}")
print(f"Feature names: {feature_names}")
print(f"\nFirst 5 rows of X:")
print(np.round(X[:5], 2))
print(f"\nFirst 5 values of Y: {np.round(y[:5], 2)}")

### Train a model

We train a Linear Regression on the data.

In [None]:
# Split into train and test
n_train = 1000
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = y[:n_train], y[n_train:]

# Train a Linear Regression (unregularized)
model = LinearRegression()
model.fit(X_train, y_train)

# Evaluate
y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r2 = model.score(X_test, y_test)
print(f"Test MSE: {mse:.3f}")
print(f"Test R\u00b2:  {r2:.3f}")
print(f"\nFitted coefficients: {np.round(model.coef_, 2)}")

### Provided functions: Permutation Sampler and PFI

Below we provide:
- A **permutation sampler**: for a given feature $j$, it randomly shuffles (permutes) the values of $X_j$ across observations, breaking the association between $X_j$ and all other variables.
- A **PFI function**: computes the Permutation Feature Importance:

$$\text{PFI}_j = \mathbb{E}[L(Y, \hat{f}(\tilde{X}_j, X_{-j}))] - \mathbb{E}[L(Y, \hat{f}(X))]$$

where $\tilde{X}_j$ is drawn independently from the marginal distribution of $X_j$ (i.e., permuted), and $L$ is the loss function (here: MSE).

In [None]:
def permutation_sampler(X, feature_idx, rng=None):
    """
    Permutation (marginal) sampler.
    Returns a copy of X where column `feature_idx` is randomly permuted,
    effectively sampling X_j from its marginal distribution independently
    of all other features and Y.
    """
    if rng is None:
        rng = np.random.RandomState(0)
    X_perm = X.copy()
    X_perm[:, feature_idx] = rng.permutation(X[:, feature_idx])
    return X_perm


def compute_pfi(model, X, y, feature_idx, sampler, n_repeats=25, seed=42):
    """
    Compute Permutation Feature Importance for feature `feature_idx`.

    PFI_j = E[L(Y, f(X_tilde_j, X_{-j}))] - E[L(Y, f(X))]

    Parameters
    ----------
    model : fitted sklearn model
    X : np.ndarray, shape (n, p) - test features
    y : np.ndarray, shape (n,) - test target
    feature_idx : int - index of the feature to permute
    sampler : callable - function(X, feature_idx, rng) -> X_perturbed
    n_repeats : int - number of repetitions to average over
    seed : int - random seed

    Returns
    -------
    pfi : float - the PFI score
    """
    rng = np.random.RandomState(seed)
    baseline_mse = mean_squared_error(y, model.predict(X))

    perturbed_mses = []
    for _ in range(n_repeats):
        X_perturbed = sampler(X, feature_idx, rng=rng)
        y_pred_perturbed = model.predict(X_perturbed)
        perturbed_mses.append(mean_squared_error(y, y_pred_perturbed))

    pfi = np.mean(perturbed_mses) - baseline_mse
    return pfi

---

# Exercise 1: Compute and Interpret PFI

Use the provided `compute_pfi` function with the `permutation_sampler` to compute PFI for all four features.

**Tasks:**

1. Compute PFI for each feature ($X_1, X_2, X_3, X_4$) and display the results as a bar chart.
2. Answer the following questions:
   - Which features does the **model** rely on for its predictions?
   - Based on PFI alone, which features would you conclude are **important in the data** (i.e., associated with $Y$)?
   - What exactly does PFI measure?

In [None]:
# YOUR CODE HERE
# Step 1: Compute PFI for all features



# Step 2: Create a bar chart of PFI values



**Your interpretation (double-click to edit):**

- Which features does the model rely on?
  - *Your answer here*

- Which features appear important in the data?
  - *Your answer here*

- What does PFI measure exactly?
  - *Your answer here*

---

# Exercise 2: Why is PFI misleading here?

Now we reveal the **true data generating process (DGP)**:

$$X_1 \sim \mathcal{N}(0,1), \quad X_3 \sim \mathcal{N}(0,1) \quad \text{(independent of each other and of } X_1 \text{)}$$

$$X_2 = 0.999 \cdot X_1 + \sqrt{1 - 0.999^2} \cdot \varepsilon_2, \quad \varepsilon_2 \sim \mathcal{N}(0,1)$$

$$X_4 = 0.999 \cdot X_3 + \sqrt{1 - 0.999^2} \cdot \varepsilon_4, \quad \varepsilon_4 \sim \mathcal{N}(0,1)$$

$$Y = 5 X_1 + \varepsilon_Y, \quad \varepsilon_Y \sim \mathcal{N}(0, 1)$$

**Key observations about the DGP:**
- $Y$ depends **only** on $X_1$
- $X_2$ is **not a direct cause** of $Y$; it is only correlated with $Y$ through $X_1$
- $X_3$ and $X_4$ are **completely independent** of $Y$; they are only correlated with each other

Yet, PFI assigns substantial importance scores to **all four features**!

**Tasks:**

1. Create scatterplots comparing the **original** data vs. the **permuted** data for the pairs $(X_1, X_2)$ and $(X_3, X_4)$. What do you notice?
2. Look at the fitted model coefficients (printed in the setup). How do they relate to the PFI scores?
3. Explain in your own words **why** PFI gives $X_3$ and $X_4$ high importance scores, even though they are completely independent of $Y$.

In [None]:
# YOUR CODE HERE
# Step 1: Create permuted versions of the test data
#   - Permute X2 (feature index 1)
#   - Permute X3 (feature index 2)



# Step 2: Create side-by-side scatterplots
#   - (X1, X2) in original vs. permuted data
#   - (X3, X4) in original vs. permuted data



**Your explanation (double-click to edit):**

- What do you notice in the scatterplots?
  - *Your answer here*

- How do the model coefficients explain the PFI scores?
  - *Your answer here*

- Why does PFI assign high importance to $X_3$ and $X_4$?
  - *Your answer here*

---

# Exercise 3: Conditional Feature Importance (CFI)

To address the problem from Exercise 2, we can use a **conditional sampler** instead of the permutation sampler.

Instead of sampling $\tilde{X}_j$ from the marginal distribution (breaking all dependencies), we sample from the **conditional distribution**:

$$\tilde{X}_j \sim F_{X_j | X_{-j}}$$

This preserves the dependencies between features while still breaking the direct association between $X_j$ and $Y$.

For multivariate normal data, the conditional distribution has a closed-form solution:

$$X_j \mid X_{-j} = x_{-j} \sim \mathcal{N}\left(\mu_{j|-j},\; \sigma^2_{j|-j}\right)$$

where:
- $\mu_{j|-j} = \mu_j + \Sigma_{j,-j} \Sigma_{-j,-j}^{-1}(x_{-j} - \mu_{-j})$
- $\sigma^2_{j|-j} = \Sigma_{j,j} - \Sigma_{j,-j} \Sigma_{-j,-j}^{-1} \Sigma_{-j,j}$

The conditional sampler is provided below.

In [None]:
def conditional_sampler(X, feature_idx, rng=None, mean=None, cov=None):
    """
    Conditional sampler for multivariate normal data.
    Samples X_j from X_j | X_{-j} using the closed-form Gaussian conditional.

    Parameters
    ----------
    X : np.ndarray, shape (n, p)
    feature_idx : int - the feature to resample
    rng : np.random.RandomState
    mean : np.ndarray, shape (p,) - mean of the joint distribution
    cov : np.ndarray, shape (p, p) - covariance of the joint distribution

    Returns
    -------
    X_cond : np.ndarray - copy of X with column `feature_idx` resampled conditionally
    """
    if rng is None:
        rng = np.random.RandomState(0)

    n, p = X.shape
    j = feature_idx

    # Indices of all other features
    others = [i for i in range(p) if i != j]

    # Extract sub-matrices from the covariance
    sigma_jj = cov[j, j]                              # scalar
    sigma_j_others = cov[j, others]                    # (p-1,)
    sigma_others_others = cov[np.ix_(others, others)]  # (p-1, p-1)

    # Conditional parameters
    sigma_others_inv = np.linalg.inv(sigma_others_others)
    beta = sigma_j_others @ sigma_others_inv            # regression coefficients
    cond_var = sigma_jj - sigma_j_others @ sigma_others_inv @ sigma_j_others  # conditional variance

    # Conditional mean for each observation: mu_j + beta @ (x_{-j} - mu_{-j})
    x_others = X[:, others]
    cond_means = mean[j] + (x_others - mean[others]) @ beta  # (n,)

    # Sample from conditional distribution
    X_cond = X.copy()
    X_cond[:, j] = cond_means + rng.normal(0, np.sqrt(max(cond_var, 0)), n)

    return X_cond

To use the conditional sampler with `compute_pfi`, we need to wrap it so it has the same signature. We estimate the mean and covariance from the training data.

In [None]:
# Estimate mean and covariance from training data
estimated_mean = np.mean(X_train, axis=0)
estimated_cov = np.cov(X_train, rowvar=False)

print("Estimated covariance matrix:")
print(np.round(estimated_cov, 3))
print("\nNotice: X1-X2 are highly correlated, X3-X4 are highly correlated,")
print("but the two pairs are independent of each other.")


def conditional_sampler_wrapper(X, feature_idx, rng=None):
    """Wrapper so conditional_sampler has the same signature as permutation_sampler."""
    return conditional_sampler(X, feature_idx, rng=rng,
                               mean=estimated_mean, cov=estimated_cov)

**Tasks:**

1. Compute **Conditional Feature Importance (CFI)** for all features using `compute_pfi` with the `conditional_sampler_wrapper`.
2. Create a side-by-side bar chart comparing PFI and CFI for all features.
3. (Optional) Create scatterplots of $(X_1, X_2)$ and $(X_3, X_4)$ after conditional resampling and compare to the permutation scatterplots from Exercise 2.
4. Interpret the results:
   - How do PFI and CFI differ? Why?
   - Recall the three types of features in the DGP: $X_1$ (directly relevant), $X_2$ (indirectly relevant), $X_3$/$X_4$ (completely irrelevant). How does CFI treat each of these?
   - What does this mean for drawing scientific conclusions from feature importance scores?

In [None]:
# YOUR CODE HERE
# Step 1: Compute CFI for all features



# Step 2: Create a side-by-side comparison bar chart (PFI vs CFI)



In [None]:
# YOUR CODE HERE (optional)
# Step 3: Scatterplots of (X1, X2) and (X3, X4) after conditional resampling



**Your interpretation (double-click to edit):**

- How do PFI and CFI differ? Why?
  - *Your answer here*

- How does CFI treat each type of feature?
  - *Your answer here*

- What does this mean for scientific inference?
  - *Your answer here*