# Interpretable Machine Learning
## Exercise Sheet: 5 Feature Importance
### Presentation date: 04.12.2023

# Excercise 1: Permutation feature importance (PFI)

Permutation Feature Importance is one of the oldest and most widely used IML techniques. It is defined as
\begin{align}
\widehat{P F I}_S = \frac{1}{m} \sum_{k=1}^m \mathcal{R}_{\text{emp}} (\hat{f},\tilde{\mathcal{D}}_{(k)}^S)-
\mathcal{R}_{\text{emp}} (\hat{f},\mathcal{D})
\end{align}

where $\tilde{\mathcal{D}}_{(k)}^S$ is the dataset where features $S$ were replaced with a perturbed version that preserves the variables marginal distribution $P (X_S )$. We can approximate sampling from the marginal distribution by random permutation
of the original feature’s observations.

**a)** PFI has been criticized to evaluate the model on unrealistic observations. 

**1.** Describe in a few words why this extrapolation happens. 

**2.** Think of an illustrative example.

**3.** Under a (seldom realistic) assumption PFI does not suffer from the extrapolation issue. What is that
assumption? Briefly explain why.

**b)** As in the previous excercise sheet 4, we use the [students' performance](https://archive.ics.uci.edu/dataset/320/student+performance) dataset from UCI machine learning respository. 

**1.** As in the **SHAP values** excercise load `student-mat.csv`and fit a random forest classifier. (You can copy your solution from a) & b) from the last excercise).

**2.** Calculate the permutation feature importance (PFI) from a predefined python package. For example from `sklearn.inspection`.

**3.** Visualize the PFI results, i.e. plot the mean variable importances and the according standard deviations. 

**c)** Interpret PFI.

**1.** Interpret the PFI result from b). What insight into model and data do we gain?

- Which features are (mechanistically) used by the model for it’s prediction?
- Which features are (in)dependent with $Y$?
- Which features are (in)dependent with its covariates?
- Which features are dependent with $Y$, given all covariates?

**2.** Compare your results with the SHAP bar plot *g)* from last excercise. For that also plot only the 10 most important features for better visualisation. (This means also to order the feature importance.) Do they detect the same important features? What is the big difference between SHAP feature importance and PFI?

**d)** Implement PFI yourself and apply it to the `student-mat.csv` dataset. Compare your results with the plot from b). Since we have so many features, reduce your plot with the 10 most important features.   

In order to make your code reusable for the upcoming exercises, break down the implementation into three functions:

- `pfi_fname` which returns the PFI for a feature `fname`
- `fi` a function that computes the importances for all features using a single-feature importance function
such as `pfi_fname`
- `n_times` a function that repeats the computation $n$ times and returns mean and standard deviation of
the importance values
 
Hint: By passing the single-feature importance function as an argument you can reuse fi and n times later
on for other feature importance method and only have to adjust fi fname accordingly. In order to allow for
different function signatures you may use `f(*args, **kwargs)` in python (more info [here](https://realpython.com/python-kwargs-and-args/)) and `f(...)` in
R (more info [here](https://stackoverflow.com/questions/8165837/how-to-pass-a-function-and-its-arguments-through-a-wrapper-function-in-r-simila)).

In [1]:
def pfi_fname(fname, predict, score, X_test, y_test, *args):
    """Function that returns the pfi for a single feature.
    Args:
    fname: feature of interest name
    predict: prediciton function
    score: performance metric
    X_test: data for the evaluation
    y_test: respective labels
    *args: further arguments (which are ignored)
    Returns:
    performance: performance metric
    """

In [2]:
def fi_naive(fi_fname, predict, score, X_test, y_test, *args, **kwargs):
    """Naive feature importance implementation.
    Args:
    perf_pert: function that returns performance for some perturbation.
    predict: prediction function
    score: performance metric
    X_test: test data for the evaluation
    y_test: respective labels
    2
    *args: further arguments, e.g. training data (can be ignored here)
    Returns:
    results: relevance for each feature (in the order of X_test.columns)
    """

In [3]:
import multiprocess as mp
def n_times(n, method, *args, return_raw=False, **kwargs):
    """Parallelized implementation for the repeated evaluation of fi.
    Args:
    n: number of repetitions
    method: feature importance method.
    args: all further arguments that are required for the method
    return_raw: Whether only the aggregation (mean, stdd) or also the raw results are returned
    Returns:
    mean_fi, std_fi, (raw results)
    """

**e)** 
**1.** Plot the correlation structure of the data. What insight into the relationship of the features with $y$ do we gain by looking at the correlation structure of the covariates in addition to the PFI? 

**2.** In which two variables is the extrapolation issue most prominent in this dataset?

# Excercise 2: Conditional sampling based feature importance techniques

Conditional Feature Importance has been suggested as an alternative to Permutation Feature Importance.

**a)** Implement a linear Gaussian conditional sampler. For conditional feature importance the sampler must be
able to learn Gaussian conditionals with multivariate conditioning set and univariate target.
Advice: For multivariate Gaussian data, the conditional distributions can be derived analytically from mean
vector and covariance matrix, see [here](https://en.wikipedia.org/wiki/Multivariate_normal_distribution#Conditional_distributions).

**1.** Given the decomposition of the covariance matrix as

$\begin{align}
\Sigma= \begin{bmatrix}
\Sigma_{11} & \Sigma_{12}\\
\Sigma_{21} & \Sigma_{22}
\end{bmatrix} \qquad \text{with sizes} \qquad 
\begin{bmatrix}
q \times q & q \times (N-q)\\
(N-q) \times q & (N-q) \times (N-q)
\end{bmatrix}
\end{align}$

the distribution of $X_1$ conditional on $X_2 = a$ is the multivariate normal $\mathcal{N} (\bar{\mu}, \Sigma)$

$\begin{align}
\bar{\mu} &= \mu_1 + \Sigma_{12} \Sigma_{22}^{-1}(a-\mu_2)\\
\bar{\Sigma} &= \Sigma_{11} - \Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21}
\end{align}$

As the target here is univariate $q = 1$ holds. Learn a function that returns the conditional mean and
covariance structure given specific values for the conditioning set.

**2.** Then write a function that takes the conditional mean and covariate structure and allows to sample
from the respective (multivariate) Gaussian.

**b)** Using your sampler, write a function that computes CFI. You may assume that the data is multivariate
Gaussian.

**c)** Apply CFI to the dataset and model from Exercise 1. Interpret the result: which insights into model and
data are possible? Compare the result with PFI.

# Excercise 3 LOCO

We can also assess the importance of a feature by refitting the model with and without access to the feature of
interest and compare the respective predictive performances. The method is also referred to as so-called leave-one-
covariate-out (LOCO) importance.

**(a)** Implement LOCO.

**(b)** Apply LOCO to the dataset from Exercise 1 (use a random forest model again).

**(c)** Interpret the result (insight into model and data). Compare the result to PFI and CFI.