# Risk Invariance Tests

In this tutorial, we demonstrate how to use `tofi` to implement risk invariance tests (RITs) for feature importance. These tests include leave-one-covariate-out (LOCO) and plug-in estimation (PIE).

To install `tofi` via `pip`, please use:
```
pip install pytofi
```
The current tutorial alos use `plotly` for plotting. It can be install via
```
pip install plotly
```

## `Heart` Example

The `heart` data set in book [ISLP](https://www.statlearning.com/) will be used for demonstration. The data set enables us to establish a classifier for `AHD` based on 13 features. Note that there are categorical features and missing values in the `heart` data set.

In [1]:
import pandas as pd
heart = pd.read_csv(
    "https://trevorhastie.github.io/ISLR/Heart.csv", 
    index_col = 0)
heart.info()

<class 'pandas.core.frame.DataFrame'>
Index: 303 entries, 1 to 303
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Age        303 non-null    int64  
 1   Sex        303 non-null    int64  
 2   ChestPain  303 non-null    object 
 3   RestBP     303 non-null    int64  
 4   Chol       303 non-null    int64  
 5   Fbs        303 non-null    int64  
 6   RestECG    303 non-null    int64  
 7   MaxHR      303 non-null    int64  
 8   ExAng      303 non-null    int64  
 9   Oldpeak    303 non-null    float64
 10  Slope      303 non-null    int64  
 11  Ca         299 non-null    float64
 12  Thal       301 non-null    object 
 13  AHD        303 non-null    object 
dtypes: float64(2), int64(9), object(3)
memory usage: 35.5+ KB


To prepare appropriate data for learning, we transform categorical features into `category` type and split the data set into training set and test set.

In [2]:
import numpy as np
from sklearn.model_selection import train_test_split
heart["ChestPain"] = heart["ChestPain"].astype("category")
heart["Thal"] = heart["Thal"].astype("category")
X = heart.drop("AHD", axis = 1)
y = heart["AHD"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size = .5,
    random_state = 48)

We train a classifier by using `HistGradientBoostingClassifier` which can hadle missing values and categorical features natively. An optimal `max_iter` is chosen via `GridSearchCV` from `[100, 200, 300, 400]`. The best model yields about 80% correct predictions on the test data set.

In [3]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score
learner = GridSearchCV(
    estimator = HistGradientBoostingClassifier(
        categorical_features = "from_dtype"), 
    param_grid = {
        "max_iter": [100, 200, 300, 400]})
_ = learner.fit(
    X_train, y_train)
print("Accuracy:",
      accuracy_score(
          y_test, 
          learner.predict(X_test)))

Accuracy: 0.8026315789473685


To see which feature is useful in prediction, we calculate the permutation feature importances. The importance values show that `"Ca"` seems to be the most important. 

In [4]:
from sklearn.inspection import permutation_importance
pfi = permutation_importance(
    learner, 
    X_test, 
    y_test, 
    scoring = "neg_log_loss",
    n_repeats = 50)

In [5]:
import plotly.express as px
fig = px.bar(
    x = X.columns, 
    y = pfi.importances_mean,
    template = "simple_white",
    width=600, 
    height=300)
fig.show()

## RIT Overview

Let $\widehat{f}$ denote a learner that predicts $y$ based on $x$. Its risk is defined as
$$
R(f_0) = \text{E}[D(f_0;Z)],
$$
where $f_0$ is the population counterpart of $\widehat{f}$, $D$ is the estimation criterion for training $\widehat{f}$, and $Z$ is a random sample which is independent to $\widehat{f}$. The risk evaluates the predictive error of $f_0$ in the long run. To evaluate the importance of $x_j$, we consider a competitor, denoted by $\widehat{g}_j$, that predicts $y$ only based on $x_{-j}$. In most cases, $\widehat{g}_j$ is trained by using the same ML algorithm of $\widehat{f}$. Then the significance of $x_j$ in predicting $y$ can be tested by comparing the risk invaraince (RI) of $f_0$ and $g_{0,j}$:
$$
\begin{aligned}
&H_0:R(f_0) = R(g_{0,j}),\\
&H_A:R(f_0) < R(g_{0,j}).
\end{aligned}
$$
It can be shown that the conditional independence (CI) of $x_j$ and $y$ given $x_{-j}$ implies risk invariance. The converse is also true if the estimation is proportional to negative likelihood (Dai, Shen, & Pan, 2022). Although CI and RI are generally not equivalent, RI could still provide useful information to understand the relationship between $x_j$ and $y$.


## LOCO

The leave-one-covariate-out (LOCO; Lei et al., 2018) directly compares the empirical risks of $\widehat{f}$ and $\widehat{g}_j$ on a training set $Z_\text{test}$. In particular, the LOCO method considers the test statistic:
$$
t_j^\text{LOCO} = D(\widehat{g}_j; Z_\text{test}) - D(\widehat{f}; Z_\text{test}).
$$
When the estimation criterion is constructed as a sum of loss values (i.e., $L[y_n,\widehat{f}(x_n)]$), the test statistic can be regarded as a sum of paired differences (i.e., $L[y_n,\widehat{g}_j(x_n)] - L[y_n,\widehat{f}(x_n)]$). So we can use the paired-sample t-test to make inferences.

To implement LOCO, a competitor must be trained first. In this example, we leave-out the feature with the largest value of PFI - the `Ca` feature:

In [6]:
removal = "Ca"
competitor = GridSearchCV(
    estimator = HistGradientBoostingClassifier(
        categorical_features = "from_dtype"), 
    param_grid = {
        "max_iter": [100, 200, 300, 400]})
_ = competitor.fit(
    X_train.drop(removal, axis = 1), 
    y_train)

Now `RIT` can be used to construct an inferer to conduct LOCO. To initialize a `RIT`, we need to specify `learner`, `competitor`, `removal`, and `method`. In this classfication example, the loss function is `log_loss` by default. Because the specified `removal` is the column name, we must set the argument `X` of the `infer` method as `DataFrame`. After infering, the inference result can be summarized via the `summarize` method:

In [7]:
from tofi import RIT
loco = RIT(
    learner, 
    competitor = competitor,
    removal = removal,
    method = "LOCO")
_ = loco.infer(X_test, y_test)
loco.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: LOCO (double_split=False, perturb_size=None)
 + Null Distribution: Normality (n_copies=None, n_permutations=None)
 + Loss Function: Log Loss (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ca,152,0.341922,0.09251,0.000109


By default, the LOCO method uses `normality` to approximate the null distribution. Another choice here is `permutation`, which randomly shuffles the order of each paried-sample to obtain a null value. We can see that the two approaches yield nearly the same test results:

In [8]:
loco = RIT(
    learner, 
    competitor,
    removal,
    "LOCO",
    null_dist = "permutation")
_ = loco.infer(X_test, y_test)
loco.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: LOCO (double_split=False, perturb_size=None)
 + Null Distribution: Permutation (n_copies=None, n_permutations=2000)
 + Loss Function: Log Loss (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Ca,152,0.341922,0.099461,0.0


Unlike conditional independence tests (CITs) which can only remove one feature, RITs can remove several features at a time. For example, the following code removes `"Sex"`, `"ChestPain"`, and `"Thal"` and tests the significance yielded by the three features.

In [9]:
removal = ["Sex", "ChestPain", "Thal"]
competitor = GridSearchCV(
    estimator = HistGradientBoostingClassifier(
        categorical_features = "from_dtype"), 
    param_grid = {
        "max_iter": [100, 200, 300, 400]})
_ = competitor.fit(
    X_train.drop(removal, axis = 1), 
    y_train)
loco = RIT(
    learner, 
    competitor = competitor,
    removal = removal,
    method = "LOCO")
_ = loco.infer(X_test, y_test)
loco.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: LOCO (double_split=False, perturb_size=None)
 + Null Distribution: Normality (n_copies=None, n_permutations=None)
 + Loss Function: Log Loss (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sex + ChestPain + Thal,152,0.075074,0.043099,0.040762


## PIE

The LOCO test is a conceptually simple method for statistical inference. However, the normality approximation might not be correct under the null hypothesis (see Dai, B., Shen, X., & Pan, 2022; Williamson et al., 2023). To handle the problem, two strategies are availalbe. The first strategy is to introduce a random error to $t_j^\text{LOCO}$ that can be done by setting the `perturb_size` argument. The second strategy is using double splitting. By default, the plug-in estimation (PIE; Williamson et al., 2023) test adopts the second approach by setting `double_split = True`. The implementation of PIE is the same to LOCO, but the double splitting further splits the test set into set A and set B - the former is used to estimate the risk of $\widehat{g}_j$ and the latter is used to estimate the risk of $\widehat{f}$. The resulting test statistic is:
$$
t_j^\text{PIE} = D(\widehat{g}_j; Z_\text{test}^A) - D(\widehat{f}; Z_\text{test}^B).
$$
Because set A and set B are disjoint, an independent two-sample $t$-test can be used to compute a $p$-value.

In the following code, we use the competitor that removes `"Sex"`, `"ChestPain"`, and `"Thal"` again for demonstration. The result shows different values of `estimate` and `std_error`. In general, PIE yields a larger `std_error` than LOCO because it only uses partial information of test data. Hence, PIE is less powerful in detecting significant features.

In [10]:
pie = RIT(
    learner, 
    competitor,
    removal,
    "PIE")
_ = pie.infer(X_test, y_test)
pie.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: PIE (double_split=True, perturb_size=None)
 + Null Distribution: Normality (n_copies=None, n_permutations=None)
 + Loss Function: Log Loss (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Sex + ChestPain + Thal,152,0.137485,0.12635,0.138268


## References

Dai, B., Shen, X., & Pan, W. (2022). Significance tests of feature relevance for a black-box learner. IEEE Transactions on Neural Networks and Learning Systems, PP, 1-14. doi: 10.1109/TNNLS.2022.3185742

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., & Wasserman, L. (2018). Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523), 1094–1111. doi: 10.1080/01621459.2017.1307116

Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2023). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 118(543), 1645–1658. doi: 10.1080/01621459.2021.2003200