# Tutorial 1: Conditional Independence Tests

In this tutorial, we demonstrate how to use `sial` package to conduct conditional independence tests (CITs) for feature importance. These CITs include holdout randomization test (HRT), residual permutation test (RPT), and conditional predictive impact (CPI).

To install `sial` via `pip`, please use:
```
pip install sial-pkg
```

## `Diabetes` Example

We use the `diabetes` data set as example for demonstration. The details of the data set can be found [here](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset).

In [1]:
import numpy as np

In [2]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
diabetes = datasets.load_diabetes(scaled = False)
X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size = .5)

We train a random forest (RF) regressor on the training set and evaluate its performance with mean squared error (MSE) on the test set.

In [3]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
learner = RandomForestRegressor()
_ = learner.fit(
    X_train, y_train)
print("MSE:",
      mean_squared_error(
          y_test, 
          learner.predict(X_test)))

MSE: 3460.9536624434395


## CIT Overview

Let $f(x)$ denote a learner that predicts the value of $y$ based on $x=(x_1,x_2,...,x_P)$. To test the importance of $x_j$, the $j^{th}$ feature, the conditional independence tests (CITs) consider the following hypothesis testing problem:
$$
\begin{aligned}
&H_0:y \perp x_j|x_{-j}, \\ 
&H_A:y \not\perp x_j|x_{-j},
\end{aligned}
$$
where $x_{-j}$ is the feature vector without $x_j$. The basic idea of CITs is to compare the predictive error made by $f(x)$ and $f(x^*)$, where $x^*$ is a resampled feature vector with $x_j$ being replaced by $x_j^* \sim \text{Pr}(x_j|x_{-j})$. Because $x_j^*$ is independet to $y$ given $x_{-j}$, if $x_j$ is indeed important in predicting $y$, $f(x^*)$ will result in larger predictive errors than $f(x)$.


## HRT and RPT

Let $Z=\{(x_n,y_n)\}_{n=1}^N$ denote a random sample. The holdout randomization test (HRT; Tansey et al., 2022) splits $Z$ into a training set and a test set, say $Z_\text{train}$ and $Z_\text{test}$. The training set is used to train $\widehat{f}$ by minimizing an estimation criterion $D(f; Z_\text{train})$. On the other hand, the test set is used to test the significance of $x_j$ by comparing $D(\widehat{f}; Z_\text{test})$ and $D(\widehat{f}; Z_{\text{test},j}^*)$ many times, where $Z_{\text{test},j}^*$ is a resampled version of $Z_{\text{test}}$.

HRT can be conducted provided that a sampler is available, that is, a full probabilistic model for $x_j$ given $x_{-j}$ (i.e., $\widehat{\text{Pr}}(x_j|x_{-j})$). However, most ML algorithms only estimate a mean function for $x_j$ given $x_{-j}$ (i.e., $\widehat{\text{E}}(x_j|x_{-j})$), which is not sufficient for generate $x_j^*$ if $x_j$ is numeric. Hence, for a numeric $x_j$, the residual permutation test (RPT; Huang, 2024) considers an auxliary model $x_j=h_j(x_{-j})+\delta_j$ and generates each $x_{nj}^*$ by
$$
x_{nj}^* = \widehat{h}_j(x_{n,-j}) + \text{permute}(\widehat{\delta}_{nj}),
$$
where $\widehat{h}_j$ is trained on $Z_\text{train}$, $\widehat{\delta}_{nj}$ is an empirical residual, and $\text{permute}(\cdot)$ is a permutation operator. 

Now, we use the diabetes data set to demonstrate how to conduct RPT. To evaluate the significance of `bmi`, we train a `sampler` for `bmi` given other features. Note that the `sampler` is not necessarily to be learned based on the same algorithm for obtaining the learner. However, its quality might influence the performance of RPT. Our experience shows that RF is a good choice (Huang, 2024).

In [4]:
removal = 2
sampler = RandomForestRegressor()
_ = sampler.fit(
    np.delete(X_train, removal, axis = 1), 
    X_train[:, removal])

Note that `removal = 2` is the column index of `bmi`. Because `X` is a numpy array here, `removal` must be the column index of the feature being inferred. In other examples, we will see that `removal` will be the column name of $x_j$ when `X` is `DataFrame`. The correspondence between column index and column name can be shown as:

In [5]:
{index:name for index, name in enumerate(diabetes.feature_names)}

{0: 'age',
 1: 'sex',
 2: 'bmi',
 3: 'bp',
 4: 's1',
 5: 's2',
 6: 's3',
 7: 's4',
 8: 's5',
 9: 's6'}

In `sial`, the `CIT` object can be used to conduct RPT if we set `method` to `"RPT"`. By default, `CIT` uses MSE as loss function to evaluate the predictive error for regression tasks. The type of loss function can be explicitly sepcified via `loss_func` argumant. In the current version, only `"mean_squared_error"` and `"mean_absolute_error"` are supported for regression tasks.

In [6]:
from sial import CIT
rpt = CIT(
    learner, 
    sampler = sampler,
    removal = removal,
    method = "RPT")

Finally, we use the `infer` method to conduct hypothesis testing. Both HRT and RPT construct a null distribution by using resampling technique, which can be explicitly specified by `null_dist = "resampling"`. The number of resampling samples can be set via `n_copies` argument. By default, `n_copies = 2000`. The inference result can be then summarized via the `summarize` method. The `estimate` in the output table can be interpretated as a conditional permutation feature importance (CPFI) for `bmi`. The larger CPFI means that the `bmi` is more important in reducing the value of MSE. The significance of `bmi` can be determined by the $p$-value.

In [7]:
_ = rpt.infer(X_test, y_test)
rpt.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: RPT (double_split=None, perturb_size=None)
 + Null Distribution: Resampling (n_copies=2000, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,221,1025.975402,275.858725,0.0


## CPI

The conditional predictive impact (CPI; Watson & Wright, 2021) is another CIT. The CPI is proposed under the knockoff framework (Barber & Candès, 2015). However, if we only consider the significance of $x_j$, $x_j^* \sim \text{Pr}(x_j|x_{-j})$ can be regarded as a knockoff of $x_j$. Standard CPI also evaluates the difference between $D(\widehat{f}; Z_\text{test})$ and $D(\widehat{f}; Z_{\text{test},j}^*)$ but just one time.

Now, we use CPI to evlauate the significance of `sex` in the diabetes data set. Because `sex` is a categorical feature, we train a sampler for it by using a RF classifier. Then $x_j^*$ can be easily sampled by using the `predict_proba` method (which will be done inside the `infer` method).

In [8]:
from sklearn.ensemble import RandomForestClassifier
removal = 1
sampler = RandomForestClassifier()
_ = sampler.fit(
    np.delete(X_train, removal, axis = 1), 
    X_train[:, removal])

An inferer for CPI can be constructed by using `CIT` with `method = "CPI"`. Unlike HRT and RPT, by default the CPI only generates one copy of $x_j^*$ (`n_copies = 1`) and uses normal approximation for the null distribution (`null_dist = "normality"`). As a result, CPI is much faster than HRT and RPT. Another choice here is permutation that constructs the null distribution by using a permutation technique (`null_dist = "permutation"`). The number of permutations can be specified via `n_permutations`.

In [9]:
cpi = CIT(
    learner, 
    sampler,
    removal,
    "CPI")
_ = cpi.infer(X_test, y_test)
cpi.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=1, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,221,12.936254,19.243884,0.25072


CPI only generate one copy of $x_j^*$ to calculate its test statistic. Hence, it might be unstable under small sample sizes. A way to improve its stability is to generate more copies for approximation:

In [10]:
cpi = CIT(
    learner, 
    sampler,
    removal,
    "CPI",
    n_copies = 100)
_ = cpi.infer(X_test, y_test)
cpi.summarize()

Inferer Summary (cross_fit=False, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,221,9.185257,11.816376,0.218481


The `std_error` shows that increasing the number of copies would be helpful. However, using more copies results in larger computation time. The tradeoff between accuracy and computation time is always an important issue when using ML. At least, for a relatively small data set, we recommend using more copies to improve stability.

## References

Barber, R. F., & Candès, E. J. (2015). Controlling the false discovery rate via knockoffs. The Annals of Statistics, 43(5), 2055 – 2085. doi: 10.1214/15-AOS1337

Huang, P.-H. (2024). Residual Permutation Tests for Feature Importance in Machine Learning. [Manuscript submitted for publication].

Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D. M. (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics, 31(1), 151–162. doi: 10.1080/10618600.2021.1923520

Watson, D., & Wright, M. (2021). Testing conditional independence in supervised learning algorithms. Machine Learning, 110, 1-23. doi: 10.1007/s10994-021-06030-6