# Cross-Fitting and $p$-Value Combination

In this tutorial, we will demonstrate how to implement cross-fitting and $p$-value combination by using `Crosser`. The `Crosser` class provides a user-friendly interface to conduct nested cross-validation (CV) that results in a more stable result under small sample sizes.

To install `sial` via `pip`, please use:
```
pip install sial-pkg
```

We will again use the `diabetes` data set. The following code prepares the data for the tutorial.

In [1]:
from sklearn import datasets
from sklearn.model_selection import train_test_split
X, y = datasets.load_diabetes(
    return_X_y = True,
    as_frame = True,
    scaled = False)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size = .5)

## `Crosser` Overview

Briefly speaking, `Crosser` is an object that can conduct nested cross-validation (CV) to improve the stability of learning result. Suppose we have split the a data set $Z$ into 5 folds, say $Z^{(1)}, Z^{(2)}, Z^{(3)}, Z^{(4)}, Z^{(5)}$. For each $k \in \{1,2,3,4,5\}$, we set $Z_\text{train}^{(k)} = \{Z_l\}_{l \neq k}$ and $Z_\text{test}^{(k)} = \{Z_k\}$. Then $Z_\text{train}^{(k)}$ is used to train $\widehat{f}^{(k)}$ and $Z_\text{test}^{(k)}$ is used to evaluate the performance of $\widehat{f}^{(k)}$. Note that when training $\widehat{f}^{(k)}$ an inner CV loop may be used.

To understand `Crosser` more concretely, let us create a cross-validator via `KFold`. The cv can help us to create the training and test indices for 5-fold CV. An important thing here is that a specific `random_state` must be given. Otherwise, `sial` may yield a wrong result when making statistical inferences.

In [2]:
from sklearn.model_selection import KFold
cv = KFold(
    n_splits = 5,
    shuffle = True,
    random_state = 1)

Now we initialize a `Crosser` object. When initializing, an estimator and a cross-validator must be given. The estimator can be a usual `scikit-learn` estimator or a `GridSearchCV` object as in this example. The `GridSearchCV` object runs an inner CV loop to find an optimal tuning parameter value for each $\widehat{f}^{(k)}$. The `fit` method trains $\widehat{f}^{(1)}$, $\widehat{f}^{(2)}$, ..,$\widehat{f}^{(5)}$ on their corresponding training sets.

In [3]:
from sial import Crosser
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestRegressor
learner = Crosser(
    GridSearchCV(
        estimator = RandomForestRegressor(), 
        param_grid = {
            "max_features": [3, 6, 9]}),
    cv = cv)
_ = learner.fit(X, y)

To extract the trained estimators, we can use the `estimator_` attribute:

In [4]:
len(learner.estimators_)

5

The overall performances of these trained estimators can be shown by the `summarize` method. Three types of error scores are reported. The `val_score` is the validation error extracted from the `GridSearchCV`. The `train_score` and `test_score` are calculated by `Crosser`. By default, regression tasks use `r2` scorer and classification tasks uses `accuracy`. The scoring method can be explicitly specified via the `scoring` argument when initializing a `Crosser`. Available scoring methods can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values). 

In [5]:
learner.summarize()

Crosser Summary (cross_fit=True, combine=False)
 + Estimator: GridSearchCV
 + Cross-Validator: KFold (n_folds=5, n_repeats=1)
 + Train/Test Scorer: R2 (reverse=False)


Unnamed: 0_level_0,val_score,train_score,test_score
repeat,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0.435116,0.922457,0.40038


By default, the `summarize` method averages the error scores across all folds. It is also possible to show the error scores for each fold by setting `cross_fit = False`. We can see that the test errors are quite different across the folds.

In [6]:
learner.summarize(
    cross_fit = False)

Crosser Summary (cross_fit=False, combine=False)
 + Estimator: GridSearchCV
 + Cross-Validator: KFold (n_folds=5, n_repeats=1)
 + Train/Test Scorer: R2 (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,val_score,train_score,test_score
split,repeat,fold,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,0,0,0.472267,0.923259,0.306461
1,0,1,0.461485,0.928113,0.212241
2,0,2,0.420104,0.923915,0.450129
3,0,3,0.413691,0.918957,0.474996
4,0,4,0.408034,0.918042,0.558074


The `Crosser` also has a `predict` method for prediction. For a regressor, by default, it makes predictions by averaging the predictions made by all the trained estimators:

In [7]:
learner.predict(
    X.iloc[:5,:])

array([179.698,  82.326, 152.412, 191.108, 116.05 ])

When the estimator is a classifier, the prediction is based on majority voting. If we hope to make predictions based on a model trained on specific split, the argument `split` can be used:

In [8]:
learner.predict(
    X.iloc[:5,:],
    split = 0)

array([175.94,  82.67, 144.97, 194.7 ,  98.88])

## Inference with Cross-Fitting via `Crosser`

Now we will demonstrate how to implement cross-fitting strategy to improve both statistical power and stability of a test with the help of `Crosser`. In breif, the cross-fitting atrategy tries to integrate the test results made by the folds:

+ When the null distribution is constructed by `null_dist = resampling`, the integration is based on Algorithm 4 of Tansey et al (2022);
+ When the null distribution is constructed by `null_dist = normality` or `null_dist = permutation`, the integration is based on Algorithm 3 of Williamson et al. (2023).

In this example, the conditional predictive impact (CPI) is used to test the significance of `sex`. Hence, a sampler for `sex` is necessary. Note that the sampler must be trained with the same cross-validator as we specified for training the learner. Otherwise, `sial` cannot perform cross-fitting correctly.

In [9]:
from sklearn.ensemble import RandomForestClassifier
removal = "sex"
sampler = Crosser(
    GridSearchCV(
        estimator = RandomForestClassifier(), 
        param_grid = {
            "max_features": [3, 6, 9]}),
    cv = cv)
_ = sampler.fit(
    X.drop(removal, axis = 1), 
    X[removal])

To implement cross-fitting, we replace usual estimators by their corresponding `Crosser` objects when initializing the `Inferer`. After infering, the `summarize` method will integrate the results via cross-fitting by default:

In [10]:
from sial.inferer import CIT
cpi = CIT(
    learner, 
    sampler,
    removal,
    "CPI",
    n_copies = 100)
_ = cpi.infer(X, y)
cpi.summarize()

Inferer Summary (cross_fit=True, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,size,estimate,std_error,p_value
removal,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
sex,442,56.265094,18.057748,0.000917


To inspect the result on each fold, we can set `cross_fit = False`. We can see that these results are less stable and less powerful bacause of the small sample sizes of the `diabetes` data.

In [11]:
cpi.summarize(
    cross_fit = False)

Inferer Summary (cross_fit=False, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,size,estimate,std_error,p_value
removal,split,repeat,fold,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
sex,0,0,0,89,-5.246309,40.803878,0.551153
sex,1,0,1,89,119.704975,33.805048,0.000199
sex,2,0,2,88,50.167522,39.032044,0.099345
sex,3,0,3,88,29.278031,39.064681,0.226785
sex,4,0,4,88,87.421249,49.186114,0.037755


## Inference with Both Cross-Fitting and $p$-Value Combination

Another way to improve the stability of testing is using $p$-value combinations. In this example, we use both cross-fitting and $p$-value combination simultaneously. We plan to conduct 5-fold cross-validation 4 times. Hence, `RepeatedKFold` is used to initialize a cross-validator:

In [12]:
from sklearn.model_selection import RepeatedKFold
cv = RepeatedKFold(
    n_splits = 5,
    n_repeats = 4,
    random_state = 1)

In `sial`, `n_splits` is used to denote the number of all splits and the number of folds will be represented by another variable called `n_folds`. Hence, the specified `n_splits = 5` and `n_repeats = 4` in this example results in `n_splits = 20`, `n_folds = 5`, and `n_repeats = 4` that are stored in both `Crosser` and `Inferer`.

Then we train a learner and a sampler with the `RepeatedKFold` cross-validator and conduct CPI for `sex`. By default, the `summarize` method implement several p-value combination methods that aggregate the $p$-values obtained in the 4 replications.

In [13]:
learner = Crosser(
    GridSearchCV(
        estimator = RandomForestRegressor(), 
        param_grid = {
            "max_features": [3, 6, 9]}),
    cv = cv)
sampler = Crosser(
    GridSearchCV(
        estimator = RandomForestClassifier(), 
        param_grid = {
            "max_features": [3, 6, 9]}),
    cv = cv)
_ = learner.fit(X, y)
_ = sampler.fit(
    X.drop(removal, axis = 1), 
    X[removal])

In [14]:
cpi = CIT(
    learner, 
    sampler,
    removal,
    "CPI",
    n_copies = 100)
_ = cpi.infer(X, y)
cpi.summarize()

Inferer Summary (cross_fit=True, combine=True)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,size,estimate,std_error,p_value
removal,method,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sex,gmean,442.0,49.631483,17.880264,0.007043
sex,median,442.0,49.631483,17.880264,0.006286
sex,q1,442.0,49.631483,17.880264,0.002032
sex,min,442.0,49.631483,17.880264,0.003808
sex,hmean,442.0,49.631483,17.880264,0.006449
sex,hommel,442.0,49.631483,17.880264,0.004234
sex,cauchy,442.0,49.631483,17.880264,0.001711


To see the $p$-value in each replication, we can set `combine = False` in the `summarize` method. Note that these $p$-values are still calculated based on cross-fitting. Even after cross-fitting, the $p$-values could still differ slightly across replications. That is why $p$-value combination could be helpful for enhencing stability.

In [15]:
cpi.summarize(
    combine = False)

Inferer Summary (cross_fit=True, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,size,estimate,std_error,p_value
removal,repeat,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sex,0,442,43.351123,16.949814,0.00527
sex,1,442,51.239335,16.503058,0.000952
sex,2,442,58.969712,19.112014,0.001016
sex,3,442,44.965763,18.95617,0.008844


If users are interested in the $p$-value on each split, `combine = False` and `cross_fit = False` can be used. We see that the split-level results are indeed unstable.

In [16]:
cpi.summarize(
    cross_fit = False,
    combine = False)

Inferer Summary (cross_fit=False, combine=False)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,size,estimate,std_error,p_value
removal,split,repeat,fold,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
sex,0,0,0,89,-2.541062,43.365655,0.523363
sex,1,0,1,89,109.953799,33.618844,0.000537
sex,2,0,2,88,44.26181,36.981536,0.11568
sex,3,0,3,88,25.350981,35.376511,0.236809
sex,4,0,4,88,39.730088,40.162131,0.161272
sex,5,1,0,89,32.947555,34.141925,0.167268
sex,6,1,1,89,72.037018,35.774047,0.022022
sex,7,1,2,88,81.073104,36.318819,0.012799
sex,8,1,3,88,49.780208,40.981997,0.112243
sex,9,1,4,88,20.358789,37.293011,0.292562


Breifly speaking, `cross_fit` controls aggregating results over `n_folds` given a repeat and `combine` controls combining results over `n_splits`. If both `cross_fit` and `combine` are `True`, the `summarize` method first aggregates the results of folds for each repeat and then do $p$-value combinations based on the results made by first step.

Finally, it is possible to combine all $p$-values without doing cross-fitting first by setting `combine = True` and `cross_fit = False`. This approach might result in conservative results for some p-value combination methods. However, so far it is hard to say which method is more reliable. The relative performances of these combination methods require further empirical evaluations.

In [17]:
cpi.summarize(
    cross_fit = False,
    combine = True)

Inferer Summary (cross_fit=False, combine=True)
 + Method: CPI (double_split=None, perturb_size=None)
 + Null Distribution: Normality (n_copies=100, n_permutations=None)
 + Loss Function: Mean Squared Error (reverse=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,size,estimate,std_error,p_value
removal,method,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
sex,gmean,88.4,49.631483,39.981486,0.187431
sex,median,88.4,49.631483,39.981486,0.240288
sex,q1,88.4,49.631483,39.981486,0.014352
sex,min,88.4,49.631483,39.981486,0.010732
sex,hmean,88.4,49.631483,39.981486,0.055667
sex,hommel,88.4,49.631483,39.981486,0.038611
sex,cauchy,88.4,49.631483,39.981486,0.006862


## References

Tansey, W., Veitch, V., Zhang, H., Rabadan, R., & Blei, D. M. (2022). The holdout randomization test for feature selection in black box models. Journal of Computational and Graphical Statistics, 31(1), 151–162. doi: 10.1080/10618600.2021.1923520

Williamson, B. D., Gilbert, P. B., Simon, N. R., & Carone, M. (2023). A general framework for inference on algorithm-agnostic variable importance. Journal of the American Statistical Association, 118(543), 1645–1658. doi: 10.1080/01621459.2021.2003200