## Multiple Split Cross-Validation Data Leak

In our article [Cross-Methods are a Leak/Variance Trade-Off](https://github.com/WinVector/pyvtreat/blob/master/Examples/CrossVal/LeakTradeOff/CrossFrameExample.ipynb) we left out an interesting data leak: what happens if  you use different cross-validation plans on each step of a multi-stage or multi column problem?  

We thought of this example: because our students have asked us this very question. We have found our students are very smart and their questions often are very fundamental.  So here is a concrete example of some of the risk using many different cross-validation plans in the same project: each leaks different information, which can be combined into a larger leak.  The example here is extreme, but it can make the point.

So our advice is: unless you are coordinating the many plans in some way (such as 2-way independence or some sort of combinatorial design) it is generally better to use one plan. That way minor information leaks at each stage explore less of the output variations, and don’t combine into worse leaks.

Lets take a look at what having a lot of copies of the constant column encoded under different cross validation plans looks like.  For this example we will use 3-fold cross validation.  That leans less, but as we see averaging a lot of copies of the constant column treated in the same way still leaks information in a non-productive manner.  For a 3-fold cross-validation plan the bias from any one column is small, however many columns together will leak information.  We will demonstrate this next.

First we import our required packages.

In [1]:
# https://numpy.org
import numpy

# https://pandas.pydata.org
import pandas

# https://scikit-learn.org/
import sklearn.metrics
import sklearn.linear_model
import sklearn.model_selection
# https://contrib.scikit-learn.org/categorical-encoding/targetencoder.html
import category_encoders

# https://www.statsmodels.org/
import statsmodels.api

# https://github.com/WinVector/pyvtreat/blob/master/Examples/CrossVal/LeakTradeOff/break_cross_val.py
from break_cross_val import TransformerAdapter, Container

Now we construct a data frame of all the same constant value.



In [2]:
numpy.random.seed(2020)
prng = numpy.random.RandomState(numpy.random.randint(2**32))

In [3]:
y_example = numpy.random.normal(size=100)

In [4]:
wide_const_frame = pandas.DataFrame()
for j in range(1000):
    wide_const_frame[str(j)] = ['a'] * len(y_example)

wide_const_frame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
1,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
2,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
3,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
4,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
96,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
97,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a
98,a,a,a,a,a,a,a,a,a,a,...,a,a,a,a,a,a,a,a,a,a


## Using a different cross-validation plan for each column/step (not a good idea)


Now we re-code the frame using a different cross-validation plan for each column (an inadvisable idea).


In [5]:
# cross encode each column with a different plan 
# (not a good idea)
wide_coded_frame = pandas.DataFrame()

for c in wide_const_frame.columns:
    # http://www.win-vector.com/blog/2020/03/python-data-science-tip-dont-use-default-cross-validation-settings/
    cvstrat = sklearn.model_selection.KFold(
        shuffle=True, n_splits=3,
        random_state=prng)
    te_wide = TransformerAdapter(category_encoders.target_encoder.TargetEncoder())
    colf = sklearn.model_selection.cross_val_predict(
        te_wide, 
        wide_const_frame[[c]], 
        y_example, 
        cv=cvstrat)
    wide_coded_frame[c] = colf[:, 0]

wide_coded_frame

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0.055096,0.063151,0.123506,0.073977,0.155730,0.125445,0.118004,-0.011715,-0.035786,0.152003,...,0.054028,0.024162,0.089872,0.197621,0.012531,0.216629,0.079401,0.172212,0.091442,0.095523
1,0.053041,0.063151,0.056868,0.073977,0.155730,0.125445,0.153948,0.018624,0.210023,0.098883,...,0.032855,0.198822,0.022123,0.159744,0.012531,0.008650,0.141128,0.172212,0.102955,0.095523
2,0.055096,0.104923,0.123506,0.115078,0.155730,0.170804,-0.002372,0.264864,0.210023,0.019202,...,0.032855,0.198822,0.022123,-0.087162,0.199564,0.008650,0.048475,0.038054,0.074787,0.027020
3,0.160505,0.101311,0.089289,0.073977,0.005925,0.170804,0.153948,0.018624,0.210023,0.152003,...,0.054028,0.024162,0.157165,0.197621,0.199564,0.216629,0.141128,0.172212,0.091442,0.027020
4,0.160505,0.101311,0.089289,0.115078,0.005925,0.125445,0.118004,0.018624,-0.035786,0.019202,...,0.054028,0.024162,0.157165,0.197621,0.012531,0.008650,0.048475,0.172212,0.091442,0.027020
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.160505,0.063151,0.089289,0.073977,0.005925,-0.026558,-0.002372,0.264864,0.210023,0.019202,...,0.032855,0.024162,0.089872,0.197621,0.199564,0.043185,0.141128,0.038054,0.074787,0.145680
96,0.053041,0.104923,0.089289,0.079868,0.106253,0.125445,0.118004,0.264864,-0.035786,0.098883,...,0.181426,0.198822,0.089872,0.197621,0.056569,0.216629,0.048475,0.038054,0.091442,0.145680
97,0.160505,0.101311,0.089289,0.073977,0.106253,0.170804,0.153948,-0.011715,0.210023,0.098883,...,0.181426,0.045514,0.089872,0.197621,0.199564,0.216629,0.079401,0.172212,0.074787,0.095523
98,0.055096,0.063151,0.089289,0.079868,0.106253,-0.026558,0.118004,0.018624,0.210023,0.098883,...,0.032855,0.045514,0.157165,-0.087162,0.012531,0.043185,0.141128,0.058426,0.091442,0.027020



## Single column is uninformative

The cross validation is useful to the degree that any single column is uninformative.

In [6]:
c5_frame = pandas.DataFrame({
    'x': wide_coded_frame['5'], 
    'const': 1})

fit_c5 = statsmodels.api.OLS(
    y_example, 
    c5_frame)
res_c5 = fit_c5.fit()

r2 = sklearn.metrics.r2_score(
    y_true=y_example, 
    y_pred=res_c5.predict(c5_frame))

assert r2 < 0.1

res_c5.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.025
Model:,OLS,Adj. R-squared:,0.015
Method:,Least Squares,F-statistic:,2.533
Date:,"Sat, 14 Mar 2020",Prob (F-statistic):,0.115
Time:,12:23:42,Log-Likelihood:,-147.6
No. Observations:,100,AIC:,299.2
Df Residuals:,98,BIC:,304.4
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x,-2.0249,1.272,-1.591,0.115,-4.550,0.500
const,0.2725,0.157,1.736,0.086,-0.039,0.584

0,1,2,3
Omnibus:,0.921,Durbin-Watson:,2.04
Prob(Omnibus):,0.631,Jarque-Bera (JB):,0.447
Skew:,0.004,Prob(JB):,0.8
Kurtosis:,3.327,Cond. No.,12.0




### Mean encoding

The mean of these encodings has information every row nearly uniformly, except the row itself. This is much like the leave-one out pattern, and can is a data leak.


In [7]:
mean_coded_noise = wide_coded_frame.mean(axis = 1) 

wide_fit = pandas.DataFrame({
    'mean_noise': mean_coded_noise,
    'const_col': 1
})

wide_fit


Unnamed: 0,mean_noise,const_col
0,0.091302,1
1,0.099572,1
2,0.089368,1
3,0.103004,1
4,0.070237,1
...,...,...
95,0.090376,1
96,0.095393,1
97,0.111182,1
98,0.058898,1


The leak is demonstrated by showing the average column can be used estimate the explanatory variable.


In [8]:
overfit_model_wide = statsmodels.api.OLS(
    y_example, 
    wide_fit)
overfit_result_wide = overfit_model_wide.fit()

r2 = sklearn.metrics.r2_score(
    y_true=y_example, 
    y_pred=overfit_result_wide.predict(wide_fit))

assert r2 > 0.8

overfit_result_wide.summary()


0,1,2,3
Dep. Variable:,y,R-squared:,0.952
Model:,OLS,Adj. R-squared:,0.951
Method:,Least Squares,F-statistic:,1925.0
Date:,"Sat, 14 Mar 2020",Prob (F-statistic):,3.08e-66
Time:,12:23:42,Log-Likelihood:,2.5024
No. Observations:,100,AIC:,-1.005
Df Residuals:,98,BIC:,4.206
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
mean_noise,-95.9396,2.186,-43.878,0.000,-100.279,-91.601
const_col,8.6952,0.198,44.012,0.000,8.303,9.087

0,1,2,3
Omnibus:,1.031,Durbin-Watson:,1.861
Prob(Omnibus):,0.597,Jarque-Bera (JB):,1.039
Skew:,-0.116,Prob(JB):,0.595
Kurtosis:,2.557,Cond. No.,92.5


Notice the data leak essentially non-productively memorized the outcome.  Again, the issue is: this method won't work on out of sample data.


## Using the same cross-validation plan for each column/step (the recommendation)

Now let's see how this works when we use the more standard recommended method of using the same cross-validation plan throughout.


In [9]:
# cross encode each column with the same plan 
# (the standard advice)
class PreStoredCrossVal:
    def __init__(self, plan, nrow):
        self.n_splits = plan.get_n_splits()
        self.nrow = nrow
        self.plan = [(train, test) for train, test in plan.split(pandas.DataFrame({'x': range(nrow)}))]
        
    def get_n_splits(self, X=None, y=None, groups=None):
        return self.n_splits
    
    def split(self, X, y=None, groups=None):
        if X.shape[0] != self.nrow:
            raise ValueError("number of rows must be " + str(self.nrow))
        return [(train, test) for train, test in self.plan]

    
cvstratc = PreStoredCrossVal(
    sklearn.model_selection.KFold(
        shuffle=True, n_splits=3,
        random_state=prng),
    nrow=wide_const_frame.shape[0])


wide_coded_framec = pandas.DataFrame()


for c in wide_const_frame.columns:
    # http://www.win-vector.com/blog/2020/03/python-data-science-tip-dont-use-default-cross-validation-settings/
    te_wide = TransformerAdapter(category_encoders.target_encoder.TargetEncoder())
    colf = sklearn.model_selection.cross_val_predict(
        te_wide, 
        wide_const_frame[[c]], 
        y_example, 
        cv=cvstratc)
    wide_coded_framec[c] = colf[:, 0]

wide_coded_framec

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,990,991,992,993,994,995,996,997,998,999
0,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,...,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471
1,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,...,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682
2,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,...,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471
3,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,...,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682
4,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,...,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,...,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471
96,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,...,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682,0.026682
97,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,...,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471,0.124471
98,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,...,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434,0.118434


We see that all of the columns are identical, so using any one of them is the same as using all of them.

In [10]:
c_frame = pandas.DataFrame({
    'x': wide_coded_framec['0'], 
    'const': 1})

fit_c = statsmodels.api.OLS(
    y_example, 
    c_frame)
res_c = fit_c.fit()

r2 = sklearn.metrics.r2_score(
    y_true=y_example, 
    y_pred=res_c.predict(c_frame))

assert r2 < 0.1

res_c.summary()

0,1,2,3
Dep. Variable:,y,R-squared:,0.007
Model:,OLS,Adj. R-squared:,-0.003
Method:,Least Squares,F-statistic:,0.6956
Date:,"Sat, 14 Mar 2020",Prob (F-statistic):,0.406
Time:,12:25:05,Log-Likelihood:,-148.52
No. Observations:,100,AIC:,301.0
Df Residuals:,98,BIC:,306.3
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
x,-2.0179,2.420,-0.834,0.406,-6.820,2.784
const,0.2716,0.243,1.116,0.267,-0.211,0.755

0,1,2,3
Omnibus:,1.881,Durbin-Watson:,2.002
Prob(Omnibus):,0.39,Jarque-Bera (JB):,1.427
Skew:,-0.04,Prob(JB):,0.49
Kurtosis:,3.58,Cond. No.,22.6


## Why `TransformerAdapter`

As a side note, the reason we are using `TransformerAdapter` is `category_encoders.target_encoder.TargetEncoder` doesn't (at version `2.1.0`) implement all of the interface needed for `sklearn.model_selection.cross_val_predict`.

In [11]:
category_encoders.__version__

'2.1.0'

In [12]:
try:
    colf = sklearn.model_selection.cross_val_predict(
        category_encoders.target_encoder.TargetEncoder(), 
        wide_const_frame[[c]], 
        y_example, 
        cv=cvstratc)
except AttributeError as e:
    print(str(e))

'TargetEncoder' object has no attribute 'predict'


So, this time a single column being uninformative means the columns are uninformative jointly.

## Conclusion

And here we have a fairly simple demonstration of a data leak.