# [`vtreat`](https://github.com/WinVector/pyvtreat) Nested Model Bias Warning

For quite a while we have been teaching estimating variable re-encodings on the exact same data they
are later *naively* using to train a model on leads to an undesirable nested model bias.  The `vtreat`
package (both the [`R` version](https://github.com/WinVector/vtreat) and 
[`Python` version](https://github.com/WinVector/pyvtreat)) both incorporate a cross-frame method
that allows one to use all the training data both to build learn variable re-encodings and to correctly train a subsequent model (for an example please see our recent [PyData LA talk](http://www.win-vector.com/blog/2019/12/pydata-los-angeles-2019-talk-preparing-messy-real-world-data-for-supervised-machine-learning/)).

The next version of `vtreat` will warn the user if they have improperly used the same data for both `vtreat` impact code inference and downstream modeling.  So in addition to us warning you not to do this, the package now also checks and warns against this situation.


## Set up the Example


This example is copied from [some of our classification documentation](https://github.com/WinVector/pyvtreat/blob/master/Examples/Classification/Classification.md).


Load modules/packages.

In [1]:
import pkg_resources
import pandas
import numpy
import numpy.random
import vtreat
import vtreat.util

numpy.random.seed(2019)

Generate example data. 

* `y` is a noisy sinusoidal function of the variable `x`
* `yc` is the output to be predicted: : whether `y` is > 0.5. 
* Input `xc` is a categorical variable that represents a discretization of `y`, along some `NaN`s
* Input `x2` is a pure noise variable with no relationship to the output

In [2]:
def make_data(nrows):
    d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
    d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
    d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
    d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
    d['x2'] = numpy.random.normal(size=nrows)
    d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
    d['yc'] = d['y']>0.5
    return d

training_data = make_data(500)

training_data.head()

Unnamed: 0,x,y,xc,x2,yc
0,-1.088395,-0.956311,,-1.424184,False
1,4.107277,-0.671564,level_-0.5,0.42736,False
2,7.406389,0.906303,level_1.0,0.668849,True
3,,0.222792,level_0.0,-0.015787,False
4,,-0.975431,,-0.491017,False


In [3]:
outcome_name = 'yc'    # outcome variable / column
outcome_target = True  # value we consider positive

## Demonstrate the Warning

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or `NA`s.

First create the data treatment transform design object, in this case a treatment for a binomial classification problem.

We use the training data `training_data` to fit the transform and the return a treated training set: completely numeric, with no missing values.

In [4]:
treatment = vtreat.BinomialOutcomeTreatment(
    outcome_name=outcome_name,      # outcome variable
    outcome_target=outcome_target,  # outcome of interest
    cols_to_copy=['y'],  # columns to "carry along" but not treat as input variables
)  

In [5]:
train_prepared = treatment.fit_transform(training_data, training_data['yc'])

`train_prepared` is prepared in the correct way to use the same training data for inferring the impact-coded variables, using `.fit_transform()` instead of `.fit().transform()`.

We prepare new test or application data as follows.

In [6]:
test_data = make_data(100)

test_prepared = treatment.transform(test_data)

The issue is: for training data we should not call `transform()`, but instead use the value returned by `.fit_transform()`.

The point is we should not do the following:

In [7]:
train_prepared_wrong = treatment.transform(training_data)

  "possibly called transform on same data used to fit (this causes over-fit, please use fit_transform() instead)")



Notice we now get a warning that we should not have done this, and in doing so we may have a nested model bias data leak.

And that is the new nested model bias warning feature.

The `R`-version of this document can be found [here](https://github.com/WinVector/vtreat/blob/master/Examples/Classification/ClassificationWarningExample.md).