Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
381 lines (290 sloc) 9.54 KB

Nina Zumel, John Mount October 2019

These are notes on controlling the cross-validation plan in the Python version of vtreat, for notes on the R version of vtreat, please see here.

Using Custom Cross-Validation Plans with vtreat

By default, Python vtreat uses simple randomized k-way cross validation when creating and evaluating complex synthetic variables. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in vtreat.

import pandas
import numpy
import numpy.random

import vtreat
import vtreat.cross_plan

Example: Highly Unbalanced Class Outcomes

As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:

n_row = 1000

numpy.random.seed(2019)

d = pandas.DataFrame({
    'x': numpy.random.normal(size=n_row),
    'y': numpy.random.binomial(size=n_row, p=0.05, n=1)
})

d.describe()
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
x y
count 1000.000000 1000.00000
mean 0.005766 0.05800
std 1.024104 0.23386
min -3.205040 0.00000
25% -0.689752 0.00000
50% 0.012250 0.00000
75% 0.702009 0.00000
max 2.928164 1.00000

First, try preparing this data using vtreat.

By default, Python vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.

Here we start with a simple k-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we'll show how to replace the cross validation scheme in vtreat.

#
# create the treatment plan
#

k = 5 # number of cross-val folds (actually, the default)
treatment_unstratified = vtreat.BinomialOutcomeTreatment(
    var_list=['x'],
    outcome_name='y',
    outcome_target=1,
    params=vtreat.vtreat_parameters({
        'cross_validation_plan': vtreat.cross_plan.KWayCrossPlan(),
        'cross_validation_k': k
    })
)

# prepare the training data
prepared_unstratified = treatment_unstratified.fit_transform(d, d['y'])

Let's look at the distribution of the target outcome in each of the cross-validation groups:

# convenience function to mark the cross-validation group of each row
def label_rows(d, cross_plan, *, label_column = 'group'):
    d[label_column] = 0
    for i in range(len(cross_plan)):
        app = cross_plan[i]['app']
        d.loc[app, label_column] = i
            
# label the rows            
label_rows(prepared_unstratified, treatment_unstratified.cross_plan_)
# print(prepared_unstratified.head())

# get some summary statistics on the data
unstratified_summary = prepared_unstratified.groupby(['group']).agg({'y': ['sum', 'mean', 'count']})
unstratified_summary.columns = unstratified_summary.columns.get_level_values(1)
unstratified_summary
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sum mean count
group
0 14 0.07 200
1 18 0.09 200
2 8 0.04 200
3 12 0.06 200
4 6 0.03 200
# standard deviation of target prevalence per cross-val fold
std_unstratified = numpy.std(unstratified_summary['mean'])
std_unstratified 
0.02135415650406262

The target prevalence in the cross validation groups can vary fairly widely with respect to the "true" prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.

Passing in a Stratified Sampler

In this situation, vtreat has an alternative cross-validation sampler called KWayCrossPlanYStratified that can be passed in as follows:

# create the treatment plan
treatment_stratified = vtreat.BinomialOutcomeTreatment(
    var_list=['x'],
    outcome_name='y',
    outcome_target=1,
    params=vtreat.vtreat_parameters({
        'cross_validation_plan': vtreat.cross_plan.KWayCrossPlanYStratified(),
        'cross_validation_k': k
    })
)

# prepare the training data
prepared_stratified = treatment_stratified.fit_transform(d, d['y'])

# examine the target prevalence
label_rows(prepared_stratified, treatment_stratified.cross_plan_)

stratified_summary = prepared_stratified.groupby(['group']).agg({'y': ['sum', 'mean', 'count']})
stratified_summary.columns = stratified_summary.columns.get_level_values(1)
stratified_summary
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
</style>
sum mean count
group
0 13 0.065 200
1 11 0.055 200
2 12 0.060 200
3 12 0.060 200
4 10 0.050 200
# standard deviation of target prevalence
std_stratified = numpy.std(stratified_summary['mean'])
std_stratified
0.005099019513592784

The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.

std_unstratified/std_stratified
4.187894642712677

Other cross-validation schemes

If you want to cross-validate under another scheme--for example, stratifying on the prevalences on an input class--you can write your own custom cross-validation scheme and pass it into vtreat in a similar fashion as above. Your cross-validation scheme must extend vtreat's CrossValidationPlan class.

Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.

Other predefined cross-validation schemes

In addition to the y-stratified cross validation, vtreat also defines a time-oriented cross validation scheme (OrderedCrossPlan). The ordered cross plan treats time as the grouping variable. For each fold, all the datums in the application set (the datums that the model will be applied to) come from the same time period. All the datums in the training set come from one side of the application set; that is all the training data will be either earlier or later than the data in the application set. Ordered cross plans are useful when modeling time-oriented data.

Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat), we have some notes on this here.

You can’t perform that action at this time.