Fetching contributors… Cannot retrieve contributors at this time
217 lines (167 sloc) 6.2 KB

CustomizedCrossPlan

Nina Zumel, John Mount October 2019

These are notes on controlling the cross-validation plan in the R version of vtreat, for notes on the Python version of vtreat, please see here.

Using Custom Cross-Validation Plans with vtreat

First, try preparing this data using vtreat.

By default, R vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.

Here we start with a simple k-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we’ll show how to replace the cross validation scheme in vtreat.

library(wrapr)
library(rqdatatable)
library(vtreat)

Example: Highly Unbalanced Class Outcomes

As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:

n_row <- 1000

set.seed(2019)

d <- data.frame(
x = rnorm(n = n_row),
y = rbinom(n = n_row, size = 1, prob = 0.05)
)

summary(d)
##        x                  y
##  Min.   :-3.23608   Min.   :0.000
##  1st Qu.:-0.72730   1st Qu.:0.000
##  Median :-0.13212   Median :0.000
##  Mean   :-0.07818   Mean   :0.047
##  3rd Qu.: 0.59856   3rd Qu.:0.000
##  Max.   : 3.54146   Max.   :1.000

First, try preparing this data using vtreat.

#
# create the treatment plan
#

k <- 5 # number of cross-val folds
treatment_unstratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayCrossValidation,
verbose = FALSE)

# prepare the training data
prepared_unstratified = treatment_unstratified\$crossFrame

Let’s look at the distribution of the target outcome in each of the cross-validation groups:

# convenience function to mark the cross-validation group of each row
label_rows <- function(d, cross_plan, label_column = 'group') {
d[label_column] = 0
for(i in 1:length(cross_plan)) {
app = cross_plan[[i]][['app']]
d[app, label_column] = i
}
return(d)
}

# label the rows
prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified\$evalSets)

# get some summary statistics on the data
summarize_by_group <- local_td(prepared_unstratified) %.>%
project(.,
sum %:=% sum(y),
mean %:=% mean(y),
size %:=% n(),
groupby='group')

unstratified_summary <- prepared_unstratified %.>% summarize_by_group
unstratified_summary <- as.data.frame(unstratified_summary)

knitr::kable(unstratified_summary)
group sum mean size
2 9 0.045 200
3 13 0.065 200
4 7 0.035 200
1 12 0.060 200
5 6 0.030 200
# standard deviation of target prevalence per cross-val fold
std_unstratified = sd(unstratified_summary[['mean']])
std_unstratified
##  0.01524795

The target prevalence in the cross validation groups can vary fairly widely with respect to the “true” prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.

Passing in a Stratified Sampler

In this situation, vtreat has an alternative cross-validation sampler called kWayStratifiedY that can be passed in as follows:

treatment_stratified <- mkCrossFrameCExperiment(
d,
varlist = 'x',
outcomename = 'y',
outcometarget = 1,
ncross = k,
splitFunction = kWayStratifiedY,
verbose = FALSE)

# prepare the training data
prepared_stratified = treatment_stratified\$crossFrame

# examine the target prevalence
prepared_stratified = label_rows(prepared_stratified, treatment_stratified\$evalSets)

stratified_summary <- prepared_stratified %.>% summarize_by_group
stratified_summary <- as.data.frame(stratified_summary)

knitr::kable(stratified_summary)
group sum mean size
5 9 0.045 200
1 10 0.050 200
3 10 0.050 200
4 9 0.045 200
2 9 0.045 200
# standard deviation of target prevalence
std_stratified = sd(stratified_summary[['mean']])
std_stratified
##  0.002738613

The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.

std_unstratified/std_stratified
##  5.567764

Other cross-validation schemes

If you want to cross-validate under another scheme–for example, stratifying on the prevalences on an input class–you can write your own custom cross-validation scheme and pass it into vtreat in a similar fashion as above. Your cross-validation scheme must have the same signature as vtreat’s kWayCrossValidation.

Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.

Other predefined cross-validation schemes

More notes on controlling vtreat cross-validation can be found here.

Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat), we have some notes on this here.

You can’t perform that action at this time.