Skip to content
Permalink
Branch: master
Find file Copy path
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
217 lines (167 sloc) 6.2 KB

CustomizedCrossPlan

Nina Zumel, John Mount October 2019

These are notes on controlling the cross-validation plan in the R version of vtreat, for notes on the Python version of vtreat, please see here.

Using Custom Cross-Validation Plans with vtreat

First, try preparing this data using vtreat.

By default, R vtreat uses a y-stratified randomized k-way cross validation when creating and evaluating complex synthetic variables.

Here we start with a simple k-way cross validation plan. This will work well for the majority of applications. However, there may be times when you need a more specialized cross validation scheme for your modeling projects. In this document, we’ll show how to replace the cross validation scheme in vtreat.

library(wrapr)
library(rqdatatable)
## Loading required package: rquery
library(vtreat)

Example: Highly Unbalanced Class Outcomes

As an example, suppose you have data where the target class of interest is relatively rare; in this case about 5%:

n_row <- 1000

set.seed(2019)

d <- data.frame(
    x = rnorm(n = n_row),
    y = rbinom(n = n_row, size = 1, prob = 0.05)
)

summary(d)
##        x                  y        
##  Min.   :-3.23608   Min.   :0.000  
##  1st Qu.:-0.72730   1st Qu.:0.000  
##  Median :-0.13212   Median :0.000  
##  Mean   :-0.07818   Mean   :0.047  
##  3rd Qu.: 0.59856   3rd Qu.:0.000  
##  Max.   : 3.54146   Max.   :1.000

First, try preparing this data using vtreat.

#
# create the treatment plan
#

k <- 5 # number of cross-val folds
treatment_unstratified <- mkCrossFrameCExperiment(
  d,
  varlist = 'x',
  outcomename = 'y',
  outcometarget = 1,
  ncross = k,
  splitFunction = kWayCrossValidation,
  verbose = FALSE)

# prepare the training data
prepared_unstratified = treatment_unstratified$crossFrame

Let’s look at the distribution of the target outcome in each of the cross-validation groups:

# convenience function to mark the cross-validation group of each row
label_rows <- function(d, cross_plan, label_column = 'group') {
    d[label_column] = 0
    for(i in 1:length(cross_plan)) {
        app = cross_plan[[i]][['app']]
        d[app, label_column] = i
    }
    return(d)
}
            
# label the rows            
prepared_unstratified <- label_rows(prepared_unstratified, treatment_unstratified$evalSets)
# print(head(prepared_unstratified))

# get some summary statistics on the data
summarize_by_group <- local_td(prepared_unstratified) %.>%
    project(.,
      sum %:=% sum(y),
      mean %:=% mean(y),
      size %:=% n(),
    groupby='group')

unstratified_summary <- prepared_unstratified %.>% summarize_by_group
unstratified_summary <- as.data.frame(unstratified_summary)

knitr::kable(unstratified_summary)
group sum mean size
2 9 0.045 200
3 13 0.065 200
4 7 0.035 200
1 12 0.060 200
5 6 0.030 200
# standard deviation of target prevalence per cross-val fold
std_unstratified = sd(unstratified_summary[['mean']])
std_unstratified 
## [1] 0.01524795

The target prevalence in the cross validation groups can vary fairly widely with respect to the “true” prevalence of 0.05; this may adversely affect the resulting synthetic variables in the treated data. For situations like this where the target outcome is rare, you may want to stratify the cross-validation sampling to preserve the target prevalence as much as possible.

Passing in a Stratified Sampler

In this situation, vtreat has an alternative cross-validation sampler called kWayStratifiedY that can be passed in as follows:

treatment_stratified <- mkCrossFrameCExperiment(
  d,
  varlist = 'x',
  outcomename = 'y',
  outcometarget = 1,
  ncross = k,
  splitFunction = kWayStratifiedY,
  verbose = FALSE)

# prepare the training data
prepared_stratified = treatment_stratified$crossFrame

# examine the target prevalence
prepared_stratified = label_rows(prepared_stratified, treatment_stratified$evalSets)

stratified_summary <- prepared_stratified %.>% summarize_by_group
stratified_summary <- as.data.frame(stratified_summary)

knitr::kable(stratified_summary)
group sum mean size
5 9 0.045 200
1 10 0.050 200
3 10 0.050 200
4 9 0.045 200
2 9 0.045 200
# standard deviation of target prevalence
std_stratified = sd(stratified_summary[['mean']])
std_stratified
## [1] 0.002738613

The target prevalence in the stratified cross-validation groups are much closer to the true target prevalence, and the variation (standard deviation) of the target prevalence across groups has been substantially reduced.

std_unstratified/std_stratified
## [1] 5.567764

Other cross-validation schemes

If you want to cross-validate under another scheme–for example, stratifying on the prevalences on an input class–you can write your own custom cross-validation scheme and pass it into vtreat in a similar fashion as above. Your cross-validation scheme must have the same signature as vtreat’s kWayCrossValidation.

Another benefit of explicit cross-validation plans is that one can use the same cross-validation plan for both the variable design and later modeling steps. This can limit data leaks across the cross-validation folds.

Other predefined cross-validation schemes

More notes on controlling vtreat cross-validation can be found here.

Note: it is important to not use leave-one-out cross-validation when using nested or stacked modeling concepts (such as seen in vtreat), we have some notes on this here.

You can’t perform that action at this time.