vtreat: prepare data
Suppose we wish to work with some data. Our example task is to train a classification model for credit approval using the
ranger implementation of the random forests method. We will take our data from John Ross Quinlan's re-processed "credit approval" dataset hosted at Lichman, M. (2013). UCI Machine Learning Repository, http://archive.ics.uci.edu/ml; Irvine, CA: University of California, School of Information and Computer Science.
For convenience we have copied the data to our working directory here. We start by loading the data, identifying the outcome, and splitting the data into training and evaluation portions:
# load data d <- read.table( 'crx.data.txt', header = FALSE, sep = ',', stringsAsFactors = FALSE, na.strings = '?' ) # prepare outcome column and level outcome <- 'V16' positive <- '+' d[[outcome]] <- as.factor(d[[outcome]]) # identify variables vars <- setdiff(colnames(d), outcome) # split into train and test/evaluation set.seed(25325) isTrain <- runif(nrow(d)) <= 0.8 dTrain <- d[isTrain, , drop = FALSE] dTest <- d[!isTrain, , drop = FALSE] rm(list = 'd')
We could try to model directly on the original variables without
vtreat as below.
library("ranger") f <- paste(outcome, paste(vars, collapse = ' + '), sep = ' ~ ') model <- ranger(as.formula(f), probability = TRUE, data = dTrain) # Error: Missing data in columns: V1, V2, V4, V5, V6, V7, V14.
ranger signaled it did not want to work with this data as there are variables missing values in the training set. This is only one of the many potentially analysis ruining issues that can be lurking in real world data (and as always, detected or signaled errors are better news than undetected errors!).
We name just a few of the issues that could be lurking in real world data:
- Missing values.
- Categorical variable levels that occur in the evaluation set, but were not in the training set (bad luck).
- Categorical variables with large sets of levels.
vtreat is designed to prepare your data for analysis in a statistically sound manner. After
vtreat processing all columns are numeric, there are no missing values, and the information in the original data is preserved (modulo user discretion in selecting variables and categorical variable levels).
vtreat::prepare() is roughly a powered-up replacement for
stats::model.matrix() (which itself is implicit in many
R modeling work-flows).
Let's finish the modeling task with
Cross frame experiment
First we run a "cross frame experiment." A cross frame experiment is a sophisticated modeling step that:
- Collects statistics on the relations between your original modeling variables and the training outcome.
- Introduces proposed new transformed modeling variables.
- Produces a "simulated out of sample" training frame that is a version of your model training data prepared in such a way as to simulate having been produced without having looked at that same data during the design of the variable transformations. This is an attempt to eliminate any nested model bias introduced by the transform design and application step. This step is performed using cross-validation inspired methods.
vtreat thinks very hard on your data (called the design phase). We call
vtreat and pull out the promised results as follows.
library("vtreat") # Run a "cross frame experiment" cfe <- vtreat::mkCrossFrameCExperiment(dTrain, vars, outcome, positive) # get the "treatment plan" or mapping from original variables # to derived variables. plan <- cfe$treatments # get the performance statistics on the derived variables. sf <- plan$scoreFrame # get the simulated out of sample transformed training data.frame treatedTrain <- cfe$crossFrame
Training a model
scoreFrame collects estimated effects sizes and significances on the new modeling variables. The analyst can use this to evaluate and choose modeling variables. In this example we will use all the derived variables that have a training significance below
1/NumberOfVariables (which is a fun way to try and avoid multiple comparison bias in picking modeling variables).
newVars <- sf$varName[sf$sig<1/nrow(sf)]
We now build our model using our transformed training data (the cross frame) and our chosen variables.
f <- paste(outcome, paste(newVars, collapse = ' + '), sep = ' ~ ') model <- ranger(as.formula(f), probability = TRUE, data = treatedTrain)
Evaluating a model
We now prepare a transformed version of the evaluation/test frame using the treatment plan.
This is done directly using
vtreat::prepare. We could have converted our training data the same way, but as we said it is more rigorous to use the supplied cross frame (though the nested model bias we are trying to avoid is strongest on high-cardinality categorical variables, which are not present in this example). In a real world applications you would keep the
plan data structure around to treat or prepare any future application data for model application.
treatedTest <- vtreat::prepare(plan, dTest, pruneSig = NULL, varRestriction = newVars) pred <- predict(model, data=treatedTest, type='response') treatedTest$pred <- pred$predictions[,positive]
And we are now ready to examine the model's out of sample performance. In this case we are going to use ROC/AUC as our evaluation.
library("WVPlots") WVPlots::ROCPlot(treatedTest, 'pred', outcome, positive, 'test performance')
And that's it, we have fit and evaluated a model!
Let's take a moment and look at the
vtreat supplied variable statistics. These are roughly the predictive power of each derived column treated as a single variable model. Obviously this isn't always going to always correlate with the performance of the variable as part of a joint model; but frankly in real world problems this measure is in fact a useful heuristic.
vtreat reports both an effect size (in this case
rsq which for categorization is a pseudo-Rsquared, or portion of deviance explained) and a significance estimate.
vtreat also reports the new variable name (
varName), the original column the variable was derived from (
origName), and the transformation performed (
Below we add an indicator ("
take") showing if we used the variable in our model and exhibit a few rows of
sf$take <- sf$varName %in% newVars head(sf[, c('varName', 'rsq', 'sig', 'origName', 'code', 'take')]) # varName rsq sig origName code take # 1 V1_lev_x.a 7.475340e-04 0.4485378512 V1 lev FALSE # 2 V1_lev_x.b 1.694203e-04 0.7182571422 V1 lev FALSE # 3 V1_catP 1.878477e-05 0.9043753424 V1 catP FALSE # 4 V1_catB 4.921671e-04 0.5385998340 V1 catB FALSE # 5 V2_clean 2.191721e-02 0.0000406801 V2 clean TRUE # 6 V2_isBAD 9.136120e-03 0.0080629087 V2 isBAD TRUE
One thing I would like to call out is that categorical variables (such as "
V1") are eligible for a great number of transformations including:
- Retaining non-negligible levels as dummy or indicator variables (code: "
- Re-encoding the entire column as an effect code or impact code (code: "
vtreatcan also take a user supplied per-level significance filter to control the formation of this encoding.
- Other statistics, such as per-level prevalence (allows pooling of rare or common events) (code: "
And that is it:
vtreat data preparation.
vtreat data preparation is sound, very powerful, and can greatly improve the quality of your predictive models. The package is available from
CRAN here and includes a large number of worked vignettes. We have a formal write-up of the technique here, many articles and tutorials on the methodology here, and a good central resource is here.
I strongly advise adding
vtreat to your data science or predictive analytic work-flows.
And a thanks to Dmitry Larko of h2o for his generous advocacy: