Nina Zumel and John Mount updated February 2020
Note this is a description of the R
version of
vtreat
, the same example for the
Python
version of vtreat
can be found
here.
Load modules/packages.
library(vtreat)
## Loading required package: wrapr
packageVersion('vtreat')
## [1] '1.6.0'
suppressPackageStartupMessages(library(ggplot2))
library(WVPlots)
library(rqdatatable)
## Loading required package: rquery
##
## Attaching package: 'rquery'
## The following object is masked from 'package:ggplot2':
##
## arrow
Generate example data.
y
is a noisy sinusoidal plus linear function of the variablex
- Input
xc
is a categorical variable that represents a discretization ofy
, along with someNaN
s - Input
x2
is a pure noise variable with no relationship to the output - Input
x3
is a constant variable
set.seed(2020)
make_data <- function(nrows) {
d <- data.frame(x = 5*rnorm(nrows))
d['y'] = sin(d[['x']]) + 0.01*d[['x']] + 0.1*rnorm(n = nrows)
d[4:10, 'x'] = NA # introduce NAs
d['xc'] = paste0('level_', 5*round(d$y/5, 1))
d['x2'] = rnorm(n = nrows)
d['x3'] = 1
d[d['xc']=='level_-1', 'xc'] = NA # introduce a NA level
return(d)
}
d = make_data(500)
d %.>%
head(.) %.>%
knitr::kable(.)
x | y | xc | x2 | x3 |
---|---|---|---|---|
1.884861 | 1.0906132 | level_1 | 0.0046504 | 1 |
1.507742 | 1.0108804 | level_1 | -1.2287497 | 1 |
-5.490116 | 0.7766693 | level_1 | -0.1405980 | 1 |
NA | 0.5442452 | level_0.5 | -0.2073270 | 1 |
NA | -0.9738103 | NA | -0.9215306 | 1 |
NA | -0.4968719 | level_-0.5 | 0.3604742 | 1 |
Check how many levels xc
has, and their distribution (including NaN
)
unique(d['xc'])
## xc
## 1 level_1
## 4 level_0.5
## 5 <NA>
## 6 level_-0.5
## 13 level_0
## 91 level_-1.5
table(d$xc, useNA = 'always')
##
## level_-0.5 level_-1.5 level_0 level_0.5 level_1 <NA>
## 91 2 92 91 106 118
The vtreat
package is primarily intended for data treatment prior to
supervised learning, as detailed in the
Classification
and
Regression
examples. In these situations, vtreat
specifically uses the
relationship between the inputs and the outcomes in the training data to
create certain types of synthetic variables. We call these more complex
synthetic variables y-aware variables.
However, you may also want to use vtreat
for basic data treatment for
unsupervised problems, when there is no outcome variable. Or, you may
not want to create any y-aware variables when preparing the data for
supervised modeling. For these applications, vtreat
is a convenient
alternative to model.matrix()
that keeps information about the levels
of factor variables observed in the data, and can manage novel levels
that appear in future data.
In any case, we still want training data where all the input variables
are numeric and have no missing values or NaN
s.
First create the data treatment transform object, in this case a treatment for an unsupervised problem.
transform = vtreat::designTreatmentsZ(
dframe = d, # data to learn transform from
varlist = setdiff(colnames(d), c('y')) # columns to transform
)
## [1] "vtreat 1.6.0 inspecting inputs Wed Mar 11 16:29:39 2020"
## [1] "designing treatments Wed Mar 11 16:29:39 2020"
## [1] " have initial level statistics Wed Mar 11 16:29:39 2020"
## [1] " scoring treatments Wed Mar 11 16:29:39 2020"
## [1] "have treatment plan Wed Mar 11 16:29:39 2020"
Use the training data d
to fit the transform and the return a treated
training set: completely numeric, with no missing values.
d_prepared = prepare(transform, d)
d_prepared$y = d$y # copy y to the prepared data
Now examine the score frame, which gives information about each new
variable, including its type and which original variable it is derived
from. Some of the columns of the score frame (rsq
, sig
) are not
relevant to the unsupervised case; those columns are used by the
Regression and Classification transforms.
score_frame = transform$scoreFrame
knitr::kable(score_frame)
varName | varMoves | rsq | sig | needsSplit | extraModelDegrees | origName | code |
---|---|---|---|---|---|---|---|
x | TRUE | 0 | 1 | FALSE | 0 | x | clean |
x_isBAD | TRUE | 0 | 1 | FALSE | 0 | x | isBAD |
xc_catP | TRUE | 0 | 1 | TRUE | 5 | xc | catP |
x2 | TRUE | 0 | 1 | FALSE | 0 | x2 | clean |
xc_lev_NA | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_minus_0_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_minus_1_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_0 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_0_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_1 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
Notice that the variable xc
has been converted to multiple variables:
- an indicator variable for each possible level, including
NA
or missing (xc_lev*
) - a variable that returns how prevalent this particular value of
xc
is in the training data (xc_catP
)
The numeric variable x
has been converted to two variables:
- a clean version of
x
that has noNaN
s or missing values - a variable indicating when
x
wasNaN
orNA
in the original data (xd_isBAD
)
Any or all of these new variables are available for downstream modeling.
Also note that the variable x3
does not appear in the score frame (or
in the treated data), as it had no range (didn’t vary), so the
unsupervised treatment dropped it.
Let’s look at the top of d_prepared
, which includes all the new
variables, plus y
(and excluding x3
).
d_prepared %.>%
head(.) %.>%
knitr::kable(.)
x | x_isBAD | xc_catP | x2 | xc_lev_NA | xc_lev_x_level_minus_0_5 | xc_lev_x_level_minus_1_5 | xc_lev_x_level_0 | xc_lev_x_level_0_5 | xc_lev_x_level_1 | y |
---|---|---|---|---|---|---|---|---|---|---|
1.8848606 | 0 | 0.212 | 0.0046504 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0906132 |
1.5077419 | 0 | 0.212 | -1.2287497 | 0 | 0 | 0 | 0 | 0 | 1 | 1.0108804 |
-5.4901159 | 0 | 0.212 | -0.1405980 | 0 | 0 | 0 | 0 | 0 | 1 | 0.7766693 |
-0.2704873 | 1 | 0.182 | -0.2073270 | 0 | 0 | 0 | 0 | 1 | 0 | 0.5442452 |
-0.2704873 | 1 | 0.236 | -0.9215306 | 1 | 0 | 0 | 0 | 0 | 0 | -0.9738103 |
-0.2704873 | 1 | 0.182 | 0.3604742 | 0 | 1 | 0 | 0 | 0 | 0 | -0.4968719 |
Of course, what we really want to do with the prepared training data is to model.
Let’s start with an unsupervised analysis: clustering.
# don't use y to cluster
not_variables <- c('y')
model_vars <- setdiff(colnames(d_prepared), not_variables)
clusters = kmeans(d_prepared[, model_vars, drop = FALSE], centers = 5)
d_prepared['clusterID'] <- clusters$cluster
head(d_prepared$clusterID)
## [1] 1 1 2 1 1 1
ggplot(data = d_prepared, aes(x=x, y=y, color=as.character(clusterID))) +
geom_point() +
ggtitle('y as a function of x, points colored by (unsupervised) clusterID') +
theme(legend.position="none") +
scale_colour_brewer(palette = "Dark2")
Since in this case we have an outcome variable, y
, we can try fitting
a linear regression model to d_prepared
.
f <- wrapr::mk_formula('y', model_vars)
model = lm(f, data = d_prepared)
# now predict
d_prepared['prediction'] = predict(
model,
newdata = d_prepared)
## Warning in predict.lm(model, newdata = d_prepared): prediction from a rank-
## deficient fit may be misleading
# look at the fit (on the training data)
WVPlots::ScatterHist(
d_prepared,
xvar = 'prediction',
yvar = 'y',
smoothmethod = 'identity',
estimate_sig = TRUE,
title = 'Relationship between prediction and y')
Now apply the model to new data.
# create the new data
dtest <- make_data(450)
# prepare the new data with vtreat
dtest_prepared = prepare(transform, dtest)
# dtest %.>% transform is an alias for prepare(transform, dtest)
dtest_prepared$y = dtest$y
# apply the model to the prepared data
dtest_prepared['prediction'] = predict(
model,
newdata = dtest_prepared)
## Warning in predict.lm(model, newdata = dtest_prepared): prediction from a rank-
## deficient fit may be misleading
# compare the predictions to the outcome (on the test data)
WVPlots::ScatterHist(
dtest_prepared,
xvar = 'prediction',
yvar = 'y',
smoothmethod = 'identity',
estimate_sig = TRUE,
title = 'Relationship between prediction and y')
# get r-squared
sigr::wrapFTest(dtest_prepared,
predictionColumnName = 'prediction',
yColumnName = 'y',
nParameters = length(model_vars) + 1)
## [1] "F Test summary: (R2=0.9683, F(11,438)=1218, p<1e-05)."
We’ve tried to set the defaults for all parameters so that vtreat
is
usable out of the box for most applications.
suppressPackageStartupMessages(library(printr))
args("designTreatmentsZ")
## function (dframe, varlist, ..., minFraction = 0, weights = c(),
## rareCount = 0, collarProb = 0, codeRestriction = NULL, customCoders = NULL,
## verbose = TRUE, parallelCluster = NULL, use_parallel = TRUE,
## missingness_imputation = NULL, imputation_map = NULL)
## NULL
Some parameters of note include:
codeRestriction: The types of synthetic variables that vtreat
will
(potentially) produce. By default, all possible applicable types will be
produced. See Types of prepared variables below.
minFraction (default: 0): For categorical variables, indicator
variables (type levs
) are only produced for levels that are present at
least minFraction
of the time. A consequence of this is that
1/minFraction
is the maximum number of indicators that will be
produced for a given categorical variable. By default, all possible
indicator variables are produced.
missingness_imputation: The function or value that vtreat
uses to
impute or “fill in” missing numerical values. The default is mean
. To
change the imputation function or use different functions/values for
different columns, see the Imputation
example
for examples.
customCoders: For passing in user-defined transforms for custom data preparation. Won’t be needed in most situations, but see here for an example of applying a GAM transform to input variables.
# calculate the prevalence of each level of xc by hand, including NA
table(d$xc, useNA = "ifany")/nrow(d)
level_-0.5 | level_-1.5 | level_0 | level_0.5 | level_1 | NA |
---|---|---|---|---|---|
0.182 | 0.004 | 0.184 | 0.182 | 0.212 | 0.236 |
transform_common = designTreatmentsZ(
dframe = d, # data to learn transform from
varlist = setdiff(colnames(d), c('y')), # columns to transform
minFraction = 0.2 # only make indicators for levels that appear more than 20% of the time
)
## [1] "vtreat 1.6.0 inspecting inputs Wed Mar 11 16:29:43 2020"
## [1] "designing treatments Wed Mar 11 16:29:43 2020"
## [1] " have initial level statistics Wed Mar 11 16:29:43 2020"
## [1] " scoring treatments Wed Mar 11 16:29:43 2020"
## [1] "have treatment plan Wed Mar 11 16:29:43 2020"
d_prepared = prepare(transform_common, d) # fit the transform
knitr::kable(transform_common$scoreFrame) # examine the score frame
varName | varMoves | rsq | sig | needsSplit | extraModelDegrees | origName | code |
---|---|---|---|---|---|---|---|
x | TRUE | 0 | 1 | FALSE | 0 | x | clean |
x_isBAD | TRUE | 0 | 1 | FALSE | 0 | x | isBAD |
xc_catP | TRUE | 0 | 1 | TRUE | 5 | xc | catP |
x2 | TRUE | 0 | 1 | FALSE | 0 | x2 | clean |
xc_lev_NA | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_1 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
In this case, the unsupervised treatment only created levels for the two
most common levels, level_1
and NA
, which are both present more than
20% of the time.
In unsupervised situations, this may only be desirable when there are an
unworkably large number of possible levels (for example, when using ZIP
code as a variable). It is more useful in conjunction with the y-aware
variables produced by designTreatmentsN
/mkCrossFrameNExperiment
(regression), designTreatmentsC
/mkCrossFrameCExperiment
(binary
classification), or designTreatmentsM
/mkCrossFrameMExperiment
(multiclass classification).
clean: Produced from numerical variables: a clean numerical variable
with no NaNs
or missing values
lev: Produced from categorical variables, one for each level: for each level of the variable, indicates if that level was “on”
catP: Produced from categorical variables: indicates how often each level of the variable was “on” (its prevalence)
isBAD: Produced for numerical variables: an indicator variable that
marks when the original variable was missing or NaN
In this example, suppose you only want to use indicators and continuous
variables in your model; in other words, you only want to use variables
of types (clean
, isBAD
, and lev
), and no catP
variables.
transform_thin = vtreat::designTreatmentsZ(
dframe = d, # data to learn transform from
varlist = setdiff(colnames(d), c('y')), # columns to transform
codeRestriction = c('clean', 'lev', 'isBAD'))
## [1] "vtreat 1.6.0 inspecting inputs Wed Mar 11 16:29:43 2020"
## [1] "designing treatments Wed Mar 11 16:29:43 2020"
## [1] " have initial level statistics Wed Mar 11 16:29:43 2020"
## [1] " scoring treatments Wed Mar 11 16:29:43 2020"
## [1] "have treatment plan Wed Mar 11 16:29:43 2020"
score_frame_thin = transform_thin$scoreFrame
knitr::kable(score_frame_thin)
varName | varMoves | rsq | sig | needsSplit | extraModelDegrees | origName | code |
---|---|---|---|---|---|---|---|
x | TRUE | 0 | 1 | FALSE | 0 | x | clean |
x_isBAD | TRUE | 0 | 1 | FALSE | 0 | x | isBAD |
x2 | TRUE | 0 | 1 | FALSE | 0 | x2 | clean |
xc_lev_NA | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_minus_0_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_minus_1_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_0 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_0_5 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
xc_lev_x_level_1 | TRUE | 0 | 1 | FALSE | 0 | xc | lev |
In all cases (classification, regression, unsupervised, and multinomial
classification) the intent is that vtreat
transforms are essentially
one liners.
The preparation commands are organized as follows:
- Regression:
R
regression example, fit/prepare interface,R
regression example, design/prepare/experiment interface,Python
regression example. - Classification:
R
classification example, fit/prepare interface,R
classification example, design/prepare/experiment interface,Python
classification example. - Unsupervised tasks:
R
unsupervised example, fit/prepare interface,R
unsupervised example, design/prepare/experiment interface,Python
unsupervised example. - Multinomial classification:
R
multinomial classification example, fit/prepare interface,R
multinomial classification example, design/prepare/experiment interface,Python
multinomial classification example.
These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.