Fetching contributors… Cannot retrieve contributors at this time
671 lines (540 sloc) 26.2 KB

# Using vtreat with Classification Problems

Nina Zumel and John Mount February 2020

This article documents `vtreat`’s “fit_prepare” variation for classification problems. This API was inspired by the `pyvtreat` API, which was in turn based on the `.fit()`, `.transform()`, `.fit_transform()` workflow of `scikit-learn` in `Python`.

The same example in the original `R` `vtreat` notation can be found here.

The same example in the `Python` version of `vtreat` can be found here.

## Preliminaries

`library(rqdatatable)`
``````## Loading required package: wrapr

``````
```library(vtreat)
packageVersion('vtreat')```
``````##  '1.6.0'
``````
```suppressPackageStartupMessages(library(ggplot2))
library(WVPlots)```

Generate example data.

• `y` is a noisy sinusoidal function of the variable `x`
• `yc` is the output to be predicted: whether `y` is > 0.5.
• Input `xc` is a categorical variable that represents a discretization of `y`, along some `NA`s
• Input `x2` is a pure noise variable with no relationship to the output
```set.seed(2020)

make_data <- function(nrows) {
d <- data.frame(x = 5*rnorm(nrows))
d['y'] = sin(d['x']) + 0.1*rnorm(n = nrows)
d[4:10, 'x'] = NA                  # introduce NAs
d['xc'] = paste0('level_', 5*round(d\$y/5, 1))
d['x2'] = rnorm(n = nrows)
d[d['xc']=='level_-1', 'xc'] = NA  # introduce a NA level
d['yc'] = d[['y']]>0.5
return(d)
}

d = make_data(500)

d %.>%
knitr::kable(.)```
x y xc x2 yc
1.884861 1.0717646 level_1 0.0046504 TRUE
1.507742 0.9958029 level_1 -1.2287497 TRUE
-5.490116 0.8315705 level_1 -0.1405980 TRUE
NA 0.6007655 level_0.5 -0.2073270 TRUE
NA -0.8339836 NA -0.9215306 FALSE
NA -0.5329006 level_-0.5 0.3604742 FALSE

### Some quick data exploration

Check how many levels `xc` has, and their distribution (including `NA`)

`unique(d['xc'])`
``````##             xc
## 1      level_1
## 4    level_0.5
## 5         <NA>
## 6   level_-0.5
## 27     level_0
## 269 level_-1.5
``````
`table(d\$xc, useNA = 'always')`
``````##
## level_-0.5 level_-1.5    level_0  level_0.5    level_1       <NA>
##         94          1         85         98        103        119
``````

Find the mean value of `yc`

`mean(d[['yc']])`
``````##  0.324
``````

Plot of `yc` versus `x`.

```ggplot(d, aes(x=x, y=as.numeric(yc))) +
geom_line()```
``````## Warning: Removed 7 rows containing missing values (geom_path).
`````` ## Build a transform appropriate for classification problems.

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or `NA`s.

First create the data treatment object, in this case a treatment for a binomial classification problem.

```transform_spec <- vtreat::BinomialOutcomeTreatment(
var_list = setdiff(colnames(d), c('y', 'yc')),  # columns to transform
outcome_name = 'yc',                            # outcome variable
outcome_target = TRUE                           # outcome of interest
)```

Now call the `fit_prepare()` function with the training data `d` to fit the transform and also return a treated training set. The `fit_prepare()` function returns the fitted data treatment object (as `treatments`) and a statistically correct treated training set (as `cross_frame`) for training the model. The `cross_frame` is guaranteed to be completely numeric, with no missing values.

```# the unpack notation is a multiassignment operator
# see https://winvector.github.io/wrapr/articles/unpack_multiple_assignment.html
# for more details
unpack[
treatment_plan = treatments,
d_prepared = cross_frame
] <- fit_prepare(transform_spec, d)

# list the derived variables
get_feature_names(treatment_plan)```
``````##   "x"                        "x_isBAD"
##   "xc_catP"                  "xc_catB"
##   "x2"                       "xc_lev_NA"
##   "xc_lev_x_level_minus_0_5" "xc_lev_x_level_0"
##   "xc_lev_x_level_0_5"       "xc_lev_x_level_1"
``````

Notice that `d_prepared` only includes derived variables and the outcome `yc`:

```d_prepared %.>%
knitr::kable(.)```
x x_isBAD xc_catP xc_catB x2 xc_lev_NA xc_lev_x_level_minus_0_5 xc_lev_x_level_0 xc_lev_x_level_0_5 xc_lev_x_level_1 yc
1.8848606 0 0.2102102 14.206543 0.0046504 0 0 0 0 1 TRUE
1.5077419 0 0.2005988 14.139786 -1.2287497 0 0 0 0 1 TRUE
-5.4901159 0 0.2005988 14.139786 -0.1405980 0 0 0 0 1 TRUE
-0.1276897 1 0.1891892 1.219475 -0.2073270 0 0 0 1 0 TRUE
-0.3929879 1 0.2402402 -12.844663 -0.9215306 1 0 0 0 0 FALSE
-0.2908461 1 0.1766467 -12.563128 0.3604742 0 1 0 0 0 FALSE

As we will see below, the `prepare()` function applies the fitted data treatments to future data, prior to calling your model on the data. Note that for the training data `d`: `fit_prepare(transform_spec, d)` is not the same as `fit(transform_spec, d) %.>% prepare(., d)`; the second call can lead to nested model bias in some situations, and is not recommended. In other words, it is a bad idea to call `prepare()` on your original training data.

For future application data `df` that is not seen during transform design, `prepare(treatment_plan, df)` is the appropriate step.

`vtreat` version `1.5.1` and newer issue a warning if you call the incorrect transform pattern on your original training data:

`d_prepared_wrong <- prepare(treatment_plan, d)`
``````## Warning in treatmentplan\$transform(dframe = dframe, ...): possibly called
## transform() on same data frame as fit(), this can lead to over-fit. To avoid
``````

## The Score Frame

Now examine the score frame, which gives information about each new variable, including its type, which original variable it is derived from, its (cross-validated) significance as a one-variable linear model for the outcome,and the (cross-validated) R-squared of its corresponding linear model.

```# get statistics on the variables
score_frame <- get_score_frame(treatment_plan)

# only print a subset of the columns
cols = c("varName", "origName", "code", "rsq", "sig", "varMoves", "default_threshold", "recommended")
knitr::kable(score_frame[,cols])```
varName origName code rsq sig varMoves default_threshold recommended
x x clean 0.0005756 0.5470919 TRUE 0.10 FALSE
xc_catP xc catP 0.0008468 0.4652101 TRUE 0.20 FALSE
xc_catB xc catB 0.7883578 0.0000000 TRUE 0.20 TRUE
x2 x2 clean 0.0026075 0.2000083 TRUE 0.10 FALSE
xc_lev_NA xc lev 0.1750095 0.0000000 TRUE 0.04 TRUE
xc_lev_x_level_minus_0_5 xc lev 0.1328708 0.0000000 TRUE 0.04 TRUE
xc_lev_x_level_0 xc lev 0.1185254 0.0000000 TRUE 0.04 TRUE
xc_lev_x_level_0_5 xc lev 0.0644178 0.0000000 TRUE 0.04 TRUE
xc_lev_x_level_1 xc lev 0.4701626 0.0000000 TRUE 0.04 TRUE

Note that the variable `xc` has been converted to multiple variables:

• an indicator variable for each possible level, plus `NA` (`xc_lev_*`)
• the value of a (cross-validated) one-variable model for `yc` as a function of `xc` (`xc_catB`)
• a variable that returns how prevalent this particular value of `xc` is in the training data (`xc_catP`)

The variable `x` has been converted to two new variables:

• a clean version of `x` that has no missing values or `NaN`s
• a variable indicating when `x` was `NA` in the original data (`x_isBAD`).

Any or all of these new variables are available for downstream modeling.

The `recommended` column indicates which variables are non constant (`varMoves` == TRUE) and have a significance value (`sig`) smaller than `default_threshold`. See the section Deriving the Default Thresholds below for the reasoning behind the default thresholds. Recommended columns are intended as advice about which variables appear to be most likely to be useful in a downstream model. This advice attempts to be conservative, to reduce the possibility of mistakenly eliminating variables that may in fact be useful (although, obviously, it can still mistakenly eliminate variables that have a real but non-linear relationship to the output, as is the case with `x`, in our example).

Let’s look at the variables that are and are not recommended:

```# recommended variables
score_frame[score_frame[['recommended']], 'varName', drop = FALSE]  %.>%
knitr::kable(.)```
varName
4 xc_catB
6 xc_lev_NA
7 xc_lev_x_level_minus_0_5
8 xc_lev_x_level_0
9 xc_lev_x_level_0_5
10 xc_lev_x_level_1
```# not recommended variables
score_frame[!score_frame[['recommended']], 'varName', drop = FALSE] %.>%
knitr::kable(.)```
varName
1 x
3 xc_catP
5 x2

## A Closer Look at `catB` variables

Variables of type `catB` are the outputs of a one-variable regularized logistic regression of a categorical variable (in our example, `xc`) against the centered output on the (cross-validated) treated training data.

Let’s see whether `xc_catB` makes a good one-variable model for `yc`. It has a large AUC:

```WVPlots::ROCPlot(
frame = d_prepared,
xvar = 'xc_catB',
truthVar = 'yc',
truthTarget = TRUE,
title = 'performance of xc_catB variable')``` This indicates that `xc_catB` is strongly predictive of the outcome. Negative values of `xc_catB` correspond strongly to negative outcomes, and positive values correspond strongly to positive outcomes.

```WVPlots::DoubleDensityPlot(
frame = d_prepared,
xvar = 'xc_catB',
truthVar = 'yc',
title = 'performance of xc_catB variable')``` The values of `xc_catB` are in “link space”.

Variables of type `catB` are useful when dealing with categorical variables with a very large number of possible levels. For example, a categorical variable with 10,000 possible values potentially converts to 10,000 indicator variables, which may be unwieldy for some modeling methods. Using a single numerical variable of type `catB` may be a preferable alternative.

## Using the Prepared Data in a Model

Of course, what we really want to do with the prepared training data is to fit a model jointly with all the (recommended) variables. Let’s try fitting a logistic regression model to `d_prepared`.

```# only use the recommended variables to fit the model
model_vars <- score_frame\$varName[score_frame\$recommended]

# to use all the variables:
# model_vars <- score_frame\$varName

f <- wrapr::mk_formula('yc', model_vars, outcome_target = TRUE)

model = glm(f, data = d_prepared)

# now predict
d_prepared['prediction'] = predict(
model,
newdata = d_prepared,
type = 'response')

# look at the ROC curve (on the training data)
WVPlots::ROCPlot(
frame = d_prepared,
xvar = 'prediction',
truthVar = 'yc',
truthTarget = TRUE,
title = 'Performance of logistic regression model on training data')``` Now apply the model to new data.

```# create the new data
dtest <- make_data(450)

# prepare the new data with vtreat
dtest_prepared = prepare(treatment_plan, dtest)

# apply the model to the prepared data
dtest_prepared['prediction'] = predict(
model,
newdata = dtest_prepared,
type = 'response')

WVPlots::ROCPlot(
frame = dtest_prepared,
xvar = 'prediction',
truthVar = 'yc',
truthTarget = TRUE,
title = 'Performance of logistic regression model on test data')``` ## Parameters for `BinomialOutcomeTreatment`

We’ve tried to set the defaults for all parameters so that `vtreat` is usable out of the box for most applications.

`classification_parameters()`
``````## \$minFraction
##  0.02
##
## \$smFactor
##  0
##
## \$rareCount
##  0
##
## \$rareSig
## NULL
##
## \$collarProb
##  0
##
## \$codeRestriction
## NULL
##
## \$customCoders
## NULL
##
## \$splitFunction
## NULL
##
## \$ncross
##  3
##
## \$forceSplit
##  FALSE
##
## \$catScaling
##  TRUE
##
## \$verbose
##  FALSE
##
## \$use_parallel
##  TRUE
##
## \$missingness_imputation
## NULL
##
## \$pruneSig
## NULL
##
## \$scale
##  FALSE
##
## \$doCollar
##  FALSE
##
## \$varRestriction
## NULL
##
## \$trackedValues
## NULL
##
## \$check_for_duplicate_frames
##  TRUE
##
## attr(,"class")
##  "classification_parameters"
``````

Some parameters of note include:

codeRestriction (default: NULL): the types of synthetic variables that `vtreat` will (potentially) produce. See Types of prepared variables below. By default, produces all applicable variable types.

minFraction (default: 0.02): For categorical variables, indicator variables (type `lev`) are only produced for levels that are present at least `minFraction` of the time (by default, 2% of the time). A consequence of this is that 1/`minFraction` is the maximum number of indicators that will be produced for a given categorical variable. To make sure that all possible indicator variables are produced, set `minFraction = 0`

splitFunction: The cross validation method used by `vtreat`. Most people won’t have to change this.

ncross (default: 3): The number of folds to use for cross-validation

missingness_imputation: The function or value that vtreat uses to impute or “fill in” missing numerical values. The default is `mean`. To change the imputation function or use different functions/values for different columns, see the Imputation example

customCoders: For passing in user-defined transforms for custom data preparation. Won’t be needed in most situations, but see here for an example of applying a GAM transform to input variables.

## Types of prepared variables

clean: Produced from numerical variables: a clean numerical variable with no `NAs` or missing values

lev: Produced from categorical variables, one for each (common) level: a 0/1 indicator variable that indicates if that level was “on”

catP: Produced from categorical variables: indicates how often each level of the variable was “on” (its prevalence)

catB: Produced from categorical variables: score from a one-dimensional model of the centered output as a function of the variable

isBAD: Produced for numerical variables: an indicator variable that marks when the original variable was missing or `NaN`.

More on the coding types can be found here.

### Example: Produce only a subset of variable types

In this example, suppose you only want to use indicators and continuous variables in your model; in other words, you only want to use variables of types (`clean`, `isBAD`, and `lev`), and no `catB` or `catP` variables.

```# create a new set of parameters, overriding
# the default for codeRestriction
newparams = classification_parameters(
list(
))

thin_spec <- vtreat::BinomialOutcomeTreatment(
var_list = setdiff(colnames(d), c('y', 'yc')),  # columns to transform
outcome_name = 'yc',                            # outcome variable
outcome_target = TRUE,                          # outcome of interest
params = newparams                              # set the parameters
)

unpack[
thin_plan = treatments,
thin_frame = cross_frame
] <- fit_prepare(thin_spec, d)

# examine the new prepared training data
# no catB or catP
x x_isBAD x2 xc_lev_NA xc_lev_x_level_minus_0_5 xc_lev_x_level_0 xc_lev_x_level_0_5 xc_lev_x_level_1 yc
1.8848606 0 0.0046504 0 0 0 0 1 TRUE
1.5077419 0 -1.2287497 0 0 0 0 1 TRUE
-5.4901159 0 -0.1405980 0 0 0 0 1 TRUE
-0.0453530 1 -0.2073270 0 0 0 1 0 TRUE
-0.0453530 1 -0.9215306 1 0 0 0 0 FALSE
-0.4926751 1 0.3604742 0 1 0 0 0 FALSE
```# examine the score frame for the new treatment plan
# no catB or catP
knitr::kable(get_score_frame(thin_plan)[,cols])```
varName origName code rsq sig varMoves default_threshold recommended
x x clean 0.0005756 0.5470919 TRUE 0.16666667 FALSE
x2 x2 clean 0.0026075 0.2000083 TRUE 0.16666667 FALSE
xc_lev_NA xc lev 0.1750095 0.0000000 TRUE 0.06666667 TRUE
xc_lev_x_level_minus_0_5 xc lev 0.1328708 0.0000000 TRUE 0.06666667 TRUE
xc_lev_x_level_0 xc lev 0.1185254 0.0000000 TRUE 0.06666667 TRUE
xc_lev_x_level_0_5 xc lev 0.0644178 0.0000000 TRUE 0.06666667 TRUE
xc_lev_x_level_1 xc lev 0.4701626 0.0000000 TRUE 0.06666667 TRUE

## Deriving the Default Thresholds

While machine learning algorithms are generally tolerant to a reasonable number of irrelevant or noise variables, too many irrelevant variables can lead to serious overfit; see this article for an extreme example, one we call “Bad Bayes”. The default threshold is an attempt to eliminate obviously irrelevant variables early.

Imagine that you have a pure noise dataset, where none of the n inputs are related to the output. If you treat each variable as a one-variable model for the output, and look at the significances of each model, these significance-values will be uniformly distributed in the range [0:1]. You want to pick a weakest possible significance threshold that eliminates as many noise variables as possible. A moment’s thought should convince you that a threshold of 1/n allows only one variable through, in expectation.

This leads to the general-case heuristic that a significance threshold of 1/n on your variables should allow only one irrelevant variable through, in expectation (along with all the relevant variables). Hence, 1/n used to be our recommended threshold, when we originally developed the R version of `vtreat`.

We noticed, however, that this biases the filtering against numerical variables, since there are at most two derived variables (of types clean and is_BAD) for every numerical variable in the original data. Categorical variables, on the other hand, are expanded to many derived variables: several indicators (one for every common level), plus a catB and a catP. So we now reweight the thresholds.

Suppose you have a (treated) data set with ntreat different types of `vtreat` variables (`clean`, `lev`, etc). There are nT variables of type T. Then the default threshold for all the variables of type T is 1/(ntreat nT). This reweighting helps to reduce the bias against any particular type of variable. The heuristic is still that the set of recommended variables will allow at most one noise variable into the set of candidate variables.

As noted above, because `vtreat` estimates variable significances using linear methods by default, some variables with a non-linear relationship to the output may fail to pass the threshold. In this case, you may not wish to filter the variables to be used in the models to only recommended variables (as we did in the main example above), but instead use all the variables, or select the variables to use by your own criteria.

## Conclusion

In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that `vtreat` transforms are essentially one liners.

The preparation commands are organized as follows:

These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.

You can’t perform that action at this time.