Fetching contributors…
Cannot retrieve contributors at this time
1907 lines (1652 sloc) 46.8 KB

# Using vtreat with Multinomial Classification Problems

Nina Zumel and John Mount September 2019

Note: this is a description of the `Python` version of `vtreat`, the same example for the `R` version of `vtreat` can be found here.

## Preliminaries

```import pkg_resources
import pandas
import numpy
import numpy.random
import seaborn
import matplotlib.pyplot as plt
import vtreat
import vtreat.util
import wvpy.util```

Generate example data.

• `y` is a noisy sinusoidal function of the variable `x`
• `yc` is the multiple class output to be predicted: : `y`'s quantized value as 'large', 'liminal', or 'small'.
• Input `xc` is a categorical variable that represents a discretization of `y`, along with some `NaN`s
• Input `x2` is a pure noise variable with no relationship to the output
```def make_data(nrows):
d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
d['y'] = numpy.sin(d['x']) + 0.1*numpy.random.normal(size=nrows)
d.loc[numpy.arange(3, 10), 'x'] = numpy.nan                           # introduce a nan level
d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
d['x2'] = numpy.random.normal(size=nrows)
d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan  # introduce a nan level
d['yc'] = numpy.where(d['y']>0.5, 'large', numpy.where(d['y']<-0.5, 'small', 'liminal'))
return d

d = make_data(500)

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
x y xc x2 yc
0 3.074277 -0.039133 level_-0.0 0.379897 liminal
1 6.632459 0.252459 level_0.5 -1.204097 liminal
2 5.129012 -0.889266 NaN -0.048225 small
3 NaN 0.816912 level_1.0 -0.767512 large
4 NaN -0.145631 level_-0.0 1.211516 liminal

### Some quick data exploration

Check how many levels `xc` has, and their distribution (including `NaN`)

`d['xc'].unique()`
``````array(['level_-0.0', 'level_0.5', nan, 'level_1.0', 'level_0.0',
'level_-0.5'], dtype=object)
``````
`d['xc'].value_counts(dropna=False)`
``````level_1.0     123
NaN           107
level_-0.5    105
level_0.5      91
level_0.0      40
level_-0.0     34
Name: xc, dtype: int64
``````

Show the distribution of `yc`

`d['yc'].value_counts(dropna=False)`
``````small      169
large      166
liminal    165
Name: yc, dtype: int64
``````

## Build a transform appropriate for classification problems.

Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or `NaN`s.

First create the data treatment transform object, in this case a treatment for a multinomial classification problem.

```transform = vtreat.MultinomialOutcomeTreatment(
outcome_name='yc',    # outcome variable
cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
)  ```

Use the training data `d` to fit the transform and return a treated training set: completely numeric, with no missing values. Note that for the training data `d`, `transform.fit_transform()` is not the same as `transform.fit().transform()`; the second call can lead to nested model bias in some situations, and is not recommended. For other, later data, not seen during transform design `transform.transform(o)` is an appropriate step.

`d_prepared = transform.fit_transform(d, d['yc'])`

Now examine the score frame, which gives information about each new variable, including its type, which original variable it is derived from, its (cross-validated) correlation with the outcome, and its (cross-validated) significance as a one-variable linear model for the outcome.

`transform.score_frame_`
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
variable orig_variable treatment y_aware has_range PearsonR significance vcount default_threshold recommended outcome_target
0 x_is_bad x missing_indicator False True 0.024435 5.856828e-01 2.0 0.100000 False large
1 xc_is_bad xc missing_indicator False True -0.367855 1.817194e-17 2.0 0.100000 True large
2 x x clean_copy False True -0.019163 6.690388e-01 2.0 0.100000 False large
3 x2 x2 clean_copy False True -0.072579 1.050180e-01 2.0 0.100000 False large
4 xc_logit_code_large xc logit_code True True 0.839481 5.175149e-134 3.0 0.066667 True large
5 xc_logit_code_small xc logit_code True True -0.606377 1.584985e-51 3.0 0.066667 False large
6 xc_logit_code_liminal xc logit_code True True -0.401019 9.682139e-21 3.0 0.066667 False large
7 xc_prevalence_code xc prevalence_code False True 0.452524 1.305602e-26 1.0 0.200000 True large
8 xc_lev_level_1.0 xc indicator_code False True 0.810216 1.274440e-117 4.0 0.050000 True large
9 xc_lev__NA_ xc indicator_code False True -0.367855 1.817194e-17 4.0 0.050000 True large
10 xc_lev_level_-0.5 xc indicator_code False True -0.363477 4.614443e-17 4.0 0.050000 True large
11 xc_lev_level_0.5 xc indicator_code False True 0.140755 1.603795e-03 4.0 0.050000 True large
12 x_is_bad x missing_indicator False True 0.061181 1.719657e-01 2.0 0.100000 False liminal
13 xc_is_bad xc missing_indicator False True -0.366197 2.590356e-17 2.0 0.100000 True liminal
14 x x clean_copy False True 0.013960 7.555017e-01 2.0 0.100000 False liminal
15 x2 x2 clean_copy False True 0.094836 3.399997e-02 2.0 0.100000 True liminal
16 xc_logit_code_large xc logit_code True True -0.218402 8.178872e-07 3.0 0.066667 False liminal
17 xc_logit_code_small xc logit_code True True -0.243506 3.496733e-08 3.0 0.066667 False liminal
18 xc_logit_code_liminal xc logit_code True True 0.675089 8.656947e-68 3.0 0.066667 True liminal
19 xc_prevalence_code xc prevalence_code False True -0.691087 3.126704e-72 1.0 0.200000 True liminal
20 xc_lev_level_1.0 xc indicator_code False True -0.400868 1.003896e-20 4.0 0.050000 True liminal
21 xc_lev__NA_ xc indicator_code False True -0.366197 2.590356e-17 4.0 0.050000 True liminal
22 xc_lev_level_-0.5 xc indicator_code False True 0.087196 5.134260e-02 4.0 0.050000 False liminal
23 xc_lev_level_0.5 xc indicator_code False True 0.198094 8.092185e-06 4.0 0.050000 True liminal
24 x_is_bad x missing_indicator False True -0.085144 5.709636e-02 2.0 0.100000 True small
25 xc_is_bad xc missing_indicator False True 0.730241 1.951693e-84 2.0 0.100000 True small
26 x x clean_copy False True 0.005201 9.076452e-01 2.0 0.100000 False small
27 x2 x2 clean_copy False True -0.022014 6.233712e-01 2.0 0.100000 False small
28 xc_logit_code_large xc logit_code True True -0.618656 3.868754e-54 3.0 0.066667 False small
29 xc_logit_code_small xc logit_code True True 0.845744 5.948699e-138 3.0 0.066667 True small
30 xc_logit_code_liminal xc logit_code True True -0.271830 6.416587e-10 3.0 0.066667 False small
31 xc_prevalence_code xc prevalence_code False True 0.236456 8.788362e-08 1.0 0.200000 True small
32 xc_lev_level_1.0 xc indicator_code False True -0.408142 1.710758e-21 4.0 0.050000 True small
33 xc_lev__NA_ xc indicator_code False True 0.730241 1.951693e-84 4.0 0.050000 True small
34 xc_lev_level_-0.5 xc indicator_code False True 0.275188 3.868707e-10 4.0 0.050000 True small
35 xc_lev_level_0.5 xc indicator_code False True -0.337045 9.522317e-15 4.0 0.050000 True small

Note that the variable `xc` has been converted to multiple variables:

• an indicator variable for each possible level (`xc_lev_level_*`)
• the value of a (cross-validated) one-variable "one versus rest" model for `yc` as a function of `xc`; one per possible outcome class (`xc_logit_code_*`)
• a variable that returns how prevalent this particular value of `xc` is in the training data (`xc_prevalence_code`)
• a variable indicating when `xc` was `NaN` in the original data (`xc_is_bad`)

Any or all of these new variables are available for downstream modeling.

Variables of type `logit_code_*` are useful when dealing with categorical variables with a very large number of possible levels. For example, a categorical variable with 10,000 possible values potentially converts to 10,000 indicator variables, which may be unwieldy for some modeling methods. Using one numerical variable of type `logit_code_*` per outcome target may be a preferable alternative.

Unlike the other `vtreat` treatments (Numeric, Binomial, Unsupervised), the score frame here has more rows than created variables, because the significance of each variable is evaluated against each possible outcome target.

The `recommended` column indicates which variables are non constant (`has_range` == True) and have a significance value smaller than `default_threshold` with respect to a particular outcome target. See the section Deriving the Default Thresholds below for the reasoning behind the default thresholds. Recommended columns are intended as advice about which variables appear to be most likely to be useful in a downstream model. This advice attempts to be conservative, to reduce the possibility of mistakenly eliminating variables that may in fact be useful (although, obviously, it can still mistakenly eliminate variables that have a real but non-linear relationship to the output, as is the case with `x`, in our example). Since each variable has multiple recommendations, one can consider a variable to be recommended if it is recommended for any of the outcome targets: an OR of all the recommendations.

## Examining variables

To select variables we either make our selection in terms of new variables as follows.

```score_frame = transform.score_frame_
good_new_variables = score_frame.variable[score_frame.recommended].unique()
good_new_variables```
``````array(['xc_is_bad', 'xc_logit_code_large', 'xc_prevalence_code',
'xc_lev_level_1.0', 'xc_lev__NA_', 'xc_lev_level_-0.5',
'xc_logit_code_small'], dtype=object)
``````

Or in terms of original variables as follows.

```good_original_variables = score_frame.orig_variable[score_frame.recommended].unique()
good_original_variables```
``````array(['xc', 'x2', 'x'], dtype=object)
``````

Notice, in each case we must call unique as each variable (derived or original) is potentially qualified against each possible outcome.

Notice that, by default, `d_prepared` only includes recommended variables (along with `y` and `yc`):

`d_prepared.head()`
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
y yc x_is_bad xc_is_bad x2 xc_logit_code_large xc_logit_code_small xc_logit_code_liminal xc_prevalence_code xc_lev_level_1.0 xc_lev__NA_ xc_lev_level_-0.5 xc_lev_level_0.5
0 -0.039133 liminal 0.0 0.0 0.379897 -5.766408 -5.691954 1.084094 0.068 0.0 0.0 0.0 0.0
1 0.252459 liminal 0.0 0.0 -1.204097 0.386183 -5.775971 0.399366 0.182 0.0 0.0 0.0 1.0
2 -0.889266 small 0.0 1.0 -0.048225 -5.757316 1.069527 -5.797946 0.214 0.0 1.0 0.0 0.0
3 0.816912 large 1.0 0.0 -0.767512 1.084192 -5.820619 -5.755436 0.246 1.0 0.0 0.0 0.0
4 -0.145631 liminal 1.0 0.0 1.211516 -5.717225 -5.741858 1.055005 0.068 0.0 0.0 0.0 0.0

This is `vtreat`s default behavior; to include all variables in the prepared data, set the parameter `filter_to_recommended` to False, as we show later, in the Parameters for `MultinomialOutcomeTreatment` section below.

## Using the Prepared Data in a Model

Of course, what we really want to do with the prepared training data is to fit a model jointly with all the (recommended) variables. Let's try fitting a logistic regression model to `d_prepared`.

```import sklearn.linear_model
import seaborn

not_variables = ['y', 'yc', 'prediction', 'prob_on_predicted_class', 'predict', 'large', 'liminal', 'small', 'prob_on_correct_class']
model_vars = [v for v in d_prepared.columns if v not in set(not_variables)]

fitter = sklearn.linear_model.LogisticRegression(
solver = 'saga',
penalty = 'l2',
C = 1,
max_iter = 1000,
multi_class = 'multinomial')
fitter.fit(d_prepared[model_vars], d_prepared['yc'])```
``````LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=1000,
multi_class='multinomial', n_jobs=None, penalty='l2',
random_state=None, solver='saga', tol=0.0001, verbose=0,
warm_start=False)
``````
```# convenience functions for predicting and adding predictions to original data frame

pred = fitter.predict_proba(d_prepared[model_vars])
classes = fitter.classes_
d_prepared['prob_on_predicted_class'] = 0
d_prepared['predict'] = None
for i in range(len(classes)):
cl = classes[i]
d_prepared[cl] = pred[:, i]
improved = d_prepared[cl] > d_prepared['prob_on_predicted_class']
d_prepared.loc[improved, 'predict'] = cl
d_prepared.loc[improved, 'prob_on_predicted_class'] = d_prepared.loc[improved, cl]
return d_prepared

vals = d_prepared[name_column].unique()
d_prepared[new_column] = None
for v in vals:
matches = d_prepared[name_column]==v
d_prepared.loc[matches, new_column] = d_prepared.loc[matches, v]
return d_prepared```
```# now predict
to_print=['yc', 'predict','large','liminal','small', 'prob_on_predicted_class','prob_on_correct_class']
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
yc predict large liminal small prob_on_predicted_class prob_on_correct_class
0 liminal liminal 0.000695 0.996943 0.002362 0.996943 0.996943
1 liminal large 0.601422 0.398002 0.000576 0.601422 0.398002
2 small small 0.000579 0.002239 0.997182 0.997182 0.997182
3 large large 0.997975 0.001939 0.000086 0.997975 0.997975
4 liminal liminal 0.000241 0.998576 0.001183 0.998576 0.998576

Here, the columns `large`, `liminal` and `small` give the predicted probability of each target outcome and `predict` gives the predicted (most probable) class. The column `prob_on_predicted_class` returns the predicted probability of the predicted class, and `prob_on_correct_class` returns the predicted probability of the actual class.

We can compare the predictions to actual outcomes with a confusion matrix:

```import sklearn.metrics

print(fitter.classes_)
sklearn.metrics.confusion_matrix(d_prepared.yc, d_prepared.predict, labels=fitter.classes_)```
``````['large' 'liminal' 'small']

array([[142,  24,   0],
[ 15, 113,  37],
[  0,   8, 161]])
``````

In the above confusion matrix, the entry `[row, column]` gives the number of true items of `class[row]` that also have prediction of `class[column]`. In other words, the entry `[1,2]` gives the number of 'large' items predicted to be 'liminal'.

Now apply the model to new data.

```# create the new data
dtest = make_data(450)

# prepare the new data with vtreat
dtest_prepared = transform.transform(dtest)

# apply the model to the prepared data

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
yc predict large liminal small prob_on_predicted_class prob_on_correct_class
0 liminal small 0.000281 0.314538 0.685181 0.685181 0.314538
1 small small 0.000560 0.002010 0.997430 0.997430 0.99743
2 liminal liminal 0.000482 0.997910 0.001607 0.997910 0.99791
3 small small 0.000358 0.003263 0.996379 0.996379 0.996379
4 small small 0.000335 0.004111 0.995554 0.995554 0.995554
```print(fitter.classes_)
sklearn.metrics.confusion_matrix(dtest_prepared.yc, dtest_prepared.predict, labels=fitter.classes_)```
``````['large' 'liminal' 'small']

array([[126,  25,   0],
[ 17, 104,  32],
[  0,   2, 144]])
``````

## Parameters for `MultinomialOutcomeTreatment`

We've tried to set the defaults for all parameters so that `vtreat` is usable out of the box for most applications.

`vtreat.vtreat_parameters()`
``````{'use_hierarchical_estimate': True,
'coders': {'clean_copy',
'deviation_code',
'impact_code',
'indicator_code',
'logit_code',
'missing_indicator',
'prevalence_code'},
'filter_to_recommended': True,
'indicator_min_fraction': 0.1,
'cross_validation_plan': <vtreat.cross_plan.KWayCrossPlan at 0x1a193c5dd8>,
'cross_validation_k': 5,
'user_transforms': [],
'sparse_indicators': True}
``````

use_hierarchical_estimate:: When True, uses hierarchical smoothing when estimating `logit_code` variables; when False, uses unsmoothed logistic regression.

coders: The types of synthetic variables that `vtreat` will (potentially) produce. See Types of prepared variables below.

filter_to_recommended: When True, prepared data only includes variables marked as "recommended" in score frame. When False, prepared data includes all variables. See the Example below.

indicator_min_fraction: For categorical variables, indicator variables (type `indicator_code`) are only produced for levels that are present at least `indicator_min_fraction` of the time. A consequence of this is that 1/`indicator_min_fraction` is the maximum number of indicators that will be produced for a given categorical variable. To make sure that all possible indicator variables are produced, set `indicator_min_fraction = 0`

cross_validation_plan: The cross validation method used by `vtreat`. Most people won't have to change this.

cross_validation_k: The number of folds to use for cross-validation

user_transforms: For passing in user-defined transforms for custom data preparation. Won't be needed in most situations, but see here for an example of applying a GAM transform to input variables.

sparse_indicators: When True, use a (Pandas) sparse representation for indicator variables. This representation is compatible with `sklearn`; however, it may not be compatible with other modeling packages. When False, use a dense representation.

### Example: Use all variables to model, not just recommended

```transform_all = vtreat.MultinomialOutcomeTreatment(
outcome_name='yc',    # outcome variable
cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
params = vtreat.vtreat_parameters({
'filter_to_recommended': False
})
)

# the variable columns in the transformed data
omit = ['x', 'y','yc']
columns = transform_all.fit_transform(d, d['yc']).columns
the_vars = list(set(columns)-set(omit))
the_vars.sort()
the_vars```
``````['x2',
'xc_lev__NA_',
'xc_lev_level_-0.5',
'xc_lev_level_0.5',
'xc_lev_level_1.0',
'xc_logit_code_large',
'xc_logit_code_liminal',
'xc_logit_code_small',
'xc_prevalence_code']
``````
```# the variables marked "recommended" by the transform
score_frame = transform_all.score_frame_
recommended = list(score_frame.variable[score_frame.recommended].unique())
recommended.sort()
recommended```
``````['x2',
'xc_lev__NA_',
'xc_lev_level_-0.5',
'xc_lev_level_0.5',
'xc_lev_level_1.0',
'xc_logit_code_large',
'xc_logit_code_liminal',
'xc_logit_code_small',
'xc_prevalence_code']
``````

Note that the prepared data produced by `fit_transform()` includes all the variables, including those that were not marked as "recommended" (if any).

## Types of prepared variables

clean_copy: Produced from numerical variables: a clean numerical variable with no `NaNs` or missing values

indicator_code: Produced from categorical variables, one for each (common) level: for each level of the variable, indicates if that level was "on"

prevalence_code: Produced from categorical variables: indicates how often each level of the variable was "on"

logit_code: Produced from categorical variables: score from a one-dimensional "one versus rest" model of the centered output as a function of the variable. One `logit_code` variable is produced for each target class.

missing_indicator: Produced for both numerical and categorical variables: an indicator variable that marks when the original variable was missing or `NaN`

deviation_code: not used by `MultinomialOutcomeTreatment`

impact_code: not used by `MultinomialOutcomeTreatment`

### Example: Produce only a subset of variable types

In this example, suppose you only want to use indicators and continuous variables in your model; in other words, you only want to use variables of types (`clean_copy`, `missing_indicator`, and `indicator_code`), and no `logit_code` or `prevalence_code` variables.

```transform_thin = vtreat.MultinomialOutcomeTreatment(
outcome_name='yc',    # outcome variable
cols_to_copy=['y'],   # columns to "carry along" but not treat as input variables
params = vtreat.vtreat_parameters({
'filter_to_recommended': False,
'coders': {'clean_copy',
'missing_indicator',
'indicator_code',
}
})
)

<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
0 -0.039133 liminal 0.0 0.0 3.074277 0.379897 0.0 0.0 0.0 0.0
1 0.252459 liminal 0.0 0.0 6.632459 -1.204097 0.0 0.0 0.0 1.0
2 -0.889266 small 0.0 1.0 5.129012 -0.048225 0.0 1.0 0.0 0.0
3 0.816912 large 1.0 0.0 0.582386 -0.767512 1.0 0.0 0.0 0.0
4 -0.145631 liminal 1.0 0.0 0.582386 1.211516 0.0 0.0 0.0 0.0
`transform_thin.score_frame_`
<style scoped> .dataframe tbody tr th:only-of-type { vertical-align: middle; }
``````.dataframe tbody tr th {
vertical-align: top;
}

text-align: right;
}
``````
</style>
variable orig_variable treatment y_aware has_range PearsonR significance vcount default_threshold recommended outcome_target
0 x_is_bad x missing_indicator False True 0.024435 5.856828e-01 2.0 0.166667 False large
1 xc_is_bad xc missing_indicator False True -0.367855 1.817194e-17 2.0 0.166667 True large
2 x x clean_copy False True -0.019163 6.690388e-01 2.0 0.166667 False large
3 x2 x2 clean_copy False True -0.072579 1.050180e-01 2.0 0.166667 True large
4 xc_lev_level_1.0 xc indicator_code False True 0.810216 1.274440e-117 4.0 0.083333 True large
5 xc_lev__NA_ xc indicator_code False True -0.367855 1.817194e-17 4.0 0.083333 True large
6 xc_lev_level_-0.5 xc indicator_code False True -0.363477 4.614443e-17 4.0 0.083333 True large
7 xc_lev_level_0.5 xc indicator_code False True 0.140755 1.603795e-03 4.0 0.083333 True large
8 x_is_bad x missing_indicator False True 0.061181 1.719657e-01 2.0 0.166667 False liminal
9 xc_is_bad xc missing_indicator False True -0.366197 2.590356e-17 2.0 0.166667 True liminal
10 x x clean_copy False True 0.013960 7.555017e-01 2.0 0.166667 False liminal
11 x2 x2 clean_copy False True 0.094836 3.399997e-02 2.0 0.166667 True liminal
12 xc_lev_level_1.0 xc indicator_code False True -0.400868 1.003896e-20 4.0 0.083333 True liminal
13 xc_lev__NA_ xc indicator_code False True -0.366197 2.590356e-17 4.0 0.083333 True liminal
14 xc_lev_level_-0.5 xc indicator_code False True 0.087196 5.134260e-02 4.0 0.083333 True liminal
15 xc_lev_level_0.5 xc indicator_code False True 0.198094 8.092185e-06 4.0 0.083333 True liminal
16 x_is_bad x missing_indicator False True -0.085144 5.709636e-02 2.0 0.166667 True small
17 xc_is_bad xc missing_indicator False True 0.730241 1.951693e-84 2.0 0.166667 True small
18 x x clean_copy False True 0.005201 9.076452e-01 2.0 0.166667 False small
19 x2 x2 clean_copy False True -0.022014 6.233712e-01 2.0 0.166667 False small
20 xc_lev_level_1.0 xc indicator_code False True -0.408142 1.710758e-21 4.0 0.083333 True small
21 xc_lev__NA_ xc indicator_code False True 0.730241 1.951693e-84 4.0 0.083333 True small
22 xc_lev_level_-0.5 xc indicator_code False True 0.275188 3.868707e-10 4.0 0.083333 True small
23 xc_lev_level_0.5 xc indicator_code False True -0.337045 9.522317e-15 4.0 0.083333 True small

## Deriving the Default Thresholds

While machine learning algorithms are generally tolerant to a reasonable number of irrelevant or noise variables, too many irrelevant variables can lead to serious overfit; see this article for an extreme example, one we call "Bad Bayes". The default threshold is an attempt to eliminate obviously irrelevant variables early.

Imagine that you have a pure noise dataset, where none of the n inputs are related to the output. If you treat each variable as a one-variable model for the output, and look at the significances of each model, these significance-values will be uniformly distributed in the range [0:1]. You want to pick a weakest possible significance threshold that eliminates as many noise variables as possible. A moment's thought should convince you that a threshold of 1/n allows only one variable through, in expectation.

This leads to the general-case heuristic that a significance threshold of 1/n on your variables should allow only one irrelevant variable through, in expectation (along with all the relevant variables). Hence, 1/n used to be our recommended threshold, when we developed the R version of `vtreat`.

We noticed, however, that this biases the filtering against numerical variables, since there are at most two derived variables (of types clean_copy and missing_indicator for every numerical variable in the original data. Categorical variables, on the other hand, are expanded to many derived variables: several indicators (one for every common level), plus a logit_code and a prevalence_code. So we now reweight the thresholds.

Suppose you have a (treated) data set with ntreat different types of `vtreat` variables (`clean_copy`, `indicator_code`, etc). There are nT variables of type T. Then the default threshold for all the variables of type T is 1/(ntreat nT). This reweighting helps to reduce the bias against any particular type of variable. The heuristic is still that the set of recommended variables will allow at most one noise variable into the set of candidate variables.

As noted above, because `vtreat` estimates variable significances using linear methods by default, some variables with a non-linear relationship to the output may fail to pass the threshold. Setting the `filter_to_recommended` parameter to False will keep all derived variables in the treated frame, for the data scientist to filter (or not) as they will.

## Conclusion

In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that `vtreat` transforms are essentially one liners.

The preparation commands are organized as follows:

These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.

You can’t perform that action at this time.