Nina Zumel and John Mount November 2019
Note: this is a description of the Python
version of vtreat
, the same example for the R
version of vtreat
can be found here.
Load modules/packages.
import pkg_resources
import pandas
import numpy
import numpy.random
import seaborn
import matplotlib.pyplot as plt
import matplotlib.pyplot
import vtreat
import vtreat.util
import wvpy.util
numpy.random.seed(2019)
Generate example data.
y
is a noisy sinusoidal plus linear function of the variablex
, and is the output to be predicted- Input
xc
is a categorical variable that represents a discretization ofy
, along with someNaN
s - Input
x2
is a pure noise variable with no relationship to the output
def make_data(nrows):
d = pandas.DataFrame({'x': 5*numpy.random.normal(size=nrows)})
d['y'] = numpy.sin(d['x']) + 0.01*d['x'] + 0.1*numpy.random.normal(size=nrows)
d.loc[numpy.arange(3, 10), 'x'] = numpy.nan # introduce a nan level
d['xc'] = ['level_' + str(5*numpy.round(yi/5, 1)) for yi in d['y']]
d['x2'] = numpy.random.normal(size=nrows)
d.loc[d['xc']=='level_-1.0', 'xc'] = numpy.nan # introduce a nan level
return d
d = make_data(500)
d.head()
x | y | xc | x2 | |
---|---|---|---|---|
0 | -1.088395 | -0.967195 | NaN | -1.424184 |
1 | 4.107277 | -0.630491 | level_-0.5 | 0.427360 |
2 | 7.406389 | 0.980367 | level_1.0 | 0.668849 |
3 | NaN | 0.289385 | level_0.5 | -0.015787 |
4 | NaN | -0.993524 | NaN | -0.491017 |
Check how many levels xc
has, and their distribution (including NaN
)
d['xc'].unique()
array([nan, 'level_-0.5', 'level_1.0', 'level_0.5', 'level_-0.0',
'level_0.0', 'level_1.5'], dtype=object)
d['xc'].value_counts(dropna=False)
level_1.0 133
NaN 105
level_-0.5 104
level_0.5 80
level_-0.0 40
level_0.0 36
level_1.5 2
Name: xc, dtype: int64
Find the mean value of y
numpy.mean(d['y'])
0.03137290675104681
Plot of y
versus x
.
seaborn.lineplot(x='x', y='y', data=d)
<matplotlib.axes._subplots.AxesSubplot at 0x1a2161dcd0>
Now that we have the data, we want to treat it prior to modeling: we want training data where all the input variables are numeric and have no missing values or NaN
s.
First create the data treatment transform object, in this case a treatment for a regression problem.
transform = vtreat.NumericOutcomeTreatment(
outcome_name='y', # outcome variable
)
Notice that for the training data d
: transform_design$crossFrame
is not the same as transform.prepare(d)
; the second call can lead to nested model bias in some situations, and is not recommended.
For other, later data, not seen during transform design transform.preprare(o)
is an appropriate step.
Use the training data d
to fit the transform and the return a treated training set: completely numeric, with no missing values.
d_prepared = transform.fit_transform(d, d['y'])
Now examine the score frame, which gives information about each new variable, including its type, which original variable it is derived from, its (cross-validated) correlation with the outcome, and its (cross-validated) significance as a one-variable linear model for the outcome.
transform.score_frame_
variable | orig_variable | treatment | y_aware | has_range | PearsonR | R2 | significance | vcount | default_threshold | recommended | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | x_is_bad | x | missing_indicator | False | True | -0.042128 | 0.001775 | 3.471792e-01 | 2.0 | 0.083333 | False |
1 | xc_is_bad | xc | missing_indicator | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 2.0 | 0.083333 | True |
2 | x | x | clean_copy | False | True | 0.098345 | 0.009672 | 2.788603e-02 | 2.0 | 0.083333 | True |
3 | x2 | x2 | clean_copy | False | True | 0.097028 | 0.009414 | 3.005973e-02 | 2.0 | 0.083333 | True |
4 | xc_impact_code | xc | impact_code | True | True | 0.980039 | 0.960476 | 0.000000e+00 | 1.0 | 0.166667 | True |
5 | xc_deviation_code | xc | deviation_code | True | True | 0.037635 | 0.001416 | 4.010596e-01 | 1.0 | 0.166667 | False |
6 | xc_prevalence_code | xc | prevalence_code | False | True | 0.217891 | 0.047476 | 8.689113e-07 | 1.0 | 0.166667 | True |
7 | xc_lev_level_1_0 | xc | indicator_code | False | True | 0.750882 | 0.563824 | 8.969185e-92 | 4.0 | 0.041667 | True |
8 | xc_lev__NA_ | xc | indicator_code | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 4.0 | 0.041667 | True |
9 | xc_lev_level_-0_5 | xc | indicator_code | False | True | -0.392501 | 0.154057 | 7.287692e-20 | 4.0 | 0.041667 | True |
10 | xc_lev_level_0_5 | xc | indicator_code | False | True | 0.282261 | 0.079671 | 1.302347e-10 | 4.0 | 0.041667 | True |
Note that the variable xc
has been converted to multiple variables:
- an indicator variable for each common possible level (
xc_lev_level_*
) - the value of a (cross-validated) one-variable model for
y
as a function ofxc
(xc_impact_code
) - a variable indicating when
xc
wasNaN
in the original data (xc_is_bad
) - a variable that returns how prevalent this particular value of
xc
is in the training data (xc_prevalence_code
) - a variable that returns standard deviation of
y
conditioned onxc
(xc_deviation_code
)
Any or all of these new variables are available for downstream modeling.
The recommended
column indicates which variables are non constant (has_range
== True) and have a significance value smaller than default_threshold
. See the section Deriving the Default Thresholds below for the reasoning behind the default thresholds. Recommended columns are intended as advice about which variables appear to be most likely to be useful in a downstream model. This advice attempts to be conservative, to reduce the possibility of mistakenly eliminating variables that may in fact be useful (although, obviously, it can still mistakenly eliminate variables that have a real but non-linear relationship to the output).
Let's look at the recommended and not recommended variables:
# recommended variables
transform.score_frame_['variable'][transform.score_frame_['recommended']]
1 xc_is_bad
2 x
3 x2
4 xc_impact_code
6 xc_prevalence_code
7 xc_lev_level_1_0
8 xc_lev__NA_
9 xc_lev_level_-0_5
10 xc_lev_level_0_5
Name: variable, dtype: object
# not recommended variables
transform.score_frame_['variable'][transform.score_frame_['recommended']==False]
0 x_is_bad
5 xc_deviation_code
Name: variable, dtype: object
Let's look at the top of d_prepared
. Notice that the new treated data frame included only recommended variables (along with y
).
d_prepared.head()
y | xc_is_bad | x | x2 | xc_impact_code | xc_prevalence_code | xc_lev_level_1_0 | xc_lev__NA_ | xc_lev_level_-0_5 | xc_lev_level_0_5 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -0.967195 | 1.0 | -1.088395 | -1.424184 | -0.949807 | 0.210 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | -0.630491 | 0.0 | 4.107277 | 0.427360 | -0.560172 | 0.208 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.980367 | 0.0 | 7.406389 | 0.668849 | 0.921087 | 0.266 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.289385 | 0.0 | -0.057044 | -0.015787 | 0.495057 | 0.160 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | -0.993524 | 1.0 | -0.057044 | -0.491017 | -0.951141 | 0.210 | 0.0 | 1.0 | 0.0 | 0.0 |
This is vtreat
's default behavior; to include all variables in the prepared data, set the parameter filter_to_recommended
to False, as we show later, in the Parameters for NumericOutcomeTreatment
section below.
Variables of type impact_code
are the outputs of a one-variable hierarchical linear regression of a categorical variable (in our example, xc
) against the centered output on the (cross-validated) treated training data.
Let's look at the relationship between xc_impact_code
and y
(actually y_centered
, a centered version of y
).
d_prepared['y_centered'] = d_prepared.y - d_prepared.y.mean()
g = seaborn.jointplot("xc_impact_code", "y_centered", d_prepared, kind="scatter")
# add the line "x = y"
g.ax_joint.plot(d_prepared.xc_impact_code, d_prepared.xc_impact_code, ':k')
plt.title('Relationship between xc_impact_code and y', fontsize = 16)
plt.show()
This indicates that xc_impact_code
is strongly predictive of the outcome. Note that the score frame also reported the Pearson correlation between xc_impact_code
and y
, which is fairly large.
transform.score_frame_.PearsonR[transform.score_frame_.variable=='xc_impact_code']
4 0.980039
Name: PearsonR, dtype: float64
Note also that the impact code values are jittered; this is because d_prepared
is a "cross-frame": that is, the result of a cross-validated estimation process. Hence, the impact coding of xc
is a function of both the value of xc
and the cross-validation fold of the datum's row. When transform
is applied to new data, there will be only one value of impact code for each (common) level of xc
. We can see this by applying the transform to the data frame d
as if it were new data.
dtmp = transform.transform(d)
dtmp['y_centered'] = dtmp.y - dtmp.y.mean()
ax = seaborn.scatterplot(x = 'xc_impact_code', y = 'y_centered', data = dtmp)
# add the line "x = y"
matplotlib.pyplot.plot(d_prepared.xc_impact_code, dtmp.xc_impact_code, color="darkgray")
ax.set_title('Relationship between xc_impact_code and y, on improperly prepared training data')
plt.show()
/Users/johnmount/opt/anaconda3/envs/ai_academy_3_7/lib/python3.7/site-packages/vtreat/vtreat_api.py:108: UserWarning: possibly called transform on same data used to fit
(this causes over-fit, please use fit_transform() instead)
"possibly called transform on same data used to fit\n" +
Variables of type impact_code
are useful when dealing with categorical variables with a very large number of possible levels. For example, a categorical variable with 10,000 possible values potentially converts to 10,000 indicator variables, which may be unwieldy for some modeling methods. Using a single numerical variable of type impact_code
may be a preferable alternative.
Of course, what we really want to do with the prepared training data is to fit a model jointly with all the (recommended) variables.
Let's try fitting a linear regression model to d_prepared
.
import sklearn.linear_model
import seaborn
import sklearn.metrics
not_variables = ['y', 'y_centered', 'prediction']
model_vars = [v for v in d_prepared.columns if v not in set(not_variables)]
fitter = sklearn.linear_model.LinearRegression()
fitter.fit(d_prepared[model_vars], d_prepared['y'])
# now predict
d_prepared['prediction'] = fitter.predict(d_prepared[model_vars])
# get R-squared
r2 = sklearn.metrics.r2_score(y_true=d_prepared.y, y_pred=d_prepared.prediction)
title = 'Prediction vs. outcome (training data); R-sqr = {:04.2f}'.format(r2)
# compare the predictions to the outcome (on the training data)
ax = seaborn.scatterplot(x='prediction', y='y', data=d_prepared)
matplotlib.pyplot.plot(d_prepared.prediction, d_prepared.prediction, color="darkgray")
ax.set_title(title)
plt.show()
Now apply the model to new data.
# create the new data
dtest = make_data(450)
# prepare the new data with vtreat
dtest_prepared = transform.transform(dtest)
# apply the model to the prepared data
dtest_prepared['prediction'] = fitter.predict(dtest_prepared[model_vars])
# get R-squared
r2 = sklearn.metrics.r2_score(y_true=dtest_prepared.y, y_pred=dtest_prepared.prediction)
title = 'Prediction vs. outcome (test data); R-sqr = {:04.2f}'.format(r2)
# compare the predictions to the outcome (on the test data)
ax = seaborn.scatterplot(x='prediction', y='y', data=dtest_prepared)
matplotlib.pyplot.plot(dtest_prepared.prediction, dtest_prepared.prediction, color="darkgray")
ax.set_title(title)
plt.show()
We've tried to set the defaults for all parameters so that vtreat
is usable out of the box for most applications.
vtreat.vtreat_parameters()
{'use_hierarchical_estimate': True,
'coders': {'clean_copy',
'deviation_code',
'impact_code',
'indicator_code',
'logit_code',
'missing_indicator',
'prevalence_code'},
'filter_to_recommended': True,
'indicator_min_fraction': 0.1,
'cross_validation_plan': vtreat.cross_plan.KWayCrossPlanYStratified(),
'cross_validation_k': 5,
'user_transforms': [],
'sparse_indicators': True,
'missingness_imputation': <function numpy.mean(a, axis=None, dtype=None, out=None, keepdims=<no value>)>,
'check_for_duplicate_frames': True,
'retain_cross_plan': False}
use_hierarchical_estimate:: When True, uses hierarchical smoothing when estimating impact_code
variables; when False, uses unsmoothed linear regression.
coders: The types of synthetic variables that vtreat
will (potentially) produce. See Types of prepared variables below.
filter_to_recommended: When True, prepared data only includes variables marked as "recommended" in score frame. When False, prepared data includes all variables. See the Example below.
indicator_min_fraction: For categorical variables, indicator variables (type indicator_code
) are only produced for levels that are present at least indicator_min_fraction
of the time. A consequence of this is that 1/indicator_min_fraction
is the maximum number of indicators that will be produced for a given categorical variable. To make sure that all possible indicator variables are produced, set indicator_min_fraction = 0
cross_validation_plan: The cross validation method used by vtreat
. Most people won't have to change this.
cross_validation_k: The number of folds to use for cross-validation
user_transforms: For passing in user-defined transforms for custom data preparation. Won't be needed in most situations, but see here for an example of applying a GAM transform to input variables.
sparse_indicators: When True, use a (Pandas) sparse representation for indicator variables. This representation is compatible with sklearn
; however, it may not be compatible with other modeling packages. When False, use a dense representation.
missingness_imputation The function or value that vtreat
uses to impute or "fill in" missing numerical values. The default is numpy.mean()
. To change the imputation function or use different functions/values for different columns, see the Imputation example.
transform_all = vtreat.NumericOutcomeTreatment(
outcome_name='y', # outcome variable
params = vtreat.vtreat_parameters({
'filter_to_recommended': False
})
)
transform_all.fit_transform(d, d['y']).columns
Index(['y', 'x_is_bad', 'xc_is_bad', 'x', 'x2', 'xc_impact_code',
'xc_deviation_code', 'xc_prevalence_code', 'xc_lev_level_1_0',
'xc_lev__NA_', 'xc_lev_level_-0_5', 'xc_lev_level_0_5'],
dtype='object')
transform_all.score_frame_
variable | orig_variable | treatment | y_aware | has_range | PearsonR | R2 | significance | vcount | default_threshold | recommended | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | x_is_bad | x | missing_indicator | False | True | -0.042128 | 0.001775 | 3.471792e-01 | 2.0 | 0.083333 | False |
1 | xc_is_bad | xc | missing_indicator | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 2.0 | 0.083333 | True |
2 | x | x | clean_copy | False | True | 0.098345 | 0.009672 | 2.788603e-02 | 2.0 | 0.083333 | True |
3 | x2 | x2 | clean_copy | False | True | 0.097028 | 0.009414 | 3.005973e-02 | 2.0 | 0.083333 | True |
4 | xc_impact_code | xc | impact_code | True | True | 0.980056 | 0.960511 | 0.000000e+00 | 1.0 | 0.166667 | True |
5 | xc_deviation_code | xc | deviation_code | True | True | 0.040940 | 0.001676 | 3.609607e-01 | 1.0 | 0.166667 | False |
6 | xc_prevalence_code | xc | prevalence_code | False | True | 0.217891 | 0.047476 | 8.689113e-07 | 1.0 | 0.166667 | True |
7 | xc_lev_level_1_0 | xc | indicator_code | False | True | 0.750882 | 0.563824 | 8.969185e-92 | 4.0 | 0.041667 | True |
8 | xc_lev__NA_ | xc | indicator_code | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 4.0 | 0.041667 | True |
9 | xc_lev_level_-0_5 | xc | indicator_code | False | True | -0.392501 | 0.154057 | 7.287692e-20 | 4.0 | 0.041667 | True |
10 | xc_lev_level_0_5 | xc | indicator_code | False | True | 0.282261 | 0.079671 | 1.302347e-10 | 4.0 | 0.041667 | True |
Note that the prepared data produced by fit_transform()
includes all the variables, including those that were not marked as "recommended".
clean_copy: Produced from numerical variables: a clean numerical variable with no NaNs
or missing values
indicator_code: Produced from categorical variables, one for each (common) level: for each level of the variable, indicates if that level was "on"
prevalence_code: Produced from categorical variables: indicates how often each level of the variable was "on"
deviation_code: Produced from categorical variables: standard deviation of outcome conditioned on levels of the variable
impact_code: Produced from categorical variables: score from a one-dimensional model of the output as a function of the variable
missing_indicator: Produced for both numerical and categorical variables: an indicator variable that marks when the original variable was missing or NaN
logit_code: not used by NumericOutcomeTreatment
In this example, suppose you only want to use indicators and continuous variables in your model;
in other words, you only want to use variables of types (clean_copy
, missing_indicator
, and indicator_code
), and no impact_code
, deviance_code
, or prevalence_code
variables.
transform_thin = vtreat.NumericOutcomeTreatment(
outcome_name='y', # outcome variable
params = vtreat.vtreat_parameters({
'filter_to_recommended': False,
'coders': {'clean_copy',
'missing_indicator',
'indicator_code',
}
})
)
transform_thin.fit_transform(d, d['y']).head()
y | x_is_bad | xc_is_bad | x | x2 | xc_lev_level_1_0 | xc_lev__NA_ | xc_lev_level_-0_5 | xc_lev_level_0_5 | |
---|---|---|---|---|---|---|---|---|---|
0 | -0.967195 | 0.0 | 1.0 | -1.088395 | -1.424184 | 0.0 | 1.0 | 0.0 | 0.0 |
1 | -0.630491 | 0.0 | 0.0 | 4.107277 | 0.427360 | 0.0 | 0.0 | 1.0 | 0.0 |
2 | 0.980367 | 0.0 | 0.0 | 7.406389 | 0.668849 | 1.0 | 0.0 | 0.0 | 0.0 |
3 | 0.289385 | 1.0 | 0.0 | -0.057044 | -0.015787 | 0.0 | 0.0 | 0.0 | 1.0 |
4 | -0.993524 | 1.0 | 1.0 | -0.057044 | -0.491017 | 0.0 | 1.0 | 0.0 | 0.0 |
transform_thin.score_frame_
variable | orig_variable | treatment | y_aware | has_range | PearsonR | R2 | significance | vcount | default_threshold | recommended | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | x_is_bad | x | missing_indicator | False | True | -0.042128 | 0.001775 | 3.471792e-01 | 2.0 | 0.166667 | False |
1 | xc_is_bad | xc | missing_indicator | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 2.0 | 0.166667 | True |
2 | x | x | clean_copy | False | True | 0.098345 | 0.009672 | 2.788603e-02 | 2.0 | 0.166667 | True |
3 | x2 | x2 | clean_copy | False | True | 0.097028 | 0.009414 | 3.005973e-02 | 2.0 | 0.166667 | True |
4 | xc_lev_level_1_0 | xc | indicator_code | False | True | 0.750882 | 0.563824 | 8.969185e-92 | 4.0 | 0.083333 | True |
5 | xc_lev__NA_ | xc | indicator_code | False | True | -0.668393 | 0.446749 | 5.183376e-66 | 4.0 | 0.083333 | True |
6 | xc_lev_level_-0_5 | xc | indicator_code | False | True | -0.392501 | 0.154057 | 7.287692e-20 | 4.0 | 0.083333 | True |
7 | xc_lev_level_0_5 | xc | indicator_code | False | True | 0.282261 | 0.079671 | 1.302347e-10 | 4.0 | 0.083333 | True |
While machine learning algorithms are generally tolerant to a reasonable number of irrelevant or noise variables, too many irrelevant variables can lead to serious overfit; see this article for an extreme example, one we call "Bad Bayes". The default threshold is an attempt to eliminate obviously irrelevant variables early.
Imagine that you have a pure noise dataset, where none of the n inputs are related to the output. If you treat each variable as a one-variable model for the output, and look at the significances of each model, these significance-values will be uniformly distributed in the range [0:1]. You want to pick a weakest possible significance threshold that eliminates as many noise variables as possible. A moment's thought should convince you that a threshold of 1/n allows only one variable through, in expectation.
This leads to the general-case heuristic that a significance threshold of 1/n on your variables should allow only one irrelevant variable through, in expectation (along with all the relevant variables). Hence, 1/n used to be our recommended threshold, when we developed the R version of vtreat
.
We noticed, however, that this biases the filtering against numerical variables, since there are at most two derived variables (of types clean_copy and missing_indicator for every numerical variable in the original data. Categorical variables, on the other hand, are expanded to many derived variables: several indicators (one for every common level), plus a logit_code and a prevalence_code. So we now reweight the thresholds.
Suppose you have a (treated) data set with ntreat different types of vtreat
variables (clean_copy
, indicator_code
, etc).
There are nT variables of type T. Then the default threshold for all the variables of type T is 1/(ntreat nT). This reweighting helps to reduce the bias against any particular type of variable. The heuristic is still that the set of recommended variables will allow at most one noise variable into the set of candidate variables.
As noted above, because vtreat
estimates variable significances using linear methods by default, some variables with a non-linear relationship to the output may fail to pass the threshold. Setting the filter_to_recommended
parameter to False will keep all derived variables in the treated frame, for the data scientist to filter (or not) as they will.
In all cases (classification, regression, unsupervised, and multinomial classification) the intent is that vtreat
transforms are essentially one liners.
The preparation commands are organized as follows:
- Regression:
Python
regression example,R
regression example, fit/prepare interface,R
regression example, design/prepare/experiment interface. - Classification:
Python
classification example,R
classification example, fit/prepare interface,R
classification example, design/prepare/experiment interface. - Unsupervised tasks:
Python
unsupervised example,R
unsupervised example, fit/prepare interface,R
unsupervised example, design/prepare/experiment interface. - Multinomial classification:
Python
multinomial classification example,R
multinomial classification example, fit/prepare interface,R
multinomial classification example, design/prepare/experiment interface.
Some vtreat
common capabilities are documented here:
- Score Frame score_frame_, using the
score_frame_
information. - Cross Validation Customized Cross Plans, controlling the cross validation plan.
These current revisions of the examples are designed to be small, yet complete. So as a set they have some overlap, but the user can rely mostly on a single example for a single task type.