In [1]:
import vtreat
import numpy
import numpy.random
import pandas as pd
from sklearn.datasets import load_iris

numpy.random.seed(2019)

iris = load_iris()

X, y = pd.DataFrame(iris['data']), iris['target']

plan  = vtreat.MultinomialOutcomeTreatment()
X_new = plan.fit_transform(X, y)
score_frame = plan.score_frame_

In [2]:
score_frame

Unnamed: 0,variable,orig_variable,treatment,y_aware,has_range,PearsonR,significance,vcount,default_threshold,recommended,outcome_target
0,0,0,clean_copy,False,True,-0.717416,5.288768e-25,4.0,0.25,True,0
1,1,1,clean_copy,False,True,0.603348,3.054699e-16,4.0,0.25,True,0
2,2,2,clean_copy,False,True,-0.922765,3.623379e-63,4.0,0.25,True,0
3,3,3,clean_copy,False,True,-0.887344,1.288504e-51,4.0,0.25,True,0
4,0,0,clean_copy,False,True,0.079396,0.3341524,4.0,0.25,False,1
5,1,1,clean_copy,False,True,-0.467703,1.595624e-09,4.0,0.25,True,1
6,2,2,clean_copy,False,True,0.201754,0.01329302,4.0,0.25,True,1
7,3,3,clean_copy,False,True,0.117899,0.1507473,4.0,0.25,True,1
8,0,0,clean_copy,False,True,0.63802,1.619533e-18,4.0,0.25,True,2
9,1,1,clean_copy,False,True,-0.135645,0.0979117,4.0,0.25,True,2


Note in the multinomial case the score frame is keyed by `orig_variable` plus `outcome_target` (not just `orig_variable`).  This means to decide which variables to include in a model we must aggregate.

The recommended new variables are:

In [3]:
score_frame.variable[score_frame.recommended].unique()

array([0, 1, 2, 3])

And the recommended original variables are:

In [4]:
score_frame.orig_variable[score_frame.recommended].unique()

array([0, 1, 2, 3])

In this example all the names are the same as the only variable treatments were the `clean_copy`.

Let's take a look at the transformed frame.

In [5]:
X_new.head()

Unnamed: 0,0,1,2,3
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


It doesn't apply for `clean_copy` varaibles, but in general `.fit_transform` values are a function of the incoming variable *plus* the cross-validation fold (not always just a function of just the incoming value!).  This is how the cross-frame methodology helps fight nested model bias driven over-fit for complex variables such as the impact-codes. `.transform()` values are, as one would expect, functions of just the input values (indpendent of cross validation fold).