# Changing the missing value imputation in `vtreat`

For this example, we will use the `UnsupervisedTreatment`, but the same parameters can be used with the other treatment plans as well. 

## A simple data example

Here we create a simple data set where the inputs have missing values.

In [1]:
import pandas
import numpy
import vtreat  # https://github.com/WinVector/pyvtreat
import vtreat.util



d = pandas.DataFrame({
    "x": [0, 1, 1000, None],
    "w": [3, 6, None, 100],
    "y": [0, 0, 1, 1],
})

d

Unnamed: 0,x,w,y
0,0.0,3.0,0
1,1.0,6.0,0
2,1000.0,,1
3,,100.0,1


Some of the summary statistics of `d`. We're primarily interested in the inputs `x` and `w`.

In [2]:
d.describe().loc[ ['mean' , '50%', 'min', 'max'] , : ]

Unnamed: 0,x,w,y
mean,333.666667,36.333333,0.5
50%,1.0,6.0,0.5
min,0.0,3.0,0.0
max,1000.0,100.0,1.0


## The default missing value imputation

By default, `vtreat` fills in missing values with the mean value of the column, and adds an advisory `*_is_bad` column to mark the location of the original missing values.

In [3]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
)
d_treated = transform.fit_transform(d)

# put the treated frame in an order similar to the original frame
d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,36.333333,1.0,1
3,333.666667,1.0,100.0,0.0,1


## Changing the imputation strategy

If you do not want to use the mean to fill in missing values, you can change the imputation function using the parameter `missingness_imputation`. Here, we fill in missing values with the median.

In [4]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
    params=vtreat.unsupervised_parameters({
        "missingness_imputation": numpy.median,
    }))
d_treated = transform.fit_transform(d)
d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,6.0,1.0,1
3,1.0,1.0,100.0,0.0,1


You can also use a constant value instead of a function. Here we replace missing values with the value -1.

In [5]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
    params=vtreat.unsupervised_parameters({
        "missingness_imputation": -1,
    })
)
d_treated = transform.fit_transform(d)
d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,-1.0,1.0,1
3,-1.0,1.0,100.0,0.0,1


## Changing the imputation strategy per column

You can control the imputation strategy per column via the map `imputation_map`. Any column not named in the imputation map will use the imputation strategy specified by the `missingness_imputation` parameter (which is `numpy.mean` by default).

Here we use the maximum value to fill in the missing values for `x` and the value 0 to fill in the missing values for `w`.

In [6]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
    params=vtreat.unsupervised_parameters({
        "missingness_imputation": -1,
    }),
    imputation_map = {'x': numpy.max,
                      'w': 0
                     }
    )
d_treated = transform.fit_transform(d)

d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,0.0,1.0,1
3,1000.0,1.0,100.0,0.0,1


If we don't specify a column, `vtreat` looks at `missingness_imputation` (in this case, -1).

In [7]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
    params=vtreat.unsupervised_parameters({
        "missingness_imputation": -1,
    }),
    imputation_map = {'x': numpy.max
                     }
    )
d_treated = transform.fit_transform(d)

d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,-1.0,1.0,1
3,1000.0,1.0,100.0,0.0,1


If `missingness_imputation` is not specified, `vtreat` uses numpy.mean()

In [8]:
transform = vtreat.UnsupervisedTreatment(
    cols_to_copy=["y"],
    imputation_map = {'x': numpy.max
                     }
    )
d_treated = transform.fit_transform(d)

d_treated.loc[:, ['x','x_is_bad', 'w', 'w_is_bad', 'y']]

Unnamed: 0,x,x_is_bad,w,w_is_bad,y
0,0.0,0.0,3.0,0.0,0
1,1.0,0.0,6.0,0.0,0
2,1000.0,0.0,36.333333,1.0,1
3,1000.0,1.0,100.0,0.0,1
