# MeanResponseTransformer
This notebook shows the functionality in the `MeanResponseTransformer` class. This transformer applies mean response encoding such that categorical levels are mapped to the average value of the response (target) for a particular problem.

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_diabetes

In [2]:
import tubular
from tubular.nominal import MeanResponseTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load diabetes dataset from sklearn
We also create a categorical column from `bmi` and treat it as unordered for demonstration purposes in this notebook.

In [4]:
diabetes, target = load_diabetes(return_X_y=True, as_frame=True)

In [5]:
diabetes['bmi_cut'] = pd.cut(diabetes['bmi'], bins = 20)

In [6]:
diabetes['target'] = target

In [7]:
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,bmi_cut,target
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,"(0.0532, 0.0662]",151.0
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,"(-0.0642, -0.0512]",75.0
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,"(0.0401, 0.0532]",141.0
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,"(-0.012, 0.00102]",206.0
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,"(-0.0381, -0.0251]",135.0


## Simple usage

### Initialising MeanResponseTransformer
The `response_column` argument must be specified to set the response column that the `fit` method will use. <br>
There can be no nulls in the response column otherwise an exception will be raised.

In [8]:
mre_1 = MeanResponseTransformer(
    columns = 'bmi_cut', 
    response_column = 'target',
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


### MeanResponseTransformer fit
The `fit` method calculates the average response column value for each level, it must be run before the `transform` method. <br>
The mappings are stored in an attribute called `mappings`.

In [9]:
mre_1.fit(diabetes)

BaseTransformer.fit() called


MeanResponseTransformer(columns=['bmi_cut'], response_column='target')

In [10]:
pprint(mre_1.mappings)

{'bmi_cut': {Interval(-0.0905, -0.0772, closed='right'): 95.1,
             Interval(-0.0772, -0.0642, closed='right'): 92.9090909090909,
             Interval(-0.0642, -0.0512, closed='right'): 96.39285714285714,
             Interval(-0.0512, -0.0381, closed='right'): 108.52631578947368,
             Interval(-0.0381, -0.0251, closed='right'): 117.28571428571429,
             Interval(-0.0251, -0.012, closed='right'): 127.38775510204081,
             Interval(-0.012, 0.00102, closed='right'): 142.82692307692307,
             Interval(0.00102, 0.0141, closed='right'): 154.6315789473684,
             Interval(0.0141, 0.0271, closed='right'): 194.63888888888889,
             Interval(0.0271, 0.0401, closed='right'): 191.0,
             Interval(0.0401, 0.0532, closed='right'): 184.1818181818182,
             Interval(0.0532, 0.0662, closed='right'): 195.07142857142858,
             Interval(0.0662, 0.0793, closed='right'): 215.75,
             Interval(0.0793, 0.0923, closed='right'): 2

### MeanResponseTransformer transform

In [11]:
diabetes_2 = mre_1.transform(diabetes)

BaseTransformer.transform() called


In [12]:
diabetes_2['bmi_cut'].value_counts(dropna = False)

142.826923    52
117.285714    49
127.387755    49
108.526316    38
154.631579    38
194.638889    36
195.071429    28
96.392857     28
191.000000    28
184.181818    22
92.909091     22
215.750000    16
95.100000     10
234.888889     9
265.285714     7
297.250000     4
277.000000     3
294.000000     2
233.000000     1
Name: bmi_cut, dtype: int64

## Transform with nulls
Null values are not converted in the `MeanResponseTransformer`. There are other transforrmers in the package which can be used to deal with imputation first.

In [13]:
diabetes['bmi_cut_str'] = diabetes['bmi_cut'].astype(str)

In [14]:
diabetes.loc[0, 'bmi_cut_str'] = np.NaN

In [15]:
diabetes['bmi_cut_str'].isnull().sum()

1

In [16]:
mre_2 = MeanResponseTransformer(
    columns = ['bmi_cut_str'], 
    response_column = 'target',
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


In [17]:
mre_2.fit(diabetes)

BaseTransformer.fit() called


MeanResponseTransformer(columns=['bmi_cut_str'], response_column='target')

In [18]:
try:
    mre_2.transform(diabetes)
except Exception as err:
    print(type(err), err)

<class 'ValueError'> nulls would be introduced into column bmi_cut_str from levels not present in mapping


## Weights column
It is possible to specify a weights column using the `weights_column` argument when initialising the transformer. <br>
If this is the case then a weighted mean will be calculated by `fit`.

In [19]:
diabetes['weights'] = diabetes['bp'].abs()

In [20]:
mre_3 = MeanResponseTransformer(
    columns = 'bmi_cut', 
    response_column = 'target',
    weights_column = 'weights'
)

In [21]:
mre_3.fit(diabetes)

MeanResponseTransformer(columns=['bmi_cut'], response_column='target',
                        weights_column='weights')

In [22]:
diabetes_4 = mre_3.transform(diabetes)

In [23]:
diabetes_4['bmi_cut'].value_counts(dropna = False)

4511.329099     52
3327.244238     49
3639.034524     49
2831.898667     38
4348.862302     38
5592.770648     36
4426.864694     28
4216.804119     28
2616.549737     28
4412.170409     22
1744.338119     22
4936.309915     16
2224.438961     10
4011.676461      9
4607.730150      7
4348.868698      4
4919.978247      3
9488.206085      2
14563.207557     1
Name: bmi_cut, dtype: int64