# GroupRareLevelsTransformer
This notebook shows the functionality in the `GroupRareLevelsTransformer` class. This transformer groups together infrequently occuring levels of a variables into new level labelled 'rare' (by default). <br>
Rare levels are determine by either the percent of rows or the percent of weight that falls into each category.

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint
from sklearn.datasets import load_diabetes

In [2]:
import tubular
from tubular.nominal import GroupRareLevelsTransformer

In [3]:
tubular.__version__

'0.3.0'

## Load diabetes dataset from sklearn
We also create a categorical column from `bmi` and treat it as unordered for demonstration purposes in this notebook.

In [4]:
diabetes = load_diabetes(return_X_y=False, as_frame=True)['data']

In [5]:
diabetes['bmi_cut'] = pd.cut(diabetes['bmi'], bins = 20)

In [6]:
diabetes.head()

Unnamed: 0,age,sex,bmi,bp,s1,s2,s3,s4,s5,s6,bmi_cut
0,0.038076,0.05068,0.061696,0.021872,-0.044223,-0.034821,-0.043401,-0.002592,0.019908,-0.017646,"(0.0532, 0.0662]"
1,-0.001882,-0.044642,-0.051474,-0.026328,-0.008449,-0.019163,0.074412,-0.039493,-0.06833,-0.092204,"(-0.0642, -0.0512]"
2,0.085299,0.05068,0.044451,-0.005671,-0.045599,-0.034194,-0.032356,-0.002592,0.002864,-0.02593,"(0.0401, 0.0532]"
3,-0.089063,-0.044642,-0.011595,-0.036656,0.012191,0.024991,-0.036038,0.034309,0.022692,-0.009362,"(-0.012, 0.00102]"
4,0.005383,-0.044642,-0.036385,0.021872,0.003935,0.015596,0.008142,-0.002592,-0.031991,-0.046641,"(-0.0381, -0.0251]"


In [7]:
diabetes['bmi_cut'].value_counts(dropna=False) / diabetes.shape[0]

(-0.012, 0.00102]     0.117647
(-0.0381, -0.0251]    0.110860
(-0.0251, -0.012]     0.110860
(-0.0512, -0.0381]    0.085973
(0.00102, 0.0141]     0.085973
(0.0141, 0.0271]      0.081448
(-0.0642, -0.0512]    0.063348
(0.0271, 0.0401]      0.063348
(0.0532, 0.0662]      0.063348
(0.0401, 0.0532]      0.049774
(-0.0772, -0.0642]    0.049774
(0.0662, 0.0793]      0.036199
(-0.0905, -0.0772]    0.022624
(0.0923, 0.105]       0.020362
(0.0793, 0.0923]      0.015837
(0.118, 0.131]        0.009050
(0.105, 0.118]        0.006787
(0.158, 0.171]        0.004525
(0.131, 0.144]        0.002262
(0.144, 0.158]        0.000000
Name: bmi_cut, dtype: float64

## Simple usage

### Initialising GroupRareLevelsTransformer
The user must set `cut_off_percent` to determine rare levels. <br>
Note, multiple columns to group can be specified in the `columns` argument and they will all use the same cut off point.

In [8]:
grp_1 = GroupRareLevelsTransformer(
    columns = ['bmi_cut'], 
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


### GroupRareLevelsTransformer fit
The `fit` method determines the 'non-rare' levels from the input data, it must be run before the `transform` method. <br>
The 'non-rare' levels are stored in the `mapping_` attribute.

In [9]:
grp_1.fit(diabetes)

BaseTransformer.fit() called


GroupRareLevelsTransformer(columns=['bmi_cut'], cut_off_percent=0.1)

In [10]:
pprint(grp_1.mapping_)

{'bmi_cut': [Interval(-0.012, 0.00102, closed='right'),
             Interval(-0.0251, -0.012, closed='right'),
             Interval(-0.0381, -0.0251, closed='right')]}


### GroupRareLevelsTransformer transform
The `transform` method maps any levels that are not present in the `mapping_` dict to 'rare'.

In [11]:
diabetes_2 = grp_1.transform(diabetes)

BaseTransformer.transform() called


In [12]:
diabetes_2['bmi_cut'].value_counts(dropna = False)

rare                  292
(-0.012, 0.00102]      52
(-0.0381, -0.0251]     49
(-0.0251, -0.012]      49
(0.0662, 0.0793]        0
(0.158, 0.171]          0
(0.144, 0.158]          0
(0.131, 0.144]          0
(0.118, 0.131]          0
(0.105, 0.118]          0
(0.0923, 0.105]         0
(0.0793, 0.0923]        0
(-0.0905, -0.0772]      0
(0.0532, 0.0662]        0
(-0.0772, -0.0642]      0
(0.0271, 0.0401]        0
(0.0141, 0.0271]        0
(0.00102, 0.0141]       0
(-0.0512, -0.0381]      0
(-0.0642, -0.0512]      0
(0.0401, 0.0532]        0
Name: bmi_cut, dtype: int64

## Changing the rare level label
The name of the rare level can be changed by using the `rare_level_name` argument when initialising the GroupRareLevelsTransformer object.

In [13]:
grp_2 = GroupRareLevelsTransformer(
    columns = ['bmi_cut'], 
    rare_level_name = 'zzz',
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [14]:
grp_2.fit(diabetes)

GroupRareLevelsTransformer(columns=['bmi_cut'], cut_off_percent=0.1,
                           rare_level_name='zzz')

In [15]:
diabetes_3 = grp_2.transform(diabetes)

In [16]:
diabetes_3['bmi_cut'].value_counts(dropna = False)

zzz                   292
(-0.012, 0.00102]      52
(-0.0381, -0.0251]     49
(-0.0251, -0.012]      49
(0.0662, 0.0793]        0
(0.158, 0.171]          0
(0.144, 0.158]          0
(0.131, 0.144]          0
(0.118, 0.131]          0
(0.105, 0.118]          0
(0.0923, 0.105]         0
(0.0793, 0.0923]        0
(-0.0905, -0.0772]      0
(0.0532, 0.0662]        0
(-0.0772, -0.0642]      0
(0.0271, 0.0401]        0
(0.0141, 0.0271]        0
(0.00102, 0.0141]       0
(-0.0512, -0.0381]      0
(-0.0642, -0.0512]      0
(0.0401, 0.0532]        0
Name: bmi_cut, dtype: int64

## Recording rare levels
By default the levels identified as 'rare' i.e. those that fall below the `cut_off_percent` value are not recorded. This is because there could be potentially a large number of levels. This can be changed by setting the `record_rare_levels` argument to `True` when initialising the GroupRareLevelsTransformer object. <br>
If this is the case the rare levels are recorded in a dict in the `rare_levels_record_` attribute on the transformer. <br>
The user should take care doing this if using the transformer on columns with many levels as this can result in a large transformer object. 

In [17]:
grp_3 = GroupRareLevelsTransformer(
    columns = ['bmi_cut'], 
    record_rare_levels = True,
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [18]:
grp_3.fit(diabetes)

GroupRareLevelsTransformer(columns=['bmi_cut'], cut_off_percent=0.1)

In [19]:
pprint(grp_3.rare_levels_record_)

{'bmi_cut': [Interval(-0.0512, -0.0381, closed='right'),
             Interval(-0.0642, -0.0512, closed='right'),
             Interval(-0.0772, -0.0642, closed='right'),
             Interval(-0.0905, -0.0772, closed='right'),
             Interval(0.00102, 0.0141, closed='right'),
             Interval(0.0141, 0.0271, closed='right'),
             Interval(0.0271, 0.0401, closed='right'),
             Interval(0.0401, 0.0532, closed='right'),
             Interval(0.0532, 0.0662, closed='right'),
             Interval(0.0662, 0.0793, closed='right'),
             Interval(0.0793, 0.0923, closed='right'),
             Interval(0.0923, 0.105, closed='right'),
             Interval(0.105, 0.118, closed='right'),
             Interval(0.118, 0.131, closed='right'),
             Interval(0.131, 0.144, closed='right'),
             Interval(0.144, 0.158, closed='right'),
             Interval(0.158, 0.171, closed='right')]}


## Using row by row weights to identify rare levels
If records in the data do not have equal weight the user can set the `weight` argument when initialising the GroupRareLevelsTransformer object so `cut_off_percent` applies to the sum of weight rather than sum of rows. <br>
In this example we create a dummy weights column and set a rows with negative `bmi` values (note, columns have been standardised in this dataset) with a high weight compared to positive `bmi` values.

In [20]:
diabetes['weights'] = 1
diabetes.loc[diabetes['bmi'] < 0, 'weights'] = 10

In [21]:
grp_4 = GroupRareLevelsTransformer(
    columns = ['bmi_cut'], 
    weight = 'weights',
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [22]:
grp_4.fit(diabetes)

GroupRareLevelsTransformer(columns=['bmi_cut'], cut_off_percent=0.1,
                           weight='weights')

In [23]:
pprint(grp_4.mapping_)

{'bmi_cut': [Interval(-0.012, 0.00102, closed='right'),
             Interval(-0.0251, -0.012, closed='right'),
             Interval(-0.0381, -0.0251, closed='right'),
             Interval(-0.0512, -0.0381, closed='right'),
             Interval(-0.0642, -0.0512, closed='right')]}


In [24]:
diabetes.groupby('bmi_cut')['weights'].sum() / diabetes['weights'].sum()

bmi_cut
(-0.0905, -0.0772]    0.037523
(-0.0772, -0.0642]    0.082552
(-0.0642, -0.0512]    0.105066
(-0.0512, -0.0381]    0.142589
(-0.0381, -0.0251]    0.183865
(-0.0251, -0.012]     0.183865
(-0.012, 0.00102]     0.191745
(0.00102, 0.0141]     0.014259
(0.0141, 0.0271]      0.013508
(0.0271, 0.0401]      0.010507
(0.0401, 0.0532]      0.008255
(0.0532, 0.0662]      0.010507
(0.0662, 0.0793]      0.006004
(0.0793, 0.0923]      0.002627
(0.0923, 0.105]       0.003377
(0.105, 0.118]        0.001126
(0.118, 0.131]        0.001501
(0.131, 0.144]        0.000375
(0.144, 0.158]        0.000000
(0.158, 0.171]        0.000750
Name: weights, dtype: float64