# GroupRareLevelsTransformer
This notebook shows the functionality in the GroupRareLevelsTransformer class. This transformer groups together infrequently occuring levels of a variables into new level labelled 'rare' (by default). <br>
These 'rare' levels are determine by either the percent of rows or the percent of weight that falls into each level.

In [1]:
import pandas as pd
import numpy as np
from pprint import pprint

In [2]:
import tubular
from tubular.nominal import GroupRareLevelsTransformer

In [3]:
tubular.__version__

'0.2.14'

## Load Boston house price dataset from sklearn
Note, the load_boston script modifies the original Boston dataset to include nulls values and pandas categorical dtypes.

In [4]:
boston_df = tubular.testing.test_data.prepare_boston_df()
boston_df.shape

(506, 17)

In [5]:
boston_df.head()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,target,ZN_cat,CHAS_cat,RAD_cat
0,0.00632,18.0,2.31,0.0,0.538,6.575,,4.09,,296.0,15.3,396.9,4.98,24.0,18.0,0.0,
1,0.02731,,7.07,0.0,0.469,6.421,78.9,4.9671,2.0,242.0,17.8,396.9,9.14,21.6,,0.0,2.0
2,0.02729,0.0,7.07,0.0,0.469,7.185,61.1,4.9671,2.0,,17.8,392.83,4.03,34.7,0.0,0.0,2.0
3,,,2.18,0.0,0.458,,45.8,6.0622,3.0,222.0,18.7,,,33.4,,0.0,3.0
4,0.06905,0.0,2.18,0.0,0.458,,,6.0622,3.0,222.0,18.7,396.9,5.33,36.2,0.0,0.0,3.0


In [6]:
boston_df.dtypes

CRIM         float64
ZN            object
INDUS        float64
CHAS          object
NOX          float64
RM           float64
AGE          float64
DIS          float64
RAD           object
TAX          float64
PTRATIO      float64
B            float64
LSTAT        float64
target       float64
ZN_cat      category
CHAS_cat    category
RAD_cat     category
dtype: object

## Simple usage

### Initialising GroupRareLevelsTransformer
The user must set cut_off_percent to determine rare levels.

In [7]:
grp_1 = GroupRareLevelsTransformer(
    columns = ['ZN', 'RAD'], 
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = True
)

BaseTransformer.__init__() called


### GroupRareLevelsTransformer fit
The fit method must be run before the transform method. It determines the 'non-rare' levels from the input data.
The mappings are stored in an attribute called 'mapping_'.

In [8]:
grp_1.fit(boston_df)

BaseTransformer.fit() called


GroupRareLevelsTransformer(columns=['ZN', 'RAD'], cut_off_percent=0.1)

In [9]:
pprint(grp_1.mapping_)

{'RAD': ['24.0', '4.0', '5.0', nan], 'ZN': ['0.0', nan]}


### GroupRareLevelsTransformer transform
The transform method maps any levels that are not present in the mapping_ dict to 'rare'.

In [10]:
boston_df['RAD'].value_counts(normalize = True, dropna = False)

24.0    0.245059
5.0     0.203557
4.0     0.173913
NaN     0.122530
3.0     0.069170
6.0     0.043478
8.0     0.041502
2.0     0.039526
1.0     0.035573
7.0     0.025692
Name: RAD, dtype: float64

In [11]:
boston_df_2 = grp_1.transform(boston_df)

BaseTransformer.transform() called


In [12]:
boston_df_2['RAD'].value_counts(dropna = False)

rare    129
24.0    124
5.0     103
4.0      88
NaN      62
Name: RAD, dtype: int64

In [13]:
boston_df_2['ZN'].value_counts(dropna = False)

0.0     330
rare    114
NaN      62
Name: ZN, dtype: int64

## Changing the rare level label
The name of the rare level can be changed by using the 'rare_level_name' argument when initialising the GroupRareLevelsTransformer object.

In [21]:
grp_2 = GroupRareLevelsTransformer(
    columns = ['ZN'], 
    rare_level_name = 'zzz',
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [22]:
grp_2.fit(boston_df)

GroupRareLevelsTransformer(columns=['ZN'], cut_off_percent=0.1,
                           rare_level_name='zzz')

In [23]:
boston_df_3 = grp_2.transform(boston_df)

In [24]:
boston_df_3['ZN'].value_counts(dropna = False)

0.0    330
zzz    114
NaN     62
Name: ZN, dtype: int64

## Recording rare levels
By default the levels identified as 'rare' i.e. they fall below the cut_off_percent value are not recorded. This can be changed by setting the record_rare_levels argument to True when initialising the GroupRareLevelsTransformer object. <br>
If this is the case the rare levels are recorded in a dict in the 'rare_levels_record_' attribute on the object. <br>
The user should take care doing this if using the transformer on columns with many levels as this can result in a large transformer object. 

In [25]:
grp_3 = GroupRareLevelsTransformer(
    columns = ['ZN'], 
    record_rare_levels = True,
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [26]:
grp_3.fit(boston_df)

GroupRareLevelsTransformer(columns=['ZN'], cut_off_percent=0.1)

In [27]:
pprint(grp_3.rare_levels_record_)

{'ZN': ['100.0',
        '12.5',
        '17.5',
        '18.0',
        '20.0',
        '21.0',
        '22.0',
        '25.0',
        '28.0',
        '30.0',
        '33.0',
        '34.0',
        '35.0',
        '40.0',
        '45.0',
        '52.5',
        '55.0',
        '60.0',
        '70.0',
        '75.0',
        '80.0',
        '85.0',
        '90.0',
        '95.0']}


## Using row by row weights to identify rare levels
If records in the data do not have equal weight the user can set the 'weight' argument when initialising the GroupRareLevelsTransformer object so cut_off_percent applies to the sum of weight rather than sum of rows. <br>
In this example we create a dummy weights column, and set a very rarely occuring level of the 'ZN' column to have a large weight compared to all other levels. This level should be the only one selected based off these (dummy) weights.

In [28]:
(boston_df['ZN'] == '100.0').sum() / boston_df.shape[0] 

0.001976284584980237

In [29]:
boston_df['weights'] = 1

In [30]:
boston_df.loc[boston_df['ZN'] == '100.0', 'weights'] = 1000000

In [31]:
grp_4 = GroupRareLevelsTransformer(
    columns = ['ZN'], 
    weight = 'weights',
    cut_off_percent = 0.10, 
    copy = True, 
    verbose = False
)

In [32]:
grp_4.fit(boston_df)

GroupRareLevelsTransformer(columns=['ZN'], cut_off_percent=0.1,
                           weight='weights')

In [33]:
grp_4.mapping_

{'ZN': ['100.0']}