## WoERatioCategoricalEncoder (weight of evidence)

This encoder replaces the labels by the weight of evidence or the ratio of probabilities. 
#### It only works for binary classification.
    
The target probability ratio is given by: p(1) / p(0)

The weight of evidence is given by: log( p(1) / p(0) )

In [8]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from feature_engine.categorical_encoders import WoERatioCategoricalEncoder

from feature_engine.categorical_encoders import RareLabelCategoricalEncoder #to reduce cardinality

In [2]:
# Load titanic dataset from OpenML

def load_titanic():
    data = pd.read_csv('https://www.openml.org/data/get_csv/16826755/phpMYEkMl')
    data = data.replace('?', np.nan)
    data['cabin'] = data['cabin'].astype(str).str[0]
    data['pclass'] = data['pclass'].astype('O')
    data['age'] = data['age'].astype('float')
    data['fare'] = data['fare'].astype('float')
    data['embarked'].fillna('C', inplace=True)
    data.drop(labels=['boat', 'body', 'home.dest'], axis=1, inplace=True)
    return data

In [3]:
data = load_titanic()
data.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B,S
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C,S
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C,S
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C,S
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C,S


In [4]:
X = data.drop(['survived', 'name', 'ticket'], axis=1)
y = data.survived

In [5]:
# we will encode the below variables, they have no missing values
X[['cabin', 'pclass', 'embarked']].isnull().sum()

cabin       0
pclass      0
embarked    0
dtype: int64

In [6]:
X[['cabin', 'pclass', 'embarked']].dtypes

cabin       object
pclass      object
embarked    object
dtype: object

In [7]:
# let's separate into training and testing set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

X_train.shape, X_test.shape

((916, 8), (393, 8))

In [9]:
## Rare value encoder first to reduce the cardinality
# see RareLabelCategoricalEncoder jupyter notebook for more details on this encoder
rare_encoder = RareLabelCategoricalEncoder(tol=0.03,
                                           n_categories=2, 
                                           variables=['cabin', 'pclass', 'embarked'])

rare_encoder.fit(X_train)

# transform
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_test)

The WoERatioCategoricalEncoder() replaces categories by the weight of evidence
or by the ratio between the probability of the target = 1 and the probability
of the  target = 0.

The weight of evidence is given by: log(P(X=x<sub>j</sub>|Y = 1)/P(X=x<sub>j</sub>|Y=0))

The target probability ratio is given by: p(1) / p(0)

And the log of the target probability ratio is: np.log( p(1) / p(0) )

Note: This categorical encoding is exclusive for binary classification.

For example in the variable colour, if the mean of the target = 1 for blue
is 0.8 and the mean of the target = 0  is 0.2, blue will be replaced by:
np.log(0.8/0.2) = 1.386 if log_ratio is selected. Alternatively, blue will be
replaced by 0.8 / 0.2 = 4 if ratio is selected.

#### Note: 
The division by 0 is not defined and the log(0) is not defined.
Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for
woe or log_ratio, in any of the variables, the encoder will return an error.
   
The encoder will encode only categorical variables (type 'object'). A list
of variables can be passed as an argument. If no variables are passed as 
argument, the encoder will find and encode all categorical variables
(object type).<br>

For details on the calculation of the weight of evidence visit:<br>
https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

### Weight of evidence

In [10]:
woe_enc = WoERatioCategoricalEncoder(encoding_method='woe',
                                     variables=['cabin', 'pclass', 'embarked'])

# to fit you need to pass the target y
woe_enc.fit(train_t, y_train)

WoERatioCategoricalEncoder(variables=['cabin', 'pclass', 'embarked'])

In [11]:
woe_enc.encoder_dict_

{'cabin': {'B': 1.6299623810120747,
  'C': 0.7217038208351837,
  'D': 1.405081209799324,
  'E': 1.405081209799324,
  'Rare': 0.7387452866900354,
  'n': -0.35752781962490193},
 'pclass': {1: 0.9453018143294478,
  2: 0.21009172435857942,
  3: -0.5841726684724614},
 'embarked': {'C': 0.6999054533737715,
  'Q': -0.05044494288988759,
  'S': -0.20113381737960143}}

In [12]:
# transform and visualise the data

train_t = woe_enc.transform(train_t)
test_t = woe_enc.transform(test_t)

test_t.sample(5)

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
539,0.210092,male,27.0,0,0,15.0333,-0.357528,0.699905
64,0.945302,male,27.0,1,0,53.1,1.405081,-0.201134
948,-0.584173,male,,0,0,7.75,-0.357528,-0.050445
319,0.945302,female,31.0,0,0,134.5,1.405081,0.699905
641,-0.584173,male,3.0,4,2,31.3875,-0.357528,-0.201134


### Ratio

In [26]:
#Similarly, it is recommended to remove rare labels and high cardinality before using this encoder.
train_t = rare_encoder.transform(X_train)
test_t = rare_encoder.transform(X_test)

In [23]:
ratio_enc = WoERatioCategoricalEncoder(encoding_method='ratio',
                                       variables=['cabin', 'pclass', 'embarked'])

# to fit we need to pass the target y
ratio_enc.fit(train_t, y_train)

WoERatioCategoricalEncoder(encoding_method='ratio',
                           variables=['cabin', 'pclass', 'embarked'])

In [15]:
ratio_enc.encoder_dict_

{'cabin': {'B': 3.1999999999999993,
  'C': 1.2903225806451615,
  'D': 2.5555555555555554,
  'E': 2.5555555555555554,
  'Rare': 1.3124999999999998,
  'n': 0.4385245901639344},
 'pclass': {1: 1.6136363636363635,
  2: 0.7735849056603774,
  3: 0.34959349593495936},
 'embarked': {'C': 1.2625000000000002,
  'Q': 0.5961538461538461,
  'S': 0.5127610208816704}}

In [16]:
# transform and visualise the data

train_t = woe_enc.transform(train_t)
test_t = woe_enc.transform(test_t)

test_t.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
1139,-0.584173,male,38.0,0,0,7.8958,-0.357528,-0.201134
533,0.210092,female,21.0,0,1,21.0,-0.357528,-0.201134
459,0.210092,male,42.0,1,0,27.0,-0.357528,-0.201134
1150,-0.584173,male,,0,0,14.5,-0.357528,-0.201134
393,0.210092,male,25.0,0,0,31.5,-0.357528,-0.201134


### Automatically select the variables

This encoder will select all categorical variables to encode, when no variables are specified when calling the encoder.

In [24]:
ratio_enc = WoERatioCategoricalEncoder(encoding_method='ratio')

# to fit we need to pass the target y
ratio_enc.fit(train_t, y_train)

WoERatioCategoricalEncoder(encoding_method='ratio',
                           variables=['pclass', 'sex', 'cabin', 'embarked'])

In [25]:
# transform and visualise the data

train_t = ratio_enc.transform(train_t)
test_t = ratio_enc.transform(test_t)

test_t.head()

Unnamed: 0,pclass,sex,age,sibsp,parch,fare,cabin,embarked
1139,0.349593,0.230932,38.0,0,0,7.8958,0.438525,0.512761
533,0.773585,2.681319,21.0,0,1,21.0,0.438525,0.512761
459,0.773585,0.230932,42.0,1,0,27.0,0.438525,0.512761
1150,0.349593,0.230932,,0,0,14.5,0.438525,0.512761
393,0.773585,0.230932,25.0,0,0,31.5,0.438525,0.512761
