# Encoders

In [3]:
import pandas as pd
import numpy as np
from sleepmind.preprocessing import SumEncoder

For the given df below, we will be using the following list of encoders to encode Region ,which is the Categorical Variable, and Salary is the dependent variable

In [4]:
df = pd.DataFrame({'Region': ['NY','SF','NY','NY','SF', 'CT'], 'Salary': [100,120,150,130,140, 90]})
X = pd.DataFrame(df.Region)
y = pd.DataFrame(df.Salary)

List of Encoders :-

- Sleepmind sum encoder
- One hot Encoding
- Label Encoding
- Categorical Encoders Library
    - BackwardDifferenceEncoder
    - BinaryEncoder
    - HelmertEncoder
    - OneHotEncoder
    - OrdinalEncoder
    - SumEncoder
    - PolynomialEncoder
    - BaseNEncoder
    - TargetEncoder
    - LeaveOneOutEncoder
- Bayesian Encoders
    - Xam_BayesianTargetEncoder
    - hcc_BayesEncoding
    - hcc_BayesEncodingKfold
    - hcc_LOOEncoding
    - hcc_LOOEncodingKfold

## Sleepmind Sum Encoder

https://github.com/IamGianluca/sleepmind/blob/master/sleepmind/preprocessing/sum_encoding.py

Logic: mean of the dependent variable which has the class under observation/ mean of the dependent variable for all the other classes

Implementation:
<br>NY:  ((100+150+130)/3) / ((120+140+90)/3) = 1.0857
<br>SF:  ((120+140)/2) / ((100+150+130+90)/4) = 1.106
<br>CT:  (90) / ((100+150+130+120+140)/5)

In [5]:
from sleepmind.preprocessing import SumEncoder
sum_encoder = SumEncoder()
encoded_X = sum_encoder.fit_transform(df.Region,df.Salary)

for region,value in zip(X.values,encoded_X):
    print (region,value)

['NY'] [1.08571429]
['SF'] [1.10638298]
['NY'] [1.08571429]
['NY'] [1.08571429]
['SF'] [1.10638298]
['CT'] [0.703125]


## One Hot Encoding

Logic: It creates k new variables where k is the cardinality of the Nominal variable and has the value 1 if the value is present in the original variable else 0.

Implementation: Region should be split between 3 columns, Region_CT, Region_NY and Region_SF and has the values  [0, 0, 0, 0, 1], [1, 0, 1, 1, 0] and [0, 1, 0, 0, 1] respectively

In [94]:
df_ohe = pd.get_dummies(df, columns=["Region"])

In [95]:
df_ohe

Unnamed: 0,Salary,Region_CT,Region_NY,Region_SF
0,100,0,1,0
1,120,0,0,1
2,150,0,1,0
3,130,0,1,0
4,140,0,0,1
5,90,1,0,0


## Label Encoding

Logic: It creates a new column with k-1 values where k is the cardinality of the Ordinal variable.
Hence, This is generally used for Ordinal variables.

Implementation: Region should be split between k-1 which is 3-1 i.e. 2 values, [1, 2, 1, 1, 2, 0] where 0 represents CT, 1 represents NY, and 2 represents SF


In [96]:
from sklearn.preprocessing import LabelEncoder

In [106]:
lb = LabelEncoder()
lb.fit(df_label_encoding["Region"])

LabelEncoder()

In [107]:
lb.transform(X.Region)

array([1, 2, 1, 1, 2, 0])

## Categorical Encoders Library

In [1]:
import category_encoders as ce

### BackwardDifferenceEncoder

Logic : the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a **nominal or an ordinal variable**

Implementation:
First needs to create to ordinal variables such as 1, 2, 3 for the states.
It returns k-1 variables which is 3-1 i.e. 2 in our case.

Consider the level where NY =1, SF = 2, CT = 3

NY = -(k-level_1)/k, -(k-level_2)/k = -(3-1)/3, -(3-2)/3 = -2/3, -1/3
SF = level_1/k, -(k-level_2)/k = 1/3, -1/3
CT = level_1/k, level_2/k = 1/3, 2/3


Implmentation:
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter5/regression-with-statachapter-5-additional-coding-systems-for-categorical-variables-in-regressionanalysis/

In [109]:
bde = ce.BackwardDifferenceEncoder(cols=['Region'])
bde.fit(X)

BackwardDifferenceEncoder(cols=['Region'], drop_invariant=False,
             handle_unknown='impute', impute_missing=True,
             mapping=[{'col': 'Region', 'mapping':       [D.1]     [D.2]
1 -0.666667 -0.333333
2  0.333333 -0.333333
3  0.333333  0.666667
0  0.000000  0.000000}],
             return_df=True, verbose=0)

In [110]:
bde.transform(X)

Unnamed: 0,intercept,Region_0,Region_1
0,1,-0.666667,-0.333333
1,1,0.333333,-0.333333
2,1,-0.666667,-0.333333
3,1,-0.666667,-0.333333
4,1,0.333333,-0.333333
5,1,0.333333,0.666667


### Binary Encoder

Logic :  first the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns.  This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances.

Implmentation:
First needs to create to ordinal variables such as 1, 2, 3 for the states.
It returns k-1 variables which is 3-1 i.e. 2 in our case.

Consider the level where NY =1, SF = 2, CT = 3
Since we the cardinality is 3, hence it can be encoded in 2 variables. Since binary for 3 is 1 1.

Binary for 1 i.e. NY - 0 1
Binary for 2 i.e. SF - 1 0
Binary for 3 i.e. NY - 1 1

In [111]:
bin_e = ce.BinaryEncoder(cols=['Region'])
bin_e.fit(X)
bin_e.fit_transform(X)

Unnamed: 0,Region_0,Region_1,Region_2
0,0,0,1
1,0,1,0
2,0,0,1
3,0,0,1
4,0,1,0
5,0,1,1


### Helmert Encoder

Logic : the mean of the dependent variable for a level is compared with the mean of the dependent variable for the prior level. This type of coding may be useful for a **an ordinal variable**

Implementation:
First needs to create to ordinal variables such as 1, 2, 3 for the states.
It returns k-1 variables which is 3-1 i.e. 2 in our case.

Consider the level where NY =1, SF = 2, CT = 3

NY = -(k-level_1)/k, -(k-level_2)/k = -(3-1)/3, -(3-2)/3 = -2/3, -1/3
SF = level_1/k, -(k-level_2)/k = 1/3, -1/3
CT = level_1/k, level_2/k = 1/3, 2/3


Implmentation:
https://stats.idre.ucla.edu/stata/webbooks/reg/chapter5/regression-with-statachapter-5-additional-coding-systems-for-categorical-variables-in-regressionanalysis/

In [116]:
helmert_encoder = ce.HelmertEncoder(cols=['Region'])
helmert_encoder.fit(X)
helmert_encoder.transform(X)

Unnamed: 0,intercept,Region_0,Region_1
0,1,-1.0,-1.0
1,1,1.0,-1.0
2,1,-1.0,-1.0
3,1,-1.0,-1.0
4,1,1.0,-1.0
5,1,0.0,2.0


### One hot Encoder

Logic: It creates k new variables where k is the cardinality of the Nominal variable and has the value 1 if the value is present in the original variable else 0.

Implementation: Region should be split between 3 columns, Region_CT, Region_NY and Region_SF and has the values  [0, 0, 0, 0, 1], [1, 0, 1, 1, 0] and [0, 1, 0, 0, 1] respectively

In [117]:
ohe = ce.OneHotEncoder(cols=['Region'])
ohe.fit_transform(X)
ohe.transform(X)

Unnamed: 0,Region_1,Region_2,Region_3,Region_-1
0,1,0,0,0
1,0,1,0,0
2,1,0,0,0
3,1,0,0,0
4,0,1,0,0
5,0,0,1,0


### Ordinal Encoder

Logic: It creates a new column with k values where k is the cardinality of the Ordinal variable.
Hence, This is generally used for Ordinal variavariables.

Implementation: Region should be split between k values which is 3 values, [1, 2, 1, 1, 2, 3] where 3 represents CT, 1 represents NY, and 2 represents SF


In [118]:
oe = ce.OrdinalEncoder(cols=['Region'])
oe.fit(X)
oe.transform(X)

Unnamed: 0,Region
0,1
1,2
2,1
3,1
4,2
5,3


### Sum Encoder

Logic: compares the mean of the dependent variable for a given level to the overall mean of the dependent variable over all the levels. That is, it uses contrasts between each of the first k-1 levels and level k In this example, level 1 is compared to all the others, level 2 to all the others, and level 3 to all the others.
Hence, This is generally used for Ordinal variavariables.

In [119]:
se = ce.SumEncoder(cols=['Region'])
se.fit(X)
se.transform(X)

Unnamed: 0,intercept,Region_0,Region_1
0,1,1.0,0.0
1,1,0.0,1.0
2,1,1.0,0.0
3,1,1.0,0.0
4,1,0.0,1.0
5,1,-1.0,-1.0


### Polynomial Encoding

Logic:  The coefficients taken on by polynomial coding for k=4 levels are the linear, quadratic, and cubic trends in the categorical variable. The categorical variable here is assumed to be represented by an underlying, equally spaced numeric variable. Therefore, this type of encoding is used only for ordered categorical variables with equal spacing.

Implementation:

In [120]:
pe = ce.PolynomialEncoder(cols=['Region'])
pe.fit(X)
pe.transform(X)

Unnamed: 0,intercept,Region_0,Region_1
0,1,-0.7071068,0.408248
1,1,-5.5511150000000004e-17,-0.816497
2,1,-0.7071068,0.408248
3,1,-0.7071068,0.408248
4,1,-5.5511150000000004e-17,-0.816497
5,1,0.7071068,0.408248


### BaseN Encoding

Logic :  first the categories are encoded as ordinal, then those integers are converted into BaseN code (Binary when k =3) code, then the digits from that string are split into separate columns.  This encodes the data in fewer dimensions that one-hot, but with some distortion of the distances.

Implmentation:
First needs to create to ordinal variables such as 1, 2, 3 for the states.
It returns k-1 variables which is 3-1 i.e. 2 in our case.

Consider the level where NY =1, SF = 2, CT = 3
Since we the cardinality is 3, hence it can be encoded in 2 variables. Since binary for 3 is 1 1.

Binary for 1 i.e. NY - 0 1
Binary for 2 i.e. SF - 1 0
Binary for 3 i.e. CT - 1 1

In [122]:
base_ne = ce.BaseNEncoder(cols=['Region'])
base_ne.fit(X)
base_ne.transform(X)

Unnamed: 0,Region_0,Region_1,Region_2
0,0,0,1
1,0,1,0
2,0,0,1
3,0,0,1
4,0,1,0
5,0,1,1


### Target Encoder

Logic : The mean of the dependent variable is smoothened according to the implementation shown belowTargetEncoding returns a weighted average of p(y|x) and p(y).


Implmentation: 

<br>prior - np.mean(df.Salary)
<br>Consider the level where NY =1, SF = 2, CT = 3
<br>tmp ={'CT': {'sum': 90, 'count': 1, 'mean': 90.0}, 'NY': {'sum': 380, 'count': 3, 'mean': 126.66666666666667}, 'SF': {'sum': 260, 'count': 2, 'mean': 130.0}}
<br>smoothing = 1 / (1 + np.exp(-(tmp[val]["count"] - min_samples_leaf) / smoothing))
<br>cust_smoothing = prior * (1 - smoothing) + tmp[val]['mean'] * smoothing
<br>tmp[val]['smoothing'] = cust_smoothing
<br>if count =1, value = np.mean(df['Salary')
<br>min_samples_leaf =1 

http://dx.doi.org/10.1145/507533.507538

In [126]:
te = ce.TargetEncoder(cols=['Region'])
te.fit(X, y)

TargetEncoder(cols=['Region'], drop_invariant=False, handle_unknown='impute',
       impute_missing=True, min_samples_leaf=1, return_df=True,
       smoothing=1.0, verbose=0)

In [127]:
te.transform(X)

Unnamed: 0,Region
0,126.070652
1,127.758821
2,126.070652
3,126.070652
4,127.758821
5,121.666667


### Leave One Out Encoder

Logic :
 LeaveOneOut does not calculate the average - it just returns an estimate of p(y|x
 LeaveOneOut performs leave-one-out estimation of p(y|x) - it excludes the current row from the estimate. TargetEncoding does not do that - it is using even the current row.
 Hence it has different transformation for training and test set. For training, it will remove the current row from calculation. And it won't do that for test data


Implmentation: The mean of the salary per region is the encoded value for the categorical variable.
As we can see, the encoded value for:
NY = (100+150+130)/3
SF = (140+120)/2
CT = (90)/1

https://pkghosh.wordpress.com/2018/06/18/leave-one-out-encoding-for-categorical-feature-variables-on-spark/

In [128]:
loo = ce.LeaveOneOutEncoder(cols=['Region'])
loo.fit(X, y)

LeaveOneOutEncoder(cols=['Region'], drop_invariant=False,
          handle_unknown='impute', impute_missing=True, random_state=None,
          randomized=False, return_df=True, sigma=0.05, verbose=0)

In [130]:
loo.transform(X)

Unnamed: 0,Region
0,126.666667
1,130.0
2,126.666667
3,126.666667
4,130.0
5,90.0


## Bayesian Encoders
### Xam_BayesianTargetEncoder

https://github.com/MaxHalford/xam/blob/master/xam/feature_extraction/encoding/bayesian_target.py

Logic: It calculates the posteriors using the priors as shown below: 
        <br>prior_ = y.mean()
        <br>result = pw * prior_ + counts * means) / (pw + counts))
        where 
        y = dependent variable
        pw = prior weights (needs to be optimised)
        counts = count for the dependent variable (y) variable for that categorical value
        means = mean for the dependent variable (y) variable for that categorical value

Implementation:<br>
* prior_ = (100+120+150+130+140)/5 =121.66666666666667
* pw = 3
* counts : <br><t>NY:3 <br>SF:2 <br>CT:1
  
* means :  <br>NY:126.666667 <br>SF:130 <br>CT:90
    
* result: <br>NY:(3*121.66666666666667 + 3*126.666667)/(3+3) = 124.16666683333334 <br>SF:(3*121.66666666666667 + 2*130)/(5) = 125<br> CT:  (3*121.66666666666667 + 1*90)/(4) = 113.75

In [6]:
X_xam = X.copy()

In [7]:
from xam.feature_extraction import BayesianTargetEncoder
xam_encoder = BayesianTargetEncoder(
            columns=['Region'],
            prior_weight=3,
            suffix='')
encoded_X = xam_encoder.fit_transform(X_xam,y.Salary)
for region,value in zip(X.values,encoded_X.values):
    print (region,value)

['NY'] [124.16666667]
['SF'] [125.]
['NY'] [124.16666667]
['NY'] [124.16666667]
['SF'] [125.]
['CT'] [113.75]


### hcc_BayesEncoding

https://github.com/Robin888/hccEncoding-project/blob/master/hccEncoding/EncoderForRegression.py

Logic: It calculates the posteriors using the priors as shown below: 
    <br>count = count()
    <br>prior = y.mean()
    <br>means = Sample means grouped by categorical variable
    <br>B=1/(1+np.exp(-1*(count-k)/f))
    <br>result = mean * B + (1-B) * prior
    <br>where 
    <br> y = dependent variable
    <br> k [default=5] - parameter for BayesEncoding and BayesEncodingKfold, determines half of the minimal sample size of which we completely ‘trust’ the estimate of transition between the cell’s posterior probability and the prior probability
    <br> f [default=1] - parameter for BayesEncoding and BayesEncodingKfold,controls how quickly the weight changes from the prior to the posterior as the size of the group increases, to further understand k and f’s meaning
    
Implementation:<br>
* counts : <br><t>NY:3 <br>SF:2 <br>CT:1
  
* means :  <br>NY:126.666667 <br>SF:130 <br>CT:90
* prior = (100+120+150+130+140)/5 =121.66666666666667
* f : 1
* k : 5
<br>
* B :<br>NY: 1/(1+np.exp(-1*(3-5)/1)) = 0.11920292202211755
     <br>SF: 1/(1+np.exp(-1*(2-5)/1)) = 0.04742587317756678 
     <br>CT: 1/(1+np.exp(-1*(1-5)/1)) = 0.01798620996209156
   
* result:
     <br>NY: 0.11920292202211755*126.666667 + (1-0.11920292202211755)*121.66666666666667 = 122.26268131651156
     <br>SF: 0.04742587317756678*130 + (1-0.04742587317756678)*121.66666666666667 = 122.06188227647972
     <br>CT: 0.01798620996209156*130 + (1-0.01798620996209156)*121.66666666666667 = 121.8165517496841

In [8]:
from hccEncoding.EncoderForRegression import BayesEncoding

X_train, X_test=BayesEncoding(train=df,test=df,target='Salary', feature='Region', drop_origin_feature=False, noise=0)
X_train


Unnamed: 0,Region,Salary,bayes_Region
0,NY,100,122.262681
1,SF,120,122.061882
2,NY,150,122.262681
3,NY,130,122.262681
4,SF,140,122.061882
5,CT,90,121.097103


### hcc_BayesEncodingKfold

https://github.com/Robin888/hccEncoding-project/blob/master/hccEncoding/EncoderForRegression.py

Logic: It calculates the encoded value using the BayesEncoding as shown above with KfoldcrossValidation 
   

In [41]:
from hccEncoding.EncoderForRegression import BayesEncodingKfold

X_train, X_test=BayesEncodingKfold(train=df,test=df,target='Salary', feature='Region', drop_origin_feature=False, noise=0)
X_train

Unnamed: 0,Region,Salary,bayes_Region
0,NY,100,128.092823
1,SF,120,127.724828
2,NY,150,115.952574
3,NY,130,120.237129
4,SF,140,118.035972
5,CT,90,128.0


### hcc_LOOEncoding

https://github.com/Robin888/hccEncoding-project/blob/master/hccEncoding/EncoderForRegression.py

Logic: It calculates the new variable on the basis of following 
* sum = sum of target grouped by the categorical variable we want to encode
* count =  count of target grouped by the categorical variable we want to encode
* target = dependent variable (y)

Implementation:
* sum:<br> NY: 380 <br> SF: 260 <br> CT: 90
* counts : <br>NY:3 <br>SF:2 <br>CT:1<br>
* results : 
   <br>NY: (380 - 100)/(3-1) = 140
   <br>SF: (260 - 120)/(2-1) = 140
   <br>NY: (380 - 150)/(3-1) = 115
   <br>NY: (380 - 130)/(3-1) = 125
   <br>SF: (260 - 140)/(2-1) = 120
   <br>CT: (180 - 90)/2 = 90
 
for test dataset: it takes the avg of the new column made
Why not taking averages as it is same as taking avg

In [4]:
from hccEncoding.EncoderForRegression import LOOEncoding

X_train, X_test=LOOEncoding(train=df,test=df[0:4],target='Salary', feature='Region', drop_origin_feature=False, noise=0)
X_train

Unnamed: 0,Region,Salary,loo_Region
0,NY,100,140.0
1,SF,120,140.0
2,NY,150,115.0
3,NY,130,125.0
4,SF,140,120.0
5,CT,90,90.0


In [46]:
X_test

Unnamed: 0,Region,Salary,loo_Region
0,NY,100,126.666667
1,SF,120,130.0
2,NY,150,126.666667
3,NY,130,126.666667
4,SF,140,130.0
5,CT,90,90.0


### hcc_LOOEncodingKfold

https://github.com/Robin888/hccEncoding-project/blob/master/hccEncoding/EncoderForRegression.py

Logic: It calculates the encoded value using the LOOEncoding as shown above with KfoldcrossValidation 
 

In [10]:
from hccEncoding.EncoderForRegression import LOOEncodingKfold

X_train, X_test=LOOEncodingKfold(train=df,test=df,target='Salary', feature='Region', drop_origin_feature=False, noise=0)
X_train

Unnamed: 0,Region,Salary,loo_Region
0,NY,100,140.0
1,SF,120,140.0
2,NY,150,115.0
3,NY,130,125.0
4,SF,140,120.0
5,CT,90,128.0


In [11]:
X_test

Unnamed: 0,Region,Salary,loo_Region
0,NY,100,126.666667
1,SF,120,130.0
2,NY,150,126.666667
3,NY,130,126.666667
4,SF,140,130.0
5,CT,90,90.0
