### Dealing with categorical variables

One hot encode
- The categorical value represents the numerical value of the entry in the dataset.
- curse of dimensionality makes dimensionality increase exponential
- lose the explicit one columns relationship of the feature

Label encode
- `0, 1, 2, 3`
- is an ordinal encoding - even if feature is not ordinal

Mean encode
- put the training data average for the target for that class
- could also use other statistics like median, quantiles or variance

Target encode
- features are replaced with a blend of the posterior probability of the target for the given particular categorical value and the prior probability of the target over all the training data.
- the posterior probability is the probability p(t = 1 | x = ci) where t denotes the target, x is the input and ci is the i-th category given the input was the category ci.
- they are not generated for the test data. 
- We usually save the target encodings obtained from the training data set and use the same encodings to encode features in the test data set.

BaseN Encoding
- In binary encoding, we convert the integers into binary i.e base 2.
---> First, the categories are encoded as ordinal, then those integers are converted into binary code, 
then the digits from that binary string are split into separate columns. (this concept uses Hashing!) Read More here:https://ieeexplore.ieee.org/document/5474379
- A base of 1 is equivalent to one-hot encoding, a base of 2 is equivalent to binary encoding.
- BaseN allows us to convert the integers with any value of the base.
- ideal for columns with large categorical types

In [1]:
import category_encoders as ce
import pandas as pd
import numpy as np

#import warnings
#warnings.filterwarnings("ignore")


data = pd.read_csv('./data/cars/cars.csv',index_col=0)
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,Black,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,Black,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,Blue,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,Black,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,Grey,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [2]:
data.color.value_counts()

color
Black         390
Silver        243
Grey          131
Red            84
White          81
Blue           76
Gold           57
Maroon         47
Dark Grey      32
Dark Blue      25
Dark Green     13
Green           8
Other           3
Name: count, dtype: int64

In [3]:
#Label Encoding the color column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()  #instantiate the Label Encoder
data['color'] = le.fit_transform(data['color'])

#it's ideal to always instantiate new LabelEncoders for different columns

In [4]:
data.head()

Unnamed: 0,Foreign/Local Used,color,wheel drive,Automation,seat-make,price,description,make-year,manufacturer
0,Foreign Used,0,4,Automatic,Leather,17500000,2014 Lexus LX,2014,Lexus
1,Foreign Used,0,4,Automatic,Leather,13000000,2012 Toyota Sequoia,2012,Toyota
2,Foreign Used,1,4,Automatic,Cloth,6500000,2007 Toyota FJ CRUISER,2007,Toyota
3,Foreign Used,0,4,Automatic,Leather,4700000,2005 Lexus GX,2005,Lexus
4,Foreign Used,7,4,Automatic,Leather,3800000,2005 Toyota 4-Runner,2008,Toyota


In [5]:
data['seat-make'].value_counts()

seat-make
Leather    920
Cloth      270
Name: count, dtype: int64

In [6]:
data.dtypes

Foreign/Local Used    object
color                  int64
wheel drive            int64
Automation            object
seat-make             object
price                  int64
description           object
make-year              int64
manufacturer          object
dtype: object

In [7]:
#one hot encoding for Foreign/Local Used column
import category_encoders as ce

# create an object of the OneHotEncoder
ce_one = ce.OneHotEncoder(cols=['Foreign/Local Used','Automation']) 

ce_one.fit_transform(data)

AttributeError: module 'pandas.api.types' has no attribute 'is_categorical'

In [8]:
#get_dummies
pd.get_dummies(data['Foreign/Local Used']).head(30)

#Convert categorical variable into dummy/indicator variables.

Unnamed: 0,Foreign Used,Locally Used
0,True,False
1,True,False
2,True,False
3,True,False
4,True,False
5,True,False
6,True,False
7,True,False
8,True,False
9,True,False


In [10]:
#Target encoding
ce_te = ce.TargetEncoder(cols=['seat-make'])

#column to perform encoding
X = data['seat-make']
y = data['color']

#create an object of the Targetencoder
ce_te.fit(X,y)

ce_te.transform(X).head()

AttributeError: module 'pandas.api.types' has no attribute 'is_categorical'

In [10]:
# make some data
example_df = pd.DataFrame({
 'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
example_df

Unnamed: 0,class
0,a
1,b
2,a
3,b
4,d
5,e
6,d
7,f
8,g
9,h


In [11]:
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=4)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(example_df).head(12)

Unnamed: 0,class_0,class_1,class_2
0,0,0,1
1,0,0,2
2,0,0,1
3,0,0,2
4,0,0,3
5,0,1,0
6,0,0,3
7,0,1,1
8,0,1,2
9,0,1,3


In [12]:
#mean encode
def mean_encode(data, col, on):
    group = data.groupby(col).mean()
    mapper = {k: v for k, v in zip(group.index, group.loc[:, on].values)}

    data.loc[:, col] = data.loc[:, col].replace(mapper)
    data.loc[:, col].fillna(value=np.mean(data.loc[:, col]), inplace=True)

    return data

In [13]:
#example dataframe_1
store1 = pd.DataFrame({'store': ['Alpha'] * 3,
         'Sales': [1000, 2000, 3000],
         'noise': [0, 0, 0]})

#example dataframe_2
store2 = pd.DataFrame(
        {'store': ['Beta'] * 3,
         'Sales': [100, 200, 300],
         'noise': [0, 0, 0]})

data = pd.concat([store1, store2], axis=0)  #concat dataframe
#np.testing.assert_array_equal(data.loc[:, 'store'],np.array([200, 200, 200, 20, 20, 20]))

In [14]:
data

Unnamed: 0,store,Sales,noise
0,Alpha,1000,0
1,Alpha,2000,0
2,Alpha,3000,0
0,Beta,100,0
1,Beta,200,0
2,Beta,300,0


In [15]:
mean_encode(data, col='store', on='Sales')

Unnamed: 0,store,Sales,noise
0,2000,1000,0
1,2000,2000,0
2,2000,3000,0
0,200,100,0
1,200,200,0
2,200,300,0


In [16]:
def median_encoder(df, col, on):
    group = df.groupby(col).median()
    mapper = {a: b for a, b in zip(group.index, group.loc[:, on].values)}

    df.loc[:, col] = df.loc[:, col].replace(mapper)
    df.loc[:, col].fillna(value=np.median(df.loc[:, col]), inplace=True)

    return df