Code from "Smarter Ways to Encode Categorical Data for Machine Learning: Exploring Category Encoders by Jeff Hale
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

In [2]:
#An example of using the binary encoder 

# import the packages
import numpy as np
import pandas as pd
import category_encoders as ce

# make some data
df = pd.DataFrame({
 'color':["a", "b", "a", "c"], 
 'outcome':[1, 2, 3, 2]})

# split into X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

# instantiate an encoder - here we use Binary()
ce_binary = ce.BinaryEncoder(cols = ['color'])

# fit and transform and presto, you've got encoded data
ce_binary.fit_transform(X, y)

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,0,1
3,1,1


In [3]:
print(df)

  color  outcome
0     a        1
1     b        2
2     a        3
3     c        2


This here is from https://www.kaggle.com/code/discdiver/category-encoders-examples/notebook

And it has more of interest 

In [4]:
import numpy as np
import pandas as pd              # version 0.23.4
import category_encoders as ce   # version 1.2.8
from sklearn.preprocessing import LabelEncoder

pd.options.display.float_format = '{:.2f}'.format # to make legible

# make some data
df = pd.DataFrame({
    'color':["a", "c", "a", "a", "b", "b"], 
    'outcome':[1, 2, 0, 0, 0, 1]})

# set up X and y
X = df.drop('outcome', axis = 1)
y = df.drop('color', axis = 1)

In [5]:
X

Unnamed: 0,color
0,a
1,c
2,a
3,a
4,b
5,b


OrdinalEncoder code to change the strings to integers

In [7]:
ce_ord = ce.OrdinalEncoder(cols=['color'])
ce_ord.fit_transform(X,y['outcome'])

Unnamed: 0,color
0,1
1,2
2,1
3,1
4,3
5,3


Here's OneHot code

In [9]:
ce_one_hot = ce.OneHotEncoder(cols=['color'])
ce_one_hot.fit_transform(X,y)

Unnamed: 0,color_1,color_2,color_3
0,1,0,0
1,0,1,0
2,1,0,0
3,1,0,0
4,0,0,1
5,0,0,1


Here's Binary encoding

In [11]:
ce_bin = ce.BinaryEncoder(cols=['color'])
ce_bin.fit_transform(X,y)

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,0,1
3,0,1
4,1,1
5,1,1


Here's BaseN encoder but nobody uses it? Default base is 2 which is jsut Binary Encoder.

In [12]:
ce_basen = ce.BaseNEncoder(cols=['color'])
ce_basen.fit_transform(X,y)

Unnamed: 0,color_0,color_1
0,0,1
1,1,0
2,0,1
3,0,1
4,1,1
5,1,1


Here's the hashing encoder

In [13]:
ce_hash = ce.HashingEncoder(cols=['color'])
ce_hash.fit_transform(X,y)

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7
0,0,1,0,0,0,0,0,0
1,0,0,0,1,0,0,0,0
2,0,1,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0
4,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,1
