# Categorical Transformer

1. Feature Encoding
2. Label Encoding

## OneHotEncoder

Encodes as onehot numeric array | creates one binary column for each K unique values

In [9]:
import numpy as np
X = np.array([[1],[2],[3],[1],[2]])
print(X)

[[1]
 [2]
 [3]
 [1]
 [2]]


In [11]:
# number of unique values here are k = 3
# thus transformed matrix will have three columns

from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(sparse = False)
ohe.fit_transform(X)

array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])

## Label Encoder

It encodes target variable with values 0 and k-1 where k is number of distinct values

In [14]:
import numpy as np
mat = np.array([[1],[2],[6],[1],[8],[6]])
print(mat)

#here k is 4: 1 2 6 8

[[1]
 [2]
 [6]
 [1]
 [8]
 [6]]


In [17]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit_transform(mat)

# here 1 is encodes 0 | 2 encodes to 1 |six encodes to 2 | 8 encodes to 3|

array([0, 1, 2, 0, 3, 2])

## Ordinal Encoder

Encodes cat features betwen values 0 and k-1

In [18]:
import numpy as np
mat = np.array([[1,'male'],
                [2,'female'],
                [6,'female'],
                [1,'male'],
                [8,'male'],
                [6,'female']])
print(mat)

[['1' 'male']
 ['2' 'female']
 ['6' 'female']
 ['1' 'male']
 ['8' 'male']
 ['6' 'female']]


In [20]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit_transform(mat)

array([[0., 1.],
       [1., 0.],
       [2., 0.],
       [0., 1.],
       [3., 1.],
       [2., 0.]])

Ordinal Encoder| Label encoder
----|----
Can operate on multidimensional data | only 1D data

## LabelBinarizer

Several regression and Binary classification can be extended to multiclass setup in one vs all fashion

This involves training a single regressor or classifier per class

for this we need to convert multiclass labels to binary labels and LabelBinarizer performs this task

In [21]:
import numpy as np
mat = np.array([[1],[2],[6],[1],[8],[6]])
print(mat)

[[1]
 [2]
 [6]
 [1]
 [8]
 [6]]


In [24]:
from sklearn.preprocessing import LabelBinarizer
lb = LabelBinarizer()
lb.fit_transform(mat)

# column corresponds to unique value
# 1col >> 1 | 2 >>> 2 | 3 >> 6| 4>>8 

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0]])

In [25]:
# if estimator supports multiclass setup then this is not needed

## Multilabel Binarizer

In [28]:
movie_genre = [{'action','comedy'},
               {'comedy'},
               {'action','thriller'},
               {'science-fiction','action','thriller'}]
mat = np.array(movie_genre)
print(mat)

[{'action', 'comedy'} {'comedy'} {'action', 'thriller'}
 {'science-fiction', 'action', 'thriller'}]


In [29]:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
mlb.fit_transform(mat)

array([[1, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 1]])

# Add Dummy Feature

In [40]:
import numpy as np
mat = np.array([[1],[2],[6],[1],[8],[6]])
print(mat)

#here k is 4: 1 2 6 8

[[1]
 [2]
 [6]
 [1]
 [8]
 [6]]


In [42]:
from sklearn.preprocessing import add_dummy_feature
add_dummy_feature(mat)

# adds a new column with all 1 (ones)

array([[1., 1.],
       [1., 2.],
       [1., 6.],
       [1., 1.],
       [1., 8.],
       [1., 6.]])