# Categorical Transformers
* Feature Encoding
* Label Encoding

### OneHotEncoding
* Encodes categorical feature or label to one-hot numeric array
* Creates one binary column for each K unique values.
* Exactly one column has 1 and rest are 0.

what does it look like?

From this:

| wine_color |
|------------|
| red        |
| white      |
| rosé       |
| red        |

to this:

| red | white | rosé |
|-----|-------|------|
| 1   | 0     | 0    |
| 0   | 1     | 0    |
| 0   | 0     | 1    |
| 1   | 0     | 0    |

here also we can use fit_transform on OneHotEncoder the fit scans for all unique column entries and transform builds a binary vector for each of the unique categories

it outputs a SciPy sparse matrix rather than numpy array to save the space if we have a large no of categories.

we can use `toarray` to convert it to numpy array and the list of categories can be viewed by `categories_` method

In [1]:
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data = np.array([["red"], ["white"], ["rose"], ["red"]])

ohe = OneHotEncoder()
ohe.fit_transform(data).toarray()

array([[1., 0., 0.],
       [0., 0., 1.],
       [0., 1., 0.],
       [1., 0., 0.]])

In [None]:
ohe.categories_

[array(['red', 'rose', 'white'], dtype='<U5')]

### Label Encoding:
* encodes target labels with value between 0 to K-1, where K is the no of distinct values


In [None]:
from sklearn.preprocessing import LabelEncoder

data = np.array([[1], [2], [6], [1], [8], [6]])

le = LabelEncoder()
le.fit_transform(data)

  y = column_or_1d(y, warn=True)


array([0, 1, 2, 0, 3, 2])

### Ordinal Encoder:
* encodes target labels with value between 0 to K-1, where K is the no of distinct values


In [None]:
from sklearn.preprocessing import OrdinalEncoder

data = np.array([[1, "male"], [2, "female"], [6, "female"], [1, "male"], [8, "male"], [6, "female"]])

oe = OrdinalEncoder()
oe.fit_transform(data)

array([[0., 1.],
       [1., 0.],
       [2., 0.],
       [0., 1.],
       [3., 1.],
       [2., 0.]])

* OrdinalEncoder can operate multi dimensional data, while LabelEncoder can transform only 1D data.

### LabelBinarizer
* several regression and binary classification can be extended to multi-class setup in one-vs-all fashion
* this involves training a single regressor or classifier per class.
* for this we need to convert multi-class labels to binay labels and LabelBinarizer helps us to do this.


In [4]:
from sklearn.preprocessing import LabelBinarizer

data = np.array([[1], [2], [6], [1], [8], [6]])

lb = LabelBinarizer()
lb.fit_transform(data)

array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [1, 0, 0, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0]])

* if estimator supports multi-class data, there is no need for LabelBinarizer.

### MultiLabelBinarizer
* It encodes categorical features with value between 0 to K-1 where K is the no of classes
* in the given example $ K = 4 $ since the no of genres of movies is 4

In [6]:
from sklearn.preprocessing import MultiLabelBinarizer

movie_genres = [
    {"action", "comedy"},
    {"comedy"},
    {"action", "thriller"},
    {"sci-fi", "action", "thriller"}]

mlb = MultiLabelBinarizer()
mlb.fit_transform(movie_genres)

array([[1, 1, 0, 0],
       [0, 1, 0, 0],
       [1, 0, 0, 1],
       [1, 0, 1, 1]])

### add_dummy_feature
* Augments the dataset with a column vector, each value in the vector is 1.

In [7]:
from sklearn.preprocessing import add_dummy_feature

data = np.array([[7, 1], [1, 8], [2, 0], [9, 6]])

add_dummy_feature(data)

array([[1., 7., 1.],
       [1., 1., 8.],
       [1., 2., 0.],
       [1., 9., 6.]])