# Encoding categorical features and target values

## OrdinalEncoder

Encode categorical features as an integer array.

This results in a single column of integers (0 to n_categories - 1) per feature.

In [26]:
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
X = [['Male', 1], ['Female', 3], ['Female', 2], ['Female', np.nan]]
enc.fit(X)

In [20]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]

In [25]:
enc.transform([['Female', 3], ['Male', 1]])

array([[0., 2.],
       [1., 0.]])

In [22]:
enc.inverse_transform([[1, 0], [0, 1]])

array([['Male', 1],
       ['Female', 2]], dtype=object)

In [23]:
enc.transform(X)

array([[1., 0.],
       [0., 2.],
       [0., 1.]])

## OneHotEncoder

Encode categorical features as a one-hot numeric array.

The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme.

This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

In [27]:
from sklearn.preprocessing import OneHotEncoder

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)

In [28]:
enc.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3, nan], dtype=object)]

transforms into a sparse matrix

In [38]:
enc.transform([['Female', 1], ['Male', 4]])

<2x6 sparse matrix of type '<class 'numpy.float64'>'
	with 3 stored elements in Compressed Sparse Row format>

In [39]:
enc.transform([['Female', 1], ['Male', 4]]).toarray()

array([[1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

### dealing with missing values

np.nan, None, and other "missing" values will be categories of their own.

In [42]:
enc.transform([['Female', np.nan], ['Male', None]]).toarray()  # None is ignored here

array([[1., 0., 0., 0., 0., 1.],
       [0., 1., 0., 0., 0., 0.]])

you can specify the categories and transform other values to all zeros.

In [45]:
enc = OneHotEncoder(categories=[["Female", "Male"], [1, 2, 3]], handle_unknown="ignore")
enc.fit(X)
enc.transform([['Female', np.nan], ['Male', None]]).toarray()  # both are ignored

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.]])

### dummy variables

just drop one column from one-hot array

In [35]:
drop_enc = OneHotEncoder(drop='first', handle_unknown='ignore').fit(X)

In [36]:
drop_enc.transform([['Female', 1], ['Male', 4], ['Female', np.nan]]).toarray()



array([[0., 0., 0., 0.],
       [1., 0., 0., 0.],
       [0., 0., 0., 1.]])

but if you are dealing with data that has some values missing, do you really need one-hot encoding?

you COULD drop all the records missing any values and use it, I guess...

## Encoding target values: LabelEncoder

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

`OrdinalEncoder` can fit data that has the shape of `(n_samples, n_features)` while `LabelEncoder` can only fit data that has the shape of `(n_samples,)`
https://datascience.stackexchange.com/questions/39317/difference-between-ordinalencoder-and-labelencoder

In [46]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit([1, 2, 2, 6])
le.classes_

array([1, 2, 6])

In [47]:
le.transform([1, 1, 2, 6])

array([0, 0, 1, 2], dtype=int64)

In [48]:

le.inverse_transform([0, 0, 1, 2])

array([1, 1, 2, 6])