# Understanding One hot encoder

Lets begin to understand this

In [6]:
from sklearn.preprocessing import OneHotEncoder

In [7]:
#Input to this thing is always a 2D array
states  = [['ON', 'OFF']]
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(states)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [8]:
encoder.categories_

[array(['ON'], dtype=object), array(['OFF'], dtype=object)]

In [13]:
encoded = encoder.transform([['OFF', 'ON']])
encoded.toarray()

array([[0., 0.]])

In [14]:
encoder.inverse_transform(encoded)

array([[None, None]], dtype=object)

What went wrong here????
    - Why has ['ON', 'OFF'] been encoded as [[0, 0]]

Lets understand this further

In [15]:
states  = [['ON'], ['OFF']]
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(states)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [16]:
encoder.categories_

[array(['OFF', 'ON'], dtype=object)]

In [17]:
encoded = encoder.transform([['OFF'], ['ON']])
encoded.toarray()

array([[1., 0.],
       [0., 1.]])

In [18]:
encoder.inverse_transform(encoded)

array([['OFF'],
       ['ON']], dtype=object)

This works fine. The reasons are:

    - [['OFF', 'ON']] denotes only one valid state
    - [['OFF'], ['ON']] denotes two valid states
    
The remainder encoding is same as we saw in label binarizer and other binarziers.

Now lets look at slightly advanced features.

Lets consider the following example:

    - A 2D array
    - Each sub array has first element as gender, second element has the roll number
    - We need to encode this

In [21]:
data = [['Male', 1], ['Female', 3], ['Male', 2], ['Female', 5]]
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(data)

OneHotEncoder(categorical_features=None, categories=None,
       dtype=<class 'numpy.float64'>, handle_unknown='ignore',
       n_values=None, sparse=True)

In [22]:
encoder.categories_

[array(['Female', 'Male'], dtype=object), array([1, 2, 3, 5], dtype=object)]

In [23]:
encoded = encoder.transform([['Female', 1], ['Male', 4]])
encoded.toarray()

array([[1., 0., 1., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0.]])

In [24]:
encoder.inverse_transform(encoded)

array([['Female', 1],
       ['Male', None]], dtype=object)

We could not encode '4' because we didn't provide that data during 'fit'

But female has been well encoded.