In [14]:
from sklearn import preprocessing
import numpy as np

'''
["male", "female"]
["from Europe", "from US", "from Asia"]
["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]

["male", "from US", "uses Internet Explorer"] as [0, 1, 3]
["female", "from Asia", "uses Chrome"] as [1, 2, 1]


Such integer representation can not be used directly with scikit-learn estimators as these expect continuous 
input, and would interpret the categories as being ordered, which is often not desired.

One possibility to convert categorical features to features that can be used with scikit-learn estimators is 
to use a one-of-K or one-hot encoding, which is implemented in OneHotEncoder. 

This estimator transforms each categorical feature with m possible values into m binary features, with only 
one active.
'''

X = [[0], [1], [1], [2], [3]]

enc = preprocessing.OneHotEncoder()
enc.fit(X)

print enc.transform([[0]]).toarray()
print enc.transform([[1]]).toarray()
print enc.transform([[2]]).toarray()
print enc.transform([[3]]).toarray()

[[ 1.  0.  0.  0.]]
[[ 0.  1.  0.  0.]]
[[ 0.  0.  1.  0.]]
[[ 0.  0.  0.  1.]]


In [15]:
X = [[0, 0, 3], [1, 1, 0], [0, 2, 1], [1, 0, 2]]

enc = preprocessing.OneHotEncoder()
enc.fit(X)

enc.transform([[0, 1, 3]]).toarray()

array([[ 1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  1.]])

In [None]:
'''
By default, how many values each feature can take is inferred automatically from the dataset. 
It is possible to specify this explicitly using the parameter n_values. 

There are two genders, three possible continents and four web browsers in our dataset. Then we fit 
the estimator, and transform a data point. 

In the result, the first two numbers encode the gender, the next set of three numbers the 
continent and the last four the web browser.
'''

In [16]:
enc = preprocessing.OneHotEncoder(n_values=[2, 3, 4])

enc.fit([[1, 2, 3], [0, 2, 0]])

print enc.transform([[1, 0, 0]]).toarray()

print enc.transform([[0, 1, 0]]).toarray()

print enc.transform([[1, 2, 3]]).toarray()

[[ 0.  1.  1.  0.  0.  1.  0.  0.  0.]]
[[ 1.  0.  0.  1.  0.  1.  0.  0.  0.]]
[[ 0.  1.  0.  0.  1.  0.  0.  0.  1.]]
