## Problem

Often, features are in text or categorical form. For example, sizes "Small", "Medium", "Large", or ratings "Good", "Average", "Bad". However, machine learning models only operate on numerical values. 

## Solution
Mapping each category or word to a number in such a way that the meaning of the word is retained.

### Categorical encoding example

In [3]:
# An array of sizes
sizes = ["S", "M", "L", "S", "M", "S", "M", "L", "M", "L"]

# A dictionary containing the mapping
mapping = {"S": 1, "M": 2, "L": 3}

numerical_sizes = list(map((lambda size : mapping[size]), sizes))
print("Numerical sizes are")
print(numerical_sizes)

Numerical sizes are
[1, 2, 3, 1, 2, 1, 2, 3, 2, 3]


#### Problem with this approach
Manually mapping each category to a number can be tedious, especially when the number of categories is large. Instead, we could assign a new index to each new category, and map the input automatically. 
Let's see how: 

In [18]:
colors = ["Red", "Yellow", "Green", "Green", "Purple", "Orange", "Blue", "White", "Red", "Black", "Yellow"]

unique_colors = list(set(colors)) # Set will contain only unique colors

colors_numerical = list(map((lambda x : unique_colors.index(x)), colors))
print("Numerical colors are: ")
print(colors_numerical)

Numerical colors are: 
[5, 6, 3, 3, 0, 2, 4, 1, 5, 7, 6]


We can skip this hassle of finding unique categories, thanks to Sklearn's LabelEncoder

In [11]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
colors_numerical = le.fit_transform(colors)
print(colors_numerical)

[5 7 2 2 4 3 1 6 5 0 7]


### Will this always work?

Sometimes, directly mapping a word to a number can cause problems.
Let's look the above example about colors. A color mapped to a smaller number will have less numerical weight as compared to a color mapped to a larger number. This is unfair, and can lead to poorly trained models.

#### One-Hot Encoding
What we could do instead is convert each value to an array, and have a particular index of the array set to HIGH, and the rest set to LOW (Hence the name "One-Hot"). This will make sure that only one value in the array is set to HIGH, and hence, eliminate the problem we discussed above. 

For instance, if we had Yes->1, No->2, Maybe->3, we could convert them to an array in the following manner : 
- Yes -> [1,0,0]
- No -> [0,1,0]
- Maybe -> [0,0,1]

#### Example

In [17]:
# We will be using the Keras library here
from keras.utils import to_categorical

one_hot_colors = to_categorical(colors_numerical)

print("One hot encoded values of colors are: ")
for one_hot_color in one_hot_colors:
    print(one_hot_color)

One hot encoded values of colors are: 
[ 0.  0.  0.  0.  0.  1.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.]
[ 0.  0.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  1.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  1.  0.  0.  0.]
[ 0.  0.  0.  1.  0.  0.  0.  0.]
[ 0.  1.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  1.  0.]
[ 0.  0.  0.  0.  0.  1.  0.  0.]
[ 1.  0.  0.  0.  0.  0.  0.  0.]
[ 0.  0.  0.  0.  0.  0.  0.  1.]
