# **1. Handling Categorical data**

When we are talking about categorical data, we have to further distinguish between
ordinal and nominal features. Ordinal features can be understood as categorical
values that can be ordered or sorted. In contrast, nominal features
do not imply any order.

For more, from a practitioner perspective, see [this article](https://www.analyticsvidhya.com/blog/2020/08/types-of-categorical-data-encoding/). It discusses Label Ordinal, one-hot, dummy, effect, binary, baseN, hash, and target encoding of cayegorical attribites.


In [None]:
import pandas as pd

df = pd.DataFrame([['green', 'M',  10.1, 'classA'],
                   ['red',   'L',  13.5, 'classB'],
                   ['blue',  'XL', 15.3, 'classA']])

df.columns = ['color', 'size', 'price', 'class_label']
df


Unnamed: 0,color,size,price,class_label
0,green,M,10.1,classA
1,red,L,13.5,classB
2,blue,XL,15.3,classA


## **1.1 Mapping ordinal features: Size**

To ensure that the learning algo interprets the ordinal features correctly,
we need to convert the categorical string values into integers. Unfortunately, there is no convenient function that can automatically derive the correct order of the labels of our size feature, so we have to define the mapping manually. In the following simple example, let's assume that we know the numerical difference between features, for example, XL = L + 1 = M + 2 and M = 1:

In [None]:
size_mapping = {'XL': 3,
                'L' : 2,
                'M' : 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,class_label
0,green,1,10.1,classA
1,red,2,13.5,classB
2,blue,3,15.3,classA


If we want to transform the integer values back to the original string representation
at a later stage, we can simply define a **reverse-mapping** dictionary, inv_size_
mapping = {v: k for k, v in size_mapping.items()}, which can then be used
via the pandas map method on the transformed feature column and is similar to the
size_mapping dictionary that we used previously. We can use it as follows:

In [None]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
# print(inv_size_mapping)
df['size'].map(inv_size_mapping)


0     M
1     L
2    XL
Name: size, dtype: object

## **1.2 Encoding Class labels**

To encode the class labels, we can use an
approach similar to the mapping of ordinal features discussed previously. We need
to remember that class labels are not ordinal, and it doesn't matter which integer
number we assign to a particular string label. Thus, we can simply enumerate
the class labels, starting at 0:

In [None]:
import numpy as np

# create a mapping dict
# to convert class labels from strings to integers
class_mapping = {label: idx for idx, label in enumerate(np.unique(df['class_label']))}

print(class_mapping)

# Use the mapping dictionary to transform the class labels into integers:
# to convert class labels from strings to integers
df['class_label'] = df['class_label'].map(class_mapping)
df

{'classA': 0, 'classB': 1}


Unnamed: 0,color,size,price,class_label
0,green,1,10.1,0
1,red,2,13.5,1
2,blue,3,15.3,0


In [None]:
# reverse the class label mapping, going back
inv_class_mapping = {v: k for k, v in class_mapping.items()}

print(inv_class_mapping)

df['class_label'] = df['class_label'].map(inv_class_mapping)
df

{0: 'classA', 1: 'classB'}


Unnamed: 0,color,size,price,class_label
0,green,1,10.1,classA
1,red,2,13.5,classB
2,blue,3,15.3,classA


In [None]:
#Alternatively, there is a convenient LabelEncoder class directly implemented in Scikit-Learn to achieve this:

from sklearn.preprocessing import LabelEncoder

# Label encoding with sklearn's LabelEncoder
class_le = LabelEncoder()

y = class_le.fit_transform(df['class_label'].values)

print(y)

# reverse mapping
y = class_le.inverse_transform(y)
print(y)

[0 1 0]
['classA' 'classB' 'classA']


## **1.3 One-hot encoding**



In [None]:
X = df[['color', 'size', 'price']].values

color_le = LabelEncoder()

X[:, 0] = color_le.fit_transform(X[:, 0])

X

array([[1, 1, 10.1],
       [2, 2, 13.5],
       [0, 3, 15.3]], dtype=object)

After executing the preceding code, the first column of the NumPy array,
X, now holds the new color values, which are encoded as follows:

• blue = 0

• green = 1

• red = 2


If we stop at this point and feed the array to our classifier, we will make one of the
most common mistakes in dealing with categorical data. 
Although the color values don't come in any particular order, a learning algorithm
will now assume that green is larger than blue, and red is larger than green.
Although this assumption is incorrect, the algorithm could still produce useful
results. However, those results would not be optimal.
A common workaround for this problem is to use a technique called **one-hot
encoding**. The idea behind this approach is to create a new dummy feature for each
unique value in the nominal feature column. Here, we would convert the color
feature into three new features: blue, green, and red. Binary values can then be used
to indicate the particular color of an example; for example, a blue example can be
encoded as blue=1, green=0, red=0. To perform this transformation, we can use the
OneHotEncoder that is implemented in scikit-learn's preprocessing module:

In [None]:
from sklearn.preprocessing import OneHotEncoder

X = df[['color', 'size', 'price']].values

color_ohe = OneHotEncoder()

print(color_ohe.fit_transform(X[:, 0].reshape(-1, 1)).toarray()) #reshape(-1,1) means rows unknown with one column 

from sklearn.compose import ColumnTransformer

X = df[['color', 'size', 'price']].values

c_transf = ColumnTransformer([ ('onehot', OneHotEncoder(), [0]),
                               ('nothing', 'passthrough', [1, 2])])   #ColumnTransformer, which accepts a list of (name, transformer,column(s))

c_transf.fit_transform(X).astype(float)

[[0. 1. 0.]
 [0. 0. 1.]
 [1. 0. 0.]]


array([[ 0. ,  1. ,  0. ,  1. , 10.1],
       [ 0. ,  0. ,  1. ,  2. , 13.5],
       [ 1. ,  0. ,  0. ,  3. , 15.3]])

## **1.4 get_dummies()** -- arguably the most convenient way to create dummies

In [None]:
# An even more convenient way to create those dummy features via one-hot encoding
# is to use the get_dummies method implemented in pandas. Applied to a DataFrame,
# the get_dummies method will only convert string columns and leave all other
# columns unchanged:

pd.get_dummies(df[['price', 'color', 'size','class_label']])


Unnamed: 0,price,size,color_blue,color_green,color_red,class_label_classA,class_label_classB
0,10.1,1,0,1,0,1,0
1,13.5,2,0,0,1,0,1
2,15.3,3,1,0,0,1,0


## Assignment - Handling Categorical Data

Work through the steps in **1 Handling. Categorical Data**
for the following dataframe.

* 1. Map ordinal features: shirt_size to numbers
* 2. One-hot encode class_label 
* 3. Use Get_Dummies to encode shirt_size

```
df = pd.DataFrame([['yellow','XL', 1085.07, 'classC'],
                   ['blue',  'L',  339.61,  'classB'],
                   ['green', 'L',  400.0,   'classB'],
                   ['green', 'M',  238,     'classB'],
                   ['grey',  'S',  52.99,   'classA']])

df.columns = ['color', 'shirt_size', 'price', 'class_label']
```