# Dealing With Categorical Data

* So far, we have only dealt with numerical data. Often features are categorical, that is features that is more qualitative than quantitative.
* Two kinds of categorical features: ordinal features and nominal features
* Ordinal features are features that can be sorted or ordered. eg T-Shirt sizes
* Nominal features are features that can't be sorted. eg T-Shirt color


In [46]:
#example dataset with categorical data
import pandas as pd

df = pd.DataFrame([['green', 'M', 10.1, 'class1'],
                   ['red', 'L', 13.5, 'class2'],
                   ['blue', 'XL', 15.3, 'class1']])

df.columns = ['color', 'size', 'price', 'classlabel']
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class1
1,red,L,13.5,class2
2,blue,XL,15.3,class1


* Learning algorithms in sklearn don't use categorical data by default, so it's necessary to map these features numerically
* Likewise we'll have to encode text labels numerically

## Mapping ordinal features
We map features to integers using a dictionary

In [47]:
size_mapping = {'XL': 3,
                'L': 2,
                'M': 1}

df['size'] = df['size'].map(size_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,1,10.1,class1
1,red,2,13.5,class2
2,blue,3,15.3,class1


Likewise we can reverse the mapping

In [48]:
inv_size_mapping = {v: k for k, v in size_mapping.items()}
df['size'].map(inv_size_mapping)

0     M
1     L
2    XL
Name: size, dtype: object

## Mapping Nominal Features

We could similarly map nominal features to integers. But that would be wrong, so very wrong.

Instead we should map them using something called one-hot encoding. Implemented below one hot encoding takes a nominal feature and expands it into several featues in a sparse matrix. This was rather than having a 'color' feature with string datapoints, you have 'color_blue' and 'color_red' etc, with int datapoints that are either 0 or 1.

In [49]:
#the easy way with pandas
df = pd.get_dummies(df,columns = ['color'])
df
#you can also do it with sklearn but the output is the same with way more work...

Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red
0,1,10.1,class1,0,1,0
1,2,13.5,class2,0,0,1
2,3,15.3,class1,1,0,0


## Encoding Class Labels

Many ML algorithms require class labels to be encoded as integers. Since the order of class labels often doesn't matter we can encode them more haphazardly.

In [50]:
import numpy as np
class_mapping = {label:idx for idx,label in 
                 enumerate( np.unique(df['classlabel']) ) }
class_mapping

{'class1': 0, 'class2': 1}

In [51]:
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,size,price,classlabel,color_blue,color_green,color_red
0,1,10.1,0,0,1,0
1,2,13.5,1,0,0,1
2,3,15.3,0,1,0,0
