In [2]:
import numpy as np
import pandas as pd

When we are talking about categorical data, we have to further distinguish between ordinal and nominal features. Ordinal features can be understood as categorical values that can be sorted or ordered. For example, t-shirt size would be an ordinal feature, because we can define an order: XL > L > M. In contrast, nominal features don’t imply any order; to continue with the previous example, we could think of t-shirt color as a nominal feature since it typically doesn’t make sense to say that, for example, red is larger than blue.

In [36]:
df = pd.DataFrame([
                    ['green', 'M', 10.1, 'class2'],
                    ['red', 'L', 13.5, 'class1'],
                    ['blue', 'XL', 15.3, 'class2']])
df.columns = ['color', 'size', 'price', 'classlabel']

In [37]:
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


## Mapping Ordinal Value

 let’s assume that we know the numerical difference between size features, for example, XL = L + 1 = M + 2, we can create a map with these mappings

In [6]:
size_map = {
            'M' : 0,
            'L': 1,
            'XL': 2
}

In [7]:
df['size'] = df['size'].map(size_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class2
1,red,1,13.5,class1
2,blue,2,15.3,class2


In [8]:
#to get back the actual values
inv_size_map = { v: k for k, v in size_map.items()}

df['size'] = df['size'].map(inv_size_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


### Encoding class labels

Many machine learning libraries require that class labels are encoded as integer values. Although most estimators for classification in scikit-learn convert class labels to integers internally, it is considered good practice to provide class labels as integer arrays to avoid technical glitches.

In [9]:
## we can do this similar to above mapping
class_mapping = {
    'class1': 0,
    'class2': 1
}
df['classlabel'] = df['classlabel'].map(class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,1
1,red,L,13.5,0
2,blue,XL,15.3,1


In [11]:
inv_class_mapping = { v: k for k,v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


Alternatively, there is a convenient LabelEncoder class directly implemented in scikit-learn to achieve this:

In [12]:
from sklearn.preprocessing import LabelEncoder
class_le = LabelEncoder()
class_le.fit(df['classlabel'].values)
df['classlabel'] =class_le.transform(df['classlabel'])
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,1
1,red,L,13.5,0
2,blue,XL,15.3,1


In [13]:
df['classlabel'] = class_le.inverse_transform(df['classlabel'])
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


### Performing one-hot encoding on nominal features

for nominal features such as color in this instance, although the color values don’t come in any particular order, common classification models will now assume that green is larger than blue, and red is larger than green. Although this assumption is incorrect, a classifier could still produce useful results. However, those results would not be optimal.

In one-hot encoding, The idea behind this approach is to create a new dummy feature for each unique value in the nominal feature column. Here, we would convert the color feature into three new features: blue, green, and red. Binary values can then be used to indicate the particular color of an example; for example, a blue example can be encoded as blue=1, green=0, red=0. To perform this transformation, we can use the OneHotEncoder that is implemented in scikit-learn’s preprocessing module:

In [14]:
df['color'].values

array(['green', 'red', 'blue'], dtype=object)

In [19]:
from sklearn.preprocessing import OneHotEncoder
color_ohe = OneHotEncoder()
color_ohe.fit_transform(df['color'].values.reshape(-1, 1)).toarray()  #this is the shortcut for fit and transform in one


array([[0., 1., 0.],
       [0., 0., 1.],
       [1., 0., 0.]])

Note that we applied the OneHotEncoder to only a single column,(X[:, 0].reshape(-1, 1)),to avoid modifying the other two columns in the array as well. If we want to selectively transform columns in a multi-feature array, we can use the ColumnTransformer, which accepts a list of (name, transformer, column(s)) tuples as follows:

In [20]:
from sklearn.compose import ColumnTransformer
X = df[['color', 'size', 'price']].values
c_trx = ColumnTransformer([
    ('ohe', OneHotEncoder(), [0]),
    ('nochange', 'passthrough', [1,2])
])
c_trx.fit_transform(X)

array([[0.0, 1.0, 0.0, 'M', 10.1],
       [0.0, 0.0, 1.0, 'L', 13.5],
       [1.0, 0.0, 0.0, 'XL', 15.3]], dtype=object)

An even more convenient way to create those dummy features via one-hot encoding is to use the get_dummies method implemented in pandas. Applied to a DataFrame, the get_dummies method will only convert string columns and leave all other columns unchanged:

In [38]:
df['size'] = df['size'].map(size_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,0,10.1,class2
1,red,1,13.5,class1
2,blue,2,15.3,class2


In [41]:
df_new = pd.get_dummies(df[['color','size', 'price']], dtype=int)

Note that we do not lose any important information by removing a feature column, though; for example, if we remove the column color_blue, the feature information is still preserved since if we observe color_green=0 and color_red=0, it implies that the observation must be blue.

In [43]:
df_new = pd.get_dummies(df[['color','size', 'price']], dtype=int, drop_first=True)
df_new

Unnamed: 0,size,price,color_green,color_red
0,0,10.1,1,0
1,1,13.5,0,1
2,2,15.3,0,0


In order to drop a redundant column via the OneHotEncoder, we need to set drop='first' and set categories='auto' as follows:

In [44]:
color_ohe = OneHotEncoder( drop='first', categories='auto')
c_trx = ColumnTransformer([('ohe', color_ohe, [0]), ('nochange', 'passthrough', [1,2])])
c_trx.fit_transform(X)

array([[1.0, 0.0, 'M', 10.1],
       [0.0, 1.0, 'L', 13.5],
       [0.0, 0.0, 'XL', 15.3]], dtype=object)

## encoding ordinal features
If we are unsure about the numerical differences between the categories of ordinal features, or the difference between two ordinal values is not defined, we can also encode them using a threshold encoding with 0/1 values. For example, we can split the feature size with values M, L, and XL into two new features, x > M and x > L. Let’s consider the original DataFrame

In [46]:
df['size'] = df['size'].map(inv_size_map)
df

Unnamed: 0,color,size,price,classlabel
0,green,M,10.1,class2
1,red,L,13.5,class1
2,blue,XL,15.3,class2


In [48]:
df['x>M'] = df['size'].apply(lambda x: 1 if x in ['L','XL'] else 0)
df['x>L'] = df['size'].apply(lambda x: 1 if x in ['XL'] else 0)

del df['size']
df

Unnamed: 0,color,price,classlabel,x>M,x>L
0,green,10.1,class2,0,0
1,red,13.5,class1,1,0
2,blue,15.3,class2,1,1
