Categorical variables needed preprocessing prior to utilising them within machine learning models. 

Categorical variables take on a number of limited possible values, for example, a dataset with a column name 'colour' could store entries such as 'red', 'blue', 'purple', or a column name of 'size' possessing values of 'small', 'medium' and 'large'. 

In this example, although data is categorical, it may or may not also be ordered. 

- Data which is not ordered can be described as nominal features. 
- Ordered data is referred to as ordinal features. 

Data can be encoded with a number of methods, discussed in this notebook.

## Method 1- replace

- useful for binary replacement, eg transforming pass or fail entries to 0 and 1.

In [1]:
#%load imports.py
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
penguins=sns.load_dataset('penguins').dropna()
penguins.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE


In [3]:
Replace_example=penguins.copy()
Replace_example=Replace_example.replace({'sex':{'MALE':0, 'FEMALE':1}})
Replace_example.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,1
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,1
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,1
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,0


## Method 2- Label Encoding
- used to convert each value in a column to a number. The datatype must be categorical. However, results in values which may have some kind of order.

In [4]:
LE_example=penguins.copy()
LE_example['species']=LE_example['species'].astype('category')
LE_example['species_cat']=LE_example['species'].cat.codes

In [5]:
LE_example.head()

Unnamed: 0,species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,species_cat
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,0
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE,0


or performed in Sklearn using the LabelEncoder...

In [6]:
from sklearn.preprocessing import LabelEncoder
LE2_example=penguins.copy()

LE = LabelEncoder()
LE2_example['species encoded']= LE.fit_transform(LE2_example['species'])
LE2_example[['species encoded','species']].head()

Unnamed: 0,species encoded,species
0,0,Adelie
1,0,Adelie
2,0,Adelie
4,0,Adelie
5,0,Adelie


## Method 3- One Hot Encoding
Label encoding can result in values which are misinterpted by an algorthim. Eg a value of 1 may be seen as 'less than' a category 4 value. This may not be the case.

Instead, we can create x new columns for x number of categories in a column using one hot encoding. 

get_dummies is used to create dummy variables of 1 or 0. This returns the full dataframe so objects need to be filtered out.


In [7]:
OHE_example=penguins.copy()
OHE_example=pd.get_dummies(OHE_example, columns=['island'], prefix=['island_cat'])

In [8]:
OHE_example.head()

Unnamed: 0,species,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,island_cat_Biscoe,island_cat_Dream,island_cat_Torgersen
0,Adelie,39.1,18.7,181.0,3750.0,MALE,0,0,1
1,Adelie,39.5,17.4,186.0,3800.0,FEMALE,0,0,1
2,Adelie,40.3,18.0,195.0,3250.0,FEMALE,0,0,1
4,Adelie,36.7,19.3,193.0,3450.0,FEMALE,0,0,1
5,Adelie,39.3,20.6,190.0,3650.0,MALE,0,0,1


can also be done using Sklearns LabelBinarizer and OneHotEncoder...

In [9]:
from sklearn.preprocessing import LabelBinarizer
LB=LabelBinarizer()
LB_example=penguins.copy()
lb_results=LB.fit_transform(LB_example['species'])
lb_results_df=pd.DataFrame(lb_results, columns=LB.classes_)
print(lb_results_df.head())

   Adelie  Chinstrap  Gentoo
0       1          0       0
1       1          0       0
2       1          0       0
3       1          0       0
4       1          0       0


OneHotEncoder from SciKit library only takes numerical categorical values.
any value of string type should be label encoded before one hot encoded. 


In [10]:
from sklearn.preprocessing import OneHotEncoder

OHE_SK=OneHotEncoder(handle_unknown='ignore')

OHE_SK_results=pd.DataFrame(OHE_SK.fit_transform(LE2_example[['species encoded']]).toarray())

df=LE2_example.join(OHE_SK_results)
print(df.head())

print(OHE_SK_results.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
0  Adelie  Torgersen            39.1           18.7              181.0   
1  Adelie  Torgersen            39.5           17.4              186.0   
2  Adelie  Torgersen            40.3           18.0              195.0   
4  Adelie  Torgersen            36.7           19.3              193.0   
5  Adelie  Torgersen            39.3           20.6              190.0   

   body_mass_g     sex  species encoded    0    1    2  
0       3750.0    MALE                0  1.0  0.0  0.0  
1       3800.0  FEMALE                0  1.0  0.0  0.0  
2       3250.0  FEMALE                0  1.0  0.0  0.0  
4       3450.0  FEMALE                0  1.0  0.0  0.0  
5       3650.0    MALE                0  1.0  0.0  0.0  
     0    1    2
0  1.0  0.0  0.0
1  1.0  0.0  0.0
2  1.0  0.0  0.0
3  1.0  0.0  0.0
4  1.0  0.0  0.0
