# One Hot Encoding

An investigation into one hot encoding in machine learning using the iris dataset with a new randomly generated cateogorical column. Many machine learning algorithms are unable to handle categorical variables. These columns should be transformed to numeric prior to modelling as part of pre-processing. One method is to consider one hot encoding.

In [1]:
import numpy as np
import pandas as pd
import random

from sklearn.datasets import load_iris

In [2]:
# Load Data
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['target'] = pd.Series(iris.target)

# Map target from numeric back to categorical
map_dict = {0:'setosa',
            1:'versicolor',
            2:'virginica'
            }
df['species'] = df["target"].map(map_dict)
df.drop('target', axis=1, inplace=True)

# Add new categorical variable
colour = ['blue','red','yellow','green']
np.random.seed(42)
df['colour'] = np.random.choice(colour, size=len(df))

df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species,colour
0,5.1,3.5,1.4,0.2,setosa,yellow
1,4.9,3.0,1.4,0.2,setosa,green
2,4.7,3.2,1.3,0.2,setosa,blue
3,4.6,3.1,1.5,0.2,setosa,yellow
4,5.0,3.6,1.4,0.2,setosa,yellow
5,5.4,3.9,1.7,0.4,setosa,green
6,4.6,3.4,1.4,0.3,setosa,blue
7,5.0,3.4,1.5,0.2,setosa,blue
8,4.4,2.9,1.4,0.2,setosa,yellow
9,4.9,3.1,1.5,0.1,setosa,red


In [3]:
df['colour'].value_counts()

green     46
red       36
yellow    34
blue      34
Name: colour, dtype: int64

Categorical variables in this data include:
* `species` with 3 levels (`setosa`, `versicolor` and `virginica`) equally distributed
* `colour` with 4 levels (`blue`, `green`, `red` and `yellow`) with green as mode

We can one hot encode using `pandas.get_dummies()`:

In [4]:
cat_cols = ['species','colour']
df = pd.get_dummies(df, columns=cat_cols, prefix_sep='_')
df.head(10)

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species_setosa,species_versicolor,species_virginica,colour_blue,colour_green,colour_red,colour_yellow
0,5.1,3.5,1.4,0.2,1,0,0,0,0,0,1
1,4.9,3.0,1.4,0.2,1,0,0,0,1,0,0
2,4.7,3.2,1.3,0.2,1,0,0,1,0,0,0
3,4.6,3.1,1.5,0.2,1,0,0,0,0,0,1
4,5.0,3.6,1.4,0.2,1,0,0,0,0,0,1
5,5.4,3.9,1.7,0.4,1,0,0,0,1,0,0
6,4.6,3.4,1.4,0.3,1,0,0,1,0,0,0
7,5.0,3.4,1.5,0.2,1,0,0,1,0,0,0
8,4.4,2.9,1.4,0.2,1,0,0,0,0,0,1
9,4.9,3.1,1.5,0.1,1,0,0,0,0,1,0


Each one hot encoded columns acts as an indicator for that level for the category. Some characteristics of one hot encoded columns:
* Binary
* Each cateogory has only one level assigned as a `1` (by definition of categorical variables each level is mutually exclusive). The rest are `0`s.

## Validating and Correcting One Hot Encoded Columns

After processing the data, there may be instances where the above characteristics are not upheld e.g. SMOTE oversampling (Note: `SMOTE-NC` should handle cateogrical variables without issue). The question then becomes how we can validate and correct these one hot encoded columns. To start, let's mess up our data to see an example:

In [5]:
# Manually mess up one hot encoded columns - TESTING only
df.loc[0,'species_setosa'] = 0
df.loc[0,'colour_yellow'] = 0
df.loc[1,'species_versicolor'] = 1
df.loc[1,'colour_red'] = 1
df.loc[2,'species_virginica'] = 1
df.loc[2,'colour_yellow'] = 1
df.loc[3,'species_versicolor'] = 1
df.loc[3,'species_virginica'] = 1
df.loc[3,'colour_green'] = 1
df.loc[3,'colour_red'] = 1

df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species_setosa,species_versicolor,species_virginica,colour_blue,colour_green,colour_red,colour_yellow
0,5.1,3.5,1.4,0.2,0,0,0,0,0,0,0
1,4.9,3.0,1.4,0.2,1,1,0,0,1,1,0
2,4.7,3.2,1.3,0.2,1,0,1,1,0,0,1
3,4.6,3.1,1.5,0.2,1,1,1,0,1,1,1
4,5.0,3.6,1.4,0.2,1,0,0,0,0,0,1


Here the first 4 rows have been incorrectly encoded. Suppose there is no way to identify the original value for the cateogrical variable. Here are some logical ways to treat the columns:
* If a category has `0`s for all levels (e.g. row 0), assign the most frequent/ mode level as its value. If there are multiple modes, randomly assign one as its value.
* If a cateogry has multiple `1`s assigned to some of its levels (e.g. row 1-3), randomly select one of these levels as its value.

The function `one_hot_fix` below implements the above rules to correct the columns. Note the `separator` argument should be same as `prefix_sep` in `get_dummies()`

In [6]:
def one_hot_fix(data, separator='_'):
    """ 
        Validates and corrects one hot encoding in a specified dataframe. 
        Note the algorithm assumes the separator string does not appear in numeric columns.
        As such, making the separator string unique is recommended.
        
        INPUT
            data: A pandas dataframe with numeric and cateogorical columns
            separator: A string that separates categorical variable names from their levels
        OUPUT
            Dataframe data with adjusted one hot encoded columns
    """
    # Find a list of unique categorical column names
    cat_cols = list(set(col.partition(separator)[0] for col in data.filter(regex=separator).columns))
    
    for col in cat_cols:
        print(f'Validating categorical variable {col}...')
        grouped_df = data.filter(regex=col + separator)
        grouped_df = grouped_df.assign(ohe_ind = grouped_df.sum(axis=1))
        
        # Validate and Fix
        data_to_fix = grouped_df.query('ohe_ind != 1 and ohe_ind != 0')
        col_list = data_to_fix.columns[:-1]
        rows_to_fix = data_to_fix.index
        for row in rows_to_fix:
            levels_to_fix = []
            # Identify which columns to fix
            for level in col_list:
                if data_to_fix.loc[row, level] != 0:
                    levels_to_fix.append(level)
            chosen_level = random.choice(levels_to_fix)
            data.loc[row, levels_to_fix] = 0
            data.loc[row, chosen_level] = 1
        
        zeroes_to_fix = grouped_df.query('ohe_ind == 0')
        zero_rows_to_fix = zeroes_to_fix.index
        
        # Find most popular column and populate. If multiple, randomly choose
        grouped_df.drop(columns='ohe_ind', inplace=True)
        max_value = np.max(grouped_df.sum())
        most_pop_col = [level for level in grouped_df if grouped_df[level].sum() == max_value]
        
        if len(most_pop_col) == 1:
            data.loc[zero_rows_to_fix, most_pop_col] = 1
        else:
            for row in zero_rows_to_fix:
                chosen_level = random.choice(most_pop_col)
                data.loc[row, chosen_level] = 1
    print('Validation complete.')            
    return data

In [7]:
one_hot_fix(df, separator='_')
df.head()

Validating categorical variable species...
Validating categorical variable colour...
Validation complete.


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),species_setosa,species_versicolor,species_virginica,colour_blue,colour_green,colour_red,colour_yellow
0,5.1,3.5,1.4,0.2,0,0,1,0,1,0,0
1,4.9,3.0,1.4,0.2,1,0,0,0,1,0,0
2,4.7,3.2,1.3,0.2,0,0,1,0,0,0,1
3,4.6,3.1,1.5,0.2,0,1,0,0,1,0,0
4,5.0,3.6,1.4,0.2,1,0,0,0,0,0,1


## Findings

Summary:
* `green` and `versicolor` were most popular leading to the output in row 0
* Rows 1-3 should have randomly selected a level which was previously encoded as `1`
* Note the `ohe_ind` variable within function validates that each category has values within the levels that sum to 1 i.e. correctly encoded

Assumptions & Limitations of algorithm:
* Treatment of incorrectly encoded rows & columns are fixed in the function.
* Function requires a separator string that uniquely identifies the original column as categorical. Numeric column names should not have this string.
* One hot encoded values are assumed to be non-negative