Link to Medium blog post: https://towardsdatascience.com/learning-one-hot-encoding-in-python-the-easy-way-665010457ad9

# Learning One-Hot Encoding in Python the Easy Way

## Creating the dataset from scratch

In [1]:
import pandas as pd

# Creating a list with some values 
studentID = [1000, 1001, 1002, 1003, 1004, 1005, 1006]
color = ['Red', 'Orange', "Yellow", 'Green', 'Yellow', 'Purple', 'Blue']
DaysOfTheWeek = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Attitude = ['Best', 'Decent', 'Better', 'Excellent', 'Excellent', 'Good', 'Best']

Now that we have the list let’s convert this into a data frame. To do this we need to zip all the list values and then store it.

In [2]:
# Converting the list into a data frame and simultaneously renaming the columns.
df = pd.DataFrame(list(zip(studentID, color, DaysOfTheWeek, Attitude)), columns =['Student ID', 'Favourite Color', 'Favourite Day', 'Attitude'])
print(df)

   Student ID Favourite Color Favourite Day   Attitude
0        1000             Red        Monday       Best
1        1001          Orange       Tuesday     Decent
2        1002          Yellow     Wednesday     Better
3        1003           Green      Thursday  Excellent
4        1004          Yellow        Friday  Excellent
5        1005          Purple      Saturday       Good
6        1006            Blue        Sunday       Best


## Converting the object type data into the categorical type

This is because in most cases you might get a categorical type of data. But in this, all the three as seen above is of an object type. If this is the case with you then you need to manually convert them to categorical type.

In [3]:
# Converting the object type data into categorical data column
for col in ['Favourite Color','Favourite Day', 'Attitude']:
    df[col] = df[col].astype('category')
print(df.dtypes)

Student ID            int64
Favourite Color    category
Favourite Day      category
Attitude           category
dtype: object


## Assigning the binary codes to the categorical values

We will be transforming only the Favourite Color and Favourite Day columns to its binary value columns. Rather than manually doing this we can use the pandas get_dummies method.

In [4]:
# Assigning the binary values for Favourite Day and Favourite Color columns
df = pd.get_dummies(data=df,columns=['Favourite Color','Favourite Day'])
print(df)

   Student ID   Attitude  Favourite Color_Blue  Favourite Color_Green  \
0        1000       Best                 False                  False   
1        1001     Decent                 False                  False   
2        1002     Better                 False                  False   
3        1003  Excellent                 False                   True   
4        1004  Excellent                 False                  False   
5        1005       Good                 False                  False   
6        1006       Best                  True                  False   

   Favourite Color_Orange  Favourite Color_Purple  Favourite Color_Red  \
0                   False                   False                 True   
1                    True                   False                False   
2                   False                   False                False   
3                   False                   False                False   
4                   False                   F

By doing so you will obviously increase the dimension of your data set, but your learning algorithm will perform a lot more better.

## Assigning orders to the categorical column called “Attitude”

There are two ways you can do this:

- Manually assigning values using a dictionary.
- Using LabelEncoder method

Option 1 is just of no use because what if you have more than 1000 unique values then you might use a looping statement and make your life complicated. It’s 2020 think smart and use the sklearn library to do this.

In [5]:
# Assigning order to the categorical column 
from sklearn.preprocessing import LabelEncoder
# Initializing an object of class LabelEncoder
labelencoder = LabelEncoder() 
df['Attitude'] = labelencoder.fit_transform(df['Attitude'])
print(df)

   Student ID  Attitude  Favourite Color_Blue  Favourite Color_Green  \
0        1000         0                 False                  False   
1        1001         2                 False                  False   
2        1002         1                 False                  False   
3        1003         3                 False                   True   
4        1004         3                 False                  False   
5        1005         4                 False                  False   
6        1006         0                  True                  False   

   Favourite Color_Orange  Favourite Color_Purple  Favourite Color_Red  \
0                   False                   False                 True   
1                    True                   False                False   
2                   False                   False                False   
3                   False                   False                False   
4                   False                   False    