# One Hot Encoding

## A simple demonstration of One Hot Encoding using Pandas and Sklearn.

## Importing the Required Packages

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer

## Reading the Dataset

In [3]:
car = pd.read_csv('car.csv')
car

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,BMW X5,69000,18000,6
1,BMW X5,35000,34000,3
2,BMW X5,57000,26100,5
3,BMW X5,22500,40000,2
4,BMW X5,46000,31500,4
5,Audi A5,59000,29400,5
6,Audi A5,52000,32000,5
7,Audi A5,72000,19300,6
8,Audi A5,91000,12000,8
9,Mercedez Benz C class,67000,22000,6


## Using Pandas for creating One Hot Vectors

Using Pandas get_dummies module, the one hot vector representation of the three categories of Car Model attribute is calculated.

In [4]:
car_dummies = pd.get_dummies(car['Car Model'])
car_dummies

Unnamed: 0,Audi A5,BMW X5,Mercedez Benz C class
0,0,1,0
1,0,1,0
2,0,1,0
3,0,1,0
4,0,1,0
5,1,0,0
6,1,0,0
7,1,0,0
8,1,0,0
9,0,0,1


The one hot vectors are merged with the original dataframe.

In [5]:
car_encoded = pd.concat([car, car_dummies], axis='columns')
car_encoded

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs),Audi A5,BMW X5,Mercedez Benz C class
0,BMW X5,69000,18000,6,0,1,0
1,BMW X5,35000,34000,3,0,1,0
2,BMW X5,57000,26100,5,0,1,0
3,BMW X5,22500,40000,2,0,1,0
4,BMW X5,46000,31500,4,0,1,0
5,Audi A5,59000,29400,5,1,0,0
6,Audi A5,52000,32000,5,1,0,0
7,Audi A5,72000,19300,6,1,0,0
8,Audi A5,91000,12000,8,1,0,0
9,Mercedez Benz C class,67000,22000,6,0,0,1


Since we have the one hot vectors of the Car Model attribute, the attribute can be safely dropped. Also, in order to avoid the phenomenon of Dummy Variable Trap, the Audi A5 attribute is also dropped.

In [6]:
car_final = car_encoded.drop(['Car Model', 'Audi A5'], axis='columns')
car_final

Unnamed: 0,Mileage,Sell Price($),Age(yrs),BMW X5,Mercedez Benz C class
0,69000,18000,6,1,0
1,35000,34000,3,1,0
2,57000,26100,5,1,0
3,22500,40000,2,1,0
4,46000,31500,4,1,0
5,59000,29400,5,0,0
6,52000,32000,5,0,0
7,72000,19300,6,0,0
8,91000,12000,8,0,0
9,67000,22000,6,0,1


## Splitting Dataset

The dataset is split into 2 parts - Independent Variables and Dependent Variables (To be predicted).

In [7]:
# Independent Variables
X = car_final.drop(['Sell Price($)'], axis='columns')

# Dependent Variables
y = car_final['Sell Price($)']

## Initializing Linear Regression Model

In [8]:
linReg = LinearRegression()

## Fitting the model

In [9]:
linReg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the accuracy of model

In [10]:
linReg.score(X, y)

0.9417050937281083

# Using Sklearn for creating One Hot Vectors

Using sklearn's LabelEncoder() module, the three categories of Car Model are represented as 3 numeric values -

- Audi A5: 0
- BMW X5: 1
- Mercedez Benz C class: 2

In [12]:
label_encoder = LabelEncoder()
car_label_encoder = car
car_label_encoder['Car Model'] = label_encoder.fit_transform(car['Car Model'])
car_label_encoder

Unnamed: 0,Car Model,Mileage,Sell Price($),Age(yrs)
0,1,69000,18000,6
1,1,35000,34000,3
2,1,57000,26100,5
3,1,22500,40000,2
4,1,46000,31500,4
5,0,59000,29400,5
6,0,52000,32000,5
7,0,72000,19300,6
8,0,91000,12000,8
9,2,67000,22000,6


Calculating the index of the column for which one hot vectors need to be generated.

In [14]:
Car_Model_idx = car_label_encoder.columns.get_loc('Car Model')

Using sklearn's ColumnTransformaer() module to encode the categories stored in the index Car_Model_idx, with sklearn's OneHotEncoder() module as the transformer class.

In [16]:
transformer = ColumnTransformer(
    transformers=[
        ("OneHot",        # Just a name
         OneHotEncoder(), # The transformer class
         [Car_Model_idx]  # The column(s) to be applied on.
         )
    ],
    remainder='passthrough' # donot apply anything to the remaining columns
)

Passing the labeled dataframe in this tranformer to generate a numpy array with the required encodings.

In [17]:
car_one_hot_encoded = transformer.fit_transform(car_label_encoder)

In [18]:
car_one_hot_encoded

array([[0.00e+00, 1.00e+00, 0.00e+00, 6.90e+04, 1.80e+04, 6.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 3.50e+04, 3.40e+04, 3.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 5.70e+04, 2.61e+04, 5.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 2.25e+04, 4.00e+04, 2.00e+00],
       [0.00e+00, 1.00e+00, 0.00e+00, 4.60e+04, 3.15e+04, 4.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.90e+04, 2.94e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 5.20e+04, 3.20e+04, 5.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 7.20e+04, 1.93e+04, 6.00e+00],
       [1.00e+00, 0.00e+00, 0.00e+00, 9.10e+04, 1.20e+04, 8.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 6.70e+04, 2.20e+04, 6.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 8.30e+04, 2.00e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 7.90e+04, 2.10e+04, 7.00e+00],
       [0.00e+00, 0.00e+00, 1.00e+00, 5.90e+04, 3.30e+04, 5.00e+00]])

Passing the encoded numpy array into a pandas dataframe for better visual representation. In order to avoid the phenomenon of Dummy Variable Trap, the Audi A5 attribute is dropped.

In [19]:
car_final = pd.DataFrame(columns=['Audi A5', 'BMW X5', 'Mercedez Benz C class', 'Mileage', 'Sell Price($)', 'Age(yrs)'])
for i, column in enumerate(car_final.columns):
    car_final[column] = car_one_hot_encoded[:, i]
car_final = car_final.drop(['Audi A5'], axis='columns')
car_final

Unnamed: 0,BMW X5,Mercedez Benz C class,Mileage,Sell Price($),Age(yrs)
0,1.0,0.0,69000.0,18000.0,6.0
1,1.0,0.0,35000.0,34000.0,3.0
2,1.0,0.0,57000.0,26100.0,5.0
3,1.0,0.0,22500.0,40000.0,2.0
4,1.0,0.0,46000.0,31500.0,4.0
5,0.0,0.0,59000.0,29400.0,5.0
6,0.0,0.0,52000.0,32000.0,5.0
7,0.0,0.0,72000.0,19300.0,6.0
8,0.0,0.0,91000.0,12000.0,8.0
9,0.0,1.0,67000.0,22000.0,6.0


## Splitting Dataset

The dataset is split into 2 parts - Independent Variables and Dependent Variables (To be predicted).

In [20]:
# Independent Variables
X = car_final.drop(['Sell Price($)'], axis='columns')

# Dependent Variables
y = car_final['Sell Price($)']

## Initializing Linear Regression Model

In [21]:
linReg = LinearRegression()

## Fitting the model

In [22]:
linReg.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

## Predicting the accuracy of model

In [23]:
linReg.score(X, y)

0.9417050937281083