# One Hot Encoding

## Introduction
One hot encoding is one of the way of preprocessing data to prepare it for an algorithm and get improved prediction. With one hot encoding, we convert all the categorical values into new categorical column and assign them binary values(i.e. 0 and 1) to those columns. It is the process of creating dummy variables.

### What are categorical variables?
Categorial values are the labelled values. These values contain label rather than numeric values.They take limited number of possible values.
Some example of categorical variables are "sex", "gender","race",etc.

### Need of handling categorical values
Machine learning algorithms reuires that input and output variables should be numbers. As we know categorical values are non-numeric. Hence, we must need to encode these variables to numbers before we can use it to fit and evaluate the model.

One of such metods to handle categorical variables is one hot encoding.

Here we will descuss two approaches.

In [None]:
#Importing required Libraries

import pandas as pd
from sklearn import preprocessing

In [None]:
#loading the dataset

df = pd.read_csv("/content/insurance.csv")
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


Understanding the dataset

In [None]:
#To know how data is distributed

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB


In [None]:
# To check if any column as null value

df.isnull().sum()

age         0
sex         0
bmi         0
children    0
smoker      0
region      0
charges     0
dtype: int64

In [None]:
#creating a new dataframe consisting of only categorical variables

df2 = df.select_dtypes(include=[object])
df2.head()

Unnamed: 0,sex,smoker,region
0,female,yes,southwest
1,male,no,southeast
2,male,no,southeast
3,male,no,northwest
4,male,no,northwest


In [None]:
# Entries of each categorical variable

print(df2['sex'].unique())
print(df2['smoker'].unique())
print(df2['region'].unique())

['female' 'male']
['yes' 'no']
['southwest' 'southeast' 'northwest' 'northeast']


In [None]:
# Count of each entry of categorical variable
print(df2['sex'].value_counts())
print(df2['smoker'].value_counts())
print(df2['region'].value_counts())


male      676
female    662
Name: sex, dtype: int64
no     1064
yes     274
Name: smoker, dtype: int64
southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64


Approach 1: Using dummies values

In [None]:
# In this approach we will use get_dummies() method 

hot_enc = pd.get_dummies(df2, columns = ['region', 'sex','smoker'])
hot_enc

Unnamed: 0,region_northeast,region_northwest,region_southeast,region_southwest,sex_female,sex_male,smoker_no,smoker_yes
0,0,0,0,1,1,0,0,1
1,0,0,1,0,0,1,1,0
2,0,0,1,0,0,1,1,0
3,0,1,0,0,0,1,1,0
4,0,1,0,0,0,1,1,0
...,...,...,...,...,...,...,...,...
1333,0,1,0,0,0,1,1,0
1334,1,0,0,0,1,0,1,0
1335,0,0,1,0,1,0,1,0
1336,0,0,0,1,1,0,1,0


Approach 2: Using sci-kit learn library

In [None]:
# In this approach we will use OneHotEncoder() method from sklearn

hot_enc_2 = preprocessing.OneHotEncoder()
hot_enc_2.fit(df2)
onehotlabels = hot_enc_2.transform(df2).toarray()
onehotlabels

array([[1., 0., 0., ..., 0., 0., 1.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [1., 0., 1., ..., 0., 1., 0.],
       [1., 0., 1., ..., 0., 0., 1.],
       [1., 0., 0., ..., 1., 0., 0.]])

In [None]:
onehotlabels.shape

(1338, 8)