# <center>One-Hot Encoding</center>
One-hot encoding is used to convert categorical variables into a format that can be readily used by machine learning algorithms.The basic idea of one-hot encoding is to create new variables that take on values 0 and 1 to represent the original categorical values. Let's see one such example,

In [1]:
# importing package
import pandas as pd

### Create the Data

In [2]:
# create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
                   'points': [25, 12, 15, 14, 19, 23, 25, 29]})
df

Unnamed: 0,team,points
0,A,25
1,A,12
2,B,15
3,B,14
4,B,19
5,B,23
6,C,25
7,C,29


### Perform One-Hot Encoding

#### Option I

In [3]:
# create categorical variables list
catvars = ['team']
catvars

['team']

In [4]:
# create numeric variables list
numvars = ['points']
numvars

['points']

Here since we don't have very less features that's why these steps might sound redundent however in real work scenario most of the time this helps.

In [5]:
# create dummyvars
dummyvars = pd.get_dummies(df[catvars])
dummyvars

Unnamed: 0,team_A,team_B,team_C
0,1,0,0
1,1,0,0
2,0,1,0
3,0,1,0
4,0,1,0
5,0,1,0
6,0,0,1
7,0,0,1


In [6]:
# create final DataFrame
df_final = pd.concat([df[numvars], dummyvars], axis = 1)
df_final

Unnamed: 0,points,team_A,team_B,team_C
0,25,1,0,0
1,12,1,0,0
2,15,0,1,0
3,14,0,1,0
4,19,0,1,0
5,23,0,1,0
6,25,0,0,1
7,29,0,0,1


#### Option II
- import the `OneHotEncoder()` function from the sklearn library and use it to perform one-hot encoding on the `team` variable in the pandas DataFrame

In [7]:
# create DataFrame
df = pd.DataFrame({'team': ['A', 'A', 'B', 'B', 'B', 'B', 'C', 'C'],
                   'points': [25, 12, 15, 14, 19, 23, 25, 29]})
df

Unnamed: 0,team,points
0,A,25
1,A,12
2,B,15
3,B,14
4,B,19
5,B,23
6,C,25
7,C,29


In [8]:
from sklearn.preprocessing import OneHotEncoder

# creating instance of one-hot-encoder
encoder = OneHotEncoder(handle_unknown='ignore')

# perform one-hot encoding on 'team' column 
encoder_df = pd.DataFrame(encoder.fit_transform(df[['team']]).toarray())

# merge one-hot encoded columns back with original DataFrame
final_df = df.join(encoder_df)

# view final df
final_df

Unnamed: 0,team,points,0,1,2
0,A,25,1.0,0.0,0.0
1,A,12,1.0,0.0,0.0
2,B,15,0.0,1.0,0.0
3,B,14,0.0,1.0,0.0
4,B,19,0.0,1.0,0.0
5,B,23,0.0,1.0,0.0
6,C,25,0.0,0.0,1.0
7,C,29,0.0,0.0,1.0


Notice that three new columns were added to the DataFrame since the original ‘team’ column contained three unique values.

### Drop the Original Categorical Variable

In [9]:
# drop 'team' column
final_df.drop('team', axis=1, inplace=True)

# view final df
final_df

Unnamed: 0,points,0,1,2
0,25,1.0,0.0,0.0
1,12,1.0,0.0,0.0
2,15,0.0,1.0,0.0
3,14,0.0,1.0,0.0
4,19,0.0,1.0,0.0
5,23,0.0,1.0,0.0
6,25,0.0,0.0,1.0
7,29,0.0,0.0,1.0


We could also rename the columns of the final DataFrame to make them easier to read.

In [10]:
# rename columns
final_df.columns = ['points', 'team_A', 'team_B', 'team_C']

# view final df
final_df

Unnamed: 0,points,team_A,team_B,team_C
0,25,1.0,0.0,0.0
1,12,1.0,0.0,0.0
2,15,0.0,1.0,0.0
3,14,0.0,1.0,0.0
4,19,0.0,1.0,0.0
5,23,0.0,1.0,0.0
6,25,0.0,0.0,1.0
7,29,0.0,0.0,1.0


The one-hot encoding is complete and we can now feed this pandas DataFrame into any machine learning algorithm that we’d like.