### One Hot Encoding

One hot encoding, consists of replacing the categorical variable by different boolean variables, which take value 0 or 1, to indicate whether or not a certain category / label of the variable was present for that observation.

Each one of the boolean variables are also known as dummy variables or binary variables.

For example, from the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is female or 0 otherwise. We can also generate the variable male, which takes 1 if the person is "male" and 0 otherwise

In [7]:
import pandas as pd
data = pd.read_csv('titanic.csv',usecols=['Sex']) # Lets consider onle one column to understand 'One Hot Encoding'
data.head()

Unnamed: 0,Sex
0,male
1,female
2,female
3,female
4,male


In [9]:
data.isnull().sum()# To check whether any missing values in the data

Sex    0
dtype: int64

In [14]:
data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [15]:
#Lets use dummy variables to convert categorical data to binary/boolian data
pd.get_dummies(data).head()

Unnamed: 0,Sex_female,Sex_male
0,0,1
1,1,0
2,1,0
3,1,0
4,0,1


In [18]:
pd.concat([data,pd.get_dummies(data)],axis=1).head()

Unnamed: 0,Sex,Sex_female,Sex_male
0,male,0,1
1,female,1,0
2,female,1,0
3,female,1,0
4,male,0,1


As you may have noticed, we only need 1 of the 2 dummy variables to represent the original categorical variable Sex. Any of the 2 will suffice, and it doesn't matter which one we select, since they are equivalent.

Therefore, to encode a categorical variable with 2 labels, we need 1 dummy variable.

To extend this concept, to encode categorical variable with k labels, we need k-1 dummy variables.

How can we get this using pandas?

In [19]:
pd.concat([data,pd.get_dummies(data,drop_first=True)],axis=1).head()
#After this encoding you can drop the Sex column from the original data set

Unnamed: 0,Sex,Sex_male
0,male,1
1,female,0
2,female,0
3,female,0
4,male,1


In [22]:
# Let's now look at an example with more than 2 labels,lets consider the Embarked column
df = pd.read_csv('Titanic.csv',usecols=['Embarked'])
df['Embarked'].isnull().sum()#To check null values

2

In [23]:
df['Embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [24]:
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [26]:
#Lets fill 'Nan' values with most frequently occured label 'S'
df['Embarked'].fillna('S',inplace=True)
df.isnull().sum()

Embarked    0
dtype: int64

In [29]:
#Lets apply one hot encoding(dummy variables)for Embarked feature
pd.get_dummies(df,drop_first=True).head()
#Since we have 3 labels in 'Embarked' feature, it is been replaced by 2 columns

Unnamed: 0,Embarked_Q,Embarked_S
0,0,1
1,0,0
2,0,1
3,0,1
4,0,1


One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (removed) binary variable. As an example, for the variable gender encoded into male, if the observation is 0, then it has to be female. We do not need the additional female variable to explain that.

One hot encoding with k-1 binary variables should be used in linear regression, to keep the correct number of degrees of freedom (k-1). The linear regression has access to all of the features as it is being trained, and therefore examines altogether the whole set of dummy variables. This means that k-1 binary variables give the whole information about (represent completely) the original categorical variable to the linear regression.

And the same is true for all machine learning algorithms that look at ALL the features at the same time during training. For example, support vector machines and neural networks as well. And clustering algorithms.

In [30]:
# Let us take Titanic dataset and consider the Categorical variable only to predict whether the passanger is survived or not
data = pd.read_csv('Titanic.csv')
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [33]:
#Lets take only categorical variables for the Algorithm

data_OHE = pd.concat([data[['Survived','Pclass','Age','SibSp','Parch']],
                      pd.get_dummies(data.Sex,drop_first=True),
                      pd.get_dummies(data.Embarked,drop_first=True)], axis=1)

In [34]:
data_OHE.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,male,Q,S
0,0,3,22.0,1,0,1,0,1
1,1,1,38.0,1,0,0,0,0
2,1,3,26.0,0,0,0,0,1
3,1,1,35.0,1,0,0,0,1
4,0,3,35.0,0,0,1,0,1


In [41]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

In [37]:
X_train, X_test, y_train, y_test = train_test_split(data_OHE[['Pclass', 'Age', 'SibSp',
                                                              'Parch', 'male', 'Q', 'S']].fillna(0),
                                                    data_OHE.Survived,
                                                    test_size=0.3,
                                                    random_state=0)
X_train.shape, X_test.shape

((623, 7), (268, 7))

In [42]:
# Applying logistic regression
logit = LogisticRegression(random_state=44)
logit.fit(X_train, y_train)
print('Train set')
pred = logit.predict_proba(X_train)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_train, pred[:,1])))
print('Test set')
pred = logit.predict_proba(X_test)
print('Logistic Regression roc-auc: {}'.format(roc_auc_score(y_test, pred[:,1])))

Train set
Logistic Regression roc-auc: 0.8482679334504674
Test set
Logistic Regression roc-auc: 0.8444642857142857


#### Advantages

    Straightforward to implement
    Makes no assumption
    Keeps all the information of the categorical variable

#### Disadvantages

    Does not add any information that may make the variable more predictive
    If the variable has loads of categories, then OHE increases the feature space dramatically


In [43]:
data = pd.read_csv('Titanic.csv', usecols = ['Cabin'])
data.head()

Unnamed: 0,Cabin
0,
1,C85
2,
3,C123
4,


In [44]:
# let's inspect the number of different labels in Cabin

print('number of labels: {}'.format(len(data.Cabin.unique())))

number of labels: 148


In [45]:
# let's see how many features we would create if we did OHE for Cabin

Cabin_OHE = pd.get_dummies(data.Cabin)
Cabin_OHE.shape

(891, 147)

In [46]:
Cabin_OHE.head()

Unnamed: 0,A10,A14,A16,A19,A20,A23,A24,A26,A31,A32,...,E8,F E69,F G63,F G73,F2,F33,F38,F4,G6,T
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


If we performed OHE in the variable Cabin that contains 148 different labels, we would end up with 147 variables, where originally there was one. If we have a few categorical variables like this, we would end up with huge datasets. Therefore, OHE is not always the best option to encode categorical variables.