## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.



### Advantages of one hot encoding

- Straightforward to implement
- Makes no assumption about the distribution or categories of the categorical variable
- Keeps all the information of the categorical variable
- Suitable for linear models

### Limitations

- Expands the feature space
- Does not add extra information while encoding
- Many dummy variables may be identical, introducing redundant information



In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from feature_engine.categorical_encoders import OneHotCategoricalEncoder

In [2]:
# load titanic dataset

data = pd.read_csv('titanic.csv',
                   usecols=['sex', 'embarked', 'cabin', 'survived'])
data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B5,S
1,1,male,C22,S
2,0,female,C22,S
3,0,male,C22,S
4,0,female,C22,S


In [3]:
# let's capture only the first letter of the cabin for this demonstration

data['cabin'] = data['cabin'].str[0]

data.head()

Unnamed: 0,survived,sex,cabin,embarked
0,1,female,B,S
1,1,male,C,S
2,0,female,C,S
3,0,male,C,S
4,0,female,C,S


In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['sex', 'embarked', 'cabin']],  # predictors
    data['survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((916, 3), (393, 3))

### explore the cardinality

In [5]:
# sex has 2 labels

X_train['sex'].unique()

array(['female', 'male'], dtype=object)

In [6]:
# embarked has 3 labels and missing data

X_train['embarked'].unique()

array(['S', 'C', 'Q', nan], dtype=object)

In [7]:
# cabin has 9 labels and missing data

X_train['cabin'].unique()

array([nan, 'E', 'C', 'D', 'B', 'A', 'F', 'T', 'G'], dtype=object)

In [8]:
# we can create dummy variables with the build in  pandas method get_dummies

tmp = pd.get_dummies(X_train['sex'])

tmp.head()

Unnamed: 0,female,male
501,1,0
588,1,0
402,1,0
1193,0,1
686,1,0


In [9]:
# for better visualisation let's put the dummies next to the original variable

pd.concat([X_train['sex'],
           pd.get_dummies(X_train['sex'])], axis=1).head()

Unnamed: 0,sex,female,male
501,female,1,0
588,female,1,0
402,female,1,0
1193,male,0,1
686,female,1,0


In [10]:
# and now let's repeat for embarked

tmp = pd.get_dummies(X_train['embarked'])

tmp.head()

Unnamed: 0,C,Q,S
501,0,0,1
588,0,0,1
402,1,0,0
1193,0,1,0
686,0,1,0


In [11]:
# for better visualisation

pd.concat([X_train['embarked'],
           pd.get_dummies(X_train['embarked'])], axis=1).head()

Unnamed: 0,embarked,C,Q,S
501,S,0,0,1
588,S,0,0,1
402,C,1,0,0
1193,Q,0,1,0
686,Q,0,1,0


In [12]:
#  for cabin

tmp = pd.get_dummies(X_train['cabin'])

tmp.head()

Unnamed: 0,A,B,C,D,E,F,G,T
501,0,0,0,0,0,0,0,0
588,0,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0
1193,0,0,0,0,0,0,0,0
686,0,0,0,0,0,0,0,0


In [13]:
#  train set

tmp = pd.get_dummies(X_train)

print(tmp.shape)

tmp.head()

(916, 13)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,1,0,0,0,1,0,0,0,0,0,0,0,0
588,1,0,0,0,1,0,0,0,0,0,0,0,0
402,1,0,1,0,0,0,0,0,0,0,0,0,0
1193,0,1,0,1,0,0,0,0,0,0,0,0,0
686,1,0,0,1,0,0,0,0,0,0,0,0,0


In [14]:
# and now for all variables together: test set

tmp = pd.get_dummies(X_test)

print(tmp.shape)

tmp.head()

(393, 12)


Unnamed: 0,sex_female,sex_male,embarked_C,embarked_Q,embarked_S,cabin_A,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,0,1,0,0,1,0,0,0,0,0,0,0
533,1,0,0,0,1,0,0,0,0,0,0,0
459,0,1,0,0,1,0,0,0,0,0,0,0
1150,0,1,0,0,1,0,0,0,0,0,0,0
393,0,1,0,0,1,0,0,0,0,0,0,0


In [15]:
# obtaining k-1 labels,need to drop the first binary variable

tmp = pd.get_dummies(X_train['sex'], drop_first=True)

tmp.head()

Unnamed: 0,male
501,0
588,0
402,0
1193,1
686,0


In [16]:
# obtaining k-1 labels: we need to indicate get_dummies  to drop the first binary variable

tmp = pd.get_dummies(X_train['embarked'], drop_first=True)

tmp.head()

Unnamed: 0,Q,S
501,0,1
588,0,1
402,0,0
1193,1,0
686,1,0


For embarked, if an observation shows 0 for Q and S, then its value must be C, the remaining category.

Caveat, this variable has missing data, so unless we encode missing data as well, all the information contained in the variable is not captured.

In [17]:
# altogether: train set

tmp = pd.get_dummies(X_train, drop_first=True)

print(tmp.shape)

tmp.head()

(916, 10)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G,cabin_T
501,0,0,1,0,0,0,0,0,0,0
588,0,0,1,0,0,0,0,0,0,0
402,0,0,0,0,0,0,0,0,0,0
1193,1,1,0,0,0,0,0,0,0,0
686,0,1,0,0,0,0,0,0,0,0


In [18]:
# altogether: test set

tmp = pd.get_dummies(X_test, drop_first=True)

print(tmp.shape)

tmp.head()

(393, 9)


Unnamed: 0,sex_male,embarked_Q,embarked_S,cabin_B,cabin_C,cabin_D,cabin_E,cabin_F,cabin_G
1139,1,0,1,0,0,0,0,0,0
533,0,0,1,0,0,0,0,0,0
459,1,0,1,0,0,0,0,0,0
1150,1,0,1,0,0,0,0,0,0
393,1,0,1,0,0,0,0,0,0


In [19]:
# we can add an additional dummy variable to indicate  missing data

pd.get_dummies(X_train['embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
501,0,1,0
588,0,1,0
402,0,0,0
1193,1,0,0
686,1,0,0
