## One Hot Encoding

One hot encoding, consists in encoding each categorical variable with different boolean variables (also called dummy variables) which take values 0 or 1, indicating if a category is present in an observation.

For example, for the categorical variable "Gender", with labels 'female' and 'male', we can generate the boolean variable "female", which takes 1 if the person is 'female' or 0 otherwise, or we can generate the variable "male", which takes 1 if the person is 'male' and 0 otherwise.

For the categorical variable "colour" with values 'red', 'blue' and 'green', we can create 3 new variables called "red", "blue" and "green". These variables will take the value 1, if the observation is of the said colour or 0 otherwise.

### Encoding into k-1 dummy variables

Note however, that for the variable "colour", by creating 2 binary variables, say "red" and "blue", we already encode **ALL** the information:

- if the observation is red, it will be captured by the variable "red" (red = 1, blue = 0)
- if the observation is blue, it will be captured by the variable "blue" (red = 0, blue = 1)
- if the observation is green, it will be captured by the combination of "red" and "blue" (red = 0, blue = 0)

We do not need to add a third variable "green" to capture that the observation is green.

More generally, a categorical variable should be encoded by creating k-1 binary variables, where k is the number of distinct categories. In the case of gender, k=2 (male / female), therefore we need to create only 1 (k - 1 = 1) binary variable. In the case of colour, which has 3 different categories (k=3), we need to create 2 (k - 1 = 2) binary variables to capture all the information.

One hot encoding into k-1 binary variables takes into account that we can use 1 less dimension and still represent the whole information: if the observation is 0 in all the binary variables, then it must be 1 in the final (not present) binary variable.

**When one hot encoding categorical variables, we create k - 1 binary variables**


Most machine learning algorithms, consider the entire data set while being fit. Therefore, encoding categorical variables into k - 1 binary variables, is better, as it avoids introducing redundant information.


In [1]:
import pandas as pd

# to split the datasets
from sklearn.model_selection import train_test_split

# for one hot encoding with sklearn
from sklearn.preprocessing import OneHotEncoder

# for one hot encoding with feature-engine
from feature_engine.encoding import OneHotEncoder as fe_OneHotEncoder

In [2]:
# load titanic dataset

data = pd.read_csv('tested.csv',
                   usecols=['Sex', 'Embarked', 'Cabin', 'Survived'])
data.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,Q
1,1,female,,S
2,0,male,,Q
3,0,male,,S
4,1,female,,S


In [3]:
# let's capture only the first letter of the 
# cabin for this demonstration

data['Cabin'] = data['Cabin'].str[0]

data.head()

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,male,,Q
1,1,female,,S
2,0,male,,Q
3,0,male,,S
4,1,female,,S


In [4]:
# let's separate into training and testing set

X_train, X_test, y_train, y_test = train_test_split(
    data[['Sex', 'Embarked', 'Cabin']],  # predictors
    data['Survived'],  # target
    test_size=0.3,  # percentage of obs in test set
    random_state=0)  # seed to ensure reproducibility

X_train.shape, X_test.shape

((292, 3), (126, 3))

## Let's explore the cardinality

In [5]:
# sex has 2 labels

X_train['Sex'].unique()

array(['female', 'male'], dtype=object)

In [6]:
# embarked has 3 labels and missing data

X_train['Embarked'].unique()

array(['S', 'Q', 'C'], dtype=object)

In [7]:
# cabin has 9 labels and missing data

X_train['Cabin'].unique()

array(['C', nan, 'B', 'A', 'E', 'F', 'D', 'G'], dtype=object)

## One hot encoding with pandas

### Advantages

- quick
- returns pandas dataframe
- returns feature names for the dummy variables

### Limitations of pandas:

- it does not preserve information from train data to propagate to test data


-----

The pandas method get_dummies(), will create as many binary variables as categories in the variable:

If the variable colour has 3 categories in the train data, it will create 2 dummy variables. However, if the variable colour has 5 categories in the test data, it will create 4 binary variables, therefore train and test sets will end up with different number of features and will be incompatible with training and scoring using Scikit-learn.

In practice, we shouldn't be using get-dummies in our machine learning pipelines. It is however useful, for a quick data exploration. Let's look at this with examples.

### into k  dummy variables

In [8]:
# we can create dummy variables with the build in
# pandas method get_dummies

tmp = pd.get_dummies(X_train['Sex'])

tmp.head()

Unnamed: 0,female,male
96,1,0
381,0,1
89,0,1
233,0,1
191,0,1


In [9]:
# for better visualisation let's put the dummies next
# to the original variable

pd.concat([X_train['Sex'],
           pd.get_dummies(X_train['Sex'])], axis=1).head()

Unnamed: 0,Sex,female,male
96,female,1,0
381,male,0,1
89,male,0,1
233,male,0,1
191,male,0,1


In [10]:
# and now let's repeat for embarked

tmp = pd.get_dummies(X_train['Embarked'])

tmp.head()

Unnamed: 0,C,Q,S
96,0,0,1
381,0,1,0
89,0,0,1
233,0,1,0
191,0,0,1


In [11]:
# for better visualisation

pd.concat([X_train['Embarked'],
           pd.get_dummies(X_train['Embarked'])], axis=1).head()

Unnamed: 0,Embarked,C,Q,S
96,S,0,0,1
381,Q,0,1,0
89,S,0,0,1
233,Q,0,1,0
191,S,0,0,1


In [12]:
# and now for cabin

tmp = pd.get_dummies(X_train['Cabin'])

tmp.head()

Unnamed: 0,A,B,C,D,E,F,G
96,0,0,1,0,0,0,0
381,0,0,0,0,0,0,0
89,0,0,0,0,0,0,0
233,0,0,0,0,0,0,0
191,0,0,0,0,0,0,0


In [13]:
# and now for all variables together: train set

tmp = pd.get_dummies(X_train)

print(tmp.shape)

tmp.head()

(292, 12)


Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G
96,1,0,0,0,1,0,0,1,0,0,0,0
381,0,1,0,1,0,0,0,0,0,0,0,0
89,0,1,0,0,1,0,0,0,0,0,0,0
233,0,1,0,1,0,0,0,0,0,0,0,0
191,0,1,0,0,1,0,0,0,0,0,0,0


In [14]:
# and now for all variables together: test set

tmp = pd.get_dummies(X_test)

print(tmp.shape)

tmp.head()

(126, 10)


Unnamed: 0,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S,Cabin_A,Cabin_B,Cabin_C,Cabin_D,Cabin_E
360,0,1,0,0,1,0,0,0,0,0
170,0,1,0,0,1,0,0,0,0,0
224,1,0,1,0,0,0,0,0,0,0
358,0,1,0,1,0,0,0,0,0,0
309,1,0,0,0,1,0,0,0,0,0


Notice the positives of pandas get_dummies:
- dataframe returned with feature names

**And the limitations:**

The train set contains 13 dummy features, whereas the test set contains 12 features. This occurred because there was no category T in cabin in the test set.

This will cause problems if training and scoring models with scikit-learn, because predictors require train and test sets to be of the same shape.

### into k -1 

In [15]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['Sex'], drop_first=True)

tmp.head()

Unnamed: 0,male
96,0
381,1
89,1
233,1
191,1


In [16]:
# obtaining k-1 labels: we need to indicate get_dummies
# to drop the first binary variable

tmp = pd.get_dummies(X_train['Embarked'], drop_first=True)

tmp.head()

Unnamed: 0,Q,S
96,0,1
381,1,0
89,0,1
233,1,0
191,0,1


For embarked, if an observation shows 0 for Q and S, then its value must be C, the remaining category.

Caveat, this variable has missing data, so unless we encode missing data as well, all the information contained in the variable is not captured.

In [17]:
# altogether: train set

tmp = pd.get_dummies(X_train, drop_first=True)

print(tmp.shape)

tmp.head()

(292, 9)


Unnamed: 0,Sex_male,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E,Cabin_F,Cabin_G
96,0,0,1,0,1,0,0,0,0
381,1,1,0,0,0,0,0,0,0
89,1,0,1,0,0,0,0,0,0
233,1,1,0,0,0,0,0,0,0
191,1,0,1,0,0,0,0,0,0


In [18]:
# altogether: test set

tmp = pd.get_dummies(X_test, drop_first=True)

print(tmp.shape)

tmp.head()

(126, 7)


Unnamed: 0,Sex_male,Embarked_Q,Embarked_S,Cabin_B,Cabin_C,Cabin_D,Cabin_E
360,1,0,1,0,0,0,0
170,1,0,1,0,0,0,0
224,0,0,0,0,0,0,0
358,1,1,0,0,0,0,0
309,0,0,1,0,0,0,0


### Bonus: get_dummies() can handle missing values

In [19]:
# we can add an additional dummy variable to indicate
# missing data

pd.get_dummies(X_train['Embarked'], drop_first=True, dummy_na=True).head()

Unnamed: 0,Q,S,NaN
96,0,1,0
381,1,0,0
89,0,1,0
233,1,0,0
191,0,1,0


## One hot encoding with Scikit-learn

### Advantages

- quick
- Creates the same number of features in train and test set

### Limitations

- it returns a numpy array instead of a pandas dataframe
- it does not return the variable names, therefore inconvenient for variable exploration

In [20]:
# we create and train the encoder

encoder = OneHotEncoder(categories='auto',
                       drop='first', # to return k-1, use drop=false to return k dummies
                       sparse=False,
                       handle_unknown='error') # helps deal with rare labels

encoder.fit(X_train.fillna('Missing'))

OneHotEncoder(drop='first', sparse=False)

In [21]:
# we observe the learned categories

encoder.categories_

[array(['female', 'male'], dtype=object),
 array(['C', 'Q', 'S'], dtype=object),
 array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Missing'], dtype=object)]

In [22]:
# transform the train set

tmp = encoder.transform(X_train.fillna('Missing'))

pd.DataFrame(tmp).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0



## One hot encoding with Feature-Engine

### Advantages
- quick
- returns dataframe
- returns feature names
- allows to select features to encode


In [23]:
ohe_enc = fe_OneHotEncoder(
    top_categories=None,
    variables=['Sex', 'Embarked'],  # we can select which variables to encode
    drop_last=True)  # to return k-1, false to return k


ohe_enc.fit(X_train.fillna('Missing'))

OneHotEncoder(drop_last=True, variables=['Sex', 'Embarked'])

In [24]:
tmp = ohe_enc.transform(X_train.fillna('Missing'))

tmp.head()

Unnamed: 0,Cabin,Sex_female,Embarked_S,Embarked_Q
96,C,1,1,0
381,Missing,0,0,1
89,Missing,0,1,0
233,Missing,0,0,1
191,Missing,0,1,0


# LabelEncoder from Sklearn

Label Encoding refers to converting the labels into a numeric form so as to convert them into the machine-readable form. Machine learning algorithms can then decide in a better way how those labels must be operated. It is an important pre-processing step for the structured dataset in supervised learning.

In [25]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [26]:
dfle = data
dfle.Sex = le.fit_transform(dfle.Sex)
dfle

Unnamed: 0,Survived,Sex,Cabin,Embarked
0,0,1,,Q
1,1,0,,S
2,0,1,,Q
3,0,1,,S
4,1,0,,S
...,...,...,...,...
413,0,1,,S
414,1,0,C,C
415,0,1,,S
416,0,1,,S
