# Variable encoding

Using Pandas, we will see two ways of encoding nominal variables, which are very common in datasets: label and one-hot encoding.

In [6]:
# make sure the required packages are installed
%pip install pandas seaborn --quiet 

Note: you may need to restart the kernel to use updated packages.


## Dataset

We will use the Titanic dataset, which contains information about the passengers of the Titanic. The dataset is available at the seaborn library.

In [7]:
# get the titanic dataset from seaborn
import seaborn as sns
titanic_df = sns.load_dataset('titanic')
titanic_df.head(10)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False


## ✨ Questions ✨

1. What type of variable is the `sex` column?
2. What type of variable is the `pclass` column?
3. What type of variable is the `embark_town` column?
4. What type of variable is the `fare` column?
5. What type of variable is the `survived` column?
6. What type of variable is the `age` column?
7. What type of variable is the `class` column?

### Answers:

*Write your answers here.*



## Label encoding

Label encoding is a way to encode categorical variables. It assigns a unique integer to each category. For example, the `class` column is a ordinal variable encoded with three categories: `First`, `Second` and `Third`. We can encode them as `0`, `1` and `2`. 

In [8]:
# label encoding with pandas
print(f"Existing classes: {titanic_df['class'].unique()}.")  # to check the unique values of the column 'class'
class_codes = {'First': 1, 'Second': 2, 'Third': 3}
titanic_df['encoded_class'] = titanic_df['class'].map(class_codes) 
titanic_df.head(10)

Existing classes: ['Third', 'First', 'Second']
Categories (3, object): ['First', 'Second', 'Third'].


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone,encoded_class
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False,3
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False,1
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True,3
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False,1
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True,3
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,Queenstown,no,True,3
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,Southampton,no,True,1
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,Southampton,no,False,3
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,Southampton,yes,False,3
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,Cherbourg,yes,False,2


## One-hot encoding

The `class` variable was ordinal, so we used label encoding because the categories have a natural order. . But what if the variable is nominal (e.g., `embark_town`? In this case, we cannot use the `<` and `>` operators so label encoding is not appropriate. For this case scenario, we have one-hot encoding.

One-hot encoding creates a new column for each category, and assigns `1` to the category and `0` to the others. The purpose is to avoid the model to interpret the categories as ordinal. That is, `==` and `!=` operators are valid, but not `<` and `>`.

In [9]:
# one-hot encoding with pandas
import pandas as pd
print(f"Existing embankment towns: {titanic_df['embark_town'].unique()}.")  # to check the unique values of the column 'embark_town'
# get_dummies creates the one-hot encoding, replacing the original column
# use 0 and 1 instead of True and false
titanic_with_dummies_df = pd.get_dummies(titanic_df, 
                                         columns=['embark_town'],  # column to encode 
                                         prefix='embark',  # prefix for the new columns
                                         dtype=int)  # generate 0 and 1 instead of True and False
titanic_with_dummies_df.head(10)

Existing embankment towns: ['Southampton' 'Cherbourg' 'Queenstown' nan].


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,alive,alone,encoded_class,embark_Cherbourg,embark_Queenstown,embark_Southampton
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,no,False,3,0,0,1
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,yes,False,1,1,0,0
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,yes,True,3,0,0,1
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,yes,False,1,0,0,1
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,no,True,3,0,0,1
5,0,3,male,,0,0,8.4583,Q,Third,man,True,,no,True,3,0,1,0
6,0,1,male,54.0,0,0,51.8625,S,First,man,True,E,no,True,1,0,0,1
7,0,3,male,2.0,3,1,21.075,S,Third,child,False,,no,False,3,0,0,1
8,1,3,female,27.0,0,2,11.1333,S,Third,woman,False,,yes,False,3,0,0,1
9,1,2,female,14.0,1,0,30.0708,C,Second,child,False,,yes,False,2,1,0,0


## ✨ Questions ✨

8. What is the best way to encode the `sex` column?
9. Why?
10. Do it?

### Answers:

*Write your answers here.*



In [10]:
# Answer to Question 10:

# add your code here

