# Working with Categorical Variables

This demonstrates how to handle categorical variables in a dataset.

## Importing and loading data

In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

print(pd.__version__)
print(np.__version__)

2.1.3
1.26.1


In [2]:
# Loading the data
data = pd.read_csv('datasets/titanic_train.csv')

# Checking the data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Handling missing values

In [3]:
# Missing values in the data
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

### Imputing missing values

#### Age

In [3]:
# finding mean value of 'Age'
mean_age = data['Age'].mean()
print('Mean Age:', mean_age.round(3))

Mean Age: 29.699


In [4]:
# Making a copy
cleaned_data = data.copy() 

# Imputing missing values
cleaned_data['Age'] = data['Age'].fillna(value = mean_age)
cleaned_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

#### Embarked

In [5]:
# Checking the frequency of every value
data['Embarked'].value_counts()

Embarked
S    644
C    168
Q     77
Name: count, dtype: int64

In [6]:
mode_emb = data['Embarked'].mode()[0]
print('Embarked mode value:', mode_emb)

Embarked mode value: S


In [7]:
# Imputing missing values
cleaned_data['Embarked'] = data['Embarked'].fillna(value = mode_emb)
cleaned_data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age              0
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         0
dtype: int64

## Handling categorical variables

In [8]:
# Categorical variables in the data
print(data.dtypes) 

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object


In [11]:
# Selecting the columns which have 'object' as the category 
cat_cols = data.select_dtypes(include = ['object']).columns
print(cat_cols)

Index(['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'], dtype='object')


In [12]:
# Number of unique values
data[cat_cols].nunique()

Name        891
Sex           2
Ticket      681
Cabin       147
Embarked      3
dtype: int64

### One-Hot encoding

- One-Hot allows the use of categorical variables in models that require numerical input.
- It can improve model performance by providing more information to the model about the categorical variable.
- One-Hot encoding is a technique that we use to represent categorical variables as numerical values.
- Use the function **`get_dummies()`** to perform one-hot encoding and convert into 1s ans 0s with the help of **`dtype = int`** parameter. 

In [14]:
# Use the get_dummies() to perform one-hot encoding
# Use the dtype = int for getting 0s and 1s
pd.get_dummies(data['Embarked'], dtype = int).head()

Unnamed: 0,C,Q,S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [15]:
# Drop the irrelevant columns
cleaned_data = cleaned_data.drop(['Name','Ticket','Cabin'], axis = 1)

# Perform one-hot encoding on cleaned data
cleaned_data = pd.get_dummies(cleaned_data, dtype = int)

# Check the data
cleaned_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_female,Sex_male,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,22.0,1,0,7.25,0,1,0,0,1
1,2,1,1,38.0,1,0,71.2833,1,0,1,0,0
2,3,1,3,26.0,0,0,7.925,1,0,0,0,1
3,4,1,1,35.0,1,0,53.1,1,0,0,0,1
4,5,0,3,35.0,0,0,8.05,0,1,0,0,1


'SibSp' and 'Parch' hold discrete values. We can convert them into separate columns as well.

### Label encoding

- Label Encoding is a technique that is used to convert categorical columns into numerical ones.
- It is an important pre-processing step in a machine-learning project as it requires numerical data for training purposes.

In [16]:
# Checking the data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
# Map function
data['Embarked'].map({'Q': 0, 'S': 1, 'C':2}).head()

0    1.0
1    2.0
2    1.0
3    1.0
4    1.0
Name: Embarked, dtype: float64

In [18]:
# map the values for embarked into numerical data. 
data['Embarked'] = data['Embarked'].map({'Q': 0, 'S': 1, 'C':2})

# Check the data
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,2.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,1.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,1.0


We can also perform label encoding with the help of the inbuilt **`LabelEncoder()`** function.

In [19]:
# Import label encoder 
from sklearn import preprocessing
LE = preprocessing.LabelEncoder() 
  
# Encode labels in column 'Embarked'. 
data['Embarked']= LE.fit_transform(data['Embarked']) 

data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,1
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,1
