# 📝 Feature Engineering

## Introduction 

Feature engineering is a crucial step in the data preprocessing pipeline, aimed at enhancing the predictive power of machine learning models. For the Titanic dataset, this involves creating new features and modifying existing ones to better capture the underlying patterns that influence passenger survival.

The Titanic dataset includes various columns such as 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', and 'Embarked'. Each of these features holds potential insights into the survival outcomes, but they often require transformation and enrichment to become more effective for predictive modeling.

Feature engineering is an iterative process that involves experimenting with different transformations and evaluating their impact on model performance. By carefully crafting and selecting features, we can significantly improve the accuracy and robustness of predictive models for the Titanic dataset.

## Feature Engineering


This stage offers numerous opportunities for a deeper analysis, especially when comparing with other columns. However, for our practical purposes, we will follow this approach:

* We will remove the 'Name' and 'Ticket' columns, as they do not initially contribute significantly to the model.
* For the 'Age' variable, we will fill the missing values with the mean age.
* We will address the 'Cabin' column by replacing the missing values with the most frequent value, thus optimizing data integrity.
* We will change the data type of the 'Pclass', 'SibSp', and 'Parch' variables.

In [1]:
# Libraries
from loguru import logger
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [2]:
logger.info("Read Data")

# Paths
path_raw = "../../data/raw/"
path_processed = "../../data/processed/"
path_final = "../../data/final/"

# Read data
train = pd.read_csv(path_raw + "train.csv")
test = pd.read_csv(path_raw + "test.csv")

[32m2024-06-09 18:03:20.901[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mRead Data[0m


In [3]:
# Get column names by data types
target = 'Survived'

float_columns = [x for x in list(train.select_dtypes(include=['float64']).columns) if x != target]
integer_columns = [x for x in list(train.select_dtypes(include=['int32', 'int64']).columns) if x != target]
object_columns = [x for x in list(train.select_dtypes(include=['object']).columns) if x != target]

In [4]:
logger.info("Remove variables: 'Name' and 'Ticket'")

cols_delete = ['Name', 'Ticket']

train = train.drop(cols_delete, axis=1)
test = test.drop(cols_delete, axis=1)

[32m2024-06-09 18:03:20.933[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mRemove variables: 'Name' and 'Ticket'[0m


In [5]:
logger.info("Fill 'Age' with the mean")
age_mean = round(train['Age'].mean())

train['Age'] = train['Age'].fillna(age_mean)
test['Age'] = test['Age'].fillna(age_mean)

[32m2024-06-09 18:03:20.949[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mFill 'Age' with the mean[0m


In [6]:
logger.info("Modify and fill missing values in 'Cabin'")
train['Cabin'] = train['Cabin'].fillna('N').str[0]
test['Cabin'] = test['Cabin'].fillna('N').str[0]

[32m2024-06-09 18:03:20.965[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mModify and fill missing values in 'Cabin'[0m


In [7]:
logger.info("Change data type: 'Pclass', 'SibSp', and 'Parch'")

columns_to_convert = ['Pclass', 'SibSp', 'Parch']
train[columns_to_convert] = train[columns_to_convert].astype(str)
test[columns_to_convert] = test[columns_to_convert].astype(str)

[32m2024-06-09 18:03:20.981[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mChange data type: 'Pclass', 'SibSp', and 'Parch'[0m


In [8]:
# Display train and test dataset
logger.info("New train data")
train.head()

[32m2024-06-09 18:03:20.996[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m2[0m - [1mNew train data[0m


Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,1,0,3,male,22.0,1,0,7.25,N,S
1,2,1,1,female,38.0,1,0,71.283,C,C
2,3,1,3,female,26.0,0,0,7.925,N,S
3,4,1,1,female,35.0,1,0,53.1,C,S
4,5,0,3,male,35.0,0,0,8.05,N,S


In [9]:
logger.info("New test data")
test.head()

[32m2024-06-09 18:03:21.011[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mNew test data[0m


Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,892,3,male,34.5,0,0,7.829,N,Q
1,893,3,female,47.0,1,0,7.0,N,S
2,894,2,male,62.0,0,0,9.688,N,Q
3,895,3,male,27.0,0,0,8.662,N,S
4,896,3,female,22.0,1,1,12.287,N,S


In [10]:
logger.info("Save Results")

train.to_csv(path_processed + 'train.csv', sep=',', index=False)
test.to_csv(path_processed + 'test.csv', sep=',', index=False)

[32m2024-06-09 18:03:21.031[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mSave Results[0m


## Conclusion


In our feature engineering process for the Titanic dataset, we undertook several steps to prepare the data for effective modeling:

1. **Removal of Non-Contributory Columns:** We removed the 'Name' and 'Ticket' columns, as they did not provide significant predictive value for our model.
2. **Handling Missing Values:**
   - For the 'Age' column, missing values were filled with the mean age to maintain consistency and avoid data loss.
   - For the 'Cabin' column, missing values were replaced with the most frequent value ('N'), and only the first letter of the cabin was retained to simplify the data.
3. **Data Type Conversion:** The columns 'Pclass', 'SibSp', and 'Parch' were converted from numerical to string type to better capture categorical relationships.
4. **Data Saving:** The processed training and test datasets were saved for future modeling and analysis.

These feature engineering steps have improved the quality and usability of the dataset, ensuring that it is well-prepared for subsequent analysis and machine learning tasks. By addressing missing values, simplifying categorical data, and removing unnecessary columns, we have created a more robust and interpretable dataset for predicting passenger survival on the Titanic.