# Data Preprocessing

In this notebook, we'll conduct a series of steps necessary to prepare ourIn this notebook, we'll conduct a series of steps necessary to prepare our Titanic dataset for model training. This includes handling missing data, transforming certain features, and encoding categorical variables.

In [50]:
# Import necessary libraries
import numpy as np
import pandas as pd

# Load the Titanic dataset
df = pd.read_csv('../data/raw/train.csv')

# Print the first few rows of the dataframe
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Missing Data

Let's start by examining how much missing data we have and decide how to handle it.

In [51]:
# Identify missing values
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

We can see that Age, Cabin, and Embarked have missing values. Let's handle each of them.

In [52]:
# Use median to fill missing values in Age
median_age = df['Age'].median()
print(f'Median of "Age" column: {median_age}')
df['Age'].fillna(median_age, inplace=True)

# Fill missing values in Embarked with the mode
mode_embarked = df['Embarked'].mode()[0]
print(f'Mode of "Embarked" column: {mode_embarked}')
df['Embarked'].fillna(mode_embarked, inplace=True)

# Too many missing values in Cabin, drop the column
df = df.drop('Cabin', axis=1)

Median of "Age" column: 28.0
Mode of "Embarked" column: S


## Feature Transformation

Let's transform some of the features to better suit our modeling needs.

In [53]:
# Combine SibSp and Parch into a single feature FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch']
df = df.drop(['SibSp', 'Parch'], axis=1)

# Transform Fare to log scale due to high skewness
df['Fare'] = df['Fare'].apply(lambda x: np.log(x) if x > 0 else 0)


## Categorical Encoding

Finally, let's handle our categorical variables. We'll perform one-hot encoding on the Embarked and Pclass columns, and encode the Sex column to binary.

In [54]:
# One-hot encoding for Embarked and Pclass
df = pd.get_dummies(df, columns=['Embarked', 'Pclass'])

# Binary encoding for Sex
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Extract titles from the Name column as it might indicate social status.

In [55]:
# Extract titles from name
df['Title'] = df['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
df = df.drop(['Name'], axis=1)

# One-hot encoding for Title
df = pd.get_dummies(df, columns=['Title'])

The 'Ticket' field in the Titanic dataset represents the ticket number for each passenger. At first glance, this seems like an arbitrary string that might not have much information. However, upon further investigation, some potential patterns and useful information could be extracted:

Ticket Prefix: Some tickets have prefixes which may denote some sort of classification, possibly related to the cabin, embarkation point, or passenger type. You could try extracting these prefixes and see if they provide any additional value to your model.

Shared Tickets: Passengers traveling together often have the same ticket number. This could help us derive a feature that represents groups or families traveling together.

Ticket Length: The length of the ticket number (including the prefix) could also potentially hold some information.



In [56]:
# Extracting ticket prefix
df['Ticket_Prefix'] = df['Ticket'].apply(lambda x: x.split()[0] if not x.split()[0].isdigit() else 'NoPrefix')
df = df.drop(['Ticket'], axis=1)

# One-hot encoding for Ticket_Prefix
df = pd.get_dummies(df, columns=['Ticket_Prefix'])


Let's take a look at our processed dataframe.

In [57]:
df.head()

Unnamed: 0,PassengerId,Survived,Sex,Age,Fare,FamilySize,Embarked_C,Embarked_Q,Embarked_S,Pclass_1,...,Ticket_Prefix_SOTON/O.Q.,Ticket_Prefix_SOTON/O2,Ticket_Prefix_SOTON/OQ,Ticket_Prefix_STON/O,Ticket_Prefix_STON/O2.,Ticket_Prefix_SW/PP,Ticket_Prefix_W./C.,Ticket_Prefix_W.E.P.,Ticket_Prefix_W/C,Ticket_Prefix_WE/P
0,1,0,0,22.0,1.981001,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,2,1,1,38.0,4.266662,1,1,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,3,1,1,26.0,2.070022,0,0,0,1,0,...,0,0,0,0,1,0,0,0,0,0
3,4,1,1,35.0,3.972177,1,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
4,5,0,0,35.0,2.085672,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0


In [58]:
df.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 73 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   PassengerId               891 non-null    int64  
 1   Survived                  891 non-null    int64  
 2   Sex                       891 non-null    int64  
 3   Age                       891 non-null    float64
 4   Fare                      891 non-null    float64
 5   FamilySize                891 non-null    int64  
 6   Embarked_C                891 non-null    uint8  
 7   Embarked_Q                891 non-null    uint8  
 8   Embarked_S                891 non-null    uint8  
 9   Pclass_1                  891 non-null    uint8  
 10  Pclass_2                  891 non-null    uint8  
 11  Pclass_3                  891 non-null    uint8  
 12  Title_Capt                891 non-null    uint8  
 13  Title_Col                 891 non-null    uint8  
 14  Title_Coun

In [59]:
df.to_csv('../data/processed/train_processed.csv', index=False)

Our data is now ready for model training!
