In [77]:
# Load all of the libraries
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, StandardScaler
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import StratifiedKFold

import string
import warnings
warnings.filterwarnings('ignore')

SEED = 42

In [78]:
df_raw = pd.read_csv(r'../Dataset/titanic.csv')

In [79]:
print('Titanic data raw X Shape = {}'.format(df_raw.shape))
print('Titanic data raw y Shape = {}\n'.format(df_raw['Survived'].shape[0]))

Titanic data raw X Shape = (891, 12)
Titanic data raw y Shape = 891



In [80]:
# Preview train dataset
df_raw.sample(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
813,814,0,3,"Andersson, Miss. Ebba Iris Alfrida",female,6.0,4,2,347082,31.275,,S
865,866,1,2,"Bystrom, Mrs. (Karolina)",female,42.0,0,0,236852,13.0,,S
642,643,0,3,"Skoog, Miss. Margit Elizabeth",female,2.0,3,2,347088,27.9,,S
155,156,0,1,"Williams, Mr. Charles Duane",male,51.0,0,1,PC 17597,61.3792,,C
232,233,0,2,"Sjostedt, Mr. Ernst Adolf",male,59.0,0,0,237442,13.5,,S


### **1.1 Overview**
* `PassengerId` is the unique id of the row and it does not have any effect on target
* `Survived` is the target variable we are trying to predict (**0** or **1**):
    - **1 = Survived**
    - **0 = Not Survived**
* `Pclass` (Passenger Class) is the socio-economic status of the passenger and it is a categorical ordinal feature which has **3** unique values (**1**,  **2** or **3**):
    - **1 = Upper Class**
    - **2 = Middle Class**
    - **3 = Lower Class**
* `Name`, `Sex` and `Age` are self-explanatory
* `SibSp` is the total number of the passengers' siblings and spouse
* `Parch` is the total number of the passengers' parents and children
* `Ticket` is the ticket number of the passenger
* `Fare` is the passenger fare
* `Cabin` is the cabin number of the passenger
* `Embarked` is port of embarkation and it is a categorical feature which has **3** unique values (**C**, **Q** or **S**):
    - **C = Cherbourg**
    - **Q = Queenstown**
    - **S = Southampton**

In [81]:
df_raw.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


**Upon reviewing the info() output, it becomes evident that there are 1309 rows in total. Additionally, it is noticeable that certain columns, such as Cabin, Age, and Survived, contain missing values.**

### 2. Feature Engineering

### **2.1 Data Cleaning and Preprocessing**

#### **2.1.1 Missing values identification and treatment**

In [82]:
# Display Missing values of each of the columns present in train and test dataset
def display_missing(df):    
    for col in df.columns.tolist():          
        print('{} column missing values: {}'.format(col, df[col].isnull().sum()))
    print('\n')
    

display_missing(df_raw)

PassengerId column missing values: 0
Survived column missing values: 0
Pclass column missing values: 0
Name column missing values: 0
Sex column missing values: 0
Age column missing values: 177
SibSp column missing values: 0
Parch column missing values: 0
Ticket column missing values: 0
Fare column missing values: 0
Cabin column missing values: 687
Embarked column missing values: 2




#### Three columns contain missing values, and we will address each of them individually, filling the missing values using relevant relationships below.

#### A. Age

In [83]:
# Let's check correlation amongs the columns present
df_raw_corr = df_raw.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()

In [84]:
df_raw_corr

Unnamed: 0,level_0,level_1,0
0,PassengerId,PassengerId,1.0
1,Survived,Survived,1.0
2,Parch,Parch,1.0
3,SibSp,SibSp,1.0
4,Pclass,Pclass,1.0
5,Age,Age,1.0
6,Fare,Fare,1.0
7,Fare,Pclass,0.5495
8,Pclass,Fare,0.5495
9,SibSp,Parch,0.414838


According to the task specifications, we have the option to populate or eliminate records lacking Age information. Deleting is not a viable option, given the existence of 177 rows with missing Age data. To effectively train a machine learning model, a substantial amount of data is required to identify patterns for accurate predictions on unseen data. To address the missing values in the Age column, we can replace them with the mean, median, or an appropriate value, depending on our familiarity with the data or event. In this context, using the median age is recommended, specifically the median age within Pclass groups. This choice is supported by the high correlation of 0.369226 with Age. Grouping ages by passenger classes is also a more logical approach compared to other features.

In [85]:
df_raw_corr = df_raw.corr().abs().unstack().sort_values(kind="quicksort", ascending=False).reset_index()
df_raw_corr.rename(columns={"level_0": "Feature 1", "level_1": "Feature 2", 0: 'Correlation Coefficient'}, inplace=True)
df_raw_corr[df_raw_corr['Feature 1'] == 'Age']

Unnamed: 0,Feature 1,Feature 2,Correlation Coefficient
5,Age,Age,1.0
12,Age,Pclass,0.369226
16,Age,SibSp,0.308247
21,Age,Parch,0.189119
26,Age,Fare,0.096067
32,Age,Survived,0.077221
36,Age,PassengerId,0.036847


In [86]:
# Filling the missing values in Age with the medians Pclass groups
df_raw['Age'] = df_raw.groupby(['Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))

In [87]:
# All missing values of Age feature is filled
df_raw['Age'].isna().sum()

0

#### B. Cabin
##### We'll exclude the Cabin column since approximately 75% of the data is absent, and evidently, it doesn't add value to the model's predictions, introducing needless interference.

In [88]:
df_raw.drop(columns=['Cabin'], inplace=True)

#### Embarked

In [89]:
### Let's check the details of the both of the passengers and see if there are any relationship between the two
df_raw[df_raw['Embarked'].isnull()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,


Embarked is a categorical feature and there are only 2 missing values in whole data set. Both of those passengers are female, upper class and they have the same ticket number. This means that they know each other and embarked from the same port together. The mode Embarked value for an upper class female passenger is **C (Cherbourg)**, Let's add **C** in Embarked column for both of the passengers.

In [90]:
df_raw['Embarked'] = df_raw['Embarked'].fillna('C')

### 2.2 Categorical feature encoding

Machine learning algorithms typically work better with numerical features. We need to convert categorical features (like Sex, Embarked) into numerical representations.

In [91]:
# Encode categorical features (e.g., Sex, Embarked)
categorical_features = ["Sex", "Embarked"]
categories = [["male", "female"], ["C", "Q", "S"]]

encoder = OneHotEncoder(sparse=False, categories=categories)

# Fit and transform the categorical features
encoded_features = encoder.fit_transform(df_raw[categorical_features])

# Get the column names for the transformed features
column_names = encoder.get_feature_names_out(categorical_features)

# Create a DataFrame with the transformed features and appropriate column names
df_raw_encoded = pd.concat([df_raw.drop(categorical_features, axis=1), pd.DataFrame(encoded_features, columns=column_names)], axis=1)

In [92]:
df_raw_encoded.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Sex_female,Embarked_C,Embarked_Q,Embarked_S
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,1.0,0.0,0.0,0.0,1.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,0.0,1.0,1.0,0.0,0.0
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0.0,1.0,0.0,0.0,1.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,0.0,1.0,0.0,0.0,1.0
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,1.0,0.0,0.0,0.0,1.0


### 2.3 Deriving New Features:

**Combining existing features can create more informative features.**
- Family size: This captures the total number of family members traveling together.
- Fare per person: This represents the affordability of the ticket per individual.

In [94]:
# Family size: Combine Parch (siblings/spouses aboard) and SibSp (siblings/spouses)
df_raw_encoded["FamilySize"] = df_raw_encoded["Parch"] + df_raw_encoded["SibSp"] + 1
# Add 1 for the passenger themselves

# Fare per person: Divide Fare by FamilySize
df_raw_encoded["Fare_Per_Person"] = df_raw_encoded["Fare"] / df_raw_encoded["FamilySize"]

In [95]:
df_raw_encoded.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Age,SibSp,Parch,Ticket,Fare,Sex_male,Sex_female,Embarked_C,Embarked_Q,Embarked_S,FamilySize,Fare_Per_Person
0,1,0,3,"Braund, Mr. Owen Harris",22.0,1,0,A/5 21171,7.25,1.0,0.0,0.0,0.0,1.0,2,3.625
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0,1,0,PC 17599,71.2833,0.0,1.0,1.0,0.0,0.0,2,35.64165
2,3,1,3,"Heikkinen, Miss. Laina",26.0,0,0,STON/O2. 3101282,7.925,0.0,1.0,0.0,0.0,1.0,1,7.925
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0,1,0,113803,53.1,0.0,1.0,0.0,0.0,1.0,2,26.55
4,5,0,3,"Allen, Mr. William Henry",35.0,0,0,373450,8.05,1.0,0.0,0.0,0.0,1.0,1,8.05


### 2.4 Feature Scaling (if necessary)

**Scaling features can improve model performance for some algorithms.**
- We use StandardScaler to standardize numerical features (mean=0, standard deviation=1).
- This ensures all features are on a similar scale.
- We combine the scaled features with the rest of the processed data.

In [96]:
# Standardize numerical features
scaler = StandardScaler()
numerical_features = ["Age", "Fare", "Fare_Per_Person"]
scaled_values = scaler.fit_transform(df_raw_encoded[numerical_features])
scaled_df = pd.DataFrame(scaled_values, columns=[f"{feature}_scaled" for feature in numerical_features])

# Combine scaled features with the rest of the data and delete previously used columns for ["Age", "Fare", "Fare_Per_Person"]
df_transformed = pd.concat([df_raw_encoded.drop(numerical_features, axis=1), scaled_df], axis=1)


In [97]:
df_transformed

Unnamed: 0,PassengerId,Survived,Pclass,Name,SibSp,Parch,Ticket,Sex_male,Sex_female,Embarked_C,Embarked_Q,Embarked_S,FamilySize,Age_scaled,Fare_scaled,Fare_Per_Person_scaled
0,1,0,3,"Braund, Mr. Owen Harris",1,0,A/5 21171,1.0,0.0,0.0,0.0,1.0,2,-0.533834,-0.502445,-0.454798
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,0,PC 17599,0.0,1.0,1.0,0.0,0.0,2,0.674891,0.786845,0.438994
2,3,1,3,"Heikkinen, Miss. Laina",0,0,STON/O2. 3101282,0.0,1.0,0.0,0.0,1.0,1,-0.231653,-0.488854,-0.334757
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,0,113803,0.0,1.0,0.0,0.0,1.0,2,0.448255,0.420730,0.185187
4,5,0,3,"Allen, Mr. William Henry",0,0,373450,1.0,0.0,0.0,0.0,1.0,1,0.448255,-0.486337,-0.331267
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",0,0,211536,1.0,0.0,0.0,0.0,1.0,1,-0.156107,-0.386671,-0.193081
887,888,1,1,"Graham, Miss. Margaret Edith",0,0,112053,0.0,1.0,0.0,0.0,1.0,1,-0.760469,-0.044381,0.281499
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",1,2,W./C. 6607,0.0,1.0,0.0,0.0,1.0,4,-0.382743,-0.176263,-0.392335
889,890,1,1,"Behr, Mr. Karl Howell",0,0,111369,1.0,0.0,1.0,0.0,0.0,1,-0.231653,-0.044381,0.281499


### 2.5 Feature Selection (Optional):

**This stage can be carried out once you've examined the significance of features in your selected machine learning model. We'll delve into this further in the upcoming chapters of the lab.**
- Techniques like correlation analysis or feature importance scores can help identify the most informative features for your specific task.

- Remember that feature engineering is an iterative process. You might need to revisit and refine your steps based on your analysis goals and the performance of your machine learning model.