# Aer Lingus Data and Analytics Pre-Interview Task - Titanic Dataset
Louise Anderson 20/01/2023

## Introduction
This notebook describes the machine learning classification approach of determining the likelihood of survival based on known data.

In [88]:
#Imports
import pandas as pd
import plotly.express as px
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

In [2]:
#Load dataset
df = pd.read_csv('train.csv')
#Show first five rows
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Data Cleaning - Handling of null-values

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


There are 177 missing values for the variable 'Age', 687 missing values for the variable 'Cabin', and 2 missing values for the variable 'Embarked'.

The missing values for 'Age' can be replaced with the average passenger age.

Given that more than half of the values for 'Cabin' are missing, it is not likely that meaningful insights can be inferred from this variable and therefore it will be removed.

The missing values for 'Embarked' can be replaced with the median value for Embarked as this is a categorical variable.

In [11]:
# replace 'Age' null values with average Age
df['Age'] = df['Age'].fillna(df['Age'].mean())

In [12]:
# remove variable 'Cabin' from dataframe
df = df.drop(columns='Cabin')

In [24]:
# find median value for 'Embarked'
df['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [25]:
# replace 'Embarked' null values with median 'S'
df['Embarked'] = df['Embarked'].fillna('S')

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(4)
memory usage: 76.7+ KB


The dataframe now contains 0 null values.

In [27]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


## Exploratory Data Analysis

In [101]:
df['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

### Survival rate per Passenger Class

In [53]:
df_Pclass = df.groupby(['Pclass']).agg(no_survived=('Survived','sum'),
                                       no_passengers=('Survived','count'),
                                      )
df_Pclass['survival_rate'] = df_Pclass['no_survived'] / df_Pclass['no_passengers']
df_Pclass = df_Pclass.reset_index()
df_Pclass

Unnamed: 0,Pclass,no_survived,no_passengers,survival_rate
0,1,136,216,0.62963
1,2,87,184,0.472826
2,3,119,491,0.242363


In [54]:
# Visualize Survival Rate of Passengers per Class
fig = px.bar(df_Pclass, x="Pclass", y='survival_rate', color='Pclass', title= "Survival Rate of Passengers Per Class")
fig.show()

From this barchart it is clear that passenger class has a strong influence on survival rate with First class passenger having a survival rate of 0.6296 vs 0.2424 for Third class passengers.

### Survival rate by Gender

In [55]:
df_Gender = df.groupby(['Sex']).agg(no_survived=('Survived','sum'),
                                    no_passengers=('Survived','count'),
                                    )
df_Gender['survival_rate'] = df_Gender['no_survived'] / df_Gender['no_passengers']
df_Gender = df_Gender.reset_index()
df_Gender

Unnamed: 0,Sex,no_survived,no_passengers,survival_rate
0,female,233,314,0.742038
1,male,109,577,0.188908


In [57]:
# Visualize Survival Rate of Passengers by Gender
fig = px.bar(df_Gender, x="Sex", y='survival_rate', color='Sex', title= "Survival Rate of Passengers By Gender")
fig.show()

Gender has a strong influence on survival rate with 74.42% of females surving vs 18.89% of males.

### Distribution of Passenger Age Range

In [64]:
fig = px.histogram(df, x="Age", color='Survived', pattern_shape="Sex")
fig.show()

There are a high number of passenger in their late twenties/early thirties. Of those whose survived in all age groups it appears to be mostly women. Unsurprisingly most children did survive regardless of gender. This chart does show that Age does have an influence on survival rate.

### Correlation

In [66]:
df.corr()





Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
PassengerId,1.0,-0.005007,-0.035144,0.033207,-0.057527,-0.001652,0.012658
Survived,-0.005007,1.0,-0.338481,-0.069809,-0.035322,0.081629,0.257307
Pclass,-0.035144,-0.338481,1.0,-0.331339,0.083081,0.018443,-0.5495
Age,0.033207,-0.069809,-0.331339,1.0,-0.232625,-0.179191,0.091566
SibSp,-0.057527,-0.035322,0.083081,-0.232625,1.0,0.414838,0.159651
Parch,-0.001652,0.081629,0.018443,-0.179191,0.414838,1.0,0.216225
Fare,0.012658,0.257307,-0.5495,0.091566,0.159651,0.216225,1.0


The strongest correlation between the target variable 'Survived' and the other numerical variables in Passenger class having a nagative correlation of -0.3385. 

## Pre-processing

In [77]:
# Encode categorical variables using Label Encoder
label_encoder = LabelEncoder()
df['Sex'] = label_encoder.fit_transform(df['Sex'])
df['Embarked'] = label_encoder.fit_transform(df['Embarked'])

In [78]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,0
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,2
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,2


In [79]:
# Split dataframe frame into train and test set. PassengerId, Name, and Ticket are dropped as these are unique identifiers.
X_train, X_test, y_train, y_test = train_test_split(df.drop(columns=['PassengerId', 'Name', 'Survived', 'Ticket']), df['Survived'].values)

## ML Classification using RandomForest Classifier

In [82]:
# Create and Train Classifier
clf = RandomForestClassifier(criterion='gini', 
                             n_estimators=700,
                             min_samples_split=10,
                             min_samples_leaf=1,
                             max_features='auto',
                             oob_score=True,
                             random_state=1,
                             n_jobs=-1)

clf.fit(X_train, y_train)


`max_features='auto'` has been deprecated in 1.1 and will be removed in 1.3. To keep the past behaviour, explicitly set `max_features='sqrt'` or remove this parameter as it is also the default value for RandomForestClassifiers and ExtraTreesClassifiers.



In [83]:
#Execute prediction
y_pred=clf.predict(X_test)

In [85]:
#Display feature importance
feature_imp = pd.Series(clf.feature_importances_, index=['Pclass', 'Sex', 'Age', 'SibSp',
                                                         'Parch', 'Fare', 'Embarked']).sort_values(ascending=False)
feature_imp

Sex         0.343838
Fare        0.223527
Age         0.190790
Pclass      0.112021
SibSp       0.054755
Embarked    0.038562
Parch       0.036507
dtype: float64

'Sex' has the strongest feature importance followed by 'Fare' and 'Age'.

## Results

In [89]:
# Measure the accuracy of the model
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8340807174887892


In [96]:
cm =metrics.confusion_matrix(y_test, y_pred)
fig = px.imshow(cm)
fig.show()

In [97]:
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.85      0.90      0.87       140
           1       0.81      0.72      0.76        83

    accuracy                           0.83       223
   macro avg       0.83      0.81      0.82       223
weighted avg       0.83      0.83      0.83       223



There is higher recall for predicting passengers who did not survive (0.9) vs passengers who did survive (0.72). This may be because the target variable is not balanced with 549 of passengers not surviving vs 342 surviving. This accuracy could be improved upon by training the model with a more balanced dataset.

Further improvement could also be made by performing hyper parameter tuning using grid search, further feature engineering, and more study of feature importance.

### Thank You