### **Introduction**

Kaggle's **Titanic - Machine Learning from Disaster** is a classic introductory problem for getting familiar with the fundamentals of machine learning.
In this notebook...


### **Where to start?**

https://www.kaggle.com/code/alexisbcook/titanic-tutorial/notebook

Alexis provides a brief introduction for making a submission to kaggle with some sample code for this challenge. She uses a random forest classifier for the model in her example.

In [164]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
import os

PATH = "../input/titanic/" # file path to the datasets

for dirname, _, filenames in os.walk('PATH'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Load Datasets

In [165]:
train_data = pd.read_csv(PATH + "train.csv")
test_data = pd.read_csv(PATH + "test.csv")

In [166]:

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model_orignial = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model_orignial.fit(X, y)
predictions = model_orignial.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


### Results
The submission of this model resulted in a score of 0.77511

### **Contribution**
<!-- We can start with this as our baseline and change a few things to see if we 
get a better result. The first and easisest thing to do is tune the hyperparameters for the Random Forest Classifier and compare the results with the original submission. -->
### Part 1
We can start by looking for entries in the dataset to dropout or modify to improve the performance of the model. The passenger ID contains a unique and non-null value which means that there will be no duplicates to drop. There are missing values in other columns that we can explore.

In [167]:
data_list = [train_data.drop("Survived", axis=1), test_data]
for i, dl in enumerate(data_list):
        data_list[i].Sex = dl.Sex.apply(lambda sex: 0 if sex == "male" else 1)
all_data = pd.concat(data_list)

missing_vals = [all_data[col].isnull().sum() for col in all_data.columns.to_list()]
labels = all_data.columns.to_list()
ser = pd.Series(data=missing_vals, index=labels, name="by amount")
ser_missing = ser[ser > 0]

percentages = ser_missing.apply(lambda x: "%.2f " % (x * 100 / all_data.shape[0]))
percentages.name = "by percent"
print(f"Total number of rows: {all_data.shape[0]}\n\n{ser_missing} \
        \n\n{percentages}")
plt.show()


Total number of rows: 1309

Age          263
Fare           1
Cabin       1014
Embarked       2
Name: by amount, dtype: int64         

Age         20.09 
Fare         0.08 
Cabin       77.46 
Embarked     0.15 
Name: by percent, dtype: object


Since the number of missing values for `Fare` and `Embarked` column is negligible, it is best to simply drop these entries from the dataset. However there are a considerable amount of missing values for the `Age` and `Cabin` columns. A naive approach for filling in the missing values for the `Age` column would be to fill them in with the mean or median of the column. The better approach is to look at the relationships between `Age` and the other columns, then determine the filler value.

In [175]:
for i, dl in enumerate(data_list):
    data_list[i] = dl[dl["Fare"].notna() & dl["Embarked"].notna()]
all_data = all_data[all_data["Fare"].notna() & all_data["Embarked"].notna()]

# corr = corr.iloc[1:, 1:]x
# corr.style.background_gradient(cmap='coolwarm', axis=None).format(precision=2)
corr = all_data.corr().loc["Age"]
corr

PassengerId    0.026757
Pclass        -0.409082
Sex           -0.066006
Age            1.000000
SibSp         -0.242345
Parch         -0.149311
Fare           0.177206
Name: Age, dtype: float64

`Pclass` appears to have a high influence on `Age` while sex has virtually no impact on `Age`. This information can be applied to extract finer approximations based on the `Pclass` to better suit the missing entries as opposed to a "one fits all" approximation.