# Titanic Survival Prediction

The following cell includes the base dependencies for the rest of this notebook.

We make heavy use of Pandas dataframes as well as Numpy's algorithms.

In [1]:
import pandas as pd
import numpy as np

We start with data preprocessing, since the quality of the provided data is not optimal for application of statistics
and machine learning models.

During the processing step, we discard the `Name`, `Cabin`, and `Ticket` columns since we have currently no way to
encode and use them when training and predicting results from our models.

We also ensure that *NaN* values in other columns containing holes default to the mean of the training data for `Age`
and `Fare`.  This is possible since these 2 data points are continuous, and we assume that using the mean will not have
a significant impact on the prediction.

In [2]:
train_df = pd.read_csv("./train.csv")
test_df = pd.read_csv("./test.csv")

train_mean_age = np.mean(train_df["Age"])
train_mean_fare = np.mean(train_df["Fare"])


def preprocess(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy(deep=True)

    df.fillna({"Age": train_mean_age, "Fare": train_mean_fare}, inplace=True)

    return df.drop(columns=["Name", "Cabin", "Ticket"])


train_df = preprocess(train_df)
test_df = preprocess(test_df)

train_df

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,male,22.000000,1,0,7.2500,S
1,2,1,1,female,38.000000,1,0,71.2833,C
2,3,1,3,female,26.000000,0,0,7.9250,S
3,4,1,1,female,35.000000,1,0,53.1000,S
4,5,0,3,male,35.000000,0,0,8.0500,S
...,...,...,...,...,...,...,...,...,...
886,887,0,2,male,27.000000,0,0,13.0000,S
887,888,1,1,female,19.000000,0,0,30.0000,S
888,889,0,3,female,29.699118,1,2,23.4500,S
889,890,1,1,male,26.000000,0,0,30.0000,C


Now that our data is preprocessed, we now have to transform it by normalizing the rest of the columns.

A z-score is applied to `Age`, `Fare`, `SibSp`, and `Parch` since these columns are continuous. A one-hot encoding is
applied to `Pclass`, `Sex`, and `Embarked` since these values are categorical.

In [3]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer

transformer = make_column_transformer(
    (StandardScaler(), ["Age", "Fare", "SibSp", "Parch"]),
    (OneHotEncoder(sparse_output=False), ["Pclass", "Sex", "Embarked"]),
).set_output(transform="pandas")

train_x = transformer.fit_transform(train_df)
train_y = train_df["Survived"]

test_x = transformer.transform(test_df)

train_x

Unnamed: 0,standardscaler__Age,standardscaler__Fare,standardscaler__SibSp,standardscaler__Parch,onehotencoder__Pclass_1,onehotencoder__Pclass_2,onehotencoder__Pclass_3,onehotencoder__Sex_female,onehotencoder__Sex_male,onehotencoder__Embarked_C,onehotencoder__Embarked_Q,onehotencoder__Embarked_S,onehotencoder__Embarked_nan
0,-0.592481,-0.502445,0.432793,-0.473674,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
1,0.638789,0.786845,0.432793,-0.473674,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0
2,-0.284663,-0.488854,-0.474545,-0.473674,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
3,0.407926,0.420730,0.432793,-0.473674,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
4,0.407926,-0.486337,-0.474545,-0.473674,0.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,-0.207709,-0.386671,-0.474545,-0.473674,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0
887,-0.823344,-0.044381,-0.474545,-0.473674,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
888,0.000000,-0.176263,0.432793,2.008933,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
889,-0.284663,-0.044381,-0.474545,-0.473674,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0


For our first model, we use a simple logistic regression from *sklearn*.

A logistic regression is an inexpensive linear model used in classification problems.  While it does not give us the
best possible answer to our problem, it does allow us to see of there is something to learn from the provided data.

In [11]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, accuracy_score, f1_score, recall_score

model = LogisticRegression()
model.fit(train_x, train_y)

pred_y = model.predict(train_x)

print(f"precision: {precision_score(train_y, pred_y):.4f}")
print(f"accuracy: {accuracy_score(train_y, pred_y):.4f}")
print(f"f1: {f1_score(train_y, pred_y):.4f}")
print(f"recall: {recall_score(train_y, pred_y):.4f}")

precision: 0.7692
accuracy: 0.8047
f1: 0.7339
recall: 0.7018


As seen in the output above, our logistic regression model already has a decent precision and accuracy.

We can then use this model to predict survival on the test dataset:

In [6]:
test_y = model.predict(test_x)
submission = test_df.assign(Survived=test_y)[["PassengerId", "Survived"]]
submission.to_csv("./submission.csv", index=False)