# Titanic Competition

This notebook will explain how I was able to predict the fate of 417 passengers that were on the Titanic by training various models based on 890 passengers of the same voyage.  

The goal is to fill the survived column with as many correct values as possible.  First I am going to explain how I prepared the data.  Then I will try a few models that will keep my precision for predicting above 80%.  We will look at **<span style="color: red">CatBoost</span>**, a tree-based ensemble, **<span style="color: blue">Sequential</span>** from **<span style="color: black">tensorflow</span>**, and **<span style="color: green">Logistic Regression</span>** from **<span style="color: black">sklearn.linear_model</span>**

## 1. Check and Clean the Data

Let's set ourselves up for a quick washing of the titanic data.

In [1]:
import pandas as pd
import numpy as np

train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


Here is the function we will use to set up our data for modeling.  We set thresholds for how many titles and decks we want to allow for the hot encoding the models will do.  For titles we will use the top 10 distinct titles, and 1 "other" category.  Decks will therefore go with the top 20 categoties and 1 "other" category.

In [2]:
def cleaner(df, title_thresh=10, deck_thresh=20):
    
    # build a deck category from "cabin" by taking the first letter
    df["Cabin"] = df["Cabin"].fillna("Unknown")
    df["Deck"]  = df["Cabin"].str[0]
    
    # consolidate the remainder decks
    deck_counts   = df["Deck"].value_counts()
    common_decks  = deck_counts[ deck_counts >= deck_thresh ].index
    df["Deck"]    = df["Deck"].where(df["Deck"].isin(common_decks), "Other")

    # build a title category from "title"
    df["Title"] = df["Name"].str.extract(r',\s*([^\.]+)\.', expand=False)

    # consolidate remainder titles
    title_counts  = df["Title"].value_counts()
    common_titles = title_counts[ title_counts >= title_thresh ].index
    df["Title"]   = df["Title"].where(df["Title"].isin(common_titles), "Other")

    # replace missing age and fare with values from grouping and stats
    age_meds = df.groupby(["Pclass","Title"])["Age"].median()
    df["Age"] = df.apply(
        lambda r: age_meds.loc[(r.Pclass, r.Title)] if pd.isna(r.Age) else r.Age,
        axis=1
    )
    fare_meds = df.groupby(["Pclass","Embarked"])["Fare"].median()
    df["Fare"] = df.apply(
        lambda r: fare_meds.loc[(r.Pclass, r.Embarked)] if pd.isna(r.Fare) else r.Fare,
        axis=1
    )

    # i know its 3 or 4 lines, but we also need to fill in the missing embarked values :-D
    emb_modes = df.groupby("Pclass")["Embarked"].agg(lambda x: x.mode().iloc[0])
    df["Embarked"] = df.apply(
        lambda r: emb_modes.loc[r.Pclass] if pd.isna(r.Embarked) else r.Embarked,
        axis=1
    )

    # check to see who was lone wolfing the titantic adventure
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    df["IsAlone"]    = (df["FamilySize"] == 1).astype(int)

    # create bins for age groups
    df['AgeGroup'] = pd.cut(
        df['Age'],
        bins=[0, 12, 18, 35, 60, np.inf],
        labels=['Child','Teen','Adult','MidAge','Senior']
    )

    # create quartiles for fare
    df['FareBand'] = pd.qcut(
        df['Fare'],
        q=4,
        labels=['Low','Med','High','VeryHigh']
    )
    
    # get rid of useless columns
    drop_cols = ["PassengerId","Name","Ticket","Cabin"]
    df = df.drop(drop_cols, axis=1)

    return df

Now we will start looking at the models, let's first start with a tree-based ensemble.

## 2. CatBoost

**<span style="color: red">CatBoost</span>** has a lot of benefits that give it an edge over other models.  It does its own hot encoding, internally learns efficient “combinatorial” features, which was key for filling in unknown values in the data, and it use ordered boosting to avoid overfitting.  It can be a bit too much for a small set like this one (under 1000 lines).

In [5]:
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split

train_clean = cleaner(train)
test_clean = cleaner(test)

# cat calls
cat_cols = ['Pclass','Sex','Embarked','Title','Deck','AgeGroup','FareBand']

for df in (train_clean, test_clean):
    for c in cat_cols:
        df[c] = df[c].astype(object).fillna("Missing").astype(str)

# train, validate
X = train_clean.drop("Survived", axis=1)
y = train_clean["Survived"]

# sklearns train_test_split
X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=15
)

model = CatBoostClassifier(
    iterations=500,
    depth=6,
    learning_rate=0.1,
    random_seed=15,
    verbose=100
)
model.fit(
    X_train, y_train,
    cat_features=cat_cols,
    eval_set=(X_val, y_val),
    early_stopping_rounds=20
)

preds = model.predict(test_clean)

0:	learn: 0.6428827	test: 0.6383958	best: 0.6383958 (0)	total: 2.54ms	remaining: 1.27s
100:	learn: 0.2752394	test: 0.3241218	best: 0.3214858 (96)	total: 280ms	remaining: 1.1s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.3185820174
bestIteration = 116

Shrink model to first 117 iterations.


Here we check out how well **<span style="color: red">CatBoost</span>** did with its predicting using **<span style="color: black">sklearn.metrics</span>**.  We have a our little basic toolbox to measure the fitness of the model.

In [7]:
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    roc_auc_score
)

y_pred_prob = model.predict_proba(X_val)[:, 1]
y_pred      = (y_pred_prob >= 0.5).astype(int)

print(f"Validation Accuracy : {accuracy_score(y_val, y_pred):.4f}")
print(f"Validation ROC AUC  : {roc_auc_score  (y_val, y_pred_prob):.4f}\n")

print("Classification Report:")
print(classification_report(y_val, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_val, y_pred))

Validation Accuracy : 0.8771
Validation ROC AUC  : 0.9255

Classification Report:
              precision    recall  f1-score   support

           0       0.89      0.92      0.90       110
           1       0.86      0.81      0.84        69

    accuracy                           0.88       179
   macro avg       0.87      0.86      0.87       179
weighted avg       0.88      0.88      0.88       179

Confusion Matrix:
[[101   9]
 [ 13  56]]


This is a very solid base model with high overall accuracy and AUC.  It stuggles a tad more with the survivor side (higher false-negative rate for class 1).  For survivors there is a 86 % precision and 81 % recall, so about 19 % of actual survivors are being missed (false negatives).  

87.7% accuracy is a good prediction rate, but what stands out is what the ROC AUC at near 93%, showing that the model does a good job of distinguishing who is a survivor and who isn't

## 3. Using Tensorflow's "Sequential" Neural Network 

In this section we will use **<span style="color: blue">Sequential</span>** from **<span style="color: black">tensorflow.keras</span>** to build a neural network.  **<span style="color: blue">Sequential</span>** is one of the simplest ways to build a neural network in **<span style="color: black">tensorflow's</span>** high-level **<span style="color: black">keras</span>** API. It represents a linear stack of layers, where you feed the output of one layer directly into the next.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Dropout

# We like to recycle here
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train_clean = cleaner(train)
test_clean = cleaner(test)

# you would think i got rid of all the nans....
cat_cols = ['Pclass','Sex','Embarked','Title','Deck','AgeGroup','FareBand']
for df in (train_clean, test_clean):
    for c in cat_cols:
        df[c] = df[c].astype(object)
        df[c] = df[c].fillna('Missing').astype(str)

# the training begins now
X = train_clean.drop("Survived", axis=1)
y = train_clean["Survived"].values

X_train, X_val, y_train, y_val = train_test_split(
    X, y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

# we have to define what is numerical and what is categorial for Sequential
num_cols = ["Age","SibSp","Parch","Fare","FamilySize","IsAlone"]

# build piplines
num_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler())
])
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot",  OneHotEncoder(handle_unknown="ignore", sparse=False))
])

preprocessor = ColumnTransformer([
    ("num", num_pipe, num_cols),
    ("cat", cat_pipe, cat_cols)
])

# prepare the fitting
X_train_proc = preprocessor.fit_transform(X_train)
X_val_proc   = preprocessor.transform(X_val)

# build our Sequential MultiLayer Preceptron and compile it
model = Sequential([
    Dense(128, activation="relu", input_shape=(X_train_proc.shape[1],)),
    Dropout(0.5),
    Dense(64,  activation="relu"),
    Dropout(0.3),
    Dense(1,   activation="sigmoid")
])
model.compile(
    optimizer="adam",
    loss="binary_crossentropy",
    metrics=["accuracy"]
)

# train the validation model
history = model.fit(
    X_train_proc, y_train,
    validation_data=(X_val_proc, y_val),
    epochs=50,
    batch_size=32,
    verbose=2
)

Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


23/23 - 1s - 48ms/step - accuracy: 0.6362 - loss: 0.6339 - val_accuracy: 0.6536 - val_loss: 0.5790
Epoch 2/50
23/23 - 0s - 3ms/step - accuracy: 0.7135 - loss: 0.5714 - val_accuracy: 0.7263 - val_loss: 0.5178
Epoch 3/50
23/23 - 0s - 3ms/step - accuracy: 0.7556 - loss: 0.4979 - val_accuracy: 0.7877 - val_loss: 0.4829
Epoch 4/50
23/23 - 0s - 3ms/step - accuracy: 0.7949 - loss: 0.4776 - val_accuracy: 0.7877 - val_loss: 0.4558
Epoch 5/50
23/23 - 0s - 4ms/step - accuracy: 0.8076 - loss: 0.4321 - val_accuracy: 0.8045 - val_loss: 0.4406
Epoch 6/50
23/23 - 0s - 3ms/step - accuracy: 0.8062 - loss: 0.4442 - val_accuracy: 0.8212 - val_loss: 0.4328
Epoch 7/50
23/23 - 0s - 3ms/step - accuracy: 0.8146 - loss: 0.4334 - val_accuracy: 0.8268 - val_loss: 0.4258
Epoch 8/50
23/23 - 0s - 3ms/step - accuracy: 0.8160 - loss: 0.4306 - val_accuracy: 0.8268 - val_loss: 0.4231
Epoch 9/50
23/23 - 0s - 3ms/step - accuracy: 0.8076 - loss: 0.4332 - val_accuracy: 0.8547 - val_loss: 0.4233
Epoch 10/50
23/23 - 0s - 3ms/

And the validation below

In [10]:

val_loss, val_acc = model.evaluate(X_val_proc, y_val, verbose=0)
print(f"\nValidation loss: {val_loss:.4f}")
print(f"Validation accuracy: {val_acc:.4f}")

y_val_prob = model.predict(X_val_proc).ravel()
y_val_pred = (y_val_prob >= 0.5).astype(int)

print(f"ROC AUC: {roc_auc_score(y_val, y_val_prob):.4f}\n")
print("Classification Report:")
print(classification_report(y_val, y_val_pred))
print("Confusion Matrix:")
print(confusion_matrix(y_val, y_val_pred))



Validation loss: 0.4486
Validation accuracy: 0.8101
[1m6/6[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 7ms/step 
ROC AUC: 0.8498

Classification Report:
              precision    recall  f1-score   support

           0       0.82      0.89      0.85       110
           1       0.80      0.68      0.73        69

    accuracy                           0.81       179
   macro avg       0.81      0.79      0.79       179
weighted avg       0.81      0.81      0.81       179

Confusion Matrix:
[[98 12]
 [22 47]]


The results for this model fall behind that of the **<span style="color: red">CatBoost</span>**, with an 81% precision rate and 85% for the ROC-AUC.  The confusion matrix shows that the model is classifying most non-survivors (89 % recall) but missing nearly one-third of survivors (68 % recall), leading to 22 false negatives.  MLP's normally do very well with nonlinear regressions, but because of the size of the Titanic data, it can't quite wrap its net around the numbers.  MLP's need a certain amount of depth and training data to get the most out of them.

## 3. SKLearn's Linear-Regression

**<span style="color: green">LogisticRegression's</span>** coefficients can tell you exactly how each standardized feature affects the log-odds of survival, which is invaluable for understanding passenger risk factors and makes it a good choice to use on the Titanic data.  

In [11]:
from sklearn.linear_model import LogisticRegression

# one more time
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")

train_clean = cleaner(train)
test_clean = cleaner(test)

# just in case
cat_cols = ['Pclass','Sex','Embarked','Title','Deck','AgeGroup','FareBand']
for df in (train_clean, test_clean):
    for c in cat_cols:
        df[c] = df[c].astype(object)
        df[c] = df[c].fillna('Missing').astype(str)

# train test split
X = train_clean.drop("Survived", axis=1)
y = train_clean["Survived"]
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=15
    )

# behold, our pipelines
num_cols = ["Age","SibSp","Parch","Fare","FamilySize","IsAlone"]
num_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='median')),
    ('scale',  StandardScaler())
])
cat_pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse=False))
])
preprocessor = ColumnTransformer([
    ('num', num_pipe, num_cols),
    ('cat', cat_pipe, cat_cols)
])

# our logistic regression pipeline
pipe = Pipeline([
    ('prep', preprocessor),
    ('clf',  LogisticRegression(solver='lbfgs', max_iter=1000))
])

# now we train
pipe.fit(X_train, y_train)



Now let's check the results.

In [12]:
y_pred = pipe.predict(X_val)
y_prob = pipe.predict_proba(X_val)[:, 1]

print("Validation Accuracy:", accuracy_score(y_val, y_pred))
print("ROC AUC:          ", roc_auc_score(y_val, y_prob))
print("\nClassification Report:\n", classification_report(y_val, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_val, y_pred))

Validation Accuracy: 0.8659217877094972
ROC AUC:           0.9127799736495388

Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.88      0.89       110
           1       0.82      0.84      0.83        69

    accuracy                           0.87       179
   macro avg       0.86      0.86      0.86       179
weighted avg       0.87      0.87      0.87       179

Confusion Matrix:
 [[97 13]
 [11 58]]


The results from **<span style="color: green">LogisticRegression's</span>** are compareable with those of **<span style="color: red">CatBoost</span>**.  Precision/Recall for non-survivors are 0.90/0.88, and for survivors are 0.82/0.84.  It slightly under‐predicts survivors but does so more evenly than the MLP did.

Future Projects

- Pairing models to push for accuracy past 95%
- Experiment with other treebase ensembles 