# Titanic Survival with Random Forest

***How probable would you have survived in such catastrophy?*** What impacts in your chance of survival?

This notebook aims to show a quick application of random forest, therefore uses sklearn library, which has random forest built in.

This problem and data are taken from Kaggle [*Titanic - Machine Learning from Disaster*](https://www.kaggle.com/c/titanic/overview)

### Imports and preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

In [None]:
# Preprocess function
label_encoders = {}

def preprocess(df, encoded=False):
    df = df.drop(columns=["Name"]).drop(columns=["Ticket"]).drop(columns=["Cabin"]).dropna()
    for attribute in ["Sex", "Embarked"]:

        # Train data builds the label encoder
        if (not(encoded)):
            le = LabelEncoder()
            df[attribute] = le.fit_transform(df[attribute])
            label_encoders[attribute] = le

        # This used to process test data when label encoders are already built
        else:
            df[attribute] = label_encoders[attribute].transform(df[attribute])


    # Split the data into x and y where x are the features/attributes and y whether survived or not
    if (not(encoded)):
        return df.iloc[:, 2:], df.iloc[:, 1:2]
    # Test data only has x
    else:
        return df.iloc[:, 1:]

In [None]:
train_data = pd.read_csv("./data/train.csv")
test_data = pd.read_csv("./data/test.csv")

if "Cabin" in train_data.columns:
    train_data_x, train_data_y = preprocess(train_data)
    test_data_x = preprocess(test_data, encoded=True)

test_data_y = pd.read_csv("./data/solution.csv")
unmatching_rows = [i for i in test_data_y.index if i not in test_data_x.index]
test_data_y = test_data_y.drop(unmatching_rows).drop(columns=["PassengerId"])



### Training

In [None]:
# We can customise the number of trees in the random forest
number_of_trees = 10
random_forest = RandomForestClassifier(n_estimators=number_of_trees, criterion="gini", random_state = 15)
random_forest.fit(train_data_x.values, train_data_y.values.ravel())

In [None]:
# Let's have a test, feel free to change variables here
customised = {
    "Pclass": 1, # Integer: 1/2/3
    "Sex": "female", # String: male/female
    "Age": 19, # Float
    "SibSp": 1, # Integer
    "Parch": 2, # Integer
    "Fare": 520, # Float
    "Embarked": "Q" # String: Q/C/S
}

predict = []

for key in customised:
    if key in label_encoders.keys():
        predict.append(label_encoders[key].transform([customised[key]]))
    else:
        predict.append(customised[key])

prediction = random_forest.predict([predict])[0]

s = """Would someone 
 - with a {} class ticket
 - sex is {}
 - is {}
 - has {} siblings and {} parents/childs aboard
 - with a ticket fare of {}
 - embarked from {}
 HAVE SURVIVED?
 Random Forest's answer is {}.""".format(
 ("first" if customised["Pclass"] == 1 else ("second" if customised["Pclass"] == 2 else "third")),
 customised["Sex"],
 (str(int(customised["Age"])) + " years old") if int(customised["Age"])>1 else (("1 year old") if int(customised["Age"]) == 1 else "a baby less than a year old"),
 str(customised["SibSp"]),
 str(customised["Parch"]),
 str(round(customised["Fare"], 2)),
 "Cherbourg" if customised["Embarked"] == "C" else ("Queenstown" if (customised["Embarked"] == "Q") else "Southampton"),
 "YES" if prediction == 1 else "no")

print(s)

### Testing and Scoring

In [None]:
# Note that we haven't done any hyperparameter tuning, so that we can have a direct comparison in perandom_forestormance
score = random_forest.score(test_data_x.values, test_data_y.values)
print("The random forest predicts the result correctly in {}%".format(round(score, 2)*100))

In [None]:
# Confusion matrix
mat = confusion_matrix(random_forest.predict(test_data_x.values), test_data_y.values)

plt.figure(figsize = (16,10))
sns.heatmap(mat, annot=True, annot_kws={'size': 15}, square = True, fmt=".3g")
plt.xticks(size = 15)
plt.yticks(size = 15)

In [None]:
# Plot any tree from index 0 to 9(inclusive) in the decision forest
i = 0
plt.figure(figsize=(25, 20))
_ = plot_tree(random_forest.estimators_[i], feature_names=train_data_x.columns, class_names=["No", "Yes"], filled=True)
plt.savefig("random_forest_tree_"+str(i)+".svg")

### The decision forest predicts the survival correctly in *75%*,

Comparing to the previous model of decision tree, decision tree resulted more accurate in *7%*, note that:
- The dataset is relatively *small*, fewer than 1000. In addition, we removed part of it, making it even smaller.
  - Maybe we can replace the missing values
- We didn't do any hyperparameter tunning (e.g. n_estimator, max_depth, min_samples_split)
  - Try to cross validate

We conclude that random forest is able to optimize a decision tree