# Titanic Survival with Decision Tree

***How probable would you have survived in such catastrophy?*** What impacts in your chance of survival?

This notebook aims to show a quick application of decision tree, therefore uses sklearn library, which has decision tree built in. If anyone is interested on how to build one from scratch, there is a notebook in this folder which has the complete code.

This problem and data are taken from Kaggle [*Titanic - Machine Learning from Disaster*](https://www.kaggle.com/c/titanic/overview)

### Imports and preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import LabelEncoder

In [None]:
# Read in the train and test data, they are already in separate files
# SibSp are the number of siblings abroad, Parch the number of parent/children abroad, Embarked is the first letter of embarked city
train_data = pd.read_csv("./data/train.csv")
train_data

In [None]:
# Getting information about the data
train_data.info()

In [None]:
# Looking at statistical summary
train_data.describe()

In [None]:
# Preprocess our data and build our label encoders and data after split
label_encoders = {}

def preprocess(df, encoded=False):
    
    # Drop Cabin column because too many missing values
    # Drop Name and Ticket column as they are not a suitable in this case
    # Drop rows with missing Age and Embarked city
    df = df.drop(columns=["Name"]).drop(columns=["Ticket"]).drop(columns=["Cabin"]).dropna()


    # Preprocessing string values into labels
    for attribute in ["Sex", "Embarked"]:

        # Train data builds the label encoder
        if (not(encoded)):
            le = LabelEncoder()
            df[attribute] = le.fit_transform(df[attribute])
            label_encoders[attribute] = le

        # This used to process test data when label encoders are already built
        else:
            df[attribute] = label_encoders[attribute].transform(df[attribute])


    # Split the data into x and y where x are the features/attributes and y whether survived or not
    if (not(encoded)):
        return df.iloc[:, 2:], df.iloc[:, 1:2]
    # Test data only has x
    else:
        return df.iloc[:, 1:]
        


if "Cabin" in train_data.columns:
    train_data_x, train_data_y = preprocess(train_data)
train_data_x


In [None]:
train_data_x.info()

### Training

In [None]:
# We create a decision tree by GINI index
decision_tree = DecisionTreeClassifier(criterion="gini", random_state=49)


# Train it with train data, omit headers
decision_tree.fit(train_data_x.values,train_data_y.values)

In [None]:
# Quick prediction given made up conditions
customised = {
    "Pclass": 3, # Integer: 1/2/3
    "Sex": "female", # String: male/female
    "Age": 20, # Float
    "SibSp": 5, # Integer
    "Parch": 2, # Integer
    "Fare": 250, # Float
    "Embarked": "C" # String: Q/C/S
}

predict = []

# Label string sttributes
for key in customised:
    if key in label_encoders.keys():
        predict.append(label_encoders[key].transform([customised[key]]))
    else:
        predict.append(customised[key])

# Make a prediction
prediction = decision_tree.predict([predict])[0]

s = """Would someone 
 - with a {} class ticket
 - sex is {}
 - is {}
 - has {} siblings and {} parents/childs aboard
 - with a ticket fare of {}
 - embarked from {}
 HAVE SURVIVED?
 Decision Tree's answer is {}.""".format(
 ("first" if customised["Pclass"] == 1 else ("second" if customised["Pclass"] == 2 else "third")),
 customised["Sex"],
 (str(int(customised["Age"])) + " years old") if int(customised["Age"])>1 else (("1 year old") if int(customised["Age"]) == 1 else "a baby less than a year old"),
 str(customised["SibSp"]),
 str(customised["Parch"]),
 str(round(customised["Fare"], 2)),
 "Cherbourg" if customised["Embarked"] == "C" else ("Queenstown" if (customised["Embarked"] == "Q") else "Southampton"),
 "YES" if prediction == 1 else "no")

print(s)






### Testing and Scoring

In [None]:
# Same preprocessing with test data
test_data = pd.read_csv("./data/test.csv")
test_data_x = preprocess(test_data, encoded=True)

# Read in solution data and drop unmatching rows
test_data_y = pd.read_csv("./data/solution.csv")
unmatching_rows = [i for i in test_data_y.index if i not in test_data_x.index]
test_data_y = test_data_y.drop(unmatching_rows).drop(columns=["PassengerId"])


score = decision_tree.score(test_data_x.values, test_data_y)
print("The decision tree predicts the result correctly in {}%".format(round(score, 2)*100))

In [None]:
# Confusion matrix
mat = confusion_matrix(decision_tree.predict(test_data_x.values), test_data_y.values)

plt.figure(figsize = (16,10))
sns.heatmap(mat, annot=True, annot_kws={'size': 15}, square = True, fmt=".3g")
plt.xticks(size = 15)
plt.yticks(size = 15)


In [None]:
# Most discriminatory features
pd.concat((pd.DataFrame(train_data_x.columns, columns = ['variable']), 
           pd.DataFrame(decision_tree.feature_importances_, columns = ['importance'])), 
          axis = 1).sort_values(by='importance', ascending = False)

In [None]:
train_data["Survived"].groupby(train_data["Sex"]).mean()

### It seems like the phrase ***"Women and children first"*** is not just a saying, our most sincere respects to them

In [None]:
plt.figure(figsize=(25, 20))
_ = plot_tree(decision_tree, feature_names=train_data_x.columns, class_names=["No", "Yes"], filled=True)
plt.savefig("decision_tree.svg")

### The tree predicts the survival correctly in *68%*,

which is not bad. However, this could be **better**, note that:
- The dataset is relatively *small*, fewer than 1000. In addition, we removed part of it, making it even smaller.
  - Maybe we can replace the missing values
- We didn't do any hyperparameter tunning (e.g. max_depth, min_samples_split)
  - Try to cross validate
- Decision trees are often overfitting training data, is there any solution?
  - Ensemble methods, e.g. Random Forest