## Introduction

This notebook is a breakdown of the kaggle titanic competicion, this is a chance for me to analize data, to then ad a machine learning module to estimate the survival of passengers based on the provived data of the passengers.

#  -------------------------------------------------------------------------------------------------------------

## Declaring Imports & Bringing In Data

In [None]:
%matplotlib inline

import time, random, datetime

# Importing these for manipulating data 
import numpy, pandas

# And importing these for visualising
import seaborn, matplotlib.pyplot as ploter, missingno

ploter.style.use('seaborn-whitegrid')

# SKLearn preprocessing
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, label_binarize

# Importing some Machine learning modules
import catboost
from sklearn.model_selection import train_test_split
from sklearn import model_selection, tree, preprocessing, metrics, linear_model
from sklearn.svm import LinearSVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression, LogisticRegression, SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from catboost import CatBoostClassifier, Pool, cv

In [None]:
# Bringing in data and storing them as a 
train = pandas.read_csv('./Data/train.csv')
test = pandas.read_csv('./Data/test.csv')
gender_submission = pandas.read_csv('./Data/gender_submission.csv')

## A Quick Look
A breif look at the existing data 

In [None]:
# Calling to view the head (first 5 rows of data) of the training data
train.head()

In [None]:
# Same as training data -- looking at first 5 rows of data.
test.head()

In [None]:
# Now having a look at the provided example submisison dataframe
gender_submission.head()

In [None]:
# Displays a bunch of handy statistics real easy.
# Ie, We see that the highest age was 80 years old, people had on average half a sibling/spouse, etc.
# Also we get a glimps that not all passengers had their age accounted for... so we have missing data that will perhaps impact our models design
train.describe()

## Missing Data
After having quick look, I found holes in the data. Time to really see what we're looking at (missing data wise) and have a deeper look at the data.

In [None]:
# Now, this will plot a graph of missing values - to better see what data we have missing.
missingno.matrix(train, figsize = (30,10))

In [None]:
df_bin = pandas.DataFrame() 
df_con = pandas.DataFrame() 

In [None]:
# Displays what are variable types my data columns that im working with.  
train.dtypes

In [None]:
# Graphing how many people survived and how many unfortunatly didnt?
fig = ploter.figure(figsize=(20,1))
seaborn.countplot(y='Survived', data=train);
print(train.Survived.value_counts())

## Gender/Sex
A look at the data in regards to the sex of the passengers and maybe find if their's a connection to their survival rate.

In [None]:
df_bin['Survived'] = train['Survived']
df_con['Survived'] = train['Survived']
df_bin['Pclass'] = train['Pclass']
df_con['Pclass'] = train['Pclass']
df_bin['Sex'] = train['Sex']
df_bin['Sex'] = numpy.where(df_bin['Sex'] == 'female', 1, 0)
df_con['Sex'] = train['Sex']

In [None]:
# Checking to see how the sex/gender variable looks compaired to survival. 
fig = ploter.figure(figsize=(10, 10))
seaborn.distplot(df_bin.loc[df_bin['Survived'] == 1]['Sex'], kde_kws={'label': 'Survived'});
seaborn.distplot(df_bin.loc[df_bin['Survived'] == 0]['Sex'], kde_kws={'label': 'Did not survive'});

The data here shows me that not many survived, but more of the survivers were female.

In [None]:
def plot_count_dist(data, bin_df, label_column, target_column, figsize=(20, 5), use_bin_df=False):
    if use_bin_df: 
        fig = ploter.figure(figsize=figsize)
        ploter.subplot(1, 2, 1)
        seaborn.countplot(y=target_column, data=bin_df);
        ploter.subplot(1, 2, 2)
        seaborn.distplot(data.loc[data[label_column] == 1][target_column], 
                     kde_kws={"label": "Survived"});
        seaborn.distplot(data.loc[data[label_column] == 0][target_column], 
                     kde_kws={"label": "Did not survive"});
    else:
        fig = ploter.figure(figsize=figsize)
        ploter.subplot(1, 2, 1)
        seaborn.countplot(y=target_column, data=data);
        ploter.subplot(1, 2, 2)
        seaborn.distplot(data.loc[data[label_column] == 1][target_column], 
                     kde_kws={"label": "Survived"});
        seaborn.distplot(data.loc[data[label_column] == 0][target_column], 
                     kde_kws={"label": "Did not survive"});

## Siblings & Spouses
Looking at the avalible data of the Siblings & Spouse, compairing them to survival and analyising.

In [None]:
# How many missing values does SibSp have?
train.SibSp.isnull().sum()

In [None]:
# How many Siblings &/ Spouses did passengers have aboard
train.SibSp.value_counts()
# IE, we now see that there were 608 people that were alone on their travels.

In [None]:
# Add SibSp to subset dataframes
df_bin['SibSp'] = train['SibSp']
df_con['SibSp'] = train['SibSp']

In [None]:
# Pairing how passengers survived when they had siblings/spouses.
plot_count_dist(train, 
                bin_df=df_bin, 
                label_column='Survived', 
                target_column='SibSp', 
                figsize=(20, 10))

Looking at the data above, we see that more couples had a chance of surviving.

It's this kind of data that can help train out ML model to make assumtions of survivers.

## Parents & Children
Looking athe number of Parents & Children aboard the titanic.

I'll be running very simular analysis to the Siblings &/ Spouses data!!

In [None]:
# Checking for missing values for Parch...
train.Parch.isnull().sum()

In [None]:
train.Parch.value_counts()

In [None]:
df_bin['Parch'] = train['Parch']
df_con['Parch'] = train['Parch']

In [None]:
plot_count_dist(train, 
                bin_df=df_bin,
                label_column='Survived', 
                target_column='Parch', 
                figsize=(20, 10))

## Tickets
Looking at the ticket data: the number of types, etc.. 

In [None]:
# Any missing data?
train.Ticket.isnull().sum()

In [None]:
# How many kinds of ticket are there?
seaborn.countplot(y="Ticket", data=train);

## Fare
~ Cost of tickets

Having a look at the fare data and trying to draw a connection to the survival rate.

In [None]:
# How many missing data in the Fare variable?
train.Fare.isnull().sum()

In [None]:
# How many different values of Fare are there?
seaborn.countplot(y="Fare", data=train);

In [None]:
# Add Fare to sub dataframes
df_con['Fare'] = train['Fare'] 
df_bin['Fare'] = pandas.cut(train['Fare'], bins=5) # discretised

In [None]:
# What do our Fare bins look like?
df_bin.Fare.value_counts()

In [None]:
# Visualise the Fare bin counts as well as the Fare distribution versus Survived.
plot_count_dist(data=train,
                bin_df=df_bin,
                label_column='Survived', 
                target_column='Fare', 
                figsize=(20,10), 
                use_bin_df=True)

## Embarked
Where passengers embarked on the titanic

<b>Legend:</b>

S -> Southampton

C -> Cherbourg

Q -> Queenstown


In [None]:
# How many missing values does Embarked have?
train.Embarked.isnull().sum()

In [None]:
# What kind of values are in Embarked?
train.Embarked.value_counts()

In [None]:
# What do the counts look like?
seaborn.countplot(y='Embarked', data=train);

In [None]:
# Add Embarked to sub dataframes
df_bin['Embarked'] = train['Embarked']
df_con['Embarked'] = train['Embarked']

In [None]:
# Remove Embarked rows which are missing values
print(len(df_con))
df_con = df_con.dropna(subset=['Embarked'])
df_bin = df_bin.dropna(subset=['Embarked'])
print(len(df_con))

## Encoding features
Now we have our two sub dataframes ready. We can encode the features so they're ready to be used with our machine learning models.

In [None]:
df_bin.head()

In [None]:
# One-hot encode binned variables
one_hot_cols = df_bin.columns.tolist()
one_hot_cols.remove('Survived')
df_bin_enc = pandas.get_dummies(df_bin, columns=one_hot_cols)

df_bin_enc.head()

In [None]:
df_con.head(10)

In [None]:
# One hot encode the categorical columns
df_embarked_one_hot = pandas.get_dummies(df_con['Embarked'], 
                                     prefix='embarked')

df_sex_one_hot = pandas.get_dummies(df_con['Sex'], 
                                prefix='sex')

df_plcass_one_hot = pandas.get_dummies(df_con['Pclass'], 
                                   prefix='pclass')

In [None]:
# Combine the one hot encoded columns with df_con_enc
df_con_enc = pandas.concat([df_con, 
                        df_embarked_one_hot, 
                        df_sex_one_hot, 
                        df_plcass_one_hot], axis=1)

# Drop the original categorical columns (because now they've been one hot encoded)
df_con_enc = df_con_enc.drop(['Pclass', 'Sex', 'Embarked'], axis=1)

In [None]:
df_con_enc.head(20)

## Building Machine Learning Models
Now our data has been manipulating and converted to numbers, we can run a series of different machine learning algorithms over it to find which yield the best results.

First, lets seperate the data!!

In [None]:
# Selecting the dataframe I'll use for my fire prediction
selected_df = df_con_enc

In [None]:
selected_df.head()

In [None]:
# Split the dataframe into data and labels
X_train = selected_df.drop('Survived', axis=1) # data
y_train = selected_df.Survived # labels

In [None]:
X_train.shape # without the lables

In [None]:
X_train.head()

In [None]:
y_train.shape

## Machine Learning Algorithims
Creating a function that will check the accuracy of the different algorithims that get passed through.

In [None]:
# Function that runs the requested algorithm and returns the accuracy metrics
def check_accuracy(algo, X_train, y_train, cv):
    
    # One Pass
    model = algo.fit(X_train, y_train)
    acc = round(model.score(X_train, y_train) * 100, 2)
    
    # Cross Validation 
    train_pred = model_selection.cross_val_predict(algo, 
                                                  X_train, 
                                                  y_train, 
                                                  cv=cv, 
                                                  n_jobs = -1)
    # Cross-validation accuracy metric
    acc_cv = round(metrics.accuracy_score(y_train, train_pred) * 100, 2)
    
    return train_pred, acc, acc_cv

In [None]:
# Logistic Regression
start_time = time.time()
train_pred_log, acc_log, acc_cv_log = check_accuracy(LogisticRegression(), 
                                                               X_train, 
                                                               y_train, 
                                                                    10)
log_time = (time.time() - start_time)
print("Accuracy: %s" % acc_log)
print("Accuracy CV 10-Fold: %s" % acc_cv_log)
print("Running Time: %s" % datetime.timedelta(seconds=log_time))

In [None]:
# k-Nearest Neighbours
start_time = time.time()
train_pred_knn, acc_knn, acc_cv_knn = check_accuracy(KNeighborsClassifier(), 
                                                  X_train, 
                                                  y_train, 
                                                  10)
knn_time = (time.time() - start_time)
print("Accuracy: %s" % acc_knn)
print("Accuracy CV 10-Fold: %s" % acc_cv_knn)
print("Running Time: %s" % datetime.timedelta(seconds=knn_time))

In [None]:
# Gaussian Naive Bayes
start_time = time.time()
train_pred_gaussian, acc_gaussian, acc_cv_gaussian = check_accuracy(GaussianNB(), 
                                                                      X_train, 
                                                                      y_train, 
                                                                           10)
gaussian_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gaussian)
print("Accuracy CV 10-Fold: %s" % acc_cv_gaussian)
print("Running Time: %s" % datetime.timedelta(seconds=gaussian_time))

In [None]:
# Linear SVC
start_time = time.time()
train_pred_svc, acc_linear_svc, acc_cv_linear_svc = check_accuracy(LinearSVC(),
                                                                X_train, 
                                                                y_train, 
                                                                10)
linear_svc_time = (time.time() - start_time)
print("Accuracy: %s" % acc_linear_svc)
print("Accuracy CV 10-Fold: %s" % acc_cv_linear_svc)
print("Running Time: %s" % datetime.timedelta(seconds=linear_svc_time))

In [None]:
# Stochastic Gradient Descent
start_time = time.time()
train_pred_sgd, acc_sgd, acc_cv_sgd = check_accuracy(SGDClassifier(), 
                                                  X_train, 
                                                  y_train,
                                                  10)
sgd_time = (time.time() - start_time)
print("Accuracy: %s" % acc_sgd)
print("Accuracy CV 10-Fold: %s" % acc_cv_sgd)
print("Running Time: %s" % datetime.timedelta(seconds=sgd_time))

In [None]:
# Decision Tree Classifier
start_time = time.time()
train_pred_dt, acc_dt, acc_cv_dt = check_accuracy(DecisionTreeClassifier(), 
                                                                X_train, 
                                                                y_train,
                                                                10)
dt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_dt)
print("Accuracy CV 10-Fold: %s" % acc_cv_dt)
print("Running Time: %s" % datetime.timedelta(seconds=dt_time))

In [None]:
# Gradient Boosting Trees
start_time = time.time()
train_pred_gbt, acc_gbt, acc_cv_gbt = check_accuracy(GradientBoostingClassifier(), 
                                                                       X_train, 
                                                                       y_train,
                                                                       10)
gbt_time = (time.time() - start_time)
print("Accuracy: %s" % acc_gbt)
print("Accuracy CV 10-Fold: %s" % acc_cv_gbt)
print("Running Time: %s" % datetime.timedelta(seconds=gbt_time))

## CatBoost Algorithm
CatBoost is a state-of-the-art open-source gradient boosting on decision trees library.