# Titanic Kaggle competition: classification problem

Andre Moreira, 2023


Based on materials from IBM's Data Science Professional Certificate (Pratiksha Verma)

## Background

This Notebook is prepared in the simplest possible way to help beginners "navigating" some of the simpler classification models, how to use hyperparameter tuning, seeing a confusion matrix, etc.

The predictions usually reach values in the range 75% accuracy and higher. The best result so far from this notebook was 78.7% accuracy in predicting survival.

## Preamble


In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats
from scipy.stats import norm

# Preprocessing allows us to standarsize our data
from sklearn import preprocessing
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
# Allows us to test parameters of classification algorithms and find the best one
from sklearn.model_selection import GridSearchCV
# Logistic Regression classification algorithm
from sklearn.linear_model import LogisticRegression
# Support Vector Machine classification algorithm
from sklearn.svm import SVC
# Decision Tree classification algorithm
from sklearn.tree import DecisionTreeClassifier
# K Nearest Neighbors classification algorithm
from sklearn.neighbors import KNeighborsClassifier
# Multilayer Perceptron
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import confusion_matrix

import sklearn.metrics as metrics

In [2]:
# Setting this option will print all collumns of a dataframe
pd.set_option('display.max_columns', None)
# Setting this option will print all of the data in a feature
pd.set_option('display.max_colwidth', None)

pd.set_option('display.max_rows', None)

This function is to plot the confusion matrix.


In [3]:
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, cmap ='Greens', annot=True, linewidths = 0.5, 
                linecolor='black',ax = ax); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix'); 
    ax.xaxis.set_ticklabels(['0', '1']); ax.yaxis.set_ticklabels(['0', '1']) 
    plt.show() 

In [4]:
# function to get the value of the confusion matrix
def cm_res(Y_t, Y_hat):
    cm_a = confusion_matrix(Y_t, Y_hat)
    tp = cm_a[1][1]
    tn = cm_a[0][0]
    fp = cm_a[0][1]
    fn = cm_a[1][0]
    return tp, tn, fp, fn

In [5]:
# this prepares for the overview, shows the confusion matrix
def overw(X, Y, Y_hat, fitter):
    plot_confusion_matrix(Y,Y_hat)
    a = fitter.score(X, Y)
    b = metrics.f1_score(Y, Y_hat)
    return a,b

## Load and prepare the data


### Training / test data

In [6]:
# This is the training data
data = pd.read_csv("train.csv")

In [7]:
data.shape

In [8]:
data.head()

X was downloaded, we will see later what is the use of it

In this dataset, the variable we want to predict is "Survived" - so that is our "Y", the rest (TBD which variables) is our "X"

Create a NumPy array from the column <code>Survived</code> in <code>data</code>, by applying the method <code>to_numpy()</code>  then
assign it  to the variable <code>Y</code>, make sure the output is a Pandas series (only one bracket df\['name of  column']).


In [9]:
r1 = "SibSp"
disc = True
b = 20

df_w = data[data["Survived"]==1]
df_w2 = data[data["Survived"]==0]

fig, axa = plt.subplots(ncols = 2, figsize = (8,3), sharey = True)
fig.tight_layout(pad=2.0)
    
sns.histplot(data = df_w, x = r1, stat = 'percent',
            kde = False, bins = b, discrete = disc, fill = False,
            ax=axa[0])

sns.histplot(data = df_w2, x = r1, stat = 'percent',
            kde = False, bins = b, discrete = disc, fill = False,
            ax=axa[1])

axa[0].set_title("Survived")
axa[1].set_title("Not survived")

plt.show()

Note: by examining the different variables, it becomes clear that some have a strong correlation with the outcome, while others have a less clear correlation.

- Strong: Pclass, Sex, Embarked 
- Weak: SibSp, Parch, Fare
- Inconclusive: Age

We will then work with the data accordingly.

After several trials, I observed that removing some of the data from the train/test set had a beter effect on the prediction than trying to create "synthetic data" to fill the gaps. However, for the set we want to predict, we will fill the gaps with a bit more sophisticated synthetic data. 

In [10]:
# variables
variables = ["Pclass", "Sex", "Embarked", "SibSp", "Parch", "Fare", "Age"]

In [11]:
# define a DF with the subset of variables that we care about - we will clean this up and use to feed the model
data_w = pd.DataFrame(data, columns = ["Survived"] + variables)
data_w.head(10)

In [12]:
data_w.shape

In [13]:
# Note that "Age" has several NaN as value -- we need to sort this out
data_w.dropna(axis = 0).shape

Most of the NaNs (but not all) come from the column "Age". We will simply drop the NaN outright and train the model with the resulting set.

In [14]:
data_w = data_w.dropna(axis = 0)

In [15]:
data_w.shape

### Prediction data

In [16]:
# This is the data for prediction
data_p = pd.read_csv("test.csv")   # we want to predict ("p") based on this dataset
data_p.shape

In [17]:
# many NaN in this set
data_p.dropna().shape

In [18]:
data_p.head()

In [19]:
data_w_p = pd.DataFrame(data_p, columns = variables)
data_w_p.head()

In [20]:
data_w_p.shape

In [21]:
data_w_p.dropna().shape

The simplest way to clean up the NaNs is to get the means and fill the NaNs with them.

This probably limits the maximum accuracy, but we have to live with it.

In [22]:
data_w_p_means = data_w_p.mean(numeric_only = True).to_dict()
data_w_p_means

In [23]:
data_w_p = data_w_p.fillna(axis = 0, value = data_w_p_means)

At this point, both train/test and prediction sets are ready to be used (free of NaN)

## How different / similar are the trial/test ensemble and the predict ensemble?

In [24]:
data_w[variables].describe()

In [25]:
data_w_p.describe()

The distributions are al very similar (from simple inspection of mean, std, etc.) - however, it stands out that SibSp and Parch between the 2 datasets show some marked difference.

This could be a moment to use KL Divergence to quantify the "closeness" of these distributions. However, we will not embark in this more complex work here.

## Modelling

Since we are dealing with a classification problem (survive / does not survive) we will compare the folllwing models from scikit learn:

- Logistic Regression
- Support Vector Machine
- Decision Tree
- K Nearest Neighbors
- Perceptron (neural network)

We will use a grid search to fine-tune the hyper-parameters.

### First we scale the data, prepare to be be used directly into the algos

In [26]:
Y = data_w["Survived"].to_numpy(copy=True)

In [27]:
# in this dataset, what is the probability of survival? 
Y.mean()

In [28]:
Y.shape

In [29]:
# creat a dataframe that we will use as variables
X_1 = data_w[variables]

In [30]:
X_1.shape

Note that the number of rows (first value in shape) in Y and X_1 must match

In [31]:
# prepare the input DF that convert category variables into booleans
X_2 = pd.get_dummies(X_1)

In [32]:
X_2.head()

Standardize the data in <code>X</code> then reassign it to the variable  <code>X</code> using the transform provided below.


In [33]:
# make sure all variables are floats
X = X_2.astype(float) 
X.head()

In [34]:
# final step for X: center and scale it so we can feed it into the models without worrying about unit variance.
# keep it in another variable to check it works before assigning it to X
X_new = preprocessing.StandardScaler().fit(X).transform(X)
X_new[0:2]

In [35]:
X = X_new.copy()
type(X)

In [36]:
X.shape

We split the data into training and testing data using the  function  <code>train_test_split</code>.   The training data is divided into validation data, a second set used for training  data; then the models are trained and hyperparameters are selected using the function <code>GridSearchCV</code>.

The training data and test data should be assigned to the following labels:

<code>X_train, X_test, Y_train, Y_test</code>


In [37]:
# Here we split 80/20, use as global variables when calling functions
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=2)
print ('Train set:', X_train.shape,  Y_train.shape)
print ('Test set:', X_test.shape,  Y_test.shape)

In [38]:
# random state in the train_test_split so that the average survival rate is +- the same as in the full dataset
print(Y.mean())
print(Y_train.mean())
print(Y_test.mean())

In [39]:
# define a function that we will use to fit the models to the data
def fitting(model, params):
    model_cv = GridSearchCV(model, params, cv = 10, verbose = 1)
    model_cv.fit(X_train, Y_train)
    print("tuned hpyerparameters :(best parameters) ",model_cv.best_params_)
    print("accuracy :",model_cv.best_score_)
    return(model_cv)

At this point, we feed Y and X into different models, train and test them, in the end we compare the models to choose the one that we will use to submit our predictions

### Logistic Regression

In [40]:
lr = LogisticRegression()
parameters ={"C":[0.05,1.0,1.5],'penalty':["l2"], 'solver':['lbfgs']} # l1 lasso l2 ridge

logreg_cv = fitting(lr, parameters)

Calculate the accuracy on the test data using the method <code>score</code>:


In [41]:
logreg_cv_scr, logreg_cv_F1 = overw(X_test, Y_test, logreg_cv.predict(X_test), logreg_cv)
print("score = ", logreg_cv_scr, "  F1 = ", logreg_cv_F1)

### Support Vector Machine


In [42]:
svm = SVC()
parameters = {'C': np.array([25, 33, 38]),  # need to play with it to get to best results
              'gamma':np.array([0.01, 0.04, 0.06])}
# use the default rbf kernel

svm_cv = fitting(svm, parameters)

In [43]:
svm_cv_scr, svm_cv_F1 = overw(X_test, Y_test, svm_cv.predict(X_test), svm_cv)
print("score = ", svm_cv_scr, "  F1 = ", svm_cv_F1)

### Decision Tree


In [44]:
tree = DecisionTreeClassifier()

parameters = {'criterion': ['gini', 'entropy'],
     'splitter': ['best', 'random'],
     'max_depth': [2*n for n in range(1,10)],
     'max_features': ['auto', 'sqrt'],
     'min_samples_leaf': [1, 2, 4],
     'min_samples_split': [2, 5, 10]}

tree_cv = fitting(tree, parameters)

In [45]:
tree_cv_scr, tree_cv_F1 = overw(X_test, Y_test, tree_cv.predict(X_test), tree_cv)
print("score = ", tree_cv_scr, "  F1 = ", tree_cv_F1)

### K nearest neighbor


In [46]:
KNN = KNeighborsClassifier()

parameters = {'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
              'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
              'p': [1,2]}

knn_cv = fitting(KNN, parameters)

In [47]:
knn_cv_scr, knn_cv_F1 = overw(X_test, Y_test, knn_cv.predict(X_test), knn_cv)
print("score = ", knn_cv_scr, "  F1 = ", knn_cv_F1)

### Neural Network (perceptron)

In [48]:
nn = MLPClassifier()
parameters ={'solver':['lbfgs', 'adam']}

nn_cv = fitting(nn, parameters)

Calculate the accuracy on the test data using the method <code>score</code>:


In [49]:
nn_cv_scr, nn_cv_F1 = overw(X_test, Y_test, nn_cv.predict(X_test), nn_cv)
print("score = ", nn_cv_scr, "  F1 = ", nn_cv_F1)

## Compare the models


In [50]:
metr_dict = {'Model':['KNN', 'Tree', 'LR', 'SVM', 'NeuralN'],
            'Best score':[knn_cv.best_score_, tree_cv.best_score_, logreg_cv.best_score_, svm_cv.best_score_, nn_cv.best_score_],
             'Score': [knn_cv_scr, tree_cv_scr, logreg_cv_scr, svm_cv_scr, nn_cv_scr],
             'F1' : [knn_cv_F1, tree_cv_F1, logreg_cv_F1, svm_cv_F1, nn_cv_F1]
            }

In [51]:
Report = pd.DataFrame() # ensure it is clear
Report = pd.DataFrame.from_dict(metr_dict)
Report

In [52]:
fig, axa = plt.subplots(ncols = 2, figsize = (8,3), sharey = True)
fig.tight_layout(pad=2.0)

sns.barplot(data=Report, x="Model", y="F1", ax = axa[1])

sns.barplot(data=Report, x="Model", y="Score", ax = axa[0])

axa[0].set_ylim(0.5, 1)

plt.show()

## Time to predict...

In [53]:
X_1_p = data_w_p
X_2_p = pd.get_dummies(X_1_p)
X_p = X_2_p.astype(float) # make sure all numbers are float
X_p.head()

In [54]:
X_p.shape

In [55]:
# as before: center and scale it so we can feed it into the models without worrying about unit variance.
X_p_new = preprocessing.StandardScaler().fit(X_p).transform(X_p)
X_p_new[0:2]

In [56]:
X_p = X_p_new.copy()
type(X_p)

### Predictions from the models

In [57]:
# survey a model, get its prediction ready to save for upload
def pred(mod, data_p, X_p):
    Y_p = mod.predict(X_p)
    # prepare the DF with the predictions
    subm = pd.concat([pd.DataFrame(data_p, columns=["PassengerId"]), pd.DataFrame({'Survived': Y_p})], axis=1)
    print("Survival rates:")
    print("Total ensemble = ", Y.mean())
    print("Train = ", Y_train.mean())
    print("Test = ", Y_test.mean())
    print("Prediction = ", Y_p.mean())
    return(subm)

In [58]:
subm_logreg = pred(logreg_cv, data_p, X_p)

In [59]:
subm_svm = pred(svm_cv, data_p, X_p)

# NB: this one scored the highest accuracy with 78.8% (as reported by Kaggle after submision)

In [60]:
subm_tree = pred(tree_cv, data_p, X_p)

In [61]:
subm_knn = pred(knn_cv, data_p, X_p)

In [62]:
subm_NeuralN = pred(nn_cv, data_p, X_p)

In [63]:
# choose a model to generate a submission

subm = subm_svm  # model
number = 1 # number of the this "trial", e.g. 20

subm.to_csv(f"submission_{number}.csv", header = True, index = False)