# Titanic ML Competition

## Introduction

In this notebook we will be exploring the data for the Titanic machine learning competition from [Kaggle](https://www.kaggle.com/c/titanic/overview). The goals of the notebook are to:
* Better understand the data.
* See data relationships.
* Determine if there are patterns within the data.
* See what sorts of people were more likely to survive the disaster.

After this, we will create ML models to make predictions.

*Introduction from Kaggle*
```
The sinking of the Titanic is one of the most infamous shipwrecks in history.

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).
```
<center>
    <img src='https://miro.medium.com/max/2000/1*fBkTkunRJ88FdEXEcGU_fg.jpeg' heigh=400 width=400>
</center>

## Imports

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import squarify

In [None]:
from math import pi

In [None]:
from preprocessing import encodeDataset
from preprocessing import encodeAndNormalizeData

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import GridSearchCV

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

## Importing the dataset

In [None]:
titanic_df = pd.read_csv("Datasets/train.csv")
titanic_df.head()

In [None]:
titanic_df.info()

From the table above we can see the different data types that pandas assigned each column and that we have some columns with missing values. Before starting to work with these, we have to clean the data. 

### Renaming of the columns & changing index

Here we change some of the names of the columns that doesn't sound that meaningful. (For instance SibSp or Parch). Also we make the index the passenger id, as it makes more sense there.

In [None]:
renamed_columns = {"Pclass":"Economic status","SibSp":"Number of siblings/spouses","Parch":"Number of parents/children"}
titanic_df.rename(columns=renamed_columns,inplace = True)

In [None]:
titanic_df.set_index(titanic_df["PassengerId"],inplace = True)
titanic_df.drop(columns="PassengerId",inplace=True)

### Data types

Now we will convert some columns to more appropriate data type, which will make things easier to work later. Additionally, this reduces the memory usage of the dataset.

The columns with object data type are candidates to be of categorical type. For this we check the cardinality they have.

In [None]:
titanic_df.select_dtypes(include=['object']).nunique()

In [None]:
titanic_df["Sex"] = titanic_df["Sex"].astype("category")
titanic_df["Embarked"] = titanic_df["Embarked"].astype("category")

Other column that could be categorical is the one representing the economic status.

In [None]:
titanic_df["Economic status"] = titanic_df["Economic status"].astype("category")

Not always is about converting columns to categorical data types, we can also convert numerical types that use 64 bits to smaller sizes (such as 8 bits, 16, 32). By doing this we can reduce even further the memory usage of the dataset.

Before converting SibSp/Parch to a an integer of smaller size, we have to check the maximum number they have. *(Max int of int8:127(signed))*

In [None]:
titanic_df[["Number of siblings/spouses","Number of parents/children"]].max()

As there is no problem we can make the conversions.

In [None]:
titanic_df["Number of siblings/spouses"] = titanic_df["Number of siblings/spouses"].astype("int8")
titanic_df["Number of parents/children"] = titanic_df["Number of parents/children"].astype("int8")

In [None]:
titanic_df["Survived"] = titanic_df["Survived"].astype("int8")

### Missing values

This is one of the most important things to do in data analysis. Let's see what we got.

In [None]:
titanic_df.isnull().sum()

Let's start with the embarked values. As these are 2 cases and we can find the information missing online with some research, we can complete this with the real values. If there were more we could use the mode.

In [None]:
embarked_is_null = titanic_df["Embarked"].isnull()
titanic_df.loc[embarked_is_null]

In [None]:
titanic_df.loc[62,"Embarked"] = "S"
titanic_df.loc[830,"Embarked"] = "S"

For the cabin we have a lot of missing values, so for now we are going to mark them with a '-'.

In [None]:
titanic_df["Cabin"].fillna("-",inplace=True)

For the age we can use the title to predict it, so we split the name and use the mean age for that title.

In [None]:
titanic_df['Surname'] = titanic_df['Name'].str.split(', ', expand=True)[0]
titanic_df['Title'] =  titanic_df['Name'].str.split(', ', expand=True)[1].str.split('. ', expand=True)[0]

Before we change them we mark these cases so that if we want we can assign them less importance.

In [None]:
titanic_df["Completed age"] = titanic_df["Age"].isnull()

In [None]:
title_count = titanic_df["Title"].value_counts()
title_count

We can see that we have some uncommon titles, for these we check if any of these is null.

In [None]:
uncommon_titles = title_count[titanic_df["Title"]] < 8
uncommon_titles.index = titanic_df.index

In [None]:
null_age = titanic_df["Age"].isnull()

In [None]:
titanic_df.loc[uncommon_titles & null_age,["Title","Age"]]

We have only one person with an uncommon title (Dr.), but in this case we have some other cases to predict this one.

In [None]:
age_by_title = titanic_df.groupby(by="Title")["Age"].agg("mean")

In [None]:
age_for_nan = age_by_title[titanic_df.loc[null_age,"Title"]]
age_for_nan.index = titanic_df[null_age].index

In [None]:
titanic_df.loc[null_age,"Age"] = age_for_nan

Now we change these values on the dataset to prevent overfitting later.

In [None]:
titanic_df.loc[uncommon_titles,"Title"] = "Other"
titanic_df['Title'] = titanic_df['Title'].astype("category")

Finally we check if we missed any missing value.

In [None]:
titanic_df.isnull().any().any()

### Feature Engineering

*Feature engineering is the process of using domain knowledge to extract features from raw data via data mining techniques.*

Now that we have explored many of the features of the dataset, we are going to create new features that may prove useful to solve the problem. For example, categorize the age or make a new variable with the sum of the family members.

In [None]:
titanic_df["Family size"] = titanic_df["Number of parents/children"] + titanic_df["Number of siblings/spouses"] + 1 # To count the passenger

In [None]:
titanic_df["Discrete age"] = pd.cut(titanic_df["Age"],bins=range(0,85,5))

In [None]:
def categorizeAge(age):
    if age < 13:
        return "Child"
    if age < 24:
        return "Youth"
    if age < 64:
        return "Adult"
    return "Senior"

In [None]:
titanic_df["Categorized age"] = titanic_df["Age"].apply(categorizeAge)
titanic_df["Categorized age"] = titanic_df["Categorized age"].astype("category")

In [None]:
titanic_df.head(3)

As the price of the ticket was given as a group, we can create a new variable which takes into account the individual price.

In [None]:
ticket_values = titanic_df["Ticket"].value_counts()
def ticketCount(ticket):
    return ticket_values[ticket]

In [None]:
titanic_df["Ticket size"] = titanic_df["Ticket"].apply(ticketCount)

In [None]:
titanic_df["Individual fare"] = titanic_df["Fare"] / titanic_df["Ticket size"]

In [None]:
titanic_df.tail(3)

### Data conversions

Here we change some of the texts the data has, so that they are more meaningful.

In [None]:
new_economic_status_names = {1:"Upper",2:"Middle",3:"Lower"}
titanic_df["Economic status"].cat.rename_categories(new_economic_status_names,inplace=True)

In [None]:
new_port_names = {"C":"Cherbourg","Q":"Queenstown","S":"Southhampton"}
titanic_df["Embarked"].cat.rename_categories(new_port_names,inplace = True)

In [None]:
new_sex = {"male":"Male","female":"Female"}
titanic_df["Sex"].cat.rename_categories(new_sex,inplace = True)

### Reordering columns

Finally, we order the dataset in a more relevant way.

In [None]:
personal_info = ["Surname","Title","Name","Sex","Age","Completed age","Discrete age","Categorized age"]
economic_status = ["Economic status","Fare","Individual fare"]
family = ["Number of siblings/spouses","Number of parents/children","Family size"]
journey = ["Cabin","Embarked","Ticket","Survived"]
new_order = personal_info + economic_status + family + journey
titanic_df = titanic_df.reindex(columns = new_order)

### Dataset after handling it

In [None]:
titanic_df.info()

In [None]:
titanic_df.head()

## Exploratory Data Analysis

First, let's start by asking some simple questions that will get us closer to the question that matters. What sorts of people were more likely to survive?
* How many survived?
* How much does the sex determine the chances of survival?
* What about the age?
* Does the economic status helps to determine it?

### How many survived the disaster?

In [None]:
survival_values = titanic_df['Survived'].value_counts()
names = ['Died',"Survived"]
plt.figure(figsize=(10, 5), dpi=100)
 
plt.subplot2grid(shape=(1,2),loc=(0,0))
plt.bar(x=survival_values.index,height=survival_values.values,color=['lightcoral', 'lightgreen'])
plt.xticks(survival_values.index,names)
plt.title("Amount of people that survived")

plt.subplot2grid(shape=(1,2),loc=(0,1))
plt.pie(survival_values, labels=names,colors=['lightcoral', 'lightgreen'], autopct='%1.0f%%')
plt.title("Proportion of people that survived")

plt.suptitle('Survival numbers')
plt.show()

We can see that approximately 40% of the passengers survived.

### How much does the sex affect the chances of survival?

Let's begin by seeing the proportions of the passengers.

In [None]:
sex_proportions = titanic_df["Sex"].value_counts()
circle=plt.Circle( (0,0), 0.7, color='white')
plt.figure(dpi=80)
plt.pie(sex_proportions.values, labels=sex_proportions.index, colors=['goldenrod','salmon'],autopct='%1.0f%%')
p=plt.gcf()
p.gca().add_artist(circle)
plt.suptitle('Proportion of passengers')
plt.show()

Now let's see the survival rate.

In [None]:
survival_by_sex = titanic_df.groupby(by="Sex")["Survived"].agg("mean")
survival_by_sex

In [None]:
plt.figure(figsize=(4, 4), dpi=100)
plt.bar(x=survival_by_sex.index,height=survival_by_sex.values,color=['palevioletred', 'cadetblue'])
plt.ylim(top=1)
plt.title("Proportions of people that survived by sex")

We can clearly see that women proportionally had a greater survival rate than men.

### What about the age?

Can we see a pattern by exploring the age? Let's try to see if the children were more likely to survive.

In [None]:
sns.distplot( a=titanic_df["Age"], hist=True, kde=False, rug=False )
plt.title("Age distribution")
plt.show()

We can see that the main group of people that was on the ship were adults, followed by children and then by old people.

In [None]:
plt.figure(figsize=(15, 5), dpi=100)
plt.subplot2grid(shape=(1,2),loc=(0,0))
g = sns.barplot('Discrete age','Survived', data=titanic_df)
g.tick_params('x',labelrotation=35)
plt.title("Survival by grouped ages")
    
plt.subplot2grid(shape=(1,2),loc=(0,1))
sns.violinplot(x=titanic_df["Survived"], y=titanic_df["Age"])
plt.title("Age distribution and survival")
 
plt.show()

Here we can see that children were more likely to survive in both graphs.

In [None]:
children = titanic_df["Age"] < 13
survived = titanic_df["Survived"] == 1
survived_age = titanic_df.loc[survived & children,"Age"]
died_age = titanic_df.loc[(~survived) & children,"Age"]

In [None]:
survived_age.count()

In [None]:
died_age.count()

### Does the economic status helps to determine it?

In [None]:
total_class_members = titanic_df["Economic status"].value_counts(normalize=True).round(2)

In [None]:
survival_by_status = titanic_df.groupby(by="Economic status")["Survived"].value_counts(normalize=True).round(2)
survival_by_status = survival_by_status.unstack().reset_index()
survival_by_status.index = survival_by_status["Economic status"]
survival_by_status = survival_by_status.drop(columns="Economic status")

In [None]:
renamed_columns = {0:"Died",1:"Survived"}
survival_by_status = survival_by_status.rename(columns=renamed_columns)
survival_by_status.columns.name = None
survival_by_status

In [None]:
plt.figure(figsize=(10, 5), dpi=100)
 
plt.subplot2grid(shape=(1,2),loc=(0,0))
squarify.plot(sizes=total_class_members,value=total_class_members, label=total_class_members.index, alpha=.8,color=['sienna', 'gold','lawngreen'])
plt.axis('off')  
plt.title("% of people according to economic status")   
    
plt.subplot2grid(shape=(1,2),loc=(0,1))
plt.bar(x=survival_by_status.index,height=survival_by_status["Survived"].values,color=['gold', 'lawngreen','sienna'])
plt.ylim(top=1)
plt.title("% of people that survived by class")

plt.show()

On the graphs we can see that people from lower class were more likely to die than from upper class.

### What happened to the families?

Here we will see if people with families had more chances of survival.

The dataset includes 2 attributes which count the number of members of each group.
* Number of siblings/spouses (sibsp):
    * Sibling = brother, sister, stepbrother, stepsister
    * Spouse = husband, wife (mistresses and fiancés were ignored)

* Number of parents/children (parch):
    * Parent = mother, father
    * Child = daughter, son, stepdaughter, stepson
    
*Note: Some children travelled only with a nanny, therefore parch=0 for them.*

In [None]:
titanic_df["Number of siblings/spouses"].value_counts()

In [None]:
titanic_df["Number of parents/children"].value_counts()

We can see that the majority of the passengers traveled alone.

In [None]:
new_name = {"Number of siblings/spouses":"Total"}
sibsp = titanic_df.groupby(by="Number of siblings/spouses")["Survived"].value_counts(normalize=True).to_frame().unstack()
sibsp = sibsp.fillna(0)
sibsp.columns = sibsp.columns.get_level_values(1)
sibsp.columns.name = None
sibsp = sibsp.rename(columns={0:"Died",1:"Survived"}) 
sibsp

In [None]:
bar_width = 0.3
start_first_bar = sibsp.index-bar_width/2
start_second_bar = sibsp.index+bar_width/2

plt.figure(figsize=(10, 4), dpi=100)
plt.bar(start_first_bar, sibsp["Died"], width = bar_width, color = 'lightcoral', edgecolor = 'black', label='Died')
plt.bar(start_second_bar, sibsp["Survived"], width = bar_width, color = 'lightgreen', edgecolor = 'black',  label='Survived')
plt.xticks(sibsp.index)
plt.ylabel('%')
plt.legend()
plt.title("Survival by number of siblings/spouses")
plt.show()

In the graph we can see that those with one or two siblings/spouses seem more likely to survive. Let's see now with the parents and children.

In [None]:
new_name = {"Number of parents/children":"Total"}
parch = titanic_df.groupby(by="Number of parents/children")["Survived"].value_counts(normalize=True).to_frame().unstack()
parch = parch.fillna(0)
parch.columns = parch.columns.get_level_values(1)
parch.columns.name = None
parch = parch.rename(columns={0:"Died",1:"Survived"}) 
parch

In [None]:
bar_width = 0.3
start_first_bar = parch.index-bar_width/2
start_second_bar = parch.index+bar_width/2

plt.figure(figsize=(10, 4), dpi=100)
plt.bar(start_first_bar, parch["Died"], width = bar_width, color = 'lightcoral', edgecolor = 'black', label='Died')
plt.bar(start_second_bar, parch["Survived"], width = bar_width, color = 'lightgreen', edgecolor = 'black',  label='Survived')
plt.xticks(parch.index)
plt.ylabel('%')
plt.legend()
plt.title("Survival by number of parents/children")
plt.show()

In this case we can see again that those with a family were more likely to survive than those that wet alone.

### Is there any relationship with the embarkation port?

In this section we will see if people that embarked in a certain port survived more than from the others.

In [None]:
embark_ports = titanic_df["Embarked"].value_counts(normalize=True).round(2)

In [None]:
survival_by_ports = titanic_df.groupby(by="Embarked")["Survived"].agg("mean")

In [None]:
plt.figure(figsize=(10, 4), dpi=100)
plt.subplot2grid(shape=(1,2),loc=(0,0))
squarify.plot(sizes=embark_ports,value=embark_ports, label=embark_ports.index, alpha=.8,color=['darkslategrey', 'slategrey','cornflowerblue'])
plt.axis('off')  
plt.title("% of people from each port")   

plt.subplot2grid(shape=(1,2),loc=(0,1))
plt.bar(survival_by_ports.index, survival_by_ports.values, color = ['slategrey', 'cornflowerblue','darkslategrey'], alpha = 0.6)
plt.ylim(top=1)
plt.ylabel('%')
plt.title("Survival by port of embarkation")
plt.show()

We can see that those that embarked in Cherbourg had slightly better chances, though we have to take into account that Southhampton was the port from which most of the people embarked.

### Is there any relationship between age/fare and the survival?

Now that we have seen each individual variable in its own, let's start to look for correlations between them.

In [None]:
sns.lmplot( x='Age', y='Fare',data=titanic_df,fit_reg=False, hue='Survived', legend=True,height=7,aspect=1.5,markers=["x","o"],palette=['red','green'])
plt.title("Age/Fare with survival")
plt.show()

In this graph we can clearly see that those that paid a higher fare likely survived. Let's see if we can get a better view by making a graph for each sex.

In [None]:
columns = ["Age","Fare","Survived"]
is_male = titanic_df["Sex"] == "Male"
fare_limit = titanic_df["Fare"] < 300 # So that we can see better the lower values
age_fare_male = titanic_df.loc[is_male & fare_limit,columns]
age_fare_female = titanic_df.loc[(~is_male) & fare_limit,columns]

In [None]:
sns.lmplot( x='Age', y='Fare',data=age_fare_male,fit_reg=False, hue='Survived', legend=True,height=7,aspect=1.5,markers=["x","o"],palette=['red','green'])
plt.title("Age/Fare with survival, male")
plt.show()

In [None]:
sns.lmplot( x='Age', y='Fare',data=age_fare_female,fit_reg=False, hue='Survived', legend=True,height=7,aspect=1.5,markers=["x","o"],palette=['red','green'])
plt.title("Age/Fare with survival, female")
plt.show()

This graph further confirms what we saw before, women more likely survived. In addition, we can see that also with the children.

Let's try to draw the boundaries of the socio-economic status by looking at different numerical values.

In [None]:
titanic_df.groupby(by="Economic status")["Fare"].agg(["min","mean","max","std"]).round(2)

Seeing that the minimum fare was 0 for all economic status, I decided to take a look at these particular cases. 

In [None]:
titanic_df.loc[titanic_df["Fare"] == 0]

By investigating a little, it turned out that these people boarded the ship without paying a fare for different reasons. For example, Andrews, Mr. Thomas Jr. was the naval architect for the ship, while other passengers were part of the ["guarantee group"](https://www.encyclopedia-titanica.org/titanic-guarantee-group/).


Let's continue without taking into account these cases.

In [None]:
passengers_with_fare = titanic_df.loc[titanic_df["Fare"]>0]
fare_range_by_status = passengers_with_fare.groupby(by="Economic status")["Fare"].agg(["min","mean","max","std"]).round(2)
fare_range_by_status

By taking into account the mean and standard deviation, we can see that some groups overlap between each other in parts, thus blurring the boundaries. Of course here we have to take into account that the fare in this case includes all of the fares from one family.

In [None]:
min_fare_std = fare_range_by_status["mean"] - fare_range_by_status["std"]
max_fare_std = fare_range_by_status["mean"] + fare_range_by_status["std"]
colors = ['gold','lawngreen','sienna']
plt.figure(figsize=(10, 3), dpi=120)
plt.hlines(y=fare_range_by_status.index, xmin=min_fare_std, xmax=max_fare_std, color=colors, alpha=1)
plt.scatter(min_fare_std, fare_range_by_status.index, color=colors, alpha=1, label='Min fare (std)')
plt.scatter(max_fare_std, fare_range_by_status.index, color=colors, alpha=1 , linewidths=1,edgecolor="black",label='Max fare (std)')
plt.legend()
 
plt.show()

### How is the survival distributed by dividing sex and the socio-economic status?

In [None]:
female = titanic_df["Sex"] == "Female"
female_passengers = titanic_df.loc[female]
survival_female_class = female_passengers.groupby(by="Economic status")["Survived"].value_counts(normalize=True).round(2).unstack()
survival_female_class = survival_female_class.rename(columns={0:"Died",1:"Survived"})
survival_female_class.columns.name = None
survival_female_class

In [None]:
male_passengers = titanic_df.loc[~female]
survival_male_class = male_passengers.groupby(by="Economic status")["Survived"].value_counts(normalize=True).round(2).unstack()
survival_male_class = survival_male_class.rename(columns={0:"Died",1:"Survived"})
survival_male_class.columns.name = None
survival_male_class

In [None]:
plt.figure(figsize=(10, 5), dpi=100)
 
plt.subplot2grid(shape=(1,2),loc=(0,0))
h1 = sns.heatmap(survival_female_class,annot=True)
plt.title("Female survival divided by class")
    
plt.subplot2grid(shape=(1,2),loc=(0,1))
h2 = sns.heatmap(survival_male_class,annot=True)
plt.title("Male survival divided by class")

plt.show()

Here we can clearly see that women from the middle and upper status have way higher chances of survival, while for those of lower class the chances drop by around 40%.

Regarding the men, we can see that those from upper class have slightly better chances than those in middle and lower classes.

### Expansion on the families

Let's try to see if we can get anything new by viewing the sum of the number of siblings/spouses and parents/children.

In [None]:
family_size = titanic_df["Family size"].value_counts(normalize=True).round(2)
colors=["tab:blue","tab:orange","tab:green","tab:red","tab:purple","tab:brown","tab:pink","tab:gray","tab:olive"]
plt.figure(figsize=(10, 4), dpi=100)
plt.subplot2grid(shape=(1,2),loc=(0,0))
squarify.plot(sizes=family_size,label=family_size.index,color=colors, alpha=.8)
plt.axis('off')  
plt.title("Family size")   

plt.subplot2grid(shape=(1,2),loc=(0,1))
plt.bar(family_size.index, titanic_df.groupby(by="Family size")["Survived"].agg("mean"), alpha = 0.6)
plt.ylim(top=1)
plt.xticks(ticks=family_size.index)
plt.ylabel('%')
plt.title("Survival by family size")
plt.show()

Here we can see more clearly that families that have 2 to 4 members have higher survival chances than bigger families or passengers alone.

In [None]:
table = pd.crosstab(titanic_df['Family size'], titanic_df['Sex'])
print('\n', table)
table_fractions = table.div(table.sum(1).astype(float), axis=0)
g = table_fractions.plot(kind="bar", stacked=True)
plt.xticks(rotation=0)
plt.xlabel('Family size', weight='bold')
plt.ylabel('%')
plt.tight_layout()
leg = plt.legend(title='Sex', loc=9, bbox_to_anchor=(1.05, 1.0))

Here we can see that the proportion of males is greatest for family size 1 (males traveling alone), and drops for family size of 2, 3, and 4 members (more females). This can explain the survival chances of those groups.

## Summary of the EDA
    
* Women are more likely to have survived.
* Many of the children survived.
* Families (2 to 4 members) have better chances of survival.
* There are clear relationships between sex, gender, and socio-economic status.

## Models

In [None]:
y = np.array(titanic_df["Survived"]).ravel()

In [None]:
titanic_df.columns

In [None]:
best_model = None

In [None]:
subset_categorical_columns = ['Sex','Categorized age','Economic status','Embarked']

In [None]:
subset_numerical_columns = ['Age','Fare','Family size']

### KNN

In [None]:
params = {
    'n_neighbors':[2,3,4,5,6,7,8,9,10,15,20,25,30],
    'weights':['uniform', 'distance'],
    'metric':['minkowski','cosine','chebyshev','correlation']
}

In [None]:
X = encodeDataset(titanic_df,subset_categorical_columns,subset_numerical_columns)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
knn = GridSearchCV(KNeighborsClassifier(), params, scoring = 'accuracy', n_jobs = -1)

In [None]:
knn.fit(X_train, y_train)

In [None]:
y_pred = knn.predict(X_test)

In [None]:
knn.best_params_

In [None]:
print(classification_report(y_pred,y_test))

In [None]:
best_model = knn

### SVM

In [None]:
params = {
    'C':[0.01,0.1,0.5,1,5,10,15,20,25,30],
    'kernel':["poly", "rbf", "linear"],
}

In [None]:
X = encodeAndNormalizeData(titanic_df,subset_categorical_columns,subset_numerical_columns)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
svm = GridSearchCV(SVC(), params, scoring = 'accuracy', n_jobs = -1)

In [None]:
svm.fit(X_train, y_train)

In [None]:
y_pred = svm.predict(X_test)

In [None]:
svm.best_params_

In [None]:
print(classification_report(y_pred,y_test))

In [None]:
best_model = svm

### Decision Tree Classifier

In [None]:
params = {
    'criterion':['gini', 'entropy'],
    'max_depth':[1,2,3,4,5,6,7,8,9,10],
}

In [None]:
X = encodeDataset(titanic_df,subset_categorical_columns,subset_numerical_columns)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
dtc = GridSearchCV(DecisionTreeClassifier(), params, scoring = 'accuracy', n_jobs = -1)

In [None]:
dtc.fit(X_train, y_train)

In [None]:
y_pred = dtc.predict(X_test)

In [None]:
dtc.best_params_

In [None]:
print(classification_report(y_pred,y_test))

### Logistic Regression

In [None]:
X = encodeAndNormalizeData(titanic_df,categorical_columns,numerical_columns)

In [None]:
params = {
    'penalty':["l2","none", "l1", "elasticnet"],
    'C':[0.001,0.01,0.1,0.2,0.5,0.7,1.0,1.5,2,2.5,3,3.5,4,5],
}

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

In [None]:
lr = GridSearchCV(LogisticRegression(solver='saga',random_state=0,max_iter=700), params, scoring = 'accuracy', n_jobs = -1)

In [None]:
lr.fit(X_train, y_train)

In [None]:
y_pred = lr.predict(X_test)

In [None]:
lr.best_params_

In [None]:
print(classification_report(y_pred,y_test))

In [None]:
best_model = lr

### Naive Bayes

### Boosting

### Random Forest

### Voting

### Stacking

## References


* [A Tour of Machine Learning in Python](https://rpmarchildon.com/ai-titanic/)
* [Encyclopedia-titanica](https://www.encyclopedia-titanica.org/).