## Can A.I. save Jack from the Titanic?

**Want to read the frienly article based on this code** (with pictures!):
https://github.com/antoinedme/titanic-dataset-ml


Author: *Antoine de Marassé* https://www.linkedin.com/in/hiantoine/

Based on: Microsoft Azure (data science masterclass), Kaggle, Pandas, Sci-kit Learn and Seaborn


Hi there, welcome on this fun Machine Learning introduction example. In this Python Notebook I play around with the RMS Titanic passenger's dataset and apply a bit of this so called "Artificial Intelligence". The different sections are:
- [Exploring the Passengers dataset entries](#Exploring-the-Passengers-dataset-entries)
- [Checking and preparing the data](#Checking-and-preparing-the-data)
- [Visual analysis with Seaborn](#Visual-Analysis-with-Seaborn)
- [Split the dataset (training and testing)](#Split-the-training-and-test-data)
- [Apply logistic regression](#Applying-Logistic-Regression)
- [Evaluate the model](#Evaluate-the-model)
- [Apply decision tree](#One-more-turn-applying-Decision-Tree)

## RMS Titanic story

RMS Titanic was a British passenger liner operated by the White Star Line that sank in the North Atlantic Ocean in the early morning hours of April 15, 1912, after striking an iceberg during her maiden voyage from Southampton to New York City. Of the estimated 2,224 passengers and crew aboard, more than 1,500 died, making the sinking one of modern history's deadliest peacetime commercial marine disasters. 
    
After leaving Southampton on 10 April 1912, Titanic called at Cherbourg in France and Queenstown (now Cobh) in Ireland, before heading west to New York. On 14 April, four days into the crossing and about 375 miles (600 km) south of Newfoundland, she hit an iceberg at 11:40 p.m. ship's time. The collision caused the hull plates to buckle inwards along her starboard (right) side and opened five of her sixteen watertight compartments to the sea; she could only survive four flooding. Meanwhile, passengers and some crew members were evacuated in lifeboats, many of which were launched only partially loaded. A disproportionate number of men were left aboard because of a "women and children first" protocol for loading lifeboats. At 2:20 a.m., she broke apart and foundered with well over one thousand people still aboard. Just under two hours after Titanic sank, the Cunard liner RMS Carpathia arrived and brought aboard an estimated 705 survivors. 


<img style="" src="https://raw.githubusercontent.com/antoinedme/titanic-dataset-ml/master/img/opening-image.png"> 

## Exploring the Passengers dataset entries

Install pandas package ```conda install pandas```

In [4]:
# Import classic numpy library, and pandas, open source data analysis and manipulation tool
import pandas as pd
import numpy as np

In [5]:
# Load the dataset file into "titanic" object
titanic = pd.read_csv("data/titanic-1309-rows-biostatvanderbilt.csv")
# Let's have a look at the dataset attributes
titanic.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [6]:
# Let's check the dataset info, variables types and null counts 
# (pclass, survived, name, sex, sibsp, parch, ticket are complete)
# (embarked miss one value, age is incomplete for 263 passengers)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [7]:
# Let's get few statistical data from the dataset (and mostly the mean of survivers: 38,19%)
titanic.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881138,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.413493,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.17,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [8]:
# mean of parameters grouped by survivors or dead
titanic.groupby('survived').mean()

Unnamed: 0_level_0,pclass,age,sibsp,parch,fare,body
survived,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,2.500618,30.545363,0.521632,0.328801,23.353831,160.809917
1,1.962,28.918244,0.462,0.476,49.361184,


### So, what did we understood from the dataset exploration?

The dataset has 1309 entries. 
This dataset describes the survival status of individual passengers on the Titanic. The dataset has 10 variables:
- `survived`: 0 = No, 1 = Yes. **(As we can see on the table above `survived` mean, 38,19% of passengers survived)**
- `pclass`: Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- Demographics: `Sex`, `Age`
- `sibsp`, `parch`: Number of siblings or spouses aboard, number of parents or children aboard
- `ticket`: Passenger ticket number
- `fare`: Passenger fare
- `cabin`: Cabin number
- `embarked`: Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton

## Checking and preparing the data

Dropping few columns and checking data values

In [9]:
# Removing some variables that might be not used (to be checked later) and anonymizing the data
titanic.drop(['name','body','boat','cabin','ticket','embarked','home.dest'],axis=1,inplace=True)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   sex       1309 non-null   object 
 3   age       1046 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   fare      1308 non-null   float64
dtypes: float64(2), int64(4), object(1)
memory usage: 71.7+ KB


In [11]:
# As we can see, the age column has quite few missing values, let's count how many:
titanic['age'].isnull().value_counts()

False    1046
True      263
Name: age, dtype: int64

In [13]:
# Let's fill this 'age' parameter with the median value for each gender type
titanic['age'] = titanic.groupby('sex')['age'].apply(lambda x: x.fillna(x.median()))

In [17]:
# Now we will plot the median price (fare) paid for each class (results: first class costs 60, while third only 8)
titanic.groupby('pclass')['fare'].median()

pclass
1    60.0000
2    15.0458
3     8.0500
Name: fare, dtype: float64

In [18]:
# Let's fill this 'fare' parameter with the exact same technique as below, but using median value for each class
titanic['fare'] = titanic.groupby('pclass')['fare'].apply(lambda x: x.fillna(x.median()))

In [20]:
# And plot once again our results (results: you can see all 1309 values count for each parameter, that's complete)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   int64  
 1   survived  1309 non-null   int64  
 2   sex       1309 non-null   object 
 3   age       1309 non-null   float64
 4   sibsp     1309 non-null   int64  
 5   parch     1309 non-null   int64  
 6   fare      1309 non-null   float64
dtypes: float64(2), int64(4), object(1)
memory usage: 71.7+ KB


In [21]:
# Let's also just duplicate our table for later, we will call it "iceberg"
iceberg = titanic

## Visual Analysis with Seaborn

Install Seaborn, statistical data visualization library based on matplotlib: ```conda install seaborn```

In this section, we will plot just a bit of information that we already got from exploring the data, and also use Seaborn to plot parameter's correlation (In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data.

In [None]:
# Duplicating the object for some cleaning and analysis
exploratory = titanic
exploratory.drop('sibsp',axis=1,inplace=True)

In [None]:
exploratory.info()

In [None]:
# Converting the sex variable into integer
exploratory['ismale'] = exploratory['sex'].replace(regex='female', value=0)
exploratory['sex_is_male'] = exploratory['ismale'].replace(regex='male', value=1)
exploratory.drop(['sex','ismale'],axis=1,inplace=True)

# re-order the columns
exploratory = exploratory[['sex_is_male','age','parch','fare','pclass','survived']]

exploratory.head()

In [None]:
exploratory.groupby(['sex_is_male','pclass'])['survived'].mean()

In [None]:
exploratory.groupby(['pclass'])['fare'].mean()

Survival probability: for women on 1st class is: 96,5% compared to men only 34,1%
When we look at the 3rd class, the probability drops to 49,1% for women and 15,2% for men. We might want to plot those stats.

In [None]:
# Use Seaborn to visualize survived male and female per class in grouped barplots
import seaborn as sns
sns.set(style="whitegrid")

# Draw a nested barplot to show survival for class and sex
#g = sns.catplot(x="pclass", y="survived", hue="sex", data=titanic, height=5, kind="bar", palette="muted")
graph = sns.catplot(x="sex_is_male", y="survived", hue="pclass", kind="bar", palette="muted", data=exploratory)
graph.set_ylabels("survival probability")

In [None]:
exploratory.corr()

In [None]:
# Draw scatterplots for joint relationships and histograms for univariate distributions:
# different levels of a categorical variable by the color of plot elements
sns.set(style="darkgrid")  
sns.pairplot(exploratory, dropna=True, hue="survived", corner=True)
#sns.pairplot(exploratory, dropna=True, hue="survived", corner=True, kind="reg")

## Split the training and test data

Install the scikit-learn - Machine Learning in Python: ```conda install -c intel scikit-learn```
- Simple and efficient tools for predictive data analysis
- Built on NumPy, SciPy, and matplotlib
Link: https://scikit-learn.org/stable/
 
#### Predicting a continuous-valued attribute associated with an object.
For this part, I will first start to I will apply the linear regression model used on the scikit-learn `diabetes` dataset, in order to illustrate a two-dimensional plot of this regression technique.
https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py

The straight line can be seen in the plot, showing how linear regression attempts to draw a straight line that will best minimize the residual sum of squares between the observed responses in the dataset, and the responses predicted by the linear approximation. The coefficients, the residual sum of squares and the coefficient of determination are also calculated.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
%matplotlib inline

In [None]:
# One step i missed earlier: dropping the null values (NaN from 'age' column)
iceberg.info()

In [None]:
# Splitting the table into all parameters list X and the value we want as y 'survived'
X = iceberg.drop(['survived'],axis=1)
y = iceberg['survived']

X.info()
y.head()

In [None]:
# Split the data into training/testing sets
# Using train_test_split function
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)

We chose a test size of 30% of total 1309 entries thus it cutted our set of training/testing to 916/393
<img style="padding-left:10px; width: 250px" src="https://miro.medium.com/max/2272/1*-8_kogvwmL1H6ooN1A1tsQ.png"> 

In [None]:
print('Lets have a look at the training dataset: ')
print('Size of parameters table followed by info: ', X_train.shape)
X_train.info()
print('Size of survival rate followed by info: ', y_train.shape)
y_train

In [None]:
print('-----------------------------------------------------')
print('Lets have a look at the testing dataset: ')
print('Size of parameters table followed by info: ', X_test.shape)
X_test.info()
print('Size of survival rate followed by info: ', y_test.shape)
y_test

## Applying Logistic Regression

In [None]:
from sklearn import linear_model

In [None]:
# Create linear regression object: LinearRegression fits a linear model with coefficients 
# w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, 
# and the targets predicted by the linear approximation.

# https://scikit-learn.org/stable/modules/linear_model.html

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()

In [None]:
# Fit linear model with fit(self, X, y[, sample_weight])
lr.fit(X_train,y_train)

In [None]:
# Predict using the linear model with predict(self, X), output is numpy array
predictions = lr.predict(X_test)

In [None]:
predictions

## Evaluate the model

In contrast to linear regression, logistic regression does not produce an $R^2$ score by which we can assess the accuracy of our model. In order to evaluate that, we will use a classification report, a confusion matrix, and the accuracy score.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [None]:
import matplotlib.pyplot as plt

In [None]:
print(classification_report(y_test,predictions))

In [None]:
print(confusion_matrix(y_test,predictions))

In [None]:
pd.DataFrame(confusion_matrix(y_test, predictions), 
             columns=['true survived', 'true not survived'], 
             index=['predicted survived', 'predicted not survived'])

In [None]:
print(accuracy_score(y_test,predictions))

In [None]:
X_test

In [None]:
# Creating Jack and Rose from the movie
jack = [3, 27, 0, 8,  1]
rose = [1, 22, 1, 60, 0]
people = pd.DataFrame(np.array([jack, rose]), columns=['pclass', 'age', 'parch', 'fare','sex_is_male'])
people

In [None]:
# Predict survival of Jack and Rose
will_they_live = lr.predict(people)

In [None]:
will_they_live

In [None]:
if will_they_live[0] == 0 : 
    print("Jack just died again.")
else: 
    print("Awesome, Jack did swim and survived!")             

In [None]:
if will_they_live[1] == 0 :
    print("Omg Rose did drown too.")
else :
    print("No worries, Rose still alive and well!")

## One more turn applying Decision Tree

In [None]:
from sklearn import tree

In [None]:
tr = tree.DecisionTreeClassifier()

In [None]:
tr.fit(X_train, y_train)

In [None]:
tr_predictions = tr.predict(X_test)

In [None]:
print(accuracy_score(y_test,tr_predictions))

In [None]:
import graphviz 

dot_data = tree.export_graphviz(tr, out_file=None) 
graph = graphviz.Source(dot_data) 


In [None]:

graph

# from IPython.display import Image  
#Image(graph.create_png())

In [None]:
# tree.plot_tree(clf.fit(iris.data, iris.target)) 



dot_file = tree.export_graphviz(tr, out_file=None, 
                                feature_names=X.columns, 
                                class_names='Survived',  
                                filled=True,rounded=True)  
graph = graphviz.Source(dot_file)  
graph


In [None]:
tr_predictions

In [None]:
from sklearn.tree.export import export_text

print(export_text(tr))

In [None]:
will_they_live_tr = lr.predict(people)
will_they_live_tr

In [None]:
if will_they_live[0] == 0 : 
    print("Jack just died again.")
else: 
    print("Awesome, Jack did swim and survived!")             

In [None]:
if will_they_live[1] == 0 :
    print("Omg Rose did drown too.")
else :
    print("No worries, Rose still alive and well!")

At 11:40 p.m. (ship's time) on 14 April, lookout Frederick Fleet spotted an iceberg immediately ahead of Titanic and alerted the bridge. First Officer William Murdoch ordered the ship to be steered around the obstacle and the engines to be stopped, but it was too late; the starboard side of Titanic struck the iceberg, creating a series of holes below the waterline. It soon became clear that the ship was doomed, as she could not survive more than four compartments being flooded. Titanic began sinking bow-first, with water spilling from compartment to compartment as her angle in the water became steeper. Third-class passengers were largely left to fend for themselves, causing many of them to become trapped below decks as the ship filled with water. The "women and children first" protocol was generally followed when loading the lifeboats, and most of the male passengers and crew were left aboard.

It looks like both logistic regression and decision tree methods can't save our Jack! The vertical stern of the ship plunges down shrieking and groaning, with bodies falling hundreds of feet down toward churning water. Some fans will never let go of the possibility that there was room enough for both Jack and Rose on that door at the end of Titanic. Director James Cameron has an explanation for them that doesn’t involve physics, but rather art. “Had he lived, the ending of the film would have been meaningless,” he said in a recent Vanity Fair interview. “The film is about death and separation; he had to die.”

Unless...

[Click this link for an alternate Titanic ending](https://github.com/antoinedme/titanic-dataset-ml/blob/master/README.md#final-remarks)


Unnamed: 0_level_0,pclass,survived,age,sibsp,parch,fare
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
female,466,466,388,466,466,466
male,843,843,658,843,843,842
