# Bayes' Theorem classification application on Titanic Data

We predict if a passenger is survived or not using __Bayes' theorem__. We use __GaussianNB__ library in __Scikit Library__ for this purpose.

In [1]:
import numpy as np
import pandas as pd

from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

In [2]:
titanic_df = pd.read_csv('Data/titanic.csv')
titanic_df.sample(10)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Survived
285,286,3,"Stankovic, Mr. Ivan",male,33.0,0,0,349239,8.6625,,C,0
222,223,3,"Green, Mr. George Henry",male,51.0,0,0,21440,8.05,,S,0
145,146,2,"Nicholls, Mr. Joseph Charles",male,19.0,1,1,C.A. 33112,36.75,,S,0
585,586,1,"Taussig, Miss. Ruth",female,18.0,0,2,110413,79.65,E68,S,1
595,596,3,"Van Impe, Mr. Jean Baptiste",male,36.0,1,1,345773,24.15,,S,0
377,378,1,"Widener, Mr. Harry Elkins",male,27.0,0,2,113503,211.5,C82,C,0
797,798,3,"Osman, Mrs. Mara",female,31.0,0,0,349244,8.6833,,S,1
65,66,3,"Moubarek, Master. Gerios",male,,1,1,2661,15.2458,,C,1
818,819,3,"Holm, Mr. John Fredrik Alexander",male,43.0,0,0,C 7075,6.45,,S,0
361,362,2,"del Carlo, Mr. Sebastiano",male,29.0,1,0,SC/PARIS 2167,27.7208,,C,0


In [3]:
titanic_df.shape

(891, 12)

In [4]:
# We don't use all the columns but 'Sex' and 'Survived'
titanic_df = titanic_df[['Sex', 'Survived']]
titanic_df.head()

Unnamed: 0,Sex,Survived
0,male,0
1,female,1
2,female,1
3,female,1
4,male,0


In [5]:
type(titanic_df['Sex'])

pandas.core.series.Series

In [6]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
Sex         891 non-null object
Survived    891 non-null int64
dtypes: int64(1), object(1)
memory usage: 14.0+ KB


In [7]:
# Let us convert the feature 'Sex' into 'Category'
titanic_df['Sex'] = titanic_df['Sex'].astype('category', copy=False).cat.codes

In [8]:
type(titanic_df['Sex'])

pandas.core.series.Series

In [9]:
titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 2 columns):
Sex         891 non-null int8
Survived    891 non-null int64
dtypes: int64(1), int8(1)
memory usage: 8.0 KB


In [10]:
titanic_df.head()

Unnamed: 0,Sex,Survived
0,1,0
1,0,1
2,0,1
3,0,1
4,1,0


In [11]:
# Let us check for null values, if any
titanic_df.isnull().any()

Sex         False
Survived    False
dtype: bool

In [12]:
# Let us drop the samples having 'null' values
titanic_df = titanic_df.dropna()

In [13]:
# Let us check the dimensions
titanic_df.shape

(891, 2)

In [14]:
# Let us separate the data into 'features' and 'Label'
# Note, we are leaving the 'Label' in features set purposefully as we need to calculate apriori probabilities
features = titanic_df[['Sex']]
label = titanic_df['Survived']

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size=0.2)

In [16]:
X_train.shape, X_test.shape

((712, 1), (179, 1))

In [17]:
model = GaussianNB()

In [18]:
model.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [19]:
y_pred = model.predict(X_test)

In [20]:
accuracy_score(y_test, y_pred)

0.7597765363128491

__Manual Computation of Probabilities using Bayes' Theorem__

__Problem__:

Given data of passengers PassengerId, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked along with Survived.
Find out if there is a __pattern or relationship__ on __Sex__ with __Survived__.

In [77]:
titanic_df[['Sex', 'Survived']]

Unnamed: 0,Sex,Survived
0,1,0
1,0,1
2,0,1
3,0,1
4,1,0
...,...,...
886,1,0
887,0,1
888,0,0
889,1,1


In [82]:
# Total Woman = 314, Survived = 233, Did not Survive = 81
titanic_df['Survived'][titanic_df['Sex'] == 0].groupby(titanic_df['Survived']).count()

Survived
0     81
1    233
Name: Survived, dtype: int64

In [83]:
# Total Man = 577, Survived = 109, Did not Survive = 468
titanic_df['Survived'][titanic_df['Sex'] == 1].groupby(titanic_df['Survived']).count()

Survived
0    468
1    109
Name: Survived, dtype: int64

In [84]:
# P(Survived|Man) = 
(109/342)*(342/891) / ((109/342 * 342/891) + (468/549 * 549/891))

0.18890814558058927

In [85]:
# P(Survived|Woman) = 
(233/342)*(342/891) / ((233/342 * 342/891) + (81/549 * 549/891))

0.7420382165605094

From the results obtained above, as __P(Survived|Man) < P(Survived|Woman)__, we say that __Woman has more chances of survival than Man__.

If we were given a vector of values of a Man or Woman whose survival is not known, we fit the given values just as above with multiple features of the vector and find the probabilities. Based on the results we can predict if the __vector__ has survived or not.