### Naive Bayes Tutorial : Part 1 - Predicting Titanic Survival using Naive Bayes

In [157]:
import pandas as pd

In [158]:
from sklearn.datasets import fetch_openml
# fetch the titanic dataset from openml
X, y = fetch_openml('titanic', version=1, return_X_y=True, as_frame=True)

  warn(


In [159]:
print(fetch_openml('titanic', version=1, as_frame=True).DESCR)

**Author**: Frank E. Harrell Jr., Thomas Cason  
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)  
**Please cite**:   

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variable

  warn(


In [160]:
X.shape, y.shape

((1309, 13), (1309,))

In [161]:
X.head(3)

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


#### Drop the columns unecessary for predicting survival

In [162]:
X.drop(['name', 'ticket', 'cabin', 'boat', 'sibsp', 'parch', 'body', 'embarked','home.dest'], axis=1, inplace=True)

In [163]:
X.head(3)

Unnamed: 0,pclass,sex,age,fare
0,1.0,female,29.0,211.3375
1,1.0,male,0.9167,151.55
2,1.0,female,2.0,151.55


In [164]:
y.head(3)

0    1
1    1
2    0
Name: survived, dtype: category
Categories (2, object): ['0', '1']

#### Encode the sex feature 

In [165]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories=[['female', 'male']]).set_output(transform='pandas')
X[['sex']] = oe.fit_transform(X[['sex']])

In [166]:
X.head(3)

Unnamed: 0,pclass,sex,age,fare
0,1.0,0.0,29.0,211.3375
1,1.0,1.0,0.9167,151.55
2,1.0,0.0,2.0,151.55


In [167]:
# check for missing values
X.isna().sum()

pclass      0
sex         0
age       263
fare        1
dtype: int64

In [168]:
# Impute missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean').set_output(transform='pandas')
X[['age']] = imputer.fit_transform(X[['age']])
X[['fare']] = imputer.fit_transform(X[['fare']])

In [169]:
X.isna().sum()

pclass    0
sex       0
age       0
fare      0
dtype: int64

In [170]:
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   pclass  1309 non-null   float64
 1   sex     1309 non-null   float64
 2   age     1309 non-null   float64
 3   fare    1309 non-null   float64
dtypes: float64(4)
memory usage: 41.0 KB


### Train Test Split

In [182]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#### Fitting the Naive Bayes Model


In [183]:
from sklearn.naive_bayes import GaussianNB # gaussian naive bayes
model = GaussianNB()
model.fit(X_train, y_train)

In [184]:
model.score(X_test, y_test)

0.7286585365853658

In [185]:
pd.concat([X_test[:10], y_test[:10]], axis=1)

Unnamed: 0,pclass,sex,age,fare,survived
1148,3.0,1.0,35.0,7.125,0
1049,3.0,1.0,20.0,15.7417,1
982,3.0,1.0,29.881135,7.8958,0
808,3.0,1.0,29.881135,8.05,0
1195,3.0,1.0,29.881135,7.75,0
240,1.0,1.0,45.0,26.55,1
1118,3.0,1.0,25.0,7.925,0
596,2.0,1.0,31.0,13.0,1
924,3.0,1.0,34.5,7.8292,0
65,1.0,0.0,33.0,53.1,1


In [186]:
predicted = pd.DataFrame(model.predict(X_test[:10]), columns=['predicted'])
predicted


Unnamed: 0,predicted
0,0
1,0
2,0
3,0
4,0
5,0
6,0
7,0
8,0
9,1


### Calculate the score using cross-validation

In [187]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, X, y, cv=5)
print(scores)
print(scores.mean())

[0.50381679 0.82824427 0.79770992 0.70992366 0.61685824]
0.6913105788072884
