# Titanic Bayes

Using the Titanic dataset, we will clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) and then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). 

### \begin{align} probability = \frac{number of chances}{total outcomes} \end{align}

### \begin{align} P(B|A) = \frac{P(B)\times P(A|B)}{P(A)} \end{align}

In [41]:
import pandas as pd
import numpy as np

from sklearn.naive_bayes import GaussianNB   #import Gaussian modeling
from sklearn.naive_bayes import BernoulliNB  #import Bernoulli modeling
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [188]:
df = pd.read_excel('titanic.xls')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


## Handle the missing values in our dataset

In [189]:
# Check for missing data in our dataset
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [190]:
#fill missing values for age based on survival status, sex, and passenger class
df['age'].fillna(df.groupby(['survived', 'sex', 'pclass'])['age'].transform('mean'), inplace=True)

In [191]:
#only 2 missing values so we'll fill with most common embarkation point
#df['embarked'].value_counts()
#fill missing values
df['embarked'].fillna('S', inplace=True)

## Reformating data for analysis:
Reformat df['sex'] to Binary value

Create df['family_num'] from df['sibsp'] and df['parch']

Generate new df to handle data that we are modeling: model_df

Drop df['embark']

Create model_df['TravelSolo']

In [192]:
#change sex values to binary
#female=0, male=1
df['sex'] = df['sex'].map({'female':0, 'male':1})

In [193]:
df['family_num'] = df['sibsp'] + df['parch']

In [194]:
model_df = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest', 'sibsp', 'embarked', 'parch'], axis=1)

In [195]:
model_df['TravelSolo'] = np.where((model_df['family_num']>0),0,1)

In [196]:
# Verify the data count per column and head(), before moving on
model_df.count()

pclass        1309
survived      1309
sex           1309
age           1309
family_num    1309
TravelSolo    1309
dtype: int64

In [197]:
model_df.head()

Unnamed: 0,pclass,survived,sex,age,family_num,TravelSolo
0,1,1,0,29.0,0,1
1,1,1,1,0.9167,3,0
2,1,0,0,2.0,3,0
3,1,0,1,30.0,3,0
4,1,0,0,25.0,3,0


In [198]:
# Showing correlation for the current dataframe before using Gaussian and Bernoulli methods
model_df.corr()

Unnamed: 0,pclass,survived,sex,age,family_num,TravelSolo
pclass,1.0,-0.312469,0.124617,-0.444002,0.050027,0.147393
survived,-0.312469,1.0,-0.528693,-0.060032,0.026876,-0.201719
sex,0.124617,-0.528693,1.0,0.080752,-0.188583,0.284537
age,-0.444002,-0.060032,0.080752,1.0,-0.206087,0.116266
family_num,0.050027,0.026876,-0.188583,-0.206087,1.0,-0.688864
TravelSolo,0.147393,-0.201719,0.284537,0.116266,-0.688864,1.0


## Gaussian model

In [199]:
# Create the dataframe with our predictive value
y = model_df['survived'].copy()

# Create the dataframe with our predicting features
X = model_df.drop('survived', axis=1)

In [200]:
# Creating the test and training data, with the default 75% train and 25% test
# Random state is arbitrary at 77
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=77)

In [201]:
gnb = GaussianNB()

In [202]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [203]:
#predictive score of the model on the training data
gscoreTrain = gnb.score(X_train, y_train)
gscoreTrain

0.7889908256880734

In [204]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [205]:
#Confusion matrix shows which values model predicted correctly vs incorrectly
gcm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Disceased', 'Predicted Survived'],
    index=['True Disceased', 'True Survived']
)

In [206]:
#frequency of survived persons to disceased in the test dataset
y_test.value_counts()

0    197
1    131
Name: survived, dtype: int64

In [207]:
#predictive score of the model on the test data
gscoreTest = gnb.score(X_test, y_test)
gscoreTest

0.7865853658536586

## Bernoulli model

In [208]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [209]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [210]:
#model's predictive score on the training data
bscoreTrain = bnb.score(X_train, y_train)
bscoreTrain

0.7757390417940877

In [211]:
#test the model on unseen data
#score predictive values in variable
y_pred = bnb.predict(X_test)

In [212]:
#Confusion matrix shows which values model predicted correctly vs incorrectly
bcm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Disceased', 'Predicted Survived'],
    index=['True Disceased', 'True Survived']
)

In [213]:
bscoreTest = bnb.score(X_test, y_test)
bscoreTest

0.7926829268292683

## Conclusion

Did one model perform better than the other? 

Setup:
- Train data: (75%)   
- Test data: (25%)   
- Random: (77)   
- f1-Accuracy: (0.79)

Survival accuracy: Gaussian (66.4%) vs Bernoulli(64.8%).

Disceased accuracy: Bernoulli(88.8%) vs Gaussian (85.7%).

Both models were poor on predicting the survival accuracy for the dataset, but were more accurately able to predict the disceased.  

In [214]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.79      0.89      0.84       197
           1       0.79      0.65      0.71       131

    accuracy                           0.79       328
   macro avg       0.79      0.77      0.78       328
weighted avg       0.79      0.79      0.79       328



In [215]:
print(f'Gaussian Test: {round(gscoreTest, 4)}, Bernoulli Test: {round(bscoreTest, 4)}')
print(f'Difference: {round(gscoreTest - bscoreTest, 4)}')
print(f'Gaussian Train: {round(gscoreTrain, 4)}, Bernoulli Train: {round(bscoreTrain, 4)}')
print(f'Difference: {round(gscoreTrain - bscoreTrain, 4)}')

Gaussian Test: 0.7866, Bernoulli Test: 0.7927
Difference: -0.0061
Gaussian Train: 0.789, Bernoulli Train: 0.7757
Difference: 0.0133


In [216]:
print('Gaussian')
gcm

Gaussian


Unnamed: 0,Predicted Disceased,Predicted Survived
True Disceased,169,28
True Survived,42,89


In [217]:
print('Bernoulli')
bcm

Bernoulli


Unnamed: 0,Predicted Disceased,Predicted Survived
True Disceased,175,22
True Survived,46,85
