# Exercise: Titanic Bayes

###Using the Titanic dataset, clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) and then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). Compare the two models against each other. Did one model perform better than the other? How does the performance of these two models compare to the other classification algorithms, logistic regression and decision trees?

For a bonus challenge, try different methods of preparing your data (cleaning, choosing rows/columns) to see if that affects your results.

*To see an example of predictive output of the logistic regression and decision trees, run the code in the notebooks for the Lv 1 Module 8: Logistic Regression and Module 9: Decision Trees notebooks (Links to an external site.)Links to an external site..

Upload your Jupyter notebook to Github and submit the URL to turn in this assignment.



# TODO
##  1. clean up the data (handle missing values either by removal or filling 
## 2.  transforming non-numerical data into number values) 
## 3. then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). 
## 4 . Compare the two models against each other

In [1]:
import pandas as pd
import numpy as np

In [2]:
#load data 

filename = "titanic.xls"
df = pd.read_excel(filename) 


df.head() #first 5 rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
pclass       1309 non-null int64
survived     1309 non-null int64
name         1309 non-null object
sex          1309 non-null object
age          1046 non-null float64
sibsp        1309 non-null int64
parch        1309 non-null int64
ticket       1309 non-null object
fare         1308 non-null float64
cabin        295 non-null object
embarked     1307 non-null object
boat         486 non-null object
body         121 non-null float64
home.dest    745 non-null object
dtypes: float64(3), int64(4), object(7)
memory usage: 143.2+ KB


In [4]:
np.logical_not(df.isnull()).sum()

pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64

In [5]:
#find columns that have missing values
df.isnull().sum()

#np.logical_not(df.isnull()).sum()

pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [6]:
np.logical_not(df.isnull()).sum()

pclass       1309
survived     1309
name         1309
sex          1309
age          1046
sibsp        1309
parch        1309
ticket       1309
fare         1308
cabin         295
embarked     1307
boat          486
body          121
home.dest     745
dtype: int64

In [7]:
#rows where the age is missing
missing_age = df.loc[df['age'].isnull()]
missing_age.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
15,1,0,"Baumann, Mr. John D",male,,0,0,PC 17318,25.925,,S,,,"New York, NY"
37,1,1,"Bradley, Mr. George (""George Arthur Brayton"")",male,,0,0,111427,26.55,,S,9.0,,"Los Angeles, CA"
40,1,0,"Brewe, Dr. Arthur Jackson",male,,0,0,112379,39.6,,C,,,"Philadelphia, PA"
46,1,0,"Cairns, Mr. Alexander",male,,0,0,113798,31.0,,S,,,
59,1,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev...",female,,0,0,17770,27.7208,,C,5.0,,"New York, NY"


In [8]:
#get index numbers of missing rows - we'll use this later
mals = list(missing_age.index)

In [9]:
#table of avg age of passenger by survival status, sex, and passenger class
df.groupby(['survived', 'sex', 'pclass'])['age'].mean()

survived  sex     pclass
0         female  1         35.200000
                  2         34.090909
                  3         23.418750
          male    1         43.658163
                  2         33.092593
                  3         26.679598
1         female  1         37.109375
                  2         26.711051
                  3         20.814815
          male    1         36.168240
                  2         17.449274
                  3         22.436441
Name: age, dtype: float64

In [10]:
#fill missing values for age based on survival status, sex, and passenger class
df['age'].fillna(df.groupby(['survived', 'sex', 'pclass'])['age'].transform('mean'), inplace=True)

In [11]:
#verify filled missing values 
df.iloc[mals].head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
15,1,0,"Baumann, Mr. John D",male,43.658163,0,0,PC 17318,25.925,,S,,,"New York, NY"
37,1,1,"Bradley, Mr. George (""George Arthur Brayton"")",male,36.16824,0,0,111427,26.55,,S,9.0,,"Los Angeles, CA"
40,1,0,"Brewe, Dr. Arthur Jackson",male,43.658163,0,0,112379,39.6,,C,,,"Philadelphia, PA"
46,1,0,"Cairns, Mr. Alexander",male,43.658163,0,0,113798,31.0,,S,,,
59,1,1,"Cassebeer, Mrs. Henry Arthur Jr (Eleanor Genev...",female,37.109375,0,0,17770,27.7208,,C,5.0,,"New York, NY"


In [12]:
#verify there are no more missing age values
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

In [13]:
#missing values for 'embarked'
embark = df.loc[df['embarked'].isnull()]
embark

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,,6,,
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,,6,,"Cincinatti, OH"


In [14]:
#save index for missing values to verify later
embarkls = list(embark.index)

In [15]:
#only 2 missing values so we'll fill with most common embarkation point
df['embarked'].value_counts()

S    914
C    270
Q    123
Name: embarked, dtype: int64

In [16]:
#fill missing values
df['embarked'].fillna('S', inplace=True)

In [17]:
#check that they're filled
df.iloc[embarkls]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,S,6,,
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,S,6,,"Cincinatti, OH"


In [18]:
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [19]:
modeldf = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest'], axis=1)

In [20]:
#columns left in our dataframe
modeldf.columns

Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch', 'embarked'], dtype='object')

In [21]:
#dummy variables for passenger class embarkation port
#get_dummies will auto-drop columns that dummies were created from
modeldf = pd.get_dummies(data=modeldf, columns=['pclass','embarked'])
modeldf.head()

Unnamed: 0,survived,sex,age,sibsp,parch,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,1,female,29.0,0,0,1,0,0,0,0,1
1,1,male,0.9167,1,2,1,0,0,0,0,1
2,0,female,2.0,1,2,1,0,0,0,0,1
3,0,male,30.0,1,2,1,0,0,0,0,1
4,0,female,25.0,1,2,1,0,0,0,0,1


In [22]:
#change sex values to binary
#female=0, male=1
modeldf['sex'] = modeldf['sex'].map({'female':0, 'male':1})
modeldf.head()

Unnamed: 0,survived,sex,age,sibsp,parch,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S
0,1,0,29.0,0,0,1,0,0,0,0,1
1,1,1,0.9167,1,2,1,0,0,0,0,1
2,0,0,2.0,1,2,1,0,0,0,0,1
3,0,1,30.0,1,2,1,0,0,0,0,1
4,0,0,25.0,1,2,1,0,0,0,0,1


In [23]:
#create new column based on number of family members
#drop sibsp and parch columns
modeldf['family_num'] = modeldf['sibsp'] + modeldf['parch']
modeldf.drop(['sibsp', 'parch'], axis=1, inplace=True)
modeldf.head(15)

Unnamed: 0,survived,sex,age,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,family_num
0,1,0,29.0,1,0,0,0,0,1,0
1,1,1,0.9167,1,0,0,0,0,1,3
2,0,0,2.0,1,0,0,0,0,1,3
3,0,1,30.0,1,0,0,0,0,1,3
4,0,0,25.0,1,0,0,0,0,1,3
5,1,1,48.0,1,0,0,0,0,1,0
6,1,0,63.0,1,0,0,0,0,1,1
7,0,1,39.0,1,0,0,0,0,1,0
8,1,0,53.0,1,0,0,0,0,1,2
9,0,1,71.0,1,0,0,1,0,0,0


### 3  Build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). 

In [24]:
#descriptive statistics
modeldf.describe()

Unnamed: 0,survived,sex,age,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,family_num
count,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0,1309.0
mean,0.381971,0.644003,29.409509,0.246753,0.211612,0.541635,0.206264,0.093965,0.699771,0.883881
std,0.486055,0.478997,13.208523,0.431287,0.408607,0.498454,0.404777,0.291891,0.458533,1.583639
min,0.0,0.0,0.1667,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.0,26.679598,0.0,0.0,1.0,0.0,0.0,1.0,0.0
75%,1.0,1.0,36.16824,0.0,0.0,1.0,0.0,0.0,1.0,1.0
max,1.0,1.0,80.0,1.0,1.0,1.0,1.0,1.0,1.0,10.0


In [25]:
#total number of passengers
total = len(modeldf)

In [26]:
total

1309

In [27]:
#rows of passengers that survived the accident
df_totalSurv = modeldf[modeldf['survived'] > 0]

#number of passengers that survived 
survivors = len(df_totalSurv)

In [28]:
survivors

500

In [29]:
#probability of surviving
#number of passengers that survived divided by total number of passengers
P_pass = survivors/total
P_pass

0.3819709702062643

In [30]:
#rows of passengers that were male
df_male = modeldf[modeldf['sex'] > 0]
men = len(df_male)
men
#number of MALE passengers that survived
maleSurv= df_male[df_male['survived']>0]

In [31]:
TotalMaleSurv = len(maleSurv)

In [32]:
TotalMaleSurv

161

In [33]:
#probability of a Male surviving
#number of males that survived divided by total number of men passengers
P_male = TotalMaleSurv/men
P_male

0.19098457888493475

In [59]:
#rows of passengers that over age 18
df_all_adult = modeldf[modeldf['age'] >=18]
adults = len(df_all_adult)
adults 
#number of passengers that were male and over 18 
#num_5less_pass = len(df_5less_pass)
#number of MALE passengers that are adults (>=18)
adultSurv= df_all_adult[df_all_adult['age'] >=18]

In [60]:
#probability of an adult surviving
#number of adults that survived divided by total number of adult passengers
P_adults = adultSurv/adults
P_adults

Unnamed: 0,survived,sex,age,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,family_num
0,0.000867,0.000000,0.025152,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.000000
3,0.000000,0.000867,0.026019,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.002602
4,0.000000,0.000000,0.021683,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.002602
5,0.000867,0.000867,0.041631,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.000000
6,0.000867,0.000000,0.054640,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.000867
7,0.000000,0.000867,0.033825,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.000000
8,0.000867,0.000000,0.045967,0.000867,0.0,0.000000,0.000000,0.0,0.000867,0.001735
9,0.000000,0.000867,0.061578,0.000867,0.0,0.000000,0.000867,0.0,0.000000,0.000000
10,0.000000,0.000867,0.040763,0.000867,0.0,0.000000,0.000867,0.0,0.000000,0.000867
11,0.000867,0.000000,0.015611,0.000867,0.0,0.000000,0.000867,0.0,0.000000,0.000867


In [61]:
#SOLUTION: probability of Surviving if you are a MALE

#probability of passing times probability of studying 5 hours or less given that you passed
#divded by probability of studying 5 hours or less
P_pass_AdultMen = (P_pass * P_adults)/(P_male)
P_pass_AdultMen

Unnamed: 0,survived,sex,age,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,family_num
0,0.001735,0.000000,0.050304,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.000000
3,0.000000,0.001735,0.052038,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.005204
4,0.000000,0.000000,0.043365,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.005204
5,0.001735,0.001735,0.083261,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.000000
6,0.001735,0.000000,0.109281,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.001735
7,0.000000,0.001735,0.067650,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.000000
8,0.001735,0.000000,0.091935,0.001735,0.0,0.000000,0.000000,0.0,0.001735,0.003469
9,0.000000,0.001735,0.123158,0.001735,0.0,0.000000,0.001735,0.0,0.000000,0.000000
10,0.000000,0.001735,0.081527,0.001735,0.0,0.000000,0.001735,0.0,0.000000,0.001735
11,0.001735,0.000000,0.031223,0.001735,0.0,0.000000,0.001735,0.0,0.000000,0.001735


***

## Naïve Bayes using Scikit-Learn

Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

### Gaussian Naïve Bayes

There are different types of Naive Bayes functions and in the examples below, we will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed.

In [39]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [42]:
#check to see if there are any missing values
modeldf

Unnamed: 0,survived,sex,age,pclass_1,pclass_2,pclass_3,embarked_C,embarked_Q,embarked_S,family_num
0,1,0,29.000000,1,0,0,0,0,1,0
1,1,1,0.916700,1,0,0,0,0,1,3
2,0,0,2.000000,1,0,0,0,0,1,3
3,0,1,30.000000,1,0,0,0,0,1,3
4,0,0,25.000000,1,0,0,0,0,1,3
5,1,1,48.000000,1,0,0,0,0,1,0
6,1,0,63.000000,1,0,0,0,0,1,1
7,0,1,39.000000,1,0,0,0,0,1,0
8,1,0,53.000000,1,0,0,0,0,1,2
9,0,1,71.000000,1,0,0,1,0,0,0


In [41]:
df.dtypes

pclass         int64
survived       int64
name          object
sex           object
age          float64
sibsp          int64
parch          int64
ticket        object
fare         float64
cabin         object
embarked      object
boat          object
body         float64
home.dest     object
dtype: object

In [43]:
#dataframe with predicting features
X = modeldf.drop('survived', axis=1)

#column of predictive target values
y = modeldf['survived']

In [44]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [45]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [46]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [47]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

0.7512742099898063

In [48]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [49]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,166,34
True Passed,46,82


In [50]:
#frequency of passed students to failed students in the test dataset
y_test.value_counts()

0    200
1    128
Name: survived, dtype: int64

In [51]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.7560975609756098

In [52]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.83      0.81       200
           1       0.71      0.64      0.67       128

   micro avg       0.76      0.76      0.76       328
   macro avg       0.74      0.74      0.74       328
weighted avg       0.75      0.76      0.75       328



### Bernoulli's Naïve Bayes

Bernoull's Naïve Bayes classifier is best on a target variable that is binary (Boolean; True/False (1,0) values). Let's try this method on the dataset from the previous example.

In [53]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [54]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [55]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [56]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

0.7757390417940877

In [57]:
#test the model on unseen data
#score predictive values in variable
y_pred = bnb.predict(X_test)

In [None]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

In [58]:
#predictive score of the model on the test data
bnb.score(X_test, y_test)

0.7652439024390244

Doing it manually was not very accurate - I could only consider a few variables at a time and it took me a long time and I personally couldnt make sense of the result.  The Gaussian model had a predictive score of 0.7560975609756098 and the Bernoulli model had a predicitve score of 0.7757390417940877 so both models gave similar predicitions and both were better than manual calculations