Note: Updated a few commands in this file (from the original). So use this instead of the pne in Canvass.

# Naïve Bayes Classifier

Probability is a way to figure out how likely something is to happen. Probability is calculated by taking the number of chances something can happen and divide it by the total number of possible outcomes. For example, when flipping a coin there are 2 possible outcomes. The probability of getting heads is 50% (1 chance to get heads, with 2 possible outcomes). The formula would look like:

### \begin{align} probability = \frac{number of chances}{total outcomes} \end{align}

The Naïve Bayes classification model is an algorithm based on Bayes' Theorem, which is a way to find the probability of a variable when other values have been known to occur already. It is represented by the following formula:

### \begin{align} P(B|A) = \frac{P(B)\times P(A|B)}{P(A)} \end{align}

Where the probability of B given that A happened is equal to the probability of B times the probability of A given that B happened, divided by the probability of A. For example, in a bag of 2 blue marbles and 3 red marbles, if a blue marble is pulled from the bag then the probability of getting another blue marble is affected by the fact that a blue marble was already drawn (and thus, there are fewer blue marbles in the bag).

<center>![Marbles Probability](https://notebooks.azure.com/priesterkc/projects/testdb/raw/marbles.png "Probability using marbles")</center>

## Naïve Bayes Probability Calculation

In the following dataset, let's find the probability of a student passing a test (60% or higher) given that they studied 5 hours or less. Here are the things we'll need to know:

- the total number of students
- the number of students that passed the test
- the number of students that studied 5 hours or less
- the number of students that studied 5 hours or less, given that they already passed

Using those values, then we can calculate:

- the probability of passing the test
- the probability of studying 5 hours or less
- the probability of studying 5 hours or less, given already passing the test

Assignment 
Using the Titanic dataset, clean up the data (handle missing values either by removal or filling, and transforming non-numerical data into number values) and then build Gaussian and Bernoulli Naive Bayes models to predict Titanic passengers' survival status (1=survived, 0=did not survive). Compare the two models against each other. Did one model perform better than the other? How does the performance of these two models compare to the other classification algorithms, logistic regression and decision trees?

For a bonus challenge, try different methods of preparing your data (cleaning, choosing rows/columns) to see if that affects your results.

*To see an example of predictive output of the logistic regression and decision trees, run the code in the notebooks for the Lv 1 Module 8: Logistic Regression and Module 9: Decision Trees notebooks (Links to an external site.)Links to an external site..

In [1]:
import pandas as pd
import numpy as np

In [2]:
#load data
filename = "titanic.xls"
df = pd.read_excel(filename)

df.head() #first 5 rows

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


In [3]:
#descriptive statistics
df.describe()

Unnamed: 0,pclass,survived,age,sibsp,parch,fare,body
count,1309.0,1309.0,1046.0,1309.0,1309.0,1308.0,121.0
mean,2.294882,0.381971,29.881135,0.498854,0.385027,33.295479,160.809917
std,0.837836,0.486055,14.4135,1.041658,0.86556,51.758668,97.696922
min,1.0,0.0,0.1667,0.0,0.0,0.0,1.0
25%,2.0,0.0,21.0,0.0,0.0,7.8958,72.0
50%,3.0,0.0,28.0,0.0,0.0,14.4542,155.0
75%,3.0,1.0,39.0,1.0,0.0,31.275,256.0
max,3.0,1.0,80.0,8.0,9.0,512.3292,328.0


In [13]:
#find columns that have missing values
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [14]:
#for the 263 missing values for age, we fill them with the mean age
df['age'] = df['age'].fillna(value = df['age'].mean())

In [15]:
#we check to see that all the blanks have been replaced by the mean age and there is no blank
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [16]:
#we fill the fare price with the average/mean price and check there there are no more blanks
df['fare'] = df['fare'].fillna(value = df['fare'].mean())
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [17]:
#only 2 missing values so we'll fill with most common embarkation point
df['embarked'].value_counts()

S    916
C    270
Q    123
Name: embarked, dtype: int64

In [18]:
#fill missing values with the most common value that is S and then check that there are no blanks
df['embarked'].fillna('S', inplace=True)
df.isnull().sum()

pclass          0
survived        0
name            0
sex             0
age             0
sibsp           0
parch           0
ticket          0
fare            0
cabin        1014
embarked        0
boat          823
body         1188
home.dest     564
dtype: int64

In [37]:
modeldf = df.drop(['name','ticket','fare', 'cabin', 'boat', 'body', 'home.dest','embarked'], axis=1)
modeldf

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
0,1,1,female,29.000000,0,0
1,1,1,male,0.916700,1,2
2,1,0,female,2.000000,1,2
3,1,0,male,30.000000,1,2
4,1,0,female,25.000000,1,2
5,1,1,male,48.000000,0,0
6,1,1,female,63.000000,1,0
7,1,0,male,39.000000,0,0
8,1,1,female,53.000000,2,0
9,1,0,male,71.000000,0,0


***

## Naïve Bayes using Scikit-Learn

Let's use the same dataset above and build a Naïve Bayes classification model to predict student grades.

### Gaussian Naïve Bayes

There are different types of Naive Bayes functions and in the examples below, we will use Gaussian Bayes to build the predictive model. Gaussian Bayes uses conditional probability on data that is normally distributed.

In [38]:
from sklearn.naive_bayes import GaussianNB   #import Gaussian Bayes modeling function
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

In [39]:
#check to see if there are any missing values
modeldf.count()

pclass      1309
survived    1309
sex         1309
age         1309
sibsp       1309
parch       1309
dtype: int64

In [40]:
modeldf.dtypes

pclass        int64
survived      int64
sex          object
age         float64
sibsp         int64
parch         int64
dtype: object

In [41]:
modeldf.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
0,1,1,female,29.0,0,0
1,1,1,male,0.9167,1,2
2,1,0,female,2.0,1,2
3,1,0,male,30.0,1,2
4,1,0,female,25.0,1,2


In [42]:
#transform gender column to binary values (0,1)
modeldf['sex'] = modeldf['sex'].map({'female': 0, 'male': 1})
modeldf.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
0,1,1,0,29.0,0,0
1,1,1,1,0.9167,1,2
2,1,0,0,2.0,1,2
3,1,0,1,30.0,1,2
4,1,0,0,25.0,1,2


In [43]:
#see which features are correlated to each other. Higher the corelationrelation, the better survival 
#in case of males, their value is 1 and their survival coorelation number will be close to 0 numerically
modeldf.corr()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch
pclass,1.0,-0.312469,0.124617,-0.36637,0.060832,0.018322
survived,-0.312469,1.0,-0.528693,-0.050199,-0.027825,0.08266
sex,0.124617,-0.528693,1.0,0.057398,-0.109609,-0.213125
age,-0.36637,-0.050199,0.057398,1.0,-0.190747,-0.130872
sibsp,0.060832,-0.027825,-0.109609,-0.190747,1.0,0.373587
parch,0.018322,0.08266,-0.213125,-0.130872,0.373587,1.0


In [44]:
#dataframe with predicting features - this is the test data. here we are dropping survived so that it can predict the survival
X = modeldf.drop('survived', axis=1)

#column of predictive target values - This is the training data. Here we are dropping survived so that it can predict the survival
y = modeldf['survived']

In [45]:
modeldf.columns

Index(['pclass', 'survived', 'sex', 'age', 'sibsp', 'parch'], dtype='object')

In [46]:
#create training and test data
#will leave test size at default (25%)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=109)

In [47]:
#initialize Gaussian Bayes classifier
gnb = GaussianNB()

In [48]:
#train the model to learn trends
gnb.fit(X_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [49]:
#predictive score of the model on the training data
gnb.score(X_train, y_train)

0.7777777777777778

In [50]:
#test the model on unseen data
#score predictive values in variable
y_pred = gnb.predict(X_test)

In [51]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,165,35
True Passed,47,81


In [52]:
#frequency of passed students to failed students in the test dataset
y_test.value_counts()

0    200
1    128
Name: survived, dtype: int64

In [53]:
#predictive score of the model on the test data
gnb.score(X_test, y_test)

0.75

In [54]:
#predictive score of the model for each predictive category
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.78      0.82      0.80       200
           1       0.70      0.63      0.66       128

   micro avg       0.75      0.75      0.75       328
   macro avg       0.74      0.73      0.73       328
weighted avg       0.75      0.75      0.75       328



### Bernoulli's Naïve Bayes

Bernoull's Naïve Bayes classifier is best on a target variable that is binary (Boolean; True/False (1,0) values). Let's try this method on the dataset from the previous example.

In [55]:
#import Bernoulli Naïve Bayes function from scikit-learn library
from sklearn.naive_bayes import BernoulliNB

In [56]:
#initialize Bernoulli Naïve Bayes function to a variable
bnb = BernoulliNB()

In [57]:
#build the model with training data
bnb.fit(X_train, y_train)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [58]:
#model's predictive score on the training data
bnb.score(X_train, y_train)

0.7849133537206932

In [59]:
#test the model on unseen data
#score predictive values in variable
y_pred = bnb.predict(X_test)

In [60]:
#Confusion matrix shows which values model predicted correctly vs incorrectly

cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failed', 'Predicted Passed'],
    index=['True Failed', 'True Passed']
)

cm

Unnamed: 0,Predicted Failed,Predicted Passed
True Failed,170,30
True Passed,47,81


In [61]:
#predictive score of the model on the test data
bnb.score(X_test, y_test)

0.7652439024390244

Overall, the model is really good at finding students that passed but in this test dataset, it didn't have enough data points to find the trend of predicting features for students that failed the test. One way to improve the results would be to decrease the size of the training data so that data points for failing students seem more significant. This dataset is also small, so new data with more students that failed could help the model see the trends for failing students. Lastly, it could just be that Naïve Bayes isn't the best model to use for the data and we should compare its results to other predictive classification models.