# Bayesian Classifier on Titanic Dataset
##### Dataset can be obtained from Kaggle at: https://www.kaggle.com/competitions/titanic/overview
##### Will need titanic_unscaled dataset from ML Lab 2 KNN for this to work
##### In this notebook, apply these tools and methodologies:
##### 1. ETL Titanic dataset
##### 2. Apply Data Discretization using Binning (qcut, cut)
##### &ensp;&nbsp; Difference between cut and qcut is the binning strategy. 
##### &ensp;&nbsp; qcut divides the data to get roughly equal numbers of points per bin
##### &ensp;&nbsp; cut divides the data explicitly by the number of bins specified, regardless of how many points fall per bin
##### 3.  Apply One Hot Encoding
##### 4.  Apply Bayesian Classifier on dataset to see prediction accuracy

In [2]:
import pandas as pd
import os

# # Show current working directory - the directory where all your files are saved by default
# os.getcwd()

# Set path for new working directory
path = "C:/Users/Sarah/Faris Stuff/USM Data Science Masters Files/CDS503/Week 5 - 16 Mar/Data"
os.chdir(path) 

# # Check to see if current directory has changed
# os.getcwd()

# Read data from CSV to a data frame named df
df = pd.read_csv('titanic_unscaled.csv') 
# Display the data
df

Unnamed: 0,survived,pclass,sex,age,sibspouse,parchild,fare
0,0,3,1,22.0,1,0,7.2500
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.9250
3,1,1,0,35.0,1,0,53.1000
4,0,3,1,35.0,0,0,8.0500
...,...,...,...,...,...,...,...
882,0,2,1,27.0,0,0,13.0000
883,1,1,0,19.0,0,0,30.0000
884,0,3,0,7.0,1,2,23.4500
885,1,1,1,26.0,0,0,30.0000


#### Data Discretization (Binning) 
##### Bin age data into 4 different buckets using "qcut" method, let pandas figure out how to distribute data (somewhat) equally

In [3]:
# Create another data frame and to make sure df_new does not modify df
df_new = df.copy()

# Divide the data into 4 bins having roughly equal number of instances
# Precision defines how many decimal points to use for calculating the bin precision
df_new['age_cat4a'] = pd.qcut(df_new['age'], q = 4, precision = 0)

# View the count of each bin
df_new['age_cat4a'].value_counts()

age_cat4a
(20.0, 28.0]    243
(-1.0, 20.0]    222
(38.0, 80.0]    214
(28.0, 38.0]    208
Name: count, dtype: int64

In [5]:
# display newly binned data
# now can see that the age is categorized into their respective bins in age_cat4a
# small problem: bin is not really descriptive
df_new

Unnamed: 0,survived,pclass,sex,age,sibspouse,parchild,fare,age_cat4a
0,0,3,1,22.0,1,0,7.2500,"(20.0, 28.0]"
1,1,1,0,38.0,1,0,71.2833,"(28.0, 38.0]"
2,1,3,0,26.0,0,0,7.9250,"(20.0, 28.0]"
3,1,1,0,35.0,1,0,53.1000,"(28.0, 38.0]"
4,0,3,1,35.0,0,0,8.0500,"(28.0, 38.0]"
...,...,...,...,...,...,...,...,...
882,0,2,1,27.0,0,0,13.0000,"(20.0, 28.0]"
883,1,1,0,19.0,0,0,30.0000,"(-1.0, 20.0]"
884,0,3,0,7.0,1,2,23.4500,"(-1.0, 20.0]"
885,1,1,1,26.0,0,0,30.0000,"(20.0, 28.0]"


In [7]:
# Use the labels parameter to name the bins to make it easier to understand what different bins represent
# basically use the same qcut code, but now add the labels. It will display the labels instead of the 
#   points at which it cuts the data
# The "\" at the end of a line allows us to split our code over different lines
df_new['age_cat4b'] = pd.qcut(df_new['age'], q = 4, \
                              precision = 0, labels = ['Junior', 'Youth', 'Adult', 'Senior'])
# View the count of each bin
df_new['age_cat4b'].value_counts()

age_cat4b
Youth     243
Junior    222
Senior    214
Adult     208
Name: count, dtype: int64

In [8]:
df_new

Unnamed: 0,survived,pclass,sex,age,sibspouse,parchild,fare,age_cat4a,age_cat4b
0,0,3,1,22.0,1,0,7.2500,"(20.0, 28.0]",Youth
1,1,1,0,38.0,1,0,71.2833,"(28.0, 38.0]",Adult
2,1,3,0,26.0,0,0,7.9250,"(20.0, 28.0]",Youth
3,1,1,0,35.0,1,0,53.1000,"(28.0, 38.0]",Adult
4,0,3,1,35.0,0,0,8.0500,"(28.0, 38.0]",Adult
...,...,...,...,...,...,...,...,...,...
882,0,2,1,27.0,0,0,13.0000,"(20.0, 28.0]",Youth
883,1,1,0,19.0,0,0,30.0000,"(-1.0, 20.0]",Junior
884,0,3,0,7.0,1,2,23.4500,"(-1.0, 20.0]",Junior
885,1,1,1,26.0,0,0,30.0000,"(20.0, 28.0]",Youth


##### Bin age data into 4 different buckets using "cut" method, pandas will create the number of bins specified
##### and distribute the data by the bin, regardless of the distribution

In [10]:
# Divide the data into 4 bins with equal range automatically
df_new['age_cat4c'] = pd.cut(df_new['age'], bins = 4)

# View the count of each bin
df_new['age_cat4c'].value_counts()

age_cat4c
(20.315, 40.21]    487
(0.34, 20.315]     222
(40.21, 60.105]    152
(60.105, 80.0]      26
Name: count, dtype: int64

In [11]:
# cut also allows users to explicitly define the bin split boundry
# Define the name of each bin
cut_labels_4 = ['Junior', 'Youth', 'Adult', 'Senior']

# Define the boundaries of the 4 bins
# Min age is 0.42 and max age is 80
cut_bins = [0, 20, 30, 50, 80]

# Divide the data into 4 bins with the manually-defined range
df_new['age_cat4d'] = pd.cut(df_new['age'], bins = cut_bins, labels = cut_labels_4)

# View the count of each bin
df_new['age_cat4d'].value_counts()

age_cat4d
Youth     303
Adult     290
Junior    222
Senior     72
Name: count, dtype: int64

In [12]:
df_new

Unnamed: 0,survived,pclass,sex,age,sibspouse,parchild,fare,age_cat4a,age_cat4b,age_cat4c,age_cat4d
0,0,3,1,22.0,1,0,7.2500,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth
1,1,1,0,38.0,1,0,71.2833,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult
2,1,3,0,26.0,0,0,7.9250,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth
3,1,1,0,35.0,1,0,53.1000,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult
4,0,3,1,35.0,0,0,8.0500,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult
...,...,...,...,...,...,...,...,...,...,...,...
882,0,2,1,27.0,0,0,13.0000,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth
883,1,1,0,19.0,0,0,30.0000,"(-1.0, 20.0]",Junior,"(0.34, 20.315]",Junior
884,0,3,0,7.0,1,2,23.4500,"(-1.0, 20.0]",Junior,"(0.34, 20.315]",Junior
885,1,1,1,26.0,0,0,30.0000,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth


#### One Hot Encoding
###### One Hot Encoding splits a categorical column into multiple columns
###### If a column (C) contains 1s and 0s for example, it will split the column into 2 columns C_0 and C_1
###### Column C_0 will contain true where C was 0, and false where C was 1
###### Column C_1 will contain true where C was 1, and false where C was 0
###### Apply One Hot Encoding to Sex column

In [13]:
# Use get_dummies() function to perform one hot encoding on sex column
df_enc = pd.get_dummies(df_new, prefix = ['sex'], columns = ['sex'])
df_enc.head()

Unnamed: 0,survived,pclass,age,sibspouse,parchild,fare,age_cat4a,age_cat4b,age_cat4c,age_cat4d,sex_0,sex_1
0,0,3,22.0,1,0,7.25,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth,False,True
1,1,1,38.0,1,0,71.2833,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult,True,False
2,1,3,26.0,0,0,7.925,"(20.0, 28.0]",Youth,"(20.315, 40.21]",Youth,True,False
3,1,1,35.0,1,0,53.1,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult,True,False
4,0,3,35.0,0,0,8.05,"(28.0, 38.0]",Adult,"(20.315, 40.21]",Adult,False,True


### Train and Validate Naive Bayes Classifier

In [14]:
# Indicate the target column
target = df['survived']

# features column are basically everything else that's not the target
features = df.drop('survived', axis = 1)

# Import train_test_split function
from sklearn.model_selection import train_test_split

# Split the dataset into training and test set, test size is 20%, training size is 80%
x_train, x_test, y_train, y_test = train_test_split(features, target, test_size = 0.2, random_state = 0)

#### Gaussian Naive Bayes
##### Gaussian Naive Bayes is useful when working with features containing continuous values, which probabilities
##### can be modeled using a Gaussian distribution (normal distribution)

In [16]:
# Import function for k-fold cross validation
from sklearn.model_selection import cross_val_score
# Import the Gaussian Naive Bayes classifier
from sklearn.naive_bayes import GaussianNB

# Create a Gaussian Naive Bayes classifier with default parameters
gnb = GaussianNB()

# Use 10-fold cross validation to perform training and validation on the training set
# Parameter scoring = 'accuracy' will compute accuracy
scores = cross_val_score(gnb, x_train, y_train, cv = 10, scoring = 'accuracy')

# Display the array containing accuracy from 10 folds or iterations
scores

array([0.84507042, 0.8028169 , 0.87323944, 0.84507042, 0.71830986,
       0.74647887, 0.77464789, 0.71830986, 0.8028169 , 0.81428571])

In [17]:
# Print the mean accuracy score
print('Accuracy (Validation) =', scores.mean())

Accuracy (Validation) = 0.7941046277665996


In [19]:
# Import scikit-learn metrics module for accuracy calculation
from sklearn import metrics

# Running prediction
gnb.fit(x_train, y_train)

# Predict the target for the test dataset
test_predict = gnb.predict(x_test)

# Compute the model accuracy on the development set: How often is the classifier correct?
print("Accuracy (Test): ", metrics.accuracy_score(y_test, test_predict))

Accuracy (Test):  0.7752808988764045


In [21]:
# Validation accuracy obtained is decent scoring close to 80%, which means making the assumption that the continuous features 
# follow the normal distribution might work well with this dataset. Running the Gaussian Naive Bayes classifier on the test 
# set also resulted in the test accuracy of 78%, thus indicating that the model does not suffer from overfitting.

#### Bernoulli Naive Bayes
##### Bernoulli Naive Bayes is suitable to be used when features are binary (represented by 0 or 1), which are modeled using a Bernoulli distribution. 
##### As our titanic dataset contains continuous values, we can first transform all the features to binary values using the binarize parameter.
##### A few important points about Bernoulli Naive Bayes:
##### * Suitable for discrete data
##### * Designed for binary/boolean features
##### * If data is not binary, binarization preprocessing will happen internally
##### * Can deal with negative numbers


In [22]:
# Import the Bernoulli Naive Bayes classifier
from sklearn.naive_bayes import BernoulliNB

# Create a Bernoulli Naive Bayes classifier with default parameters
bnb = BernoulliNB(binarize = 0.0)

# Use 10-fold cross validation to perform training and validation on the training set
scores = cross_val_score(bnb, x_train, y_train, cv = 10, scoring = 'accuracy')

# Display the array containing accuracy from 10 folds or iterations
scores

array([0.83098592, 0.77464789, 0.88732394, 0.83098592, 0.76056338,
       0.73239437, 0.78873239, 0.71830986, 0.81690141, 0.8       ])

In [23]:
# Print the mean accuracy score
print('Accuracy =', scores.mean())

Accuracy = 0.7940845070422535


In [24]:
# Running prediction
bnb.fit(x_train, y_train)

# Predict the target for the test dataset
test_predict = bnb.predict(x_test)

# Compute the model accuracy on the development set: How often is the classifier correct?
print("Accuracy (Test): ", metrics.accuracy_score(y_test, test_predict))

Accuracy (Test):  0.7528089887640449


In [25]:
# Validation accuracy is also 80%, which means making the assumption that the continuous features follow the
# Bernoulli distribution might also work well with this dataset. Test accuracy of 75% shows the Bernoulli Naive
# Bayes classifier did not overfit to the training data.

#### Multinomial Naive Bayes
##### A multinomial distribution is useful to model feature vectors where each value represents, for example, the number of occurrences or frequency counts, which are modeled using a multinomial distribution.
##### A few important points about Multinomial Naive Bayes:
##### * Suited for classification of data with discrete features (count data)
##### * Very useful in text processing
##### * Each text unit will be converted to vector of word count
##### * Cannot deal with negative numbers


In [26]:
# Import the Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB

# Create a Multinomial Naive Bayes classifier with default parameters
mnb = MultinomialNB()

# Use 10-fold cross validation to perform training and validation on the training set
scores = cross_val_score(mnb, x_train, y_train, cv = 10, scoring = 'accuracy')

# Display the array containing accuracy from 10 folds or iterations
scores

array([0.73239437, 0.71830986, 0.74647887, 0.77464789, 0.57746479,
       0.64788732, 0.5915493 , 0.63380282, 0.81690141, 0.64285714])

In [None]:
# Print the mean accuracy score
print('Accuracy =', scores.mean())

In [27]:
# Running prediction
mnb.fit(x_train, y_train)
# Predict the target for the test dataset
test_predict = mnb.predict(x_test)
# Compute the model accuracy on the development set: How often is the classifier correct?
print("Accuracy (Test): ", metrics.accuracy_score(y_test, test_predict))

Accuracy (Test):  0.6685393258426966


In [28]:
# We observe a significant drop in accuracy to only 69% using Multinomial Naive Bayes. The features in the
# dataset are not represented by counts so it makes sense that Multinomial Naive Bayes is not a suitable
# classifier for this dataset. Also no overfitting by observing the test accuracy.

# We will refer to the test accuracy to determine the best performing model. In this case, the Gaussian Naive
# Bayes classifier yields the highest accuracy.