# **Naïve Bayes Classification Implementation Using Scikit-Learn**

*By Carlos Santiago Bañón*

* **Year:** 2020
* **Technologies:** Python, Scikit-Learn, Pandas, NumPy
* **Areas**: Machine Learning, Classification, Bayesian Learning
* **Keywords:** `bayesian-learning`, `classification`, `machine-learning`, `naïve-bayes`, `naïve-bayes-classification`
* **Description:** This notebook presents an implementation of naïve Bayes classification using the Scikit-Learn library. The data used is a preprocessed version of the Kaggle Titanic dataset hosted in the GitHub repository for this notebook.

## 1. Import Statements
---

In [1]:
import numpy as np
import pandas as pd
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import GaussianNB

## 2. Load the Data
---

First, we import the preprocessed Kaggle Titanic dataset hosted in the GitHub repository for this notebook.

In [2]:
# Import the data into Pandas DataFrames.
train_df = pd.read_csv('https://bit.ly/39AQRJj')
test_df = pd.read_csv('https://bit.ly/3aoJzHG')
y_test_df = pd.read_csv('https://bit.ly/2YxfKzi')

In [3]:
# Show the training set.
train_df

Unnamed: 0,Age,Survived,Pclass,SibSp,Fare,Gender
0,2,0,3,1,0,0
1,3,1,1,1,3,1
2,2,1,3,0,1,1
3,3,1,1,1,3,1
4,3,0,3,0,1,0
...,...,...,...,...,...,...
886,2,0,2,0,1,0
887,2,1,1,0,2,1
888,2,0,3,1,2,1
889,2,1,1,0,2,0


In [4]:
# Show the test set.
test_df

Unnamed: 0,Age,Pclass,SibSp,Fare,Gender
0,3,3,0,0,0
1,4,3,1,0,1
2,5,2,0,1,0
3,2,3,0,1,0
4,2,3,1,1,1
...,...,...,...,...,...
413,3,3,0,1,0
414,3,1,0,3,1
415,3,3,0,0,0
416,3,3,0,1,0


In [5]:
# Set up the learning matrices.
X_train = train_df.drop('Survived', axis=1, inplace=False).to_numpy()
y_train = train_df[['Survived']].to_numpy()
X_test = test_df.to_numpy()
y_test = y_test_df.drop('PassengerId', axis=1, inplace=False).to_numpy()

## 3. Naïve Bayes Classification
---

### 3.1. Define and Fit the Model

In [6]:
# Define the Gaussian naïve Bayes classifier.
gnb = GaussianNB()

In [7]:
# Fit the Gaussian naïve Bayes classifier.
y_pred_gnb = gnb.fit(X_train, y_train.ravel()).predict(test_df.to_numpy())

### 3.2. Get the Cross-Validation Accuracy

In [8]:
# Calculate the evaluation metrics using cross-validation.
f1_score = cross_val_score(gnb, X_train, y_train.ravel(), cv=5, scoring='f1').mean()
accuracy_score = cross_val_score(gnb, X_train, y_train.ravel(), cv=5, scoring='accuracy').mean()
precision_score = cross_val_score(gnb, X_train, y_train.ravel(), cv=5, scoring='precision').mean()
recall_score = cross_val_score(gnb, X_train, y_train.ravel(), cv=5, scoring='recall').mean()

In [9]:
# Show the evaluation metrics.
print("F1 Score:", f1_score)
print("Accuracy Score:", accuracy_score)
print("Precision Score:", precision_score)
print("Recall Score:", recall_score)

F1 Score: 0.7282419532419533
Accuracy Score: 0.7710752620676669
Precision Score: 0.6705436720142602
Recall Score: 0.7982949701619779


### 3.3. Show the Prediction Results

In [10]:
# Show the prediction results.
print(y_pred_gnb)

[0 1 0 0 1 0 1 0 1 0 0 1 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1
 1 0 0 0 1 0 1 1 0 1 0 1 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 1 1 0 1 1
 1 1 0 1 0 1 0 1 1 0 0 0 1 1 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 0 1 0 0 0 0 0 0
 1 1 1 1 0 0 1 1 1 1 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 1 0
 1 0 1 0 0 1 0 0 1 1 1 1 1 0 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1
 0 1 0 0 0 0 1 0 0 0 0 1 1 0 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0
 1 0 1 1 0 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 0 1
 0 0 0 1 1 0 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 1 0 1 1 1 0 0 1 0 0 1 1 0 1 0 0
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0
 1 0 1 0 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 0 1 1 1 0 0 1 0 1 1 0 1 1 0 1 1 0
 0 1 0 0 1 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 1 0 1 0 0 1 0 1 1 1 0 0
 1 1 1 1 1 1 0 1 0 0 0]
