# Naive bayes algorithm

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. One of the most simple and effective classification algorithms, the Naïve Bayes classifier aids in the rapid development of machine learning models with rapid prediction capabilities.

Naïve Bayes algorithm is used for classification problems. 
- It is highly used in text classification. In text classification tasks, data contains high dimension (as each word represent one feature in the data).
- It is used in spam filtering, sentiment detection, rating classification etc.

The advantage of using naïve Bayes is its speed. It is fast and making prediction is easy with high dimension of data.

## Import libraries

In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn import metrics 
from sklearn.metrics import *
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

import warnings
warnings.filterwarnings('ignore')

## Types of Naive Bayes algorithms

### Gaussian Naive Bayes

Gaussian Naive Bayes is a type of Naive Bayes method where continuous attributes are considered and the data features follow a **Gaussian distribution** throughout the data.

#### Loading data

In [3]:
# Reading the data
iris_data = pd.read_csv('datasets/Iris.csv')

# Check the data
iris_data.head()

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa


#### Separating independent and dependent variables

In [4]:
# Separate features and target variable
x = iris_data.drop(['Id', 'Species'], axis = 1)
y = iris_data['Species']

print(x.shape, y.shape)

(150, 4) (150,)


#### Create training and testing sets

In [5]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.3, random_state = 56)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(105, 4) (105,)
(45, 4) (45,)


#### Build the gaussian naive bayes model

In [9]:
from sklearn.naive_bayes import GaussianNB

# Creating the NB instance
model = GaussianNB()

# Train the model and make predictions
model.fit(train_x, train_y)
predictions = model.predict(test_x)

In [14]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
print('Accuracy:', round(accuracy_score(test_y, predictions), 3))

Accuracy: 0.933


### Multinomial Naive Bayes

Multinomial Naive Bayes is a probabilistic classifier to calculate the probability distribution of text data, which makes it well-suited for data with features that represent discrete frequencies or counts of events in various natural language processing (NLP) tasks.

#### Loading data

In [15]:
# Reading the data
data = pd.read_csv('datasets/tweets.csv')

# Check the data
data.head()

Unnamed: 0,id,label,tweet
0,1,0,#fingerprint #Pregnancy Test https://goo.gl/h1...
1,2,0,Finally a transparant silicon case ^^ Thanks t...
2,3,0,We love this! Would you go? #talk #makememorie...
3,4,0,I'm wired I know I'm George I was made that wa...
4,5,1,What amazing service! Apple won't even talk to...


#### Separating independent and dependent variables

In [16]:
# Separate features and target variable
x = data['tweet']
y = data['label']

print(x.shape, y.shape)

(7920,) (7920,)


#### Create training and testing sets

In [17]:
# Divide into train and test sets
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size = 0.3, random_state = 56)

print(train_x.shape, train_y.shape)
print(test_x.shape, test_y.shape)

(5544,) (5544,)
(2376,) (2376,)


#### Create bag-of-words

In [18]:
from sklearn.feature_extraction.text import CountVectorizer

count_vector = CountVectorizer(stop_words = 'english')
print(count_vector)

CountVectorizer(stop_words='english')


In [19]:
# Fit the training data
training_data = count_vector.fit_transform(train_x)

# Transform testing data
testing_data = count_vector.transform(test_x)

print(training_data.shape, testing_data.shape)

(5544, 17709) (2376, 17709)


#### Build the multinomial naive bayes model

In [22]:
from sklearn.naive_bayes import MultinomialNB

# Creating the NB instance
model = MultinomialNB()

# Train the model and make predictions
model.fit(training_data, train_y)
predictions = model.predict(testing_data)

In [23]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
print('Accuracy:', round(accuracy_score(test_y, predictions), 3))

Accuracy: 0.891


### Bernoulli Naive Bayes

Bernoulli Naive Bayes is a subcategory of the Naive Bayes Algorithm. It is used for the classification of binary features such as ‘Yes’ or ‘No’, ‘1’ or ‘0’, ‘True’ or ‘False’ etc. Here it is to be noted that the features are independent of one another. Bernoulli Naive Bayes is basically used for spam detection, text classification, Sentiment Analysis, used to determine whether a certain word is present in a document or not.

#### Build the Bernoulli Naive Bayes model

In [24]:
from sklearn.naive_bayes import BernoulliNB

# Creating the NB instance
model = BernoulliNB()

# Train the model and make predictions
model.fit(training_data, train_y)
predictions = model.predict(testing_data)

In [25]:
# Calculate accuracy
from sklearn.metrics import accuracy_score
print('Accuracy:', round(accuracy_score(test_y, predictions), 3))

Accuracy: 0.88
