### Bernoulli Naive Bayes:
- Bernoulli Naive Bayes is a subcategory of the Naive Bayes Algorithm.
- It is typically used when data is binary and it models the occurance of features using Bernoulli distribution.
- It is used for classification of binary features such as 'Yes' (or) 'No', 'True' (or) 'False', '1' (or) '0'.
- Here to be noted that the features are independent of one another.
### Mathematics Behind Bernoulli Naive Bayes:
- In Bernoulli Naive Bayes model we assume that each feature is conditionally independent given the class y.
- p(Xi|y) = p(i|y)Xi + ( 1 - p(i|y))(1 - Xi)
- Here, p(Xi|y) - conditional probability of xi occuring provided y has occured.
- i - Event
- Xi - holds binary value either 0 (or) 1.
### Bernoulli Distribution:
- It is used for discrete probability distribution. It either calculates success (or) Failure.
- Here the random variable is either 1 (or) 0 whose chance of occuring is either denoted by p (or) (1-p) respectively.
- f(x) = { p^x * (1 - p)^1-x   if x=0,1
-        { 0 otherwise
  - Now here x = 0 then the value of f(x) = 1-p.
  - Now here x = 1 then the value of f(x) = p.    

### Implementation of Bernoulli Naive Bayes:
#### Importing Libraries:


In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report

#### Data Analysis:
- In this code we have performed a quick data analysis that includes reading the data, dropping the unnecessary data, shape of the data, information about dataset etc.

In [12]:
df = pd.read_csv("spam_ham_dataset.csv")
print(df.shape)
print(df.columns)
df = df.drop(['Unnamed: 0'], axis=1)

(5171, 4)
Index(['Unnamed: 0', 'label', 'text', 'label_num'], dtype='object')


### Count Vectorizer:
- In this code since text data is used to train our classifier we convert the text into a matrix comprising numbers using Count Vectorizer so that the model can perform well.

In [13]:
x = df["text"].values
y = df["label_num"].values

cv = CountVectorizer()
x = cv.fit_transform(x)

#### Data Splitting, Model Training and Prediction:

In [14]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=0)

bnb = BernoulliNB(binarize=0.0)
model = bnb.fit(X_train, y_train)
y_pred = bnb.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.84      0.98      0.91       732
           1       0.92      0.56      0.70       303

    accuracy                           0.86      1035
   macro avg       0.88      0.77      0.80      1035
weighted avg       0.87      0.86      0.84      1035

