Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

In [3]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

iris = load_iris()
X = iris.data
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

gnb = GaussianNB()
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')

Accuracy: 0.9777777777777777


Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

In [4]:
import pandas as pd

In [5]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Note: you may need to restart the kernel to use updated packages.


In [2]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
spambase = fetch_ucirepo(id=94) 
  
# data (as pandas dataframes) 
X = spambase.data.features 
y = spambase.data.targets

In [5]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,y,test_size=0.3,random_state=42)

## Bernoulli NB Model

In [56]:
import warnings
warnings.filterwarnings('ignore')

In [10]:
from sklearn.naive_bayes import BernoulliNB

In [11]:
BNB = BernoulliNB()

In [12]:
BNB.fit(X_train,Y_train)

  y = column_or_1d(y, warn=True)


In [13]:
Y1 = BNB.predict(X_test)

In [29]:
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score,recall_score,precision_score,f1_score

In [30]:
accuracy_score(Y_test,Y1)

0.8790731354091238

In [31]:
classification_report(Y_test,Y1)

'              precision    recall  f1-score   support\n\n           0       0.87      0.93      0.90       804\n           1       0.89      0.81      0.85       577\n\n    accuracy                           0.88      1381\n   macro avg       0.88      0.87      0.87      1381\nweighted avg       0.88      0.88      0.88      1381\n'

In [32]:
confusion_matrix(Y_test,Y1)

array([[745,  59],
       [108, 469]])

In [33]:
recall_score(Y_test,Y1)

0.8128249566724437

In [34]:
precision_score(Y_test,Y1)

0.8882575757575758

In [35]:
f1_score(Y_test,Y1)

0.848868778280543

In [36]:
# Hyperparameter Tuning of Bernoulli NB Model

In [37]:
from sklearn.model_selection import GridSearchCV

In [38]:
parameters = {
    'alpha':[1.0],
    'force_alpha':[True],
    'binarize':[0.0],
    'class_prior':[None]
}

In [39]:
GRID = GridSearchCV(estimator=BNB,param_grid=parameters,cv=10,verbose=1,refit=True)

In [57]:
GRID.fit(X_train,Y_train)

Fitting 10 folds for each of 1 candidates, totalling 10 fits


In [58]:
GRID.best_params_

{'alpha': 1.0, 'binarize': 0.0, 'class_prior': None, 'force_alpha': True}

## Gaussian NB Model

In [59]:
from sklearn.naive_bayes import GaussianNB

In [60]:
GNB = GaussianNB()

In [61]:
GNB.fit(X_train,Y_train)

In [62]:
Y2 = GNB.predict(X_test)

In [78]:
# Performance metrics

In [64]:
accuracy_score(Y_test,Y2)

0.8247646632874729

In [65]:
precision_score(Y_test,Y2)

0.7206851119894598

In [66]:
recall_score(Y_test,Y2)

0.9480069324090121

In [67]:
f1_score(Y_test,Y2)

0.8188622754491017

In [84]:
# Hyperparameter Tuning of Gaussian NB Model

In [69]:
param = {
    'priors':[None],
    'var_smoothing':[1e-09]
}

In [70]:
Grid_model = GridSearchCV(estimator=GNB,param_grid=param,cv=10)

In [71]:
Grid_model.fit(X_train,Y_train)

In [72]:
Grid_model.best_params_

{'priors': None, 'var_smoothing': 1e-09}

## Multinomial NB Model

In [74]:
from sklearn.naive_bayes import MultinomialNB

In [75]:
MNB = MultinomialNB()

In [76]:
MNB.fit(X_train,Y_train)

In [77]:
Y3 = MNB.predict(X_test)

In [79]:
# Performance metrics

In [80]:
accuracy_score(Y_test,Y3)

0.782041998551774

In [81]:
precision_score(Y_test,Y3)

0.7623574144486692

In [82]:
recall_score(Y_test,Y3)

0.6949740034662045

In [83]:
f1_score(Y_test,Y3)

0.7271078875793291

In [85]:
# Hyperparameter Tuning of Multinomial NB Model

In [86]:
para = {
    'alpha':[1.0],
    'force_alpha':[True]
}

In [87]:
Gridding = GridSearchCV(estimator=MNB,param_grid=para,cv=10)

In [88]:
Gridding.fit(X_train,Y_train)

In [89]:
Gridding.best_params_

{'alpha': 1.0, 'force_alpha': True}

In [94]:
# CONCLUSION:
    
    # Lets compare overall accuracy of all the 3 models of Naive Bayes:
        
print(accuracy_score(Y_test,Y3)) # Multinomial
print(accuracy_score(Y_test,Y2)) # Gaussian
print(accuracy_score(Y_test,Y1)) # Bernoulli

0.782041998551774
0.8247646632874729
0.8790731354091238


The Bernoulli NB gave us an accuracy score of 78.2%

The Gaussian NB gave us an accuracy score of 82.4%

The Multinomial NB gave us an accuracy score of 78.2%