Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Q3. How does Bernoulli Naive Bayes handle missing values?

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.


Note: This dataset contains a binary classification problem with multiple features. The dataset is
relatively small, but it can be used to demonstrate the performance of the different variants of Naive
Bayes on a real-world problem.

Answer 1...

To calculate the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem. Let S be the event that an employee is a smoker, and H be the event that an employee uses the health insurance plan. Then, we want to calculate P(S|H), the probability that an employee is a smoker given that he/she uses the health insurance plan.

Bayes' theorem tells us that:

P(S|H) = P(H|S) * P(S) / P(H)

We are given that 70% of the employees use the health insurance plan, so P(H) = 0.7. We are also given that 40% of the employees who use the plan are smokers, so P(S ∩ H) = P(H|S) * P(S) = 0.4 * P(S). Therefore, we have:

P(S|H) = 0.4 * P(S) / 0.7

To solve for P(S), we need more information. Without knowing the overall proportion of smokers in the employee population, we cannot calculate P(S) directly. However, we can use the fact that P(S) + P(~S) = 1, where ~S is the event that an employee is not a smoker. Let x be the proportion of employees who are smokers, so P(S) = x and P(~S) = 1 - x. Then, we have:

0.4 * x / 0.7 = P(S|H) = P(H|S) * P(S) / P(H) = 0.4 * x / (0.7 * x + 0.3 * (1 - x))

Solving for x, we get:

x = 0.5714

Therefore, the probability that an employee is a smoker given that he/she uses the health insurance plan is:

P(S|H) = 0.4 * 0.5714 / 0.7 = 0.327

So the probability that an employee who uses the health insurance plan is a smoker is 0.327.

Answer 2...

The main difference between Bernoulli Naive Bayes and Multinomial Naive Bayes is in the way they model the data. Bernoulli Naive Bayes assumes that the input features are binary (e.g., a word is either present or absent in a document), while Multinomial Naive Bayes assumes that the input features are counts (e.g., the frequency of a word in a document).

Answer 3...

Bernoulli Naive Bayes can handle missing values by assuming that a missing value is equivalent to the feature being absent. So, if a document has a missing feature, it is assumed that the feature is not present in the document.

Answer 4...

Gaussian Naive Bayes can be used for multi-class classification by training multiple binary classifiers, one for each class. For example, if there are three classes, A, B, and C, three binary classifiers can be trained: one to distinguish A from non-A, one to distinguish B from non-B, and one to distinguish C from non-C. To make a prediction for a new instance, the classifier with the highest posterior probability can be selected.





Answer 5...

For this assignment, we will be using the Spambase dataset from the UCI Machine Learning Repository to implement and compare the performance of three variants of Naive Bayes classifiers.

First, we need to download and prepare the dataset:

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)
X = data.iloc[:,:-1]
y = data.iloc[:,-1]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


We split the dataset into train and test sets, with 20% of the data used for testing.

Next, we will implement the three variants of Naive Bayes classifiers and evaluate their performance using 10-fold cross-validation:

In [2]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score

# Bernoulli Naive Bayes
bnb = BernoulliNB()
bnb_scores = cross_val_score(bnb, X_train, y_train, cv=10)

# Multinomial Naive Bayes
mnb = MultinomialNB()
mnb_scores = cross_val_score(mnb, X_train, y_train, cv=10)

# Gaussian Naive Bayes
gnb = GaussianNB()
gnb_scores = cross_val_score(gnb, X_train, y_train, cv=10)

# print mean and standard deviation of scores
print("Bernoulli Naive Bayes Accuracy: %0.2f (+/- %0.2f)" % (bnb_scores.mean(), bnb_scores.std() * 2))
print("Multinomial Naive Bayes Accuracy: %0.2f (+/- %0.2f)" % (mnb_scores.mean(), mnb_scores.std() * 2))
print("Gaussian Naive Bayes Accuracy: %0.2f (+/- %0.2f)" % (gnb_scores.mean(), gnb_scores.std() * 2))


Bernoulli Naive Bayes Accuracy: 0.89 (+/- 0.04)
Multinomial Naive Bayes Accuracy: 0.79 (+/- 0.03)
Gaussian Naive Bayes Accuracy: 0.82 (+/- 0.03)


The above code prints the mean and standard deviation of accuracy scores for each classifier. We can calculate additional performance metrics using scikit-learn's classification_report function:

In [3]:
from sklearn.metrics import classification_report

# Bernoulli Naive Bayes
bnb.fit(X_train, y_train)
bnb_preds = bnb.predict(X_test)
print("Bernoulli Naive Bayes Performance:")
print(classification_report(y_test, bnb_preds))

# Multinomial Naive Bayes
mnb.fit(X_train, y_train)
mnb_preds = mnb.predict(X_test)
print("Multinomial Naive Bayes Performance:")
print(classification_report(y_test, mnb_preds))

# Gaussian Naive Bayes
gnb.fit(X_train, y_train)
gnb_preds = gnb.predict(X_test)
print("Gaussian Naive Bayes Performance:")
print(classification_report(y_test, gnb_preds))


Bernoulli Naive Bayes Performance:
              precision    recall  f1-score   support

           0       0.86      0.94      0.90       531
           1       0.91      0.80      0.85       390

    accuracy                           0.88       921
   macro avg       0.89      0.87      0.88       921
weighted avg       0.88      0.88      0.88       921

Multinomial Naive Bayes Performance:
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       531
           1       0.76      0.72      0.74       390

    accuracy                           0.79       921
   macro avg       0.78      0.78      0.78       921
weighted avg       0.79      0.79      0.79       921

Gaussian Naive Bayes Performance:
              precision    recall  f1-score   support

           0       0.95      0.73      0.82       531
           1       0.72      0.95      0.82       390

    accuracy                           0.82       921
   macro avg       0.8

The above code prints the precision, recall, f1-score, and support for each class for each classifier.