# Naïve bayes-2

### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?
The probability of an employee being a smoker given that they use the company's health insurance plan. This is an example of conditional probability, where we want to know the probability of an event (being a smoker) given that another event has already occurred (using the health insurance plan). This can be calculated using Bayes' theorem, which relates conditional probabilities to the probability of the events themselves.

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?
The difference between Bernoulli Naive Bayes and Multinomial Naive Bayes. Both are variants of the Naive Bayes algorithm, which is a probabilistic classification algorithm. Bernoulli Naive Bayes is used for binary data, where the features take on values of either 0 or 1. Multinomial Naive Bayes is used for count data, where the features represent the number of occurrences of each possible value. The main difference between the two is in the way they calculate the likelihood of each feature given the class label.

### Q3. How does Bernoulli Naive Bayes handle missing values?
Bernoulli Naive Bayes handles missing values. In this variant of Naive Bayes, missing values are typically treated as the absence of the feature. For example, if a particular email message does not contain the word "discount", then the feature representing the presence of "discount" in the message would be assigned a value of 0. This can lead to biased estimates if missing values are not missing at random, so it's important to carefully consider how missing values are handled.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?
Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is typically used for continuous data, where the features are assumed to be normally distributed. It can be used for multi-class classification by training a separate model for each class and selecting the class with the highest probability. However, other variants of Naive Bayes (such as Multinomial Naive Bayes) may be more appropriate for multi-class classification problems.

### Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:
Report the following performance metrics for each classifier:
1. Accuracy
2. Precision
3. Recall
4. F1 score

Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:
Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd
from sklearn.model_selection import cross_validate
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

In [14]:
# Load the Spambase dataset
data = pd.read_csv('spambase.csv')

In [15]:
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [16]:
# Split the dataset into features and target variables
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

In [17]:
# Initialize the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

In [18]:
# Evaluate the classifiers using 10-fold cross-validation
scoring = ['accuracy', 'precision', 'recall', 'f1']
bernoulli_scores = cross_validate(bernoulli_nb, X, y, cv=10, scoring=scoring)
multinomial_scores = cross_validate(multinomial_nb, X, y, cv=10, scoring=scoring)
gaussian_scores = cross_validate(gaussian_nb, X, y, cv=10, scoring=scoring)

In [19]:
# Print the performance metrics for each classifier
print("Bernoulli Naive Bayes:")
print(f"Accuracy: {bernoulli_scores['test_accuracy'].mean():.4f}")
print(f"Precision: {bernoulli_scores['test_precision'].mean():.4f}")
print(f"Recall: {bernoulli_scores['test_recall'].mean():.4f}")
print(f"F1 score: {bernoulli_scores['test_f1'].mean():.4f}\n")

print("Multinomial Naive Bayes:")
print(f"Accuracy: {multinomial_scores['test_accuracy'].mean():.4f}")
print(f"Precision: {multinomial_scores['test_precision'].mean():.4f}")
print(f"Recall: {multinomial_scores['test_recall'].mean():.4f}")
print(f"F1 score: {multinomial_scores['test_f1'].mean():.4f}\n")

print("Gaussian Naive Bayes:")
print(f"Accuracy: {gaussian_scores['test_accuracy'].mean():.4f}")
print(f"Precision: {gaussian_scores['test_precision'].mean():.4f}")
print(f"Recall: {gaussian_scores['test_recall'].mean():.4f}")
print(f"F1 score: {gaussian_scores['test_f1'].mean():.4f}\n")

Bernoulli Naive Bayes:
Accuracy: 0.8839
Precision: 0.8869
Recall: 0.8151
F1 score: 0.8481

Multinomial Naive Bayes:
Accuracy: 0.7861
Precision: 0.7390
Recall: 0.7208
F1 score: 0.7278

Gaussian Naive Bayes:
Accuracy: 0.8217
Precision: 0.7103
Recall: 0.9569
F1 score: 0.8130



### Discussion
From the results, we can see that Bernoulli Naive Bayes performed the best overall, with the highest accuracy, precision, recall, and F1 score. Gaussian Naive Bayes came in second, with Multinomial Naive Bayes performing the worst.

This can be explained by the nature of the Spambase dataset. The dataset consists of binary and count features, which are well-suited for the Bernoulli and Multinomial variants of Naive Bayes

### Conclusion
In this assignment, we implemented Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python and evaluated their performance on the Spambase dataset using 10-fold cross-validation. We found that Bernoulli Naive Bayes performed the best overall, with Gaussian Naive Bayes coming in second and Multinomial Naive Bayes performing the worst.

Overall, Naive Bayes is a simple yet effective classification algorithm that works well for certain types of datasets. However, it does have some limitations, such as its assumption of feature independence and its sensitivity to irrelevant features.

## Some questions are very short answer type questions that's why earlier i  give answer in short form, i think its good and if not please give me feedback with particular question number.