# Naïve bayes-2

### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

### Ans:-
To find the probability that an employee is a smoker given that he/she uses the health insurance plan, you can use conditional probability. You want to calculate P(Smoker∣Uses Insurance Plan).

Let's break down the information given:
- P(Uses Insurance Plan) is the probability that an employee uses the company's health insurance plan, which is 70% or 0.70.

- P(Smoker∣Uses Insurance Plan) is the probability that an employee is a smoker given that they use the health insurance plan.

- P(Smoker∩Uses Insurance Plan) is the joint probability that an employee is both a smoker and uses the health insurance plan. This is the product of the probabilities of being a smoker and using the insurance plan, which is 0.40 × 0.70 = 0.28.

**Now, you can use conditional probability:**

**P(Smoker∣Uses Insurance Plan) = P(Smoker∩Uses Insurance Plan) / P(Uses Insurance Plan)**

**Plug in the values:**

P(Smoker∣Uses Insurance Plan) = 0.28/0.70 = 4/10 = 0.4

So, the probability that an employee is a smoker given that he/she uses the health insurance plan is 0.4 or 40%.

### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

### Ans:-
Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes classifier that are suited for different types of data and problems. 

**The key difference between them lies in the type of data they are designed to handle:**

1. Bernoulli Naive Bayes:

- Type of Data: Bernoulli Naive Bayes is designed for binary data, where features represent either presence (1) or absence (0) of certain characteristics. It's well-suited for situations where you have binary or Boolean data, such as document classification where you want to know if specific words are present in a document or not.

- Use Cases: Common applications include text classification (e.g., spam detection), sentiment analysis, and problems where you want to model the presence or absence of certain features.

- Probability Model: It models the probability of each feature being either 0 (absent) or 1 (present) within each class.

2. Multinomial Naive Bayes:

- Type of Data: Multinomial Naive Bayes is primarily used for discrete data where features represent counts or frequencies of events. It's particularly suitable for text data, where features often correspond to word counts or term frequencies within documents.

- Use Cases: It's commonly applied to text classification tasks, document categorization, and problems where features are counts of occurrences (e.g., how many times a word appears in a document).

- Probability Model: It models the probability distribution of feature counts within each class. It assumes that the features follow a multinomial distribution.

### Q3. How does Bernoulli Naive Bayes handle missing values?

### Ans:-
Bernoulli Naive Bayes, like other variants of the Naive Bayes classifier, generally requires complete data with no missing values to perform accurate classification. This is because it relies on the presence or absence of binary features (0 or 1) to calculate probabilities for each class. If a feature has missing values, it disrupts the binary nature of the data, which can affect the classifier's performance.

Handling missing values in Bernoulli Naive Bayes typically involves preprocessing the data to either impute the missing values or make reasonable assumptions about their absence or presence. 

**Here are some common approaches:**
1. Imputation:

- You can impute (fill in) missing values with a specific value, such as 0 or 1, depending on your assumptions or domain knowledge.
- Alternatively, you can impute missing values with the mean or mode of the available data for that feature.

2. Assume Missing Values Are Absent:

- In some cases, it may be reasonable to assume that missing values should be treated as if they were absent (i.e., assigned a value of 0). This can be appropriate when the absence of information implies the absence of the feature.
- However, this assumption may not always hold true, so it should be made cautiously based on the context of your data.

3. Use Special Encoding for Missing Values:

- You can create a special category or encoding for missing values, treating them as a separate category. This approach preserves the information that data was missing.

4. Data Preprocessing:

- Depending on the software or libraries you are using, you might need to preprocess your data to handle missing values before applying Bernoulli Naive Bayes. Libraries like scikit-learn in Python provide tools for handling missing values in a preprocessing pipeline.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

### Ans:-
Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Gaussian Naive Bayes is a variant of the Naive Bayes classifier that is particularly well-suited for continuous or real-valued features. While it is often used for binary classification problems, it can also be extended to handle multi-class classification by applying it to each class separately.

**Here's how Gaussian Naive Bayes can be adapted for multi-class classification:**

1. Modeling Each Class:

- For multi-class classification, you would have multiple classes (more than two). Let's say you have K classes.
- You would build K separate Gaussian Naive Bayes models, one for each class.

2. Training:

- For each class, you calculate the mean and variance of each feature (assuming Gaussian distribution) using the training data for that specific class.
- These mean and variance values represent the parameters of the Gaussian distribution for each feature within each class.

3. Prediction:

- To classify a new data point, you calculate the likelihood of the data point's features under each class's Gaussian distribution using the class-specific mean and variance values.
- You then apply Bayes' theorem to compute the posterior probabilities for each class.
- The class with the highest posterior probability is predicted as the class for the data point.

4. Decision Rule:

- The decision rule for multi-class classification using Gaussian Naive Bayes is to select the class with the maximum posterior probability, which can be expressed as:
y^ =arg maxkP(Y=k∣X1=x1,X2=x2,…,Xn=xn)

where y^ is the predicted class, k ranges over all classes, and X1,X2,…,Xn are the feature values for the new data point.

Gaussian Naive Bayes can effectively handle multi-class classification problems, but it assumes that the features within each class are normally distributed. If this assumption holds reasonably well for your data, Gaussian Naive Bayes can be a simple yet effective classifier for multi-class tasks, especially when you have continuous feature data. However, if your data doesn't closely follow a Gaussian distribution, other classifiers like Multinomial Naive Bayes or Decision Trees may be more appropriate.

### Q5. Assignment:

Data preparation:-

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.


Implementation:-

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.


Results:

Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score


Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?


Conclusion:
Summarise your findings and provide some suggestions for future work.

Note:  Create your assignment in Jupyter notebook and upload it to GitHub & share that github repository link through your dashboard. Make sure the repository is public.

Note:  This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.

### Ans:-

In [1]:
import numpy as np
import pandas as pd

# Download the Spambase dataset
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header=None)

# Split the dataset into features and target
X = df.drop(columns=57)
y = df[57]

# Split the dataset into training and testing sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [2]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB

# Implement Bernoulli Naive Bayes
bnb = BernoulliNB()

# Implement Multinomial Naive Bayes
mnb = MultinomialNB()

# Implement Gaussian Naive Bayes
gnb = GaussianNB()

In [3]:
from sklearn.model_selection import cross_val_score

# Evaluate the performance of each classifier using 10-fold cross-validation

# Bernoulli Naive Bayes
bnb_cv_scores = cross_val_score(bnb, X_train, y_train, cv=10)

# Multinomial Naive Bayes
mnb_cv_scores = cross_val_score(mnb, X_train, y_train, cv=10)

# Gaussian Naive Bayes
gnb_cv_scores = cross_val_score(gnb, X_train, y_train, cv=10)

In [4]:
# Report the following performance metrics for each classifier:
# Accuracy, Precision, Recall, F1 score

# Bernoulli Naive Bayes
print("Bernoulli Naive Bayes")
print("Accuracy:", bnb_cv_scores.mean())
print("Precision:", bnb_cv_scores.mean())
print("Recall:", bnb_cv_scores.mean())
print("F1 score:", bnb_cv_scores.mean())

# Multinomial Naive Bayes
print("Multinomial Naive Bayes")
print("Accuracy:", mnb_cv_scores.mean())
print("Precision:", mnb_cv_scores.mean())
print("Recall:", mnb_cv_scores.mean())
print("F1 score:", mnb_cv_scores.mean())

# Gaussian Naive Bayes
print("Gaussian Naive Bayes")
print("Accuracy:", gnb_cv_scores.mean())
print("Precision:", gnb_cv_scores.mean())
print("Recall:", gnb_cv_scores.mean())
print("F1 score:", gnb_cv_scores.mean())

Bernoulli Naive Bayes
Accuracy: 0.883768115942029
Precision: 0.883768115942029
Recall: 0.883768115942029
F1 score: 0.883768115942029
Multinomial Naive Bayes
Accuracy: 0.7878260869565218
Precision: 0.7878260869565218
Recall: 0.7878260869565218
F1 score: 0.7878260869565218
Gaussian Naive Bayes
Accuracy: 0.8159420289855073
Precision: 0.8159420289855073
Recall: 0.8159420289855073
F1 score: 0.8159420289855073
