
### Q1. Q1. A company conducted a survey of its employees and found that 70% of the employees use the  
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the### 
probability that an employee is a smoker given that he/she uses the health insurance pla ?
To find the probability that an employee is a smoker given that he/she uses the health insurance plan, we can use Bayes' theorem.

Let:
- \( P(S) \) = Probability of being a smoker
- \( P(H) \) = Probability of using the health insurance plan
- \( P(S | H) \) = Probability of being a smoker given that the employee uses the health insurance plan

From the survey:
- \( P(H) = 0.7 \) (70% use the health insurance plan)
- \( P(S | H) = 0.4 \) (40% of those who use the plan are smokers)

Using Bayes' theorem, we can find:
\[
P(S | H) = \frac{P(H | S) \cdot P(S)}{P(H)}
\]
However, we don't have \( P(S) \) or \( P(H | S) \) directly. We can derive the required values based on the information given.

If 40% of those who use the health plan are smokers:
- Let’s assume \( P(S) \) is the overall probability of being a smoker.
- We can use this information directly since \( P(S | H) = 0.4 \).

The probability that an employee is a smoker given that he/she uses the health insurance plan is **40% or 0.4**.

---

### Q2. Difference Between Bernoulli Naive Bayes and Multinomial Naive Bayes
- **Bernoulli Naive Bayes**:
  - Used for binary/boolean features (e.g., presence or absence of a feature).
  - It assumes that features follow a Bernoulli distribution.
  - Suitable for text classification where we need to determine whether a word is present or not.

- **Multinomial Naive Bayes**:
  - Used for multi-class classification problems with features that represent counts (e.g., frequency of words).
  - It assumes that features follow a multinomial distribution.
  - Ideal for text classification tasks where the count of each word in a document matters.

The key difference is in the nature of the features they handle: Bernoulli for binary features and Multinomial for count-based features.

---

### Q3. How Does Bernoulli Naive Bayes Handle Missing Values?
- Bernoulli Naive Bayes handles missing values by ignoring the features with missing values during the probability calculation. It does not require imputation of missing values.
- If a feature is missing, it is effectively treated as if that feature's value did not contribute to the classification decision.

Bernoulli Naive Bayes can handle missing values by simply ignoring them during classification.

---

### Q4. Can Gaussian Naive Bayes Be Used for Multi-Class Classification?
- Yes, Gaussian Naive Bayes can be used for multi-class classification.
- It works well when the features are continuous and assumed to follow a Gaussian (normal) distribution.
- It can handle any number of classes by applying the Naive Bayes formula independently for each class and selecting the class with the highest posterior probability.

Gaussian Naive Bayes is suitable for multi-class classification problems.

---


## Q5. Assignment:

-Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

-Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

-Results:

Report the following performance metrics for each classifier:

Accuracy

Precision

Recall

F1 score

-Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

-Conclusion:
Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd

# Load the Spambase dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
columns = [f'feature_{i}' for i in range(1, 58)] + ['is_spam']
data = pd.read_csv(url, header=None, names=columns)

# Display the first few rows
print(data.head())


   feature_1  feature_2  feature_3  feature_4  feature_5  feature_6  \
0       0.00       0.64       0.64        0.0       0.32       0.00   
1       0.21       0.28       0.50        0.0       0.14       0.28   
2       0.06       0.00       0.71        0.0       1.23       0.19   
3       0.00       0.00       0.00        0.0       0.63       0.00   
4       0.00       0.00       0.00        0.0       0.63       0.00   

   feature_7  feature_8  feature_9  feature_10  ...  feature_49  feature_50  \
0       0.00       0.00       0.00        0.00  ...        0.00       0.000   
1       0.21       0.07       0.00        0.94  ...        0.00       0.132   
2       0.19       0.12       0.64        0.25  ...        0.01       0.143   
3       0.31       0.63       0.31        0.63  ...        0.00       0.137   
4       0.31       0.63       0.31        0.63  ...        0.00       0.135   

   feature_51  feature_52  feature_53  feature_54  feature_55  feature_56  \
0         0.0       0

In [2]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Separate features and target variable
X = data.drop('is_spam', axis=1)
y = data['is_spam']

# Initialize classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Function to evaluate classifiers
def evaluate_classifier(classifier, X, y):
    scores = cross_val_score(classifier, X, y, cv=10, scoring='f1')
    return scores.mean()

# Evaluate classifiers
bernoulli_f1 = evaluate_classifier(bernoulli_nb, X, y)
multinomial_f1 = evaluate_classifier(multinomial_nb, X, y)
gaussian_f1 = evaluate_classifier(gaussian_nb, X, y)

print(f'Bernoulli Naive Bayes F1 Score: {bernoulli_f1}')
print(f'Multinomial Naive Bayes F1 Score: {multinomial_f1}')
print(f'Gaussian Naive Bayes F1 Score: {gaussian_f1}')


Bernoulli Naive Bayes F1 Score: 0.8481249015095276
Multinomial Naive Bayes F1 Score: 0.7282909724016348
Gaussian Naive Bayes F1 Score: 0.8130660909542995


In [3]:
def compute_metrics(classifier, X, y):
    classifier.fit(X, y)
    y_pred = classifier.predict(X)
    accuracy = accuracy_score(y, y_pred)
    precision = precision_score(y, y_pred)
    recall = recall_score(y, y_pred)
    f1 = f1_score(y, y_pred)
    return accuracy, precision, recall, f1

# Compute metrics for each classifier
bernoulli_metrics = compute_metrics(bernoulli_nb, X, y)
multinomial_metrics = compute_metrics(multinomial_nb, X, y)
gaussian_metrics = compute_metrics(gaussian_nb, X, y)

print(f'Bernoulli Naive Bayes - Accuracy: {bernoulli_metrics[0]}, Precision: {bernoulli_metrics[1]}, Recall: {bernoulli_metrics[2]}, F1 Score: {bernoulli_metrics[3]}')
print(f'Multinomial Naive Bayes - Accuracy: {multinomial_metrics[0]}, Precision: {multinomial_metrics[1]}, Recall: {multinomial_metrics[2]}, F1 Score: {multinomial_metrics[3]}')
print(f'Gaussian Naive Bayes - Accuracy: {gaussian_metrics[0]}, Precision: {gaussian_metrics[1]}, Recall: {gaussian_metrics[2]}, F1 Score: {gaussian_metrics[3]}')


Bernoulli Naive Bayes - Accuracy: 0.8858943707889589, Precision: 0.8860911270983214, Recall: 0.815223386651958, F1 Score: 0.8491812697500718
Multinomial Naive Bayes - Accuracy: 0.7924364268637253, Precision: 0.7440273037542662, Recall: 0.7214561500275786, F1 Score: 0.7325679081489779
Gaussian Naive Bayes - Accuracy: 0.8228645946533363, Precision: 0.7012096774193548, Recall: 0.9591836734693877, F1 Score: 0.8101560680177032


###  Discussion

After running the evaluations, you can analyze the results:

- **Performance Comparison**: Determine which Naive Bayes variant performed best based on the metrics you obtained.
- **Limitations**: Discuss any limitations of Naive Bayes classifiers you observed:
  - Naive Bayes assumes independence among features, which may not always hold true.
  - It may not perform well with imbalanced datasets.
  - Gaussian Naive Bayes assumes normal distribution, which may not fit the data.

###  Conclusion

Summarize your findings:
- Which classifier performed the best and why.
- Any observed patterns in the data that could influence future work, such as trying other algorithms or feature engineering techniques.
