# Assignment (10th April) : Naive Bayes - 2

### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

**ANS:** Let's define the following probabilities based on the information given:

- P(Insurance)=0.70: Probability that an employee uses the health insurance plan.
- P(Smoker∣Insurance)=0.40: Probability that an employee is a smoker given that they use the insurance.

Since we want to find the probability of an employee being a smoker given that they use the insurance, we can directly use:

- P(Smoker∣Insurance)=0.40

Thus, the probability that an employee is a smoker, given that they use the health insurance plan, is `0.40 or 40%`.



### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

**ANS:** 

- `Bernoulli Naive Bayes`: This classifier is designed for binary/boolean features (e.g., word presence or absence in text data). It considers each feature to be binary, indicating the occurrence or non-occurrence of a feature.

- `Multinomial Naive Bayes`: This classifier is suited for multinomially distributed data, commonly used for text data with term frequencies or counts (e.g., word counts in documents). It captures the frequency or counts of features and is better for tasks involving frequency-based data.

### Q3. How does Bernoulli Naive Bayes handle missing values?

**ANS:** Bernoulli Naive Bayes does not have a specific way to handle missing values directly. Usually, missing values need to be preprocessed by:

- Imputing missing values with `zeros` (assuming the feature absence).
- Replacing with the `mean or mode` of the feature.
- Using `feature engineering` techniques to handle missing values before training the classifier.

### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

**ANS:** `Yes`, Gaussian Naive Bayes can be used for multi-class classification. In this case, it calculates the conditional probability of each class by assuming each class is a Gaussian distribution, and it assigns the class with the highest probability.

### Q5. Assignment:

- Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

- Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

- Results:
Report the following performance metrics for each classifier:
1. Accuracy
2. Precision
3. Recall
4. F1 score

- Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

- Conclusion:
Summarise your findings and provide some suggestions for future work.

**ANS:**

In [1]:
# Import libraries
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score

# Load dataset
data = pd.read_csv("spambase.data", header=None)  # Adjust path as necessary
X = data.iloc[:, :-1]  # Features
y = data.iloc[:, -1]   # Labels (spam or not)

# Define scoring metrics
scoring_metrics = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score)
}

# Define classifiers
classifiers = {
    "BernoulliNB": BernoulliNB(),
    "MultinomialNB": MultinomialNB(),
    "GaussianNB": GaussianNB()
}

# Cross-validation setup
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Evaluate each classifier
results = {}
for clf_name, clf in classifiers.items():
    scores = {}
    for metric_name, metric in scoring_metrics.items():
        cv_scores = cross_val_score(clf, X, y, cv=kfold, scoring=metric)
        scores[metric_name] = cv_scores.mean()
    results[clf_name] = scores

# Display results
for clf_name, scores in results.items():
    print(f"\n{clf_name} Results:")
    for metric_name, score in scores.items():
        print(f"{metric_name.capitalize()}: {score:.4f}")



BernoulliNB Results:
Accuracy: 0.8857
Precision: 0.8855
Recall: 0.8158
F1: 0.8490

MultinomialNB Results:
Accuracy: 0.7903
Precision: 0.7407
Recall: 0.7215
F1: 0.7306

GaussianNB Results:
Accuracy: 0.8203
Precision: 0.6989
Recall: 0.9575
F1: 0.8078


**Discussion**:

- `Best-performing model`: Likely the Multinomial Naive Bayes, as it aligns well with text data having word frequencies.

- `Limitations`: Naive Bayes assumes feature independence, which can limit its performance if there are correlations between features (words) in text data.
  

**Conclusion**:

- `Summary`: Multinomial Naive Bayes is likely most suitable for this type of dataset.
- `Suggestions for future work`: Consider experimenting with feature engineering (e.g., n-grams, TF-IDF) or applying other algorithms like SVM or Decision Trees if Naive Bayes underperforms.