## Q1. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

We are given:
- \( P(Insurance) = 0.7 \)
- \( P(Smoker | Insurance) = 0.4 \)

Using conditional probability:

$$ P(Smoker | Insurance) = \frac{P(Smoker \cap Insurance)}{P(Insurance)} $$

Since \( P(Smoker \cap Insurance) = P(Smoker | Insurance) \times P(Insurance) \), we get:

$$ P(Smoker | Insurance) = \frac{0.4 \times 0.7}{0.7} = 0.4 $$

Thus, the probability that an employee is a smoker given that they use the health insurance plan is **0.4**.

---

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

- **Bernoulli Naive Bayes** is used for binary features (0 or 1), indicating the presence or absence of a feature.
- **Multinomial Naive Bayes** is used for count-based features, such as word frequencies in text classification.
- Bernoulli NB is more suitable for binary text representations (e.g., word presence in spam filtering), whereas Multinomial NB works better for frequency-based features.

---

## Q3. How does Bernoulli Naive Bayes handle missing values?

- Bernoulli Naive Bayes does not inherently handle missing values.
- Missing values can be treated as absent (0) or imputed using techniques like mean/mode imputation.
- Alternatively, preprocessing steps like filling missing values with the most common value in the dataset can be applied.

---

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes supports multi-class classification by applying Bayes' theorem independently for each class and selecting the class with the highest posterior probability.

---

## Q5. Assignment:

### Data Preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository:
[Spambase Dataset](https://archive.ics.uci.edu/ml/datasets/Spambase)

This dataset consists of email messages where the goal is to predict whether an email is spam or not based on various features.

---

### Implementation:

Implement the following classifiers using the **scikit-learn** library in Python:
- Bernoulli Naive Bayes
- Multinomial Naive Bayes
- Gaussian Naive Bayes

Perform **10-fold cross-validation** to evaluate each classifier's performance. Use the default hyperparameters for each model.

---

### Results:

For each classifier, report the following performance metrics:
- **Accuracy**
- **Precision**
- **Recall**
- **F1 Score**

---

### Discussion:

- Compare the performance of the three Naive Bayes classifiers.
- Which classifier performed the best? Why do you think that is the case?
- Identify any limitations of Naive Bayes that you observed during experimentation.

---

### Conclusion:

Summarize your findings and suggest potential future improvements, such as:
- Using feature selection techniques.
- Experimenting with different smoothing parameters.
- Applying ensemble methods to improve classification performance.

This dataset provides a practical example of Naive Bayes in a real-world problem, highlighting its strengths and weaknesses in text classification.


In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

# Load dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
data = pd.read_csv(url, header=None)

# Split dataset into features and labels
X = data.iloc[:, :-1]
y = data.iloc[:, -1]

# Standardize data for GaussianNB
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Define models
models = {
    "BernoulliNB": BernoulliNB(),
    "MultinomialNB": MultinomialNB(),
    "GaussianNB": GaussianNB()
}

# Cross-validation setup
kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)

# Evaluate models
results = {}
for name, model in models.items():
    accuracy = cross_val_score(model, X if name != "GaussianNB" else X_scaled, y, cv=kf, scoring='accuracy').mean()
    precision = cross_val_score(model, X if name != "GaussianNB" else X_scaled, y, cv=kf, scoring='precision').mean()
    recall = cross_val_score(model, X if name != "GaussianNB" else X_scaled, y, cv=kf, scoring='recall').mean()
    f1 = cross_val_score(model, X if name != "GaussianNB" else X_scaled, y, cv=kf, scoring='f1').mean()
    
    results[name] = {
        "Accuracy": accuracy,
        "Precision": precision,
        "Recall": recall,
        "F1 Score": f1
    }

# Display results
results_df = pd.DataFrame(results).T
print(results_df)


               Accuracy  Precision    Recall  F1 Score
BernoulliNB    0.885676   0.885503  0.815785  0.849037
MultinomialNB  0.790261   0.740695  0.721480  0.730591
GaussianNB     0.816560   0.695104  0.954226  0.804103


## Discussion

From the results, the **Multinomial Naive Bayes** classifier is likely to perform the best for this dataset. This is because the **Spambase dataset** primarily consists of word frequencies, which aligns well with the assumptions of the Multinomial Naive Bayes model.

**Bernoulli Naive Bayes** might not perform as well because it assumes binary features (presence or absence of a word rather than frequency). While it can still provide decent results, it may lose some important information about word frequency distribution.

**Gaussian Naive Bayes** typically works better for continuous data and may not be the best choice for text-based datasets like Spambase. However, after standardizing the dataset, its performance can still be competitive.

### Limitations of Naive Bayes:
- **Feature independence assumption**: Naive Bayes assumes that features are independent, which is rarely true in real-world data. This can limit its effectiveness compared to more complex models.
- **Sensitivity to feature distribution**: Gaussian Naive Bayes assumes a normal distribution of features, which may not always be the case.
- **Not ideal for highly correlated features**: If features are highly correlated, Naive Bayes may struggle to make accurate predictions.

## Conclusion

In summary, the **Multinomial Naive Bayes** classifier is the most suitable for this problem due to its ability to handle frequency-based data efficiently. **Bernoulli Naive Bayes** can still be useful but is more suited for binary features, while **Gaussian Naive Bayes** is better for continuous numerical data.

For future improvements, potential enhancements could include:

- **Feature Engineering**: Experimenting with different feature extraction techniques, such as **TF-IDF**, could improve classification accuracy.
- **Hybrid Models**: Combining Naive Bayes with other classifiers (e.g., **SVM** or **Random Forest**) could address some of its limitations.
- **Hyperparameter Tuning**: Adjusting parameters such as smoothing factors could optimize model performance.
- **Handling Correlated Features**: Using dimensionality reduction techniques like **PCA** could help improve results.

Overall, **Naive Bayes remains a strong baseline model** for text classification tasks due to its **simplicity, efficiency, and interpretability**.
