## Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that they use the health insurance plan, we can use Bayes' theorem, which states:

\[ P(\text{Smoker} | \text{Uses insurance}) = \frac{P(\text{Uses insurance | Smoker}) \cdot P(\text{Smoker})}{P(\text{Uses insurance})} \]

Given:

- \( P(\text{Uses insurance}) = 0.70 \) (probability that an employee uses the insurance plan)
- \( P(\text{Smoker | Uses insurance}) = 0.40 \) (probability that a user of the insurance plan is a smoker)
- \( P(\text{Smoker}) = ? \) (probability that an employee is a smoker)

We need to find \( P(\text{Smoker}) \), the overall probability that an employee is a smoker.

To calculate \( P(\text{Smoker}) \), we can use the law of total probability, which states:

\[ P(\text{Uses insurance}) = P(\text{Uses insurance | Smoker}) \cdot P(\text{Smoker}) + P(\text{Uses insurance | Non-smoker}) \cdot P(\text{Non-smoker}) \]

Given that all employees either use insurance or not, we have:

\[ P(\text{Uses insurance}) = P(\text{Uses insurance | Smoker}) \cdot P(\text{Smoker}) + P(\text{Uses insurance | Non-smoker}) \cdot (1 - P(\text{Smoker})) \]

Plugging in the given values, we can solve for \( P(\text{Smoker}) \):

\[ 0.70 = 0.40 \cdot P(\text{Smoker}) + P(\text{Uses insurance | Non-smoker}) \cdot (1 - P(\text{Smoker})) \]

We know that \( P(\text{Uses insurance | Non-smoker}) \) is the probability that a non-smoker uses the insurance plan, which we can assume to be different from the probability that a smoker uses the insurance plan. However, this value is not explicitly given in the problem statement. Therefore, we cannot find \( P(\text{Smoker}) \) without this information.

If we have this additional information, we can solve for \( P(\text{Smoker}) \) and then use Bayes' theorem to find \( P(\text{Smoker | Uses insurance}) \).

## Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

In [1]:
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score

# Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Create an SVM classifier with a polynomial kernel
svm_classifier = SVC(kernel='poly', degree=3)  # 'poly' indicates polynomial kernel, degree=3 specifies the degree of the polynomial

# Train the classifier on the training data
svm_classifier.fit(X_train_scaled, y_train)

# Make predictions on the testing data
y_pred = svm_classifier.predict(X_test_scaled)

# Evaluate the accuracy of the classifier
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.825


In this code:

We first import the necessary libraries from Scikit-learn.
We generate synthetic data using the make_classification function. Replace this with your actual dataset.
Next, we split the data into training and testing sets using train_test_split.
We standardize the features using StandardScaler.
Then, we create an SVM classifier with a polynomial kernel by specifying kernel='poly' and optionally, the degree of the polynomial (e.g., degree=3).
We train the classifier on the standardized training data using the fit method.
We make predictions on the standardized testing data using the predict method.
Finally, we evaluate the accuracy of the classifier using the accuracy_score function from Scikit-learn.
Adjust the parameters of the SVC constructor, such as degree, C, and gamma, as needed for your specific problem.








## Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes handles missing values differently depending on the specific implementation or library being used. However, in general, the handling of missing values in Bernoulli Naive Bayes can be approached in the following ways:

1. **Ignoring Missing Values**:
   - Some implementations of Bernoulli Naive Bayes simply ignore missing values during training and prediction.
   - In this approach, missing values are treated as if they were never observed, and the corresponding feature is excluded from the calculation of probabilities.

2. **Imputation**:
   - Another approach is to impute missing values before training the Bernoulli Naive Bayes model.
   - Missing values can be replaced with a default value (e.g., 0 or 1) or imputed using more sophisticated methods such as mean, median, or mode imputation.
   - After imputation, the dataset can be used to train the Bernoulli Naive Bayes model as usual.

3. **Model Extension**:
   - Some implementations of Bernoulli Naive Bayes extend the model to explicitly handle missing values.
   - This may involve introducing a separate category or state to represent missing values in the feature space.
   - During training, the model learns the probability of observing missing values for each feature.
   - During prediction, the model accounts for missing values by incorporating the learned probabilities into the classification process.

4. **Custom Handling**:
   - Depending on the specific requirements of the problem, custom handling of missing values may be implemented.
   - This could involve domain-specific strategies for handling missing data or incorporating external knowledge about the missingness mechanism.

The choice of how to handle missing values in Bernoulli Naive Bayes depends on factors such as the characteristics of the dataset, the amount of missing data, and the specific goals of the analysis. It is important to carefully consider the implications of different approaches and to evaluate their impact on the performance of the classifier. Additionally, preprocessing steps such as imputation should be performed consistently across training and testing datasets to avoid introducing bias or artifacts in the model.

## Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification tasks. Despite the term "Gaussian" in its name, Gaussian Naive Bayes is not limited to binary classification. It can handle multiple classes by extending the basic Naive Bayes algorithm to accommodate more than two classes.

In Gaussian Naive Bayes, the probability density function (PDF) of each feature given each class is assumed to be Gaussian (normal) distributed. When applied to multi-class classification, the algorithm calculates the likelihood of observing the features for each class and then applies Bayes' theorem to determine the most probable class for a given instance.

Here's how Gaussian Naive Bayes works for multi-class classification:

1. **Model Training**:
   - For each class, Gaussian Naive Bayes estimates the mean and variance of each feature using the training data belonging to that class.
   - It assumes that the features are independent given the class label, which means that the covariance matrix is diagonal (i.e., off-diagonal elements are zero).

2. **Prediction**:
   - Given a new instance with feature values \( x_1, x_2, \ldots, x_n \), Gaussian Naive Bayes calculates the conditional probability of each class given the observed feature values.
   - It computes the likelihood of observing the feature values given each class using the Gaussian probability density function.
   - The class with the highest posterior probability, computed using Bayes' theorem, is predicted as the label for the new instance.

3. **Decision Rule**:
   - The decision rule for Gaussian Naive Bayes is to choose the class with the highest posterior probability:
     \[ \text{predicted class} = \arg\max_{c \in \text{classes}} P(C=c | x_1, x_2, \ldots, x_n) \]

4. **Handling Continuous Features**:
   - Gaussian Naive Bayes assumes that the features are continuous and normally distributed within each class. If the features are not normally distributed, preprocessing techniques such as feature transformation or discretization may be applied.

In summary, Gaussian Naive Bayes can handle multi-class classification problems by extending the basic Naive Bayes algorithm to estimate Gaussian distributions for each class's feature values. It is a simple and efficient algorithm suitable for a wide range of classification tasks.

Q5. Assignment:
Data preparation:
Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.
Implementation:
Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.
Results:
Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score
Discussion:
Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?
Conclusion:
Summarise your findings and provide some suggestions for future work.

In [3]:
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Spambase dataset
data = fetch_openml(name='spambase', version=1)

X = data.data  # Features
y = data.target  # Target variable (spam or not spam)

# Define classifiers
classifiers = {
    'Bernoulli Naive Bayes': BernoulliNB(),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Gaussian Naive Bayes': GaussianNB()
}

# Evaluate each classifier using 10-fold cross-validation
for name, classifier in classifiers.items():
    scores_accuracy = cross_val_score(classifier, X, y, cv=10, scoring='accuracy')
    scores_precision = cross_val_score(classifier, X, y, cv=10, scoring='precision')
    scores_recall = cross_val_score(classifier, X, y, cv=10, scoring='recall')
    scores_f1 = cross_val_score(classifier, X, y, cv=10, scoring='f1')

    # Print performance metrics
    print(f'Classifier: {name}')
    print(f'Accuracy: {np.mean(scores_accuracy)}')
    print(f'Precision: {np.mean(scores_precision)}')
    print(f'Recall: {np.mean(scores_recall)}')
    print(f'F1 score: {np.mean(scores_f1)}')
    print()


  warn(
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1954, in precision_score
    p, _, _, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line

Classifier: Bernoulli Naive Bayes
Accuracy: 0.8839380364047911
Precision: nan
Recall: nan
F1 score: nan



Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1954, in precision_score
    p, _, _, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1382, i

Classifier: Multinomial Naive Bayes
Accuracy: 0.7863496180326323
Precision: nan
Recall: nan
F1 score: nan



Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1954, in precision_score
    p, _, _, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pred, average, labels, pos_label)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1382, i

Classifier: Gaussian Naive Bayes
Accuracy: 0.8217730830896915
Precision: nan
Recall: nan
F1 score: nan



Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 767, in _score
    scores = scorer(estimator, X_test, y_test)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 107, in __call__
    score = scorer._score(cached_call, estimator, *args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_scorer.py", line 268, in _score
    return self._sign * self._score_func(y_true, y_pred, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1146, in f1_score
    return fbeta_score(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1287, in fbeta_score
    _, _, f, _ = precision_recall_fscore_support(
  File "/opt/conda/lib/python3.10/site-packages/sklearn/metrics/_classification.py", line 1573, in precision_recall_fscore_support
    labels = _check_set_wise_labels(y_true, y_pre

In this code:

We load the Spambase dataset using fetch_openml from scikit-learn.
We define three Naive Bayes classifiers: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes.
We use 10-fold cross-validation to evaluate the performance of each classifier on the dataset.
For each classifier, we calculate and print the accuracy, precision, recall, and F1 score averaged over the 10 folds.
After running the code, you will get the performance metrics for each classifier on the Spambase dataset. You can then analyze the results to determine which variant of Naive Bayes performed the best and discuss any observed limitations of Naive Bayes classifiers.






