Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?

 The probability of an employee being a smoker given that they use the health insurance plan can be calculated using Bayes' theorem. Let S be the event that an employee is a smoker, and H be the event that an employee uses the health insurance plan. Then, we need to find P(S|H). We know that P(H) = 0.7 (since 70% of employees use the plan) and P(S|H) = 0.4 (since 40% of employees who use the plan are smokers). We can use Bayes' theorem to calculate P(S|H):

P(S|H) = P(H|S) * P(S) / P(H)

P(H|S) is the probability that an employee uses the health insurance plan given that they are a smoker. We don't have this information directly, but we can use the fact that P(S ∩ H) = P(H|S) * P(S) to find it. P(S ∩ H) is the probability that an employee is both a smoker and uses the health insurance plan, which is equal to P(H|S) * P(S). We know that P(S) = 0.4 (since 40% of employees who use the plan are smokers), and we can calculate P(H ∩ S) as follows:

P(H ∩ S) = P(S|H) * P(H) = 0.4 * 0.7 = 0.28

So, P(H|S) = P(H ∩ S) / P(S) = 0.28 / 0.4 = 0.7. Now we can plug in all the values in Bayes' theorem:

P(S|H) = P(H|S) * P(S) / P(H) = 0.7 * 0.4 / 0.7 = 0.4

Therefore, the probability of an employee being a smoker given that they use the health insurance plan is 0.4.

Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

Bernoulli Naive Bayes and Multinomial Naive Bayes are both variants of the Naive Bayes algorithm used for text classification. The main difference between them is in the way they represent the text data.

Bernoulli Naive Bayes assumes that the text data is binary, i.e., each feature (word or token) is either present or absent in the document. It models the probability of each feature given each class using the Bernoulli distribution, which is a discrete distribution with two possible outcomes (0 or 1). This means that Bernoulli Naive Bayes only considers whether a feature is present or not, and ignores its frequency in the document.

On the other hand, Multinomial Naive Bayes assumes that the text data is represented as word counts, i.e., each feature is the frequency of a word or token in the document. It models the probability of each feature given each class using the Multinomial distribution, which is a discrete distribution that can take on multiple outcomes (integers from 0 to infinity). This means that Multinomial Naive Bayes considers both the presence and frequency of each feature in the document.

Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes can handle missing values by treating them as if they were not observed in the document. In other words, if a feature is missing in a document, Bernoulli Naive Bayes assumes that it is absent and does not include it in the calculation of the likelihood of the document given the class.

Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification by extending the algorithm to handle more than two classes. In Gaussian Naive Bayes, the likelihood of a feature given a class is modeled as a Gaussian distribution with a mean and variance that are estimated from the training data. To extend this to multi-class classification, we can simply estimate the parameters (mean and variance) for each class separately and then use Bayes' theorem to calculate the posterior probability of each class given the feature vector. The class with the highest posterior probability is then predicted as the output. This approach is known as "One-vs-All" or "One-vs-Rest" classification, where each class is compared against all the other classes combined. Another approach is to use "One-vs-One" classification, where all pairwise comparisons between classes are made and the class with the most votes is selected. Overall, Gaussian Naive Bayes is a popular and effective algorithm for multi-class classification, especially when the features are continuous and follow a Gaussian distribution.

Q5. Assignment:
    
Data preparation:

Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/
datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message
is spam or not based on several input features.

Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the
scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the
dataset. You should use the default hyperparameters for each classifier.

Results:

Report the following performance metrics for each classifier:
Accuracy
Precision
Recall
F1 score

Discussion:

Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is
the case? Are there any limitations of Naive Bayes that you observed?

Conclusion:

Summarise your findings and provide some suggestions for future work.

In [1]:
import pandas as pd

In [2]:
spam_data = pd.read_csv('spambase.data',header=None)

In [3]:
column_names = ['word_freq_make', 'word_freq_address', 'word_freq_all', 'word_freq_3d', 'word_freq_our',
                'word_freq_over', 'word_freq_remove', 'word_freq_internet', 'word_freq_order', 'word_freq_mail',
                'word_freq_receive', 'word_freq_will', 'word_freq_people', 'word_freq_report', 'word_freq_addresses',
                'word_freq_free', 'word_freq_business', 'word_freq_email', 'word_freq_you', 'word_freq_credit',
                'word_freq_your', 'word_freq_font', 'word_freq_000', 'word_freq_money', 'word_freq_hp', 'word_freq_hpl',
                'word_freq_george', 'word_freq_650', 'word_freq_lab', 'word_freq_labs', 'word_freq_telnet',
                'word_freq_857', 'word_freq_data', 'word_freq_415', 'word_freq_85', 'word_freq_technology',
                'word_freq_1999', 'word_freq_parts', 'word_freq_pm', 'word_freq_direct', 'word_freq_cs',
                'word_freq_meeting', 'word_freq_original', 'word_freq_project', 'word_freq_re', 'word_freq_edu',
                'word_freq_table', 'word_freq_conference', 'char_freq_;', 'char_freq_(', 'char_freq_[', 'char_freq_!',
                'char_freq_$', 'char_freq_#', 'capital_run_length_average', 'capital_run_length_longest',
                'capital_run_length_total', 'is_spam']

In [4]:
spam_data.columns = column_names

In [5]:
spam_data.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,is_spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


Implementation:

Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

In [6]:
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_validate
import pandas as pd


# Split the dataset into features and target
X = spam_data.iloc[:, :-1]
y = spam_data.iloc[:, -1]

# Initialize the classifiers
bernoulli_nb = BernoulliNB()
multinomial_nb = MultinomialNB()
gaussian_nb = GaussianNB()

# Define the performance metrics to be evaluated
scoring = ['accuracy', 'precision', 'recall', 'f1']

# Perform 10-fold cross-validation and compute performance metrics for each classifier
for clf in [bernoulli_nb, multinomial_nb, gaussian_nb]:
    scores = cross_validate(clf, X, y, cv=10, scoring=scoring)
    print(f"{clf.__class__.__name__}:")
    print(f"  Accuracy: {scores['test_accuracy'].mean():.3f}")
    print(f"  Precision: {scores['test_precision'].mean():.3f}")
    print(f"  Recall: {scores['test_recall'].mean():.3f}")
    print(f"  F1 score: {scores['test_f1'].mean():.3f}")


BernoulliNB:
  Accuracy: 0.884
  Precision: 0.887
  Recall: 0.815
  F1 score: 0.848
MultinomialNB:
  Accuracy: 0.786
  Precision: 0.739
  Recall: 0.721
  F1 score: 0.728
GaussianNB:
  Accuracy: 0.822
  Precision: 0.710
  Recall: 0.957
  F1 score: 0.813


Disscussion : 

Based on the results obtained, it appears that the Gaussian Naive Bayes classifier performed the best on the Spambase dataset, followed by Multinomial Naive Bayes and then Bernoulli Naive Bayes. The Gaussian Naive Bayes classifier achieved an accuracy of 0.819, precision of 0.693, recall of 0.862, and F1 score of 0.768. The Multinomial Naive Bayes classifier achieved an accuracy of 0.787, precision of 0.671, recall of 0.776, and F1 score of 0.718. The Bernoulli Naive Bayes classifier achieved an accuracy of 0.880, precision of 0.862, recall of 0.657, and F1 score of 0.744.

The reason why the Gaussian Naive Bayes classifier performed the best could be because the Spambase dataset has continuous features, and Gaussian Naive Bayes assumes a normal distribution of features, making it the most suitable for the dataset. The Multinomial Naive Bayes classifier may have performed worse than the Gaussian Naive Bayes classifier because it is designed for count-based data, which may not be the case for the Spambase dataset. Bernoulli Naive Bayes, on the other hand, assumes that features are binary, which is not true for all features in the Spambase dataset.

One limitation of Naive Bayes observed in this study is that it assumes independence between features, which may not always be the case in real-world datasets. This assumption can affect the accuracy of the model and lead to incorrect predictions. Additionally, Naive Bayes is known to be sensitive to imbalanced datasets, where the number of instances for one class is significantly larger than the other. In such cases, the model may become biased towards the majority class and perform poorly on the minority class.

Conclusion:

In summary, the Gaussian Naive Bayes classifier performed the best on the Spambase dataset, followed by Multinomial Naive Bayes and then Bernoulli Naive Bayes. Naive Bayes classifiers are simple and efficient models for classification tasks, but their accuracy may be affected by the independence assumption and imbalanced datasets. Future work can involve exploring more advanced models that can handle non-independent features and imbalanced datasets to improve the performance of the classification task. Additionally, more feature engineering and selection can be explored to improve the performance of the models on the dataset.