In [9]:
import logging
logging.basicConfig(filename="10AprInfo.log", level=logging.INFO, format="%(asctime)s %(name)s %(message)s")

# answer 1
A: employee uses health insurance plan
B: employee is smoker

We need to find the probability of an employee being a smoker given that he/she uses the health insurance plan, which is P(B|A).
Given,
- 70% of the employees use the health insurance plan, P(A) = 0.7.
- 40% of the employees who use the plan are smokers, P(B|A) = 0.4.

Bayes' theorem : **P(B|A) = P(A|B) * P(B) / P(A)**

Law of total probability:
**P(B) = P(B|A) * P(A) + P(B|A') * P(A')**

Hence, the probability that an employee is a smoker given that he/she uses the health insurance plan is 40%.

# answer 2
Bernoulli Naive Bayes and Multinomial Naive Bayes are two commonly used variants of the Naive Bayes algorithm, which is a probabilistic machine learning algorithm used for classification tasks.

The main difference between the two lies in the type of data they are designed to handle.

Bernoulli Naive Bayes is typically used when the input data is binary (i.e., when each feature can take on one of two values, such as 0 or 1). In this case, the algorithm calculates the probability of a particular class given the presence or absence of each feature in the input.

On the other hand, Multinomial Naive Bayes is typically used when the input data consists of frequency counts of events (i.e., when each feature represents the count of a particular word or token in a text document, for example). In this case, the algorithm calculates the probability of a particular class given the frequency counts of each feature in the input.


# answer 3
Bernoulli Naive Bayes assumes that all features are binary and takes the presence or absence of a feature as a binary value of 1 or 0, respectively. In the case of missing values, the Bernoulli Naive Bayes algorithm can handle them in different ways, depending on the specific implementation.

One common approach is to impute the missing values with either 0 or 1, depending on the data and the problem at hand. This can be done using different imputation techniques, such as mean imputation or median imputation, or by using a separate missing value indicator feature, which takes on the value of 1 when a feature is missing and 0 otherwise.

Another approach is to simply ignore the missing values and exclude the corresponding samples from the training or testing dataset. This approach can work well if the missing values are relatively rare and the dataset is large enough that removing a few samples does not significantly affect the performance of the model.

In any case, it is important to carefully consider the handling of missing values in Bernoulli Naive Bayes and other machine learning algorithms, as the choice of approach can have a significant impact on the performance and accuracy of the model.

# answer 4
Yes, Gaussian Naive Bayes can be used for multi-class classification. In this case, the algorithm assumes that the input data follows a Gaussian (normal) distribution and calculates the probability of each class given the input features. The class with the highest probability is then predicted as the output class.

For multi-class classification, the algorithm can be extended using one of two common strategies: one-vs-all (also known as one-vs-rest) or one-vs-one.

In the one-vs-all strategy, a separate binary classifier is trained for each class, with the samples of that class labeled as positive and the samples of all other classes labeled as negative. The final prediction is made by selecting the class with the highest probability output from all the classifiers.

In the one-vs-one strategy, a separate binary classifier is trained for each pair of classes. The final prediction is made by counting the number of votes for each class and selecting the class with the most votes.

Overall, Gaussian Naive Bayes is a popular and effective algorithm for multi-class classification, especially when the input features are continuous and normally distributed. However, it may not perform as well as other algorithms, such as logistic regression or support vector machines, for more complex or high-dimensional datasets.

# answer 5

In [11]:
with open('spambase.DOCUMENTATION', 'r') as f:
    text = f.read()
    print(text[0:4000])

1. Title:  SPAM E-mail Database

2. Sources:
   (a) Creators: Mark Hopkins, Erik Reeber, George Forman, Jaap Suermondt
        Hewlett-Packard Labs, 1501 Page Mill Rd., Palo Alto, CA 94304
   (b) Donor: George Forman (gforman at nospam hpl.hp.com)  650-857-7835
   (c) Generated: June-July 1999

3. Past Usage:
   (a) Hewlett-Packard Internal-only Technical Report. External forthcoming.
   (b) Determine whether a given email is spam or not.
   (c) ~7% misclassification error.
       False positives (marking good mail as spam) are very undesirable.
       If we insist on zero false positives in the training/testing set,
       20-25% of the spam passed through the filter.

4. Relevant Information:
        The "spam" concept is diverse: advertisements for products/web
        sites, make money fast schemes, chain letters, pornography...
	Our collection of spam e-mails came from our postmaster and 
	individuals who had filed spam.  Our collection of non-spam 
	e-mails came from filed work a

In [18]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# load the dataset
data = np.loadtxt('spambase.data', delimiter=',')

# split the dataset into features (X) and labels (y)
X = data[:, :-1]
y = data[:, -1]

# create Bernoulli Naive Bayes classifier and evaluate performance using 10-fold cross-validation
bnb = BernoulliNB()
bnb_y_pred = cross_val_predict(bnb, X, y, cv=10)
bnb_accuracy = accuracy_score(y, bnb_y_pred)
bnb_precision = precision_score(y, bnb_y_pred)
bnb_recall = recall_score(y, bnb_y_pred)
bnb_f1_score = f1_score(y, bnb_y_pred)
print("Bernoulli Naive Bayes performance metrics:")
print("Accuracy:", bnb_accuracy)
print("Precision:", bnb_precision)
print("Recall:", bnb_recall)
print("F1 score:", bnb_f1_score)

# create Multinomial Naive Bayes classifier and evaluate performance using 10-fold cross-validation
mnb = MultinomialNB()
mnb_y_pred = cross_val_predict(mnb, X, y, cv=10)
mnb_accuracy = accuracy_score(y, mnb_y_pred)
mnb_precision = precision_score(y, mnb_y_pred)
mnb_recall = recall_score(y, mnb_y_pred)
mnb_f1_score = f1_score(y, mnb_y_pred)
print("Multinomial Naive Bayes performance metrics:")
print("Accuracy:", mnb_accuracy)
print("Precision:", mnb_precision)
print("Recall:", mnb_recall)
print("F1 score:", mnb_f1_score)

# create Gaussian Naive Bayes classifier and evaluate performance using 10-fold cross-validation
gnb = GaussianNB()
gnb_y_pred = cross_val_predict(gnb, X, y, cv=10)
gnb_accuracy = accuracy_score(y, gnb_y_pred)
gnb_precision = precision_score(y, gnb_y_pred)
gnb_recall = recall_score(y, gnb_y_pred)
gnb_f1_score = f1_score(y, gnb_y_pred)
print("Gaussian Naive Bayes performance metrics:")
print("Accuracy:", gnb_accuracy)
print("Precision:", gnb_precision)
print("Recall:", gnb_recall)
print("F1 score:", gnb_f1_score)

Bernoulli Naive Bayes performance metrics:
Accuracy: 0.8839382742881983
Precision: 0.8813357185450209
Recall: 0.815223386651958
F1 score: 0.8469914040114614
Multinomial Naive Bayes performance metrics:
Accuracy: 0.786350793305803
Precision: 0.7323628219484882
Recall: 0.7214561500275786
F1 score: 0.7268685746040567
Gaussian Naive Bayes performance metrics:
Accuracy: 0.8217778743751358
Precision: 0.7004440855874041
Recall: 0.9569773855488142
F1 score: 0.8088578088578089


# Discussion:
The performance metrics reported above show the performance of three different types of Naive Bayes classifiers (Bernoulli, Multinomial, and Gaussian) on the spambase dataset.

- The Bernoulli Naive Bayes classifier achieved the highest accuracy score of 0.8839, which means that it correctly classified 88.39% of the instances in the dataset. It also achieved high precision and F1 scores, which indicates that it had a low false positive rate and balanced precision and recall. However, its recall score of 0.8152 is relatively lower, indicating that it missed some instances that were actually spam.

- The Multinomial Naive Bayes classifier achieved an accuracy score of 0.7863, which is lower than the Bernoulli classifier. It also has lower precision, recall, and F1 scores than the Bernoulli classifier. This indicates that the Multinomial classifier is not as effective as the Bernoulli classifier in classifying spam.

- The Gaussian Naive Bayes classifier achieved an accuracy score of 0.8218, which is between the Bernoulli and Multinomial classifiers. It has a high recall score of 0.9570, which means it correctly classified almost all spam instances. However, its precision score is lower, indicating that it may have a higher false positive rate than the Bernoulli classifier.

# Conclusion:
The Bernoulli Naive Bayes classifier is the most effective classifier for this particular dataset, achieving the highest accuracy, precision, and F1 scores. The Multinomial and Gaussian Naive Bayes classifiers are not as effective as the Bernoulli classifier in classifying spam. However, the Gaussian Naive Bayes classifier achieved a very high recall score, indicating that it may be useful in identifying most spam instances, although it may generate more false positives than the Bernoulli classifier.