#Q1.
Since Naive Bayes assumes independence between features, it might not be the best model for this specific case where smoking and health insurance usage may not be completely independent. However, for the sake of the question, let's proceed.

In [1]:
#1
# Import necessary libraries
from sklearn.naive_bayes import GaussianNB
import numpy as np

# Define the probabilities from the survey
p_health_insurance = 0.7  # Probability that an employee uses the health insurance plan
p_smoker_given_health_insurance = 0.4  # Probability that an employee is a smoker given that he/she uses the health insurance plan

# Calculate the probability that an employee is a smoker
p_smoker = p_health_insurance * p_smoker_given_health_insurance

# Display the result
print("Probability that an employee is a smoker:", p_smoker)

Probability that an employee is a smoker: 0.27999999999999997


#Note:
this solution assumes independence between the events "using the health insurance plan" and "being a smoker," which may not be a valid assumption in real-world scenarios.

For a more accurate model, you might need more data and a more sophisticated model that considers dependencies between features.

#Q2
Bernoulli Naive Bayes and Multinomial Naive Bayes are two variants of the Naive Bayes algorithm that are commonly used for text classification problems, such as spam detection or sentiment analysis. The key difference lies in the type of features they are designed to work with.

Bernoulli Naive Bayes:

Assumes that features are binary variables (0 or 1), indicating the absence or presence of a particular feature.
Often used for document classification tasks, where each term's presence or absence in a document is considered.
Suitable for situations where the occurrence of a feature matters more than its frequency.

Multinomial Naive Bayes:

Assumes that features are discrete and represent counts or frequencies (non-negative integers).
Commonly used for text classification tasks where features are word counts (bag-of-words model).
Suitable for situations where the frequency of a feature is important, such as in natural language processing tasks.

In [2]:
#2
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Sample data
documents = ["I like natural language processing.",
             "Naive Bayes is a simple algorithm.",
             "Spam emails are annoying.",
             "I dislike spam emails."]

labels = [1, 1, 0, 0]  # 1 for positive (non-spam), 0 for negative (spam)

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(documents, labels, test_size=0.25, random_state=42)

# Vectorize the documents using the bag-of-words model
vectorizer = CountVectorizer()
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.transform(X_test)

# Bernoulli Naive Bayes
bernoulli_nb = BernoulliNB()
bernoulli_nb.fit(X_train_bow, y_train)
y_pred_bernoulli = bernoulli_nb.predict(X_test_bow)
accuracy_bernoulli = accuracy_score(y_test, y_pred_bernoulli)
print("Bernoulli Naive Bayes Accuracy:", accuracy_bernoulli)

# Multinomial Naive Bayes
multinomial_nb = MultinomialNB()
multinomial_nb.fit(X_train_bow, y_train)
y_pred_multinomial = multinomial_nb.predict(X_test_bow)
accuracy_multinomial = accuracy_score(y_test, y_pred_multinomial)
print("Multinomial Naive Bayes Accuracy:", accuracy_multinomial)

Bernoulli Naive Bayes Accuracy: 0.0
Multinomial Naive Bayes Accuracy: 0.0


#Q3
Bernoulli Naive Bayes assumes that features are binary variables, indicating the presence or absence of a particular feature. In the context of missing values, the algorithm can still work, but it interprets missing values as the absence of a feature. The algorithm doesn't explicitly handle missing values in the way some other algorithms might, and it might not be the best choice if missing values are common and important in your dataset.

In [13]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

# Load Iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='target')

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create a RandomForest classifier
rf_classifier = RandomForestClassifier(random_state=42)

# Create a Bernoulli Naive Bayes classifier (not suitable for this dataset, just for illustration)
bnb_classifier = BernoulliNB()

# Create an ensemble model with a Voting Classifier
ensemble_model = VotingClassifier(estimators=[
    ('random_forest', rf_classifier),
    ('bernoulli_naive_bayes', bnb_classifier)
], voting='soft')

# Train the ensemble model
ensemble_model.fit(X_train, y_train)

# Evaluate the ensemble model
y_pred = ensemble_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Ensemble Model Accuracy: {accuracy}')

Ensemble Model Accuracy: 1.0


#Q4
Yes, Gaussian Naive Bayes can be used for multi-class classification. It is a variant of the Naive Bayes algorithm that assumes that the features follow a Gaussian (normal) distribution. Scikit-learn provides the GaussianNB class for Gaussian Naive Bayes, and it supports multi-class classification out of the box.

In [14]:
#4
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report

# Load the Iris dataset (a well-known dataset for multi-class classification)
iris = load_iris()
X = iris.data
y = iris.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Create and fit the Gaussian Naive Bayes model
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Make predictions on the test set
y_pred = gnb.predict(X_test)

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Display classification report
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Accuracy: 1.0

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        15
           1       1.00      1.00      1.00        11
           2       1.00      1.00      1.00        12

    accuracy                           1.00        38
   macro avg       1.00      1.00      1.00        38
weighted avg       1.00      1.00      1.00        38



In [15]:
#Q5
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import Binarizer

# Step 1: Load the Spambase dataset
# Download the dataset from the UCI Machine Learning Repository and adjust the file path
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"
column_names = [f'feature_{i}' for i in range(57)] + ['spam']
df = pd.read_csv(url, header=None, names=column_names)

# Step 2: Split the data into features (X) and target (y)
X = df.drop('spam', axis=1)
y = df['spam']

# Step 3: Implement and evaluate Bernoulli Naive Bayes
print("\nBernoulli Naive Bayes:")
bnb = BernoulliNB()
bnb_scores = cross_val_score(bnb, X, y, cv=10, scoring=make_scorer(accuracy_score))
print("Accuracy:", bnb_scores.mean())

# Step 4: Implement and evaluate Multinomial Naive Bayes
print("\nMultinomial Naive Bayes:")
mnb = MultinomialNB()
mnb_scores = cross_val_score(mnb, X, y, cv=10, scoring=make_scorer(accuracy_score))
print("Accuracy:", mnb_scores.mean())

# Step 5: Implement and evaluate Gaussian Naive Bayes
print("\nGaussian Naive Bayes:")
gnb = GaussianNB()
gnb_scores = cross_val_score(gnb, X, y, cv=10, scoring=make_scorer(accuracy_score))
print("Accuracy:", gnb_scores.mean())

# Step 6: Report additional metrics for each classifier
def evaluate_classifier(classifier, X, y):
    accuracy = cross_val_score(classifier, X, y, cv=10, scoring=make_scorer(accuracy_score)).mean()
    precision = cross_val_score(classifier, X, y, cv=10, scoring=make_scorer(precision_score)).mean()
    recall = cross_val_score(classifier, X, y, cv=10, scoring=make_scorer(recall_score)).mean()
    f1 = cross_val_score(classifier, X, y, cv=10, scoring=make_scorer(f1_score)).mean()
    return accuracy, precision, recall, f1

print("\nAdditional Metrics:")

bnb_metrics = evaluate_classifier(BernoulliNB(), X, y)
print("Bernoulli Naive Bayes - Accuracy:", bnb_metrics[0], "Precision:", bnb_metrics[1], "Recall:", bnb_metrics[2], "F1 Score:", bnb_metrics[3])

mnb_metrics = evaluate_classifier(MultinomialNB(), X, y)
print("Multinomial Naive Bayes - Accuracy:", mnb_metrics[0], "Precision:", mnb_metrics[1], "Recall:", mnb_metrics[2], "F1 Score:", mnb_metrics[3])

gnb_metrics = evaluate_classifier(GaussianNB(), X, y)
print("Gaussian Naive Bayes - Accuracy:", gnb_metrics[0], "Precision:", gnb_metrics[1], "Recall:", gnb_metrics[2], "F1 Score:", gnb_metrics[3])


Bernoulli Naive Bayes:
Accuracy: 0.8839380364047911

Multinomial Naive Bayes:
Accuracy: 0.7863496180326323

Gaussian Naive Bayes:
Accuracy: 0.8217730830896915

Additional Metrics:
Bernoulli Naive Bayes - Accuracy: 0.8839380364047911 Precision: 0.8869617393737383 Recall: 0.8152389047416673 F1 Score: 0.8481249015095276
Multinomial Naive Bayes - Accuracy: 0.7863496180326323 Precision: 0.7393175533565436 Recall: 0.7214983911116508 F1 Score: 0.7282909724016348
Gaussian Naive Bayes - Accuracy: 0.8217730830896915 Precision: 0.7103733928118492 Recall: 0.9569516119239877 F1 Score: 0.8130660909542995
