# Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that they use the health insurance plan, we can use conditional probability.

Let's denote the event "an employee uses the health insurance plan" as A, and the event "an employee is a smoker" as B. We want to find P(B|A), which represents the probability of event B occurring given that event A has occurred.

According to the information given:
P(A) = 0.70 (70% of employees use the health insurance plan)
P(B|A) = 0.40 (40% of employees who use the plan are smokers)

The formula for conditional probability is:
P(B|A) = P(A and B) / P(A)

We have P(A) and P(B|A), so we can rearrange the formula to solve for P(A and B):
P(A and B) = P(B|A) * P(A)

Plugging in the values:
P(A and B) = 0.40 * 0.70
= 0.28

Therefore, the probability that an employee is a smoker given that they use the health insurance plan is 0.28 or 28%.






# Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?

The difference between Bernoulli Naive Bayes and Multinomial Naive Bayes lies in the assumptions they make about the distribution of the features.

1-Bernoulli Naive Bayes is suitable for binary features, where each feature represents a binary variable (e.g., presence or absence of a particular attribute).
It assumes that each feature follows a Bernoulli distribution, which means the features are independent and have binary outcomes (0 or 1).
It calculates the likelihood of a class given the presence or absence of each feature and uses these probabilities to make predictions.
Multinomial Naive Bayes:

2-Multinomial Naive Bayes is suitable for features that represent discrete counts or frequencies, such as word counts in document classification.
It assumes that each feature follows a multinomial distribution, which means the features are independent and represent the occurrence frequencies of discrete events.
It calculates the likelihood of a class given the frequency of each feature and uses these probabilities to make predictions.

# Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes assumes that the features follow a Bernoulli distribution, where each feature is binary, taking values of 0 or 1. When dealing with missing values in Bernoulli Naive Bayes, there are a few common approaches:

1-Ignoring missing values: One simple approach is to ignore the missing values and only consider the available features. This means treating the missing values as if they were not observed during the classification process. However, this approach discards potentially useful information and may lead to biased results if the missingness is not random.

2-Imputation: Another approach is to impute the missing values with some value that represents the absence or presence of the feature. For Bernoulli Naive Bayes, the most common imputation technique is to use the mode (the most frequent value) of the observed instances for that feature. This means replacing the missing values with either 0 or 1, based on the mode of the observed instances.

3-Special category: Alternatively, you can assign a special category or label to represent the missing values. You can treat the missing value as a separate category and include it as a distinct feature value. This approach allows the missingness itself to be used as information during the classification process.

# Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. Gaussian Naive Bayes is a variant of Naive Bayes algorithm that assumes the features follow a Gaussian (normal) distribution. It is commonly used for continuous or numeric features.

In the case of multi-class classification, where the goal is to classify instances into more than two classes, Gaussian Naive Bayes can be applied by extending the algorithm to handle multiple classes. The basic idea remains the same as in binary classification, but the calculations are extended to consider multiple classes.

To perform multi-class classification using Gaussian Naive Bayes, the algorithm calculates the conditional probability of each class given the observed feature values using the Gaussian probability density function. It then assigns the instance to the class with the highest probability.

The algorithm estimates the mean and variance of each feature for each class during the training phase. During the prediction phase, it uses these estimates to calculate the likelihood of the feature values belonging to each class. The class with the highest likelihood is chosen as the predicted class for the given instance.

Therefore, Gaussian Naive Bayes can handle multi-class classification by extending the Naive Bayes algorithm to accommodate multiple classes, making it a popular choice for problems with continuous or numeric features and multiple classes.

Q5. Assignment:
Data preparation: Download the "Spambase Data Set" from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Spambase). This dataset contains email messages, where the goal is to predict whether a message is spam or not based on several input features.

Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier.

Results: Report the following performance metrics for each classifier: Accuracy, Precision, Recall & F1 score.

Discussion: Discuss the results you obtained. Which variant of Naive Bayes performed the best? Why do you think that is the case? Are there any limitations of Naive Bayes that you observed?

Conclusion: Summarise your findings and provide some suggestions for the future work.

PLEASE NOTE: This dataset contains a binary classification problem with multiple features. The dataset is relatively small, but it can be used to demonstrate the performance of the different variants of Naive Bayes on a real-world problem.

Introduction: In this assignment, we will implement and compare the performance of three variants of Naive Bayes classifiers: Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes on the "Spambase Data Set" from the UCI Machine Learning Repository. We will use the scikit-learn library in Python for implementation and 10-fold cross-validation for evaluation.

Data Preparation: First, we need to download the Spambase Data Set from the UCI Machine Learning Repository. The dataset contains 4601 email messages, where the goal is to predict whether a message is spam or not based on several input features. The features include the frequency of various words, characters, and punctuation marks, as well as information about the length of the message and the number of capital letters in the message.

Implementation: We will now implement the three variants of Naive Bayes classifiers using the scikit-learn library in Python. The implementation is straightforward, and we will use the default hyperparameters for each classifier.

In [1]:
import numpy as np
from sklearn.naive_bayes import BernoulliNB, MultinomialNB, GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import StandardScaler

In [2]:
# Load data
import pandas as pd
data =pd.read_csv('spambase.csv',header=None)
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,48,49,50,51,52,53,54,55,56,57
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [3]:
features=[]
for i in range(data.shape[1]):
    if i!=57:
        fs = 'f'+str(i+1)
        features.append(fs)
    else:
        features.append('target')

In [4]:
data.columns = features
data.head()

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f49,f50,f51,f52,f53,f54,f55,f56,f57,target
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [5]:
data['target'].value_counts()

0    2788
1    1813
Name: target, dtype: int64

In [6]:
# checking null values
data.isnull().sum()

f1        0
f2        0
f3        0
f4        0
f5        0
f6        0
f7        0
f8        0
f9        0
f10       0
f11       0
f12       0
f13       0
f14       0
f15       0
f16       0
f17       0
f18       0
f19       0
f20       0
f21       0
f22       0
f23       0
f24       0
f25       0
f26       0
f27       0
f28       0
f29       0
f30       0
f31       0
f32       0
f33       0
f34       0
f35       0
f36       0
f37       0
f38       0
f39       0
f40       0
f41       0
f42       0
f43       0
f44       0
f45       0
f46       0
f47       0
f48       0
f49       0
f50       0
f51       0
f52       0
f53       0
f54       0
f55       0
f56       0
f57       0
target    0
dtype: int64

Implementation: Implement Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers using the scikit-learn library in Python. Use 10-fold cross-validation to evaluate the performance of each classifier on the dataset. You should use the default hyperparameters for each classifier

In [7]:
# Seperating X and Y variables
X = data.drop(labels=['target'],axis=1)
Y = data[['target']]

In [8]:
X

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f48,f49,f50,f51,f52,f53,f54,f55,f56,f57
0,0.00,0.64,0.64,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.778,0.000,0.000,3.756,61,278
1,0.21,0.28,0.50,0.0,0.14,0.28,0.21,0.07,0.00,0.94,...,0.0,0.000,0.132,0.0,0.372,0.180,0.048,5.114,101,1028
2,0.06,0.00,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.010,0.143,0.0,0.276,0.184,0.010,9.821,485,2259
3,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.137,0.0,0.137,0.000,0.000,3.537,40,191
4,0.00,0.00,0.00,0.0,0.63,0.00,0.31,0.63,0.31,0.63,...,0.0,0.000,0.135,0.0,0.135,0.000,0.000,3.537,40,191
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4596,0.31,0.00,0.62,0.0,0.00,0.31,0.00,0.00,0.00,0.00,...,0.0,0.000,0.232,0.0,0.000,0.000,0.000,1.142,3,88
4597,0.00,0.00,0.00,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.000,0.0,0.353,0.000,0.000,1.555,4,14
4598,0.30,0.00,0.30,0.0,0.00,0.00,0.00,0.00,0.00,0.00,...,0.0,0.102,0.718,0.0,0.000,0.000,0.000,1.404,6,118
4599,0.96,0.00,0.00,0.0,0.32,0.00,0.00,0.00,0.00,0.00,...,0.0,0.000,0.057,0.0,0.000,0.000,0.000,1.147,5,78


In [9]:
Y

Unnamed: 0,target
0,1
1,1
2,1
3,1
4,1
...,...
4596,0
4597,0
4598,0
4599,0


In [10]:
# Train Test Split 
from sklearn.model_selection import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(X,Y,test_size=0.3,random_state=42,stratify=Y)

# Gaussian NB

In [11]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(xtrain,ytrain.values.flatten())

In [12]:
from sklearn.model_selection import StratifiedKFold
skf =  StratifiedKFold(n_splits=10,shuffle=True,random_state=42)

In [13]:
from sklearn.model_selection import cross_val_score
scores_gnb = cross_val_score(GaussianNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
scores_gnb

array([0.77564103, 0.82191781, 0.80267559, 0.802589  , 0.78064516,
       0.81081081, 0.82876712, 0.82033898, 0.80130293, 0.8125    ])

In [14]:
import numpy as np
mean_score_gnb = np.mean(scores_gnb)
print('Results for Gaussian Naive Bayes')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_gnb:.4f}')

Results for Gaussian Naive Bayes
Mean 10 fold cross validation f1 score is : 0.8057


# Bernoulli Naive Bayes

In [15]:
from sklearn.naive_bayes import BernoulliNB
bnb = BernoulliNB()
bnb.fit(xtrain,ytrain.values.flatten())

In [16]:
scores_bnb = cross_val_score(BernoulliNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
print(scores_bnb)
mean_score_bnb = np.mean(scores_bnb)
print('Results for BernoulliNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_bnb:.4f}')

[0.84897959 0.84677419 0.84120172 0.8515625  0.85258964 0.81512605
 0.8879668  0.85232068 0.85483871 0.84081633]
Results for BernoulliNB :
Mean 10 fold cross validation f1 score is : 0.8492


# Multinomial Naive Bayes

In [17]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(xtrain,ytrain.values.flatten())
scores_mnb = cross_val_score(MultinomialNB(),xtrain,ytrain.values.flatten(),cv=skf,scoring='f1')
print(scores_mnb)
mean_score_mnb = np.mean(scores_mnb)
print('Results for MultinomialNB :')
print(f'Mean 10 fold cross validation f1 score is : {mean_score_mnb:.4f}')

[0.70817121 0.68907563 0.74509804 0.71604938 0.67741935 0.72131148
 0.76       0.712      0.703125   0.7768595 ]
Results for MultinomialNB :
Mean 10 fold cross validation f1 score is : 0.7209


In [21]:
# Define a function to store all above metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def evaluate_model(x,y,model):
    ypred = model.predict(x)
    acc = accuracy_score(y,ypred)
    pre = precision_score(y,ypred)
    rec = recall_score(y,ypred)
    f1 = f1_score(y,ypred)
    print(f'Accuracy  : {acc:.4f}')
    print(f'Precision : {pre:.4f}')
    print(f'Recall    : {rec:.4f}')
    print(f'F1 Score  : {f1:.4f}')
    return acc, pre, rec, f1

In [22]:
print('Gaussian Naive Bayes Results : \n')
acc_gnb, pre_gnb, rec_gnb, f1_gnb = evaluate_model(xtest,ytest.values.flatten(),gnb)

print('Bernoulli Naive Bayes Results : \n')
acc_bnb, pre_bnb, rec_bnb, f1_bnb = evaluate_model(xtest,ytest.values.flatten(),bnb)

print('Multinomial Naive Bayes Results : \n')
acc_mnb, pre_mnb, rec_mnb, f1_mnb = evaluate_model(xtest,ytest.values.flatten(),mnb)

Gaussian Naive Bayes Results : 

Accuracy  : 0.8240
Precision : 0.7048
Recall    : 0.9522
F1 Score  : 0.8100
Bernoulli Naive Bayes Results : 

Accuracy  : 0.8870
Precision : 0.8865
Recall    : 0.8180
F1 Score  : 0.8509
Multinomial Naive Bayes Results : 

Accuracy  : 0.7697
Precision : 0.7190
Recall    : 0.6820
F1 Score  : 0.7000


Best Model for above data is Bernoulli Naive Bayes

Bernoulli Naive Bayes is best model because of below reasons :

BernoulliNB has highest test f1 score of 0.8509 BernoulliNB has highest test accuracy of 0.8870 BernoulliNB has highest 10 fold cross validation F1 score of 0.8492

Although Naive Bayes algorithm is a powerful and widely used algorithm, it also has some limitations, including:

1. The assumption of feature independence: The Naive Bayes algorithm assumes that the features are independent of each other. However, in real-world scenarios, this assumption is not always true, and features may be dependent on each other.

2. Sensitivity to input data: Naive Bayes algorithm is very sensitive to input data, and even a slight change in the input data can significantly affect the accuracy of the model.

3. Lack of tuning parameters: Naive Bayes algorithm does not have many tuning parameters that can be adjusted to improve its performance.

4. Data sparsity problem: Naive Bayes algorithm relies on a lot of training data to estimate the probabilities of different features. However, if some features have very low frequencies in the training data, the algorithm may not be able to accurately estimate their probabilities.

5. Class-conditional independence assumption: Naive Bayes algorithm assumes that each feature is conditionally   independent given the class. However, in many cases, this assumption may not hold, and the algorithm may not perform well.

6. Imbalanced class distribution: Naive Bayes algorithm assumes that the classes are equally likely, but in real-world scenarios, the class distribution may be imbalanced, which can lead to biased results.

7. The need for continuous data: Naive Bayes algorithm assumes that the input features are continuous, which may not always be the case in real-world scenarios where the input features are discrete.
Limitations:

Naive Bayes classifiers make the assumption that the features are independent of each other, which may not always be the case. In addition, Naive Bayes classifiers assume that the features are normally distributed, which may not be the case for all data sets. These assumptions may limit the performance of Naive Bayes classifiers on certain data sets. Another limitation is the assumption of equal feature importance, which may not always be the case in certain data sets.

Conclusion:

In conclusion, the implementation of Bernoulli Naive Bayes, Multinomial Naive Bayes, and Gaussian Naive Bayes classifiers on the "Spambase Data Set" showed that the Bernoulli Naive Bayes classifier performed the best due to the binary nature of the features. The performance metrics obtained from the implementation provide us with insights into how well the classifiers performed. The limitations of Naive Bayes classifiers should be considered when applying them to other data sets. Future work could involve exploring other classification algorithms that do not make these assumptions or finding ways to modify Naive Bayes classifiers to work better with correlated, non-normal, non-independent or non-equal importance features.