#### Q1. A company conducted a survey of its employees and found that 70% of the employees use the company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the probability that an employee is a smoker given that he/she uses the health insurance plan?

To find the probability that an employee is a smoker given that they use the health insurance plan, we can use the information provided.

Let:

- H represent the event that an employee uses the health insurance plan.
- S represent the event that an employee is a smoker.

We are given:

- P(H)=0.70 (70% of employees use the health insurance plan).
- P(S∣H)=0.40 (40% of employees who use the health insurance plan are smokers).

The probability we need is 
- P(S∣H), which is directly given in the problem as:
- P(S∣H)=0.40

Answer:
The probability that an employee is a smoker, given that they use the health insurance plan, is 0.40 or 40%.

.

#### Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?


Bernoulli Naive Bayes:

- Used for binary/boolean data.
- Assumes features are binary (presence or absence).
- Example: Classifying documents based on whether specific words are present (1) or absent (0).

Multinomial Naive Bayes:

- Used for count data.
- Assumes features represent frequencies or counts.
- Example: Classifying documents based on word frequency counts in text data.

.

#### Q3. How does Bernoulli Naive Bayes handle missing values?

Bernoulli Naive Bayes does not inherently handle missing values, as it expects binary inputs (0 or 1) indicating the presence or absence of a feature. Here are common approaches to manage missing values in this model:

- Imputation: Replace missing values with 0 (assuming absence) or 1 (assuming presence), based on the context.
- Exclude Features: If certain features frequently have missing values, they can be excluded if they're not critical to classification.
- Mean/Mode Imputation: Replace missing values with the mean or mode value of the feature across the dataset.

.

#### Q4. Can Gaussian Naive Bayes be used for multi-class classification?

Yes, Gaussian Naive Bayes can be used for multi-class classification. It extends naturally to handle multiple classes by calculating the probability of each class given a set of features and then selecting the class with the highest probability.

For each class, Gaussian Naive Bayes assumes the features follow a normal (Gaussian) distribution and calculates the probability of the features given each class.

.

### Q5. Assignment:

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

In [47]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score

In [3]:
from ucimlrepo import fetch_ucirepo

In [4]:
# fetch dataset 
spambase = fetch_ucirepo(id=94) 

In [5]:
# data (as pandas dataframes) 
X = spambase.data.features

In [6]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [7]:
y = spambase.data.targets

In [8]:
y.head()

Unnamed: 0,Class
0,1
1,1
2,1
3,1
4,1


In [9]:
df = pd.concat([X, y], axis=1)

In [10]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Class
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [11]:
df = df.rename(columns={"Class": "Spam"})

In [12]:
df.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [13]:
# Checking if there are duplicate rows in the dataset

# get boolean mask of duplicated rows
duplicated_rows = df.duplicated().sum()

In [14]:
print(f"Number of duplicate rows: {duplicated_rows}")

Number of duplicate rows: 391


In [15]:
# Removing duplicate rows
df = df.drop_duplicates()

In [16]:
df.shape

(4210, 58)

In [22]:
# Setting the spam value to 0 if the "word_freq_george" or the "word_freq_650" columns are greater than 0.0
# Reason: In the documentation(spambase.DOCUMENTATION) it has clearly mentioned it
df.loc[(df['word_freq_george'] > 0) | (df['word_freq_650'] > 0), 'Spam'] = 0

In [23]:
columns_to_display = ['word_freq_george', 'word_freq_650', 'Spam']
columns_to_display

['word_freq_george', 'word_freq_650', 'Spam']

In [24]:
# Let's have a look at the dataset and check if its updated or not
columns_to_display = ['word_freq_george', 'word_freq_650', 'Spam']
df_subset = df.loc[:, columns_to_display]
df_subset

Unnamed: 0,word_freq_george,word_freq_650,Spam
0,0.0,0.0,1
1,0.0,0.0,1
2,0.0,0.0,1
3,0.0,0.0,1
4,0.0,0.0,1
...,...,...,...
4596,0.0,0.0,0
4597,0.0,0.0,0
4598,0.0,0.0,0
4599,0.0,0.0,0


In conclusion, we can state that if the values in the "word_freq_george" or "word_freq_650" columns are greater than 0.0, then the corresponding values in the "spam" column have been updated to 0. This implies that the occurrence of the words "george" or "650" in an email are strong indicators that the email is not spam.

In [25]:
# Save the modified dataframe to a new csv file
df.to_csv('modified_dataset.csv', index=False)

In [26]:
# Creating a reference to the updated dataset
df_new = pd.read_csv('modified_dataset.csv', header=0)
df_new.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total,Spam
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278,1
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1


In [27]:
X = df_new.drop(columns=['Spam'])
y = df_new['Spam']

In [28]:
X.head()

Unnamed: 0,word_freq_make,word_freq_address,word_freq_all,word_freq_3d,word_freq_our,word_freq_over,word_freq_remove,word_freq_internet,word_freq_order,word_freq_mail,...,word_freq_conference,char_freq_;,char_freq_(,char_freq_[,char_freq_!,char_freq_$,char_freq_#,capital_run_length_average,capital_run_length_longest,capital_run_length_total
0,0.0,0.64,0.64,0.0,0.32,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.778,0.0,0.0,3.756,61,278
1,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028
2,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.0,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191
4,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191


In [29]:
y.head()

0    1
1    1
2    1
3    1
4    1
Name: Spam, dtype: int64

In [30]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)

In [31]:
X_train.shape, X_test.shape

((3157, 57), (1053, 57))

## GuassianNB

In [32]:
gnb = GaussianNB()

In [50]:
gaussian_scores = cross_val_score(gnb, X, y, cv=10, scoring='accuracy')

# Print the average accuracy for Gaussian Naive Bayes
print("Gaussian Naive Bayes Average Accuracy:", np.mean(gaussian_scores))

Gaussian Naive Bayes Average Accuracy: 0.8358669833729216


In [33]:
gnb.fit(X_train,y_train)

In [36]:
y_pred_gnb = gnb.predict(X_test)

In [38]:
print(confusion_matrix(y_test,y_pred_gnb))
print(classification_report(y_test,y_pred_gnb))
print(accuracy_score(y_test,y_pred_gnb))

[[467 158]
 [  8 420]]
              precision    recall  f1-score   support

           0       0.98      0.75      0.85       625
           1       0.73      0.98      0.83       428

    accuracy                           0.84      1053
   macro avg       0.85      0.86      0.84      1053
weighted avg       0.88      0.84      0.84      1053

0.842355175688509


## MultinomialNB

In [39]:
mnb = MultinomialNB()

In [49]:
multinomial_scores = cross_val_score(mnb, X, y, cv=10, scoring='accuracy')

# Print the average accuracy for Multinomial Naive Bayes
print("Multinomial Naive Bayes Average Accuracy:", np.mean(multinomial_scores))

Multinomial Naive Bayes Average Accuracy: 0.7992874109263658


In [None]:
mnb.fit(X_train,y_train)

In [40]:
y_pred_mnb = mnb.predict(X_test)

In [41]:
print(confusion_matrix(y_test,y_pred_mnb))
print(classification_report(y_test,y_pred_mnb))
print(accuracy_score(y_test,y_pred_mnb))

[[524 101]
 [134 294]]
              precision    recall  f1-score   support

           0       0.80      0.84      0.82       625
           1       0.74      0.69      0.71       428

    accuracy                           0.78      1053
   macro avg       0.77      0.76      0.77      1053
weighted avg       0.78      0.78      0.78      1053

0.7768281101614435


## BinomialNB

In [42]:
bnb = BernoulliNB()

In [48]:
# Perform 10-fold cross-validation
bernoulli_scores = cross_val_score(bnb, X, y, cv=10, scoring='accuracy')

# Print the average accuracy for Bernoulli Naive Bayes
print("Bernoulli Naive Bayes Average Accuracy:", np.mean(bernoulli_scores))

Bernoulli Naive Bayes Average Accuracy: 0.8893111638954867


In [43]:
bnb.fit(X_train,y_train)

In [44]:
y_pred_bnb = bnb.predict(X_test)

In [45]:
print(confusion_matrix(y_test,y_pred_bnb))
print(classification_report(y_test,y_pred_bnb))
print(accuracy_score(y_test,y_pred_bnb))

[[579  46]
 [ 59 369]]
              precision    recall  f1-score   support

           0       0.91      0.93      0.92       625
           1       0.89      0.86      0.88       428

    accuracy                           0.90      1053
   macro avg       0.90      0.89      0.90      1053
weighted avg       0.90      0.90      0.90      1053

0.9002849002849003


With an accuracy of 90%, the Bernoulli Naive Bayes model performed the best among the Naive Bayes variants in our dataset.