# Naive Bayes 2

**Q1. A company conducted a survey of its employees and found that 70% of the employees use the
company's health insurance plan, while 40% of the employees who use the plan are smokers. What is the
probability that an employee is a smoker given that he/she uses the health insurance plan?**

**Ans:**  

**Determine the Probability \( P(S \mid I) \)**

To find \( P(S \mid I) \) when \( P(I \mid S) = 0.40 \), we use Bayes' theorem. 

**Define the probabilities:**
- \( P(I) \) = Probability that an employee uses the health insurance plan = 0.70.
- \( P(I \mid S) \) = Probability that an employee uses the health insurance plan given that they are a smoker = 0.40.

**Bayes' Theorem:**

Bayes' theorem states:

$$
P(S \mid I) = \frac{P(I \mid S) \cdot P(S)}{P(I)}
$$

where:
- \( P(S \mid I) \) is the probability that an employee is a smoker given that they use the health insurance plan.
- \( P(I \mid S) \) is the probability that an employee uses the health insurance plan given that they are a smoker.
- \( P(S) \) is the overall probability that an employee is a smoker.
- \( P(I) \) is the overall probability that an employee uses the health insurance plan.

**Find \( P(S) \) using the Law of Total Probability:**

The total probability of using the insurance plan can be found by considering both smokers and non-smokers:

$$
P(I) = P(I \mid S) \cdot P(S) + P(I \mid S') \cdot P(S')
$$

where:
- $P(S')$ = Probability of not being a smoker.
- $P(I \mid S')$ = Probability that an employee uses the health insurance plan given that they are not a smoker.

Rearranging to solve for \( P(S) \):

$$
P(S) = \frac{P(I) - P(I \mid S') \cdot P(S')}{P(I \mid S)}
$$

**Determine $P(I \mid S')$ and $P(S')$:**

- $P(S') = 1 - P(S)$.

If we assume we don’t have additional information about non-smokers, let’s estimate $P(S)3$:

Rearranging the equations with estimated values:

$$
P(S) \approx \frac{P(I) - P(I \mid S') \cdot (1 - P(S))}{P(I \mid S)}
$$

Given no specific $P(I \mid S')$, using simplified values provides a reasonable estimate.

**Conclusion:**

Given  $P(I \mid S) = 0.40$, the probability $P(S \mid I)$ is approximately:

$$
P(S \mid I) \approx 0.40
$$

Hence, $P(S \mid I) \approx \boxed{0.40}$ .


**Q2. What is the difference between Bernoulli Naive Bayes and Multinomial Naive Bayes?**

**Ans:**  

**Difference Between Bernoulli Naive Bayes and Multinomial Naive Bayes**

**1. Data Type:**

- **Bernoulli Naive Bayes:**
  - Suitable for binary/boolean features.
  - Assumes that each feature is binary (e.g., 0 or 1, true or false).
  - Works well for text classification problems where the presence or absence of words is more important than their frequency. For example, in spam detection, it can work with whether or not a specific word appears in an email.

- **Multinomial Naive Bayes:**
  - Suitable for categorical features where the feature values are counts or frequencies.
  - Assumes that features are counts or frequencies of events.
  - Works well for text classification problems where the frequency of word occurrence matters, such as document classification, where the number of times a word appears is important.

**2. Probability Model:**

- **Bernoulli Naive Bayes:**
  - Models the probability of a feature being present or absent.
  - The feature is modeled as a Bernoulli random variable, i.e., it takes on values 0 or 1.
  - The likelihood of a feature given a class is modeled as a Bernoulli distribution:
    $$
    P(x_i \mid y) = \text{Bernoulli}(p_i)
    $$
    where $p_i$ is the probability of feature $x_i$ being present for class y.

- **Multinomial Naive Bayes:**
  - Models the probability of feature counts.
  - The feature is modeled as a multinomial random variable, which counts occurrences of features.
  - The likelihood of a feature given a class is modeled as a Multinomial distribution:
    $$
    P(x_i \mid y) = \frac{(N_i)!}{x_i! \cdot (N_i - x_i)!} \cdot p_i^{x_i} \cdot (1 - p_i)^{N_i - x_i}
    $$
    where $x_i$ is the count of feature i, $N_i$ is the total number of features in class y, and $p_i$ is the probability of feature i given class y.

**3. Use Case Examples:**

- **Bernoulli Naive Bayes:**
  - Text classification where the presence/absence of words is important, but not their frequency. Example: spam email classification where the presence of specific keywords determines spam classification.

- **Multinomial Naive Bayes:**
  - Text classification where word frequencies matter. Example: sentiment analysis where the number of times certain words appear can be indicative of sentiment.

**4. Feature Representation:**

- **Bernoulli Naive Bayes:**
  - Features are binary, typically represented as a binary vector indicating the presence (1) or absence (0) of a word or feature.

- **Multinomial Naive Bayes:**
  - Features are counts or frequencies, represented as vectors where each value corresponds to the count of occurrences of a word or feature in a document.

**5. Mathematical Formulation:**

- **Bernoulli Naive Bayes:**
  - For a binary feature vector \( x \) and class \( y \):
    $$
    P(y \mid x) \propto P(y) \prod_{i=1}^n P(x_i \mid y)
    $$
    where $x_i$ is 0 or 1, and  $P(x_i \mid y)$ is the probability of feature i being present in class y.

- **Multinomial Naive Bayes:**
  - For a feature vector \( x \) with counts and class \( y \):
    $$
    P(y \mid x) \propto P(y) \prod_{i=1}^n \frac{(N_i)!}{x_i! \cdot (N_i - x_i)!} \cdot p_i^{x_i} \cdot (1 - p_i)^{N_i - x_i}
    $$
    where $x_i$ is the count of feature i, and $N_i$  is the total number of occurrences.

In summary, the key difference lies in the nature of the features they handle: Bernoulli Naive Bayes works with binary features indicating presence or absence, while Multinomial Naive Bayes handles features represented as counts or frequencies.


**Q3. How does Bernoulli Naive Bayes handle missing values?**

**Ans:**  

**Handling Missing Values in Bernoulli Naive Bayes**

Handling missing values is an important aspect of data preprocessing in machine learning. Bernoulli Naive Bayes, specifically designed for binary features, requires careful handling of missing values due to its binary nature. Here’s how it can handle missing values:

**1. Ignoring Missing Values:**
- **Default Approach:** One straightforward approach is to ignore missing values during the training and prediction processes. This means if a feature value is missing for a particular instance, it is simply not used in the computation of probabilities. The Naive Bayes classifier will use only the features with available values.
- **Impact:** This approach is generally effective if the proportion of missing values is small. However, if missing values are common, this can lead to a loss of valuable information and potentially biased results.

**2. Imputation:**
- **Replacing Missing Values:** Another approach is to impute the missing values before applying the Bernoulli Naive Bayes algorithm. Imputation can be done using several methods:
  - **Mode Imputation:** For binary features, you can replace missing values with the most common value (mode) of the feature in the training data.
  - **Random Imputation:** Replace missing values with randomly sampled values from the feature’s distribution.
  - **Model-Based Imputation:** Use another model to predict the missing values based on other available features.
- **Impact:** Imputation can help in making full use of available data but may introduce noise if not done carefully.

**3. Modifying the Algorithm:**
- **Augmented Bernoulli Naive Bayes:** Some variations of Naive Bayes can be adapted to handle missing values directly. For example, you could modify the Bernoulli Naive Bayes algorithm to include a special category for missing values, although this is less common.
- **Impact:** This approach may require custom implementation and may not be as straightforward as the standard Bernoulli Naive Bayes approach.

**Q4. Can Gaussian Naive Bayes be used for multi-class classification?**

**Ans:**  
  
Yes, Gaussian Naive Bayes can indeed be used for multi-class classification. The Gaussian Naive Bayes classifier is well-suited for problems where the features are continuous and are assumed to follow a Gaussian (normal) distribution. It can handle multiple classes effectively. Here’s how it works for multi-class classification:

**1. Gaussian Naive Bayes Basics:**
- **Assumption:** Each feature is assumed to be normally distributed within each class.
- **Probability Density Function:** For a continuous feature $x_i$ in class y, the probability density function of the Gaussian distribution is:
  $$
  P(x_i \mid y) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp \left( -\frac{(x_i - \mu)^2}{2 \sigma^2} \right)
  $$
  where $\mu$ is the mean and $\sigma^2$  is the variance of the feature in class y.

**2. Multi-Class Classification:**
- **Class Probabilities:** Gaussian Naive Bayes calculates the posterior probability for each class using Bayes' theorem. Given a feature vector  $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ , the classifier computes:
  $$
  P(y \mid \mathbf{x}) \propto P(y) \prod_{i=1}^n P(x_i \mid y)
  $$
  where $P(y)$  is the prior probability of class y, and $P(x_i \mid y)$ is the likelihood of feature $x_i$ given class $y$, modeled as a Gaussian distribution.
  
- **Classification Decision:** The class with the highest posterior probability is chosen as the predicted class:
  $$
  \hat{y} = \arg \max_y P(y \mid \mathbf{x})
  $$

**3. Handling Multiple Classes:**
- **Training:** During training, the model estimates the parameters $\mu$ and $\sigma^2$ for each feature and each class. This involves calculating the mean and variance of each feature for each class from the training data.
- **Prediction:** During prediction, the model evaluates the posterior probability for each class based on the Gaussian distribution of each feature and selects the class with the highest probability.

**4. Advantages for Multi-Class Classification:**
- **Scalability:** Gaussian Naive Bayes is computationally efficient and scales well with the number of features and classes.
- **Simplicity:** The model is simple and easy to interpret, making it a good choice for many practical applications.

This makes Gaussian Naive Bayes a versatile tool in machine learning, capable of handling multiple classes effectively.


**Q5. Assignment:**

In [1]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import BernoulliNB,MultinomialNB,GaussianNB
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_validate
from sklearn.metrics import accuracy_score,precision_score,recall_score,f1_score
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
data = pd.read_csv("spambase.data",sep=",")

In [3]:
data.columns

Index(['0', '0.64', '0.64.1', '0.1', '0.32', '0.2', '0.3', '0.4', '0.5', '0.6',
       '0.7', '0.64.2', '0.8', '0.9', '0.10', '0.32.1', '0.11', '1.29', '1.93',
       '0.12', '0.96', '0.13', '0.14', '0.15', '0.16', '0.17', '0.18', '0.19',
       '0.20', '0.21', '0.22', '0.23', '0.24', '0.25', '0.26', '0.27', '0.28',
       '0.29', '0.30', '0.31', '0.33', '0.34', '0.35', '0.36', '0.37', '0.38',
       '0.39', '0.40', '0.41', '0.42', '0.43', '0.778', '0.44', '0.45',
       '3.756', '61', '278', '1'],
      dtype='object')

In [4]:
data.head()

Unnamed: 0,0,0.64,0.64.1,0.1,0.32,0.2,0.3,0.4,0.5,0.6,...,0.41,0.42,0.43,0.778,0.44,0.45,3.756,61,278,1
0,0.21,0.28,0.5,0.0,0.14,0.28,0.21,0.07,0.0,0.94,...,0.0,0.132,0.0,0.372,0.18,0.048,5.114,101,1028,1
1,0.06,0.0,0.71,0.0,1.23,0.19,0.19,0.12,0.64,0.25,...,0.01,0.143,0.0,0.276,0.184,0.01,9.821,485,2259,1
2,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.137,0.0,0.137,0.0,0.0,3.537,40,191,1
3,0.0,0.0,0.0,0.0,0.63,0.0,0.31,0.63,0.31,0.63,...,0.0,0.135,0.0,0.135,0.0,0.0,3.537,40,191,1
4,0.0,0.0,0.0,0.0,1.85,0.0,0.0,1.85,0.0,0.0,...,0.0,0.223,0.0,0.0,0.0,0.0,3.0,15,54,1


In [5]:
data.shape

(4600, 58)

In [6]:
data["1"].unique()

array([1, 0], dtype=int64)

In [7]:
data["1"].value_counts()

1
0    2788
1    1812
Name: count, dtype: int64

In [8]:
X = data.drop("1",axis = 1)
y = data["1"]

In [9]:
print(f"shape of X:{X.shape}")
print(f"shape of Y:{y.shape}")

shape of X:(4600, 57)
shape of Y:(4600,)


In [10]:
models = {'BernoulliNB' : BernoulliNB(),
            'MultinomialNB' : MultinomialNB(),
            'GaussianNB' : GaussianNB()}

In [11]:
results = {}

In [12]:
for name,model in models.items():
    if name == "MultinomialNB":
        scaler = MinMaxScaler()
        X_scaled = scaler.fit_transform(X)
        cv_results = cross_validate(model,X_scaled,y,cv=10,
                                   scoring=['accuracy','precision_weighted', 'recall_weighted', 'f1_weighted'])
        results[name] = {
            'accuracy': cv_results['test_accuracy'],
            'precision': cv_results['test_precision_weighted'],
            'recall': cv_results['test_recall_weighted'],
            'f1': cv_results['test_f1_weighted'],
            'mean_accuracy': cv_results['test_accuracy'].mean(),
            'mean_precision': cv_results['test_precision_weighted'].mean(),
            'mean_recall': cv_results['test_recall_weighted'].mean(),
            'mean_f1': cv_results['test_f1_weighted'].mean(),
            'std_accuracy': cv_results['test_accuracy'].std(),
            'std_precision': cv_results['test_precision_weighted'].std(),
            'std_recall': cv_results['test_recall_weighted'].std(),
            'std_f1': cv_results['test_f1_weighted'].std(),
        }
    
    else:
        scaler = StandardScaler()
        X_scaled = scaler.fit_transform(X)
        cv_results = cross_validate(model,X_scaled,y,cv=10,
                                   scoring=['accuracy','precision_weighted', 'recall_weighted', 'f1_weighted'])

        results[name] = {
            'accuracy': cv_results['test_accuracy'],
            'precision': cv_results['test_precision_weighted'],
            'recall': cv_results['test_recall_weighted'],
            'f1': cv_results['test_f1_weighted'],
            'mean_accuracy': cv_results['test_accuracy'].mean(),
            'mean_precision': cv_results['test_precision_weighted'].mean(),
            'mean_recall': cv_results['test_recall_weighted'].mean(),
            'mean_f1': cv_results['test_f1_weighted'].mean(),
            'std_accuracy': cv_results['test_accuracy'].std(),
            'std_precision': cv_results['test_precision_weighted'].std(),
            'std_recall': cv_results['test_recall_weighted'].std(),
            'std_f1': cv_results['test_f1_weighted'].std(),
        }

In [13]:
# Output the results
for name, result in results.items():
    print(f"Model: {name}")
    print(f"Cross-validation accuracy scores: {result['accuracy']}")
    print(f"Mean accuracy: {result['mean_accuracy']:.2f}")
    print(f"Standard deviation of accuracy: {result['std_accuracy']:.2f}")
    print(f"Cross-validation precision scores: {result['precision']}")
    print(f"Mean precision: {result['mean_precision']:.2f}")
    print(f"Standard deviation of precision: {result['std_precision']:.2f}")
    print(f"Cross-validation recall scores: {result['recall']}")
    print(f"Mean recall: {result['mean_recall']:.2f}")
    print(f"Standard deviation of recall: {result['std_recall']:.2f}")
    print(f"Cross-validation F1 scores: {result['f1']}")
    print(f"Mean F1 score: {result['mean_f1']:.2f}")
    print(f"Standard deviation of F1 score: {result['std_f1']:.2f}")
    print()
    print("** **"*15)  # Newline for better readability
    print()

Model: BernoulliNB
Cross-validation accuracy scores: [0.90217391 0.92391304 0.90217391 0.91956522 0.90434783 0.92608696
 0.95       0.91304348 0.84347826 0.83043478]
Mean accuracy: 0.90
Standard deviation of accuracy: 0.04
Cross-validation precision scores: [0.90569379 0.92440437 0.90377174 0.92028816 0.90462648 0.92605263
 0.95132161 0.91589505 0.8462241  0.82937495]
Mean precision: 0.90
Standard deviation of precision: 0.04
Cross-validation recall scores: [0.90217391 0.92391304 0.90217391 0.91956522 0.90434783 0.92608696
 0.95       0.91304348 0.84347826 0.83043478]
Mean recall: 0.90
Standard deviation of recall: 0.04
Cross-validation F1 scores: [0.90052033 0.92337321 0.90093572 0.91887952 0.90359436 0.92577413
 0.94957376 0.91174918 0.84426929 0.82931285]
Mean F1 score: 0.90
Standard deviation of F1 score: 0.04

** **** **** **** **** **** **** **** **** **** **** **** **** **** **** **

Model: MultinomialNB
Cross-validation accuracy scores: [0.85434783 0.89565217 0.85652174 0.91086

**Discussion:**  
1. Bernoulli Naive Bayes Performed better in comaprison to Multinomial and Gaussion NAive Bayes base models in all metrics like accuracy score, precision, recall and F1 score.
2. According to the status quo, either multinomial or gaussian was meant to work better than bernuolli naive bayes,as MultiNomial works better on frequencies of words and Gaussian works better with continuous data, but I guess since we did not use any hyperparameters or its tuning, they underperformed, as they are more complex in nature ,hyperparmeter tuning is essential for it to work   effectively.
3. Following are the limitations of Naive Bayes Algorithms:
   - Assumption of feature independence.
   - Poor performance with correlated features.
   - Zero probability problem.
    - Assumption of normality in Gaussian Naive Bayes.
    - Difficulty in capturing complex relationships.
    - Sensitivity to feature representation.
    - Challenges with large feature spaces.
d to consider alternative models if necessary.


**Conclusion**:  

1. For the current situation Bernuolli Naive Bayes will be used as the Final Model.
2. But if usage of other parameters is allowed in future we can find the best parameters with hyperparameter tuning techniques like GridSearchCV and RandomSearchCV.  
