In [None]:
!pip install tabulate sklearn 

# Naive Bayes Classifier

This notebook will introduce you the basics of Naive Bayes Algorithm for classification tasks. It includes the following content:

- Brief overview of the Naive Bayes (NB) Classifier
- An example exercise of performing inference with NB


## What is a classifier?

A classifier is a machine learning model that is used to discriminate different objects based on certain features. Given sample data $X$, a classifier predicts the class $y$ it belongs to.

## What is Naive Bayes Classifier?

A Naive Bayes classifier is a probabilistic machine learning model for classification task. It is based on Bayes theorem and imposes a strong assumption on feature independence.

## Bayes Theorem

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

We can compute the probability of event A happening, given the fact that event B has occurred. Event B is the evidence and event A is the hypothesis. The assumption made by Naive Bayes is that the features are independent, i.e. the presence of one feature does not affect the other. Therefore it is called naive.

Under the context of classification tasks, given the observation $X$, the classifier casts prediction on the class $y$. It can also be rewritten (with $y$ and $X$ replacing $A$ and $B$) as

$$ P(y \mid X) = \frac{P(X \mid y) \, P(y)}{P(X)} $$

The formula consists of four components:

- $
P(y \mid X) :
\:$ The posterior probability, which is the probability of class $y$ given the observation $X$

- $
P(y) :
\:$ The Prior probability, which is the prior probability (initial belief) of class $y$

- $
P(X \mid y) :
\:$The Likelihood, which is the probability of obsevation $X$ given class $y$.

- $
P(X) :
\:$The Evidence, which is the probability of obsevation $X$.

In classification tasks, the variable $y$ is the class label. The variable X represent the parameters/features and it usually contains multiple features/dimensions:

$$ X = (x_1, x_2, x_3, ..., x_n) $$

where $x_1, x_2, ..., x_n$ are the features and they are assumed to be independent in NB, i.e. $ (\:x_i \: \bot \:  x_j \mid y)\:\: \text{for all features}$ ($i \neq j$ and $i, j \in \{1, 2, ...., n\}$). By expanding using the chain rule we obtained the following:

$$ P(y \mid x_1, x_2, ..., x_n) = \frac{P(x_1, x_2, ..., x_n \mid y) \, P(y)}{P(X)} = \frac{P(x_1 \mid y) P(x_2 \mid y) P(x_3 \mid y) \cdots P(x_n \mid y) \, P(y)}{P(x_1) P(x_2) P(x_3) \cdots P(x_n)} $$

The denominator ($P(X)$) of the Bayes rule remain the same for all classes. Therefore, we can exclude it when performing inference since it is just a term for normalization. Therefore, based on the assumption on feature independence and ignoring the denominator the NB formula can be written as follows:

$$ P(\: y \mid x_1,x_2,...,x_n)\: \propto P(y) \prod_{i=1}^{i=n} P(\:x_i\mid y) $$

In (binary) classification tasks, the class variable $y$ has two outcomes. We need to find the class $y$ with maximum probability, i.e. $ y = argmax_y P(y) \prod_{i=1}^{i=n} P(\:x_i\mid y) $.

## An example exercise of performing inference with NB

We will use the following example to strengthen our understanding in NB. The example toy dataset is for classifying whether a person owns a pet. Observations $X$ contain three features, two categorical ("Gender" and "Education") and one numerical ("Income"), and class label $y$ (i.e. "Has_pet") corresponds to whether this person owns a pet.

In [None]:
from IPython.display import HTML, display
import tabulate
tab_cat = [["Gender", "Education", "Income", "Has_pet"],
          ["Female", "University", 103000,   "Yes"],
          ["Female", "HighSchool", 90500,   "No"],
          ["Female", "HighSchool", 114000,   "No"],
          ["Male",   "University", 102000,   "No"],
          ["Male",   "University", 75000,   "Yes"],
          ["Male",   "HighSchool", 90000,   "No"],
          ["Male",   "HighSchool", 85000,   "Yes"],
          ["Male",   "University", 86000,   "No"]]
display(HTML(tabulate.tabulate(tab_cat, tablefmt='html')))

<div class='alert alert-block alert-success' style="font-weight:bolder">

### Task 2a - Compute the Likelihood table of having pet, for each categorical feature, as well as the marginal probability.

- $P(Gender|Has\_pet)$: $P(Male|Yes)$, $P(Female|Yes)$, $P(Male|No)$, $P(Female|No)$
    
- $P(Education|Has\_pet)$: $P(University|Yes)$, $P(HighSchool|Yes)$, $P(University|No)$, $P(HighSchool|No)$
    
</div>

In [None]:
import pandas as pd
from IPython.display import HTML, display
import tabulate

# Example: Assuming df is your DataFrame with the dataset
# Replace this with your actual DataFrame
df = pd.DataFrame([
    ["Female", "University", 103000, "Yes"],
    ["Female", "HighSchool", 90500, "No"],
    # ... add all rows here ...
], columns=["Gender", "Education", "Income", "Has_pet"])

# Calculate the likelihood for the 'Gender' feature
likelihood_gender = pd.crosstab(df['Gender'], df['Has_pet'], normalize='index')

# Convert the DataFrame to a format suitable for tabulate
table_data = [(index, *row) for index, row in likelihood_gender.iterrows()]
headers = ["Gender", "Has_pet (No)", "Has_pet (Yes)"]

# Display the table
display(HTML(tabulate.tabulate(table_data, headers=headers, tablefmt='html')))

<div class='alert alert-block alert-success' style="font-weight:bolder">

### Task 2b - Compute posterior probability

- $P(\text{No}|\text{Male})$, $P(\text{Yes}|\text{Female})$
    
- $P(\text{Yes}|\text{Univeristy})$, $P(\text{No}|\text{HighSchool})$

</div>


In [None]:
import pandas as pd

# Example DataFrame
# Replace this with your actual DataFrame
df = pd.DataFrame([
    ["Female", "University", 103000, "Yes"],
    ["Female", "HighSchool", 90500, "No"],
    # ... add all rows here ...
], columns=["Gender", "Education", "Income", "Has_pet"])

# Calculate the prior probabilities
prior_yes = df['Has_pet'].value_counts(normalize=True)['Yes']
prior_no = df['Has_pet'].value_counts(normalize=True)['No']

# Calculate the marginal probabilities for Gender
marginal_male = df['Gender'].value_counts(normalize=True)['Male']
marginal_female = df['Gender'].value_counts(normalize=True)['Female']

# Calculate the likelihoods (from Task 2a)
likelihood_male_no = pd.crosstab(df['Gender'], df['Has_pet'], normalize='index').at['Male', 'No']
likelihood_female_yes = pd.crosstab(df['Gender'], df['Has_pet'], normalize='index').at['Female', 'Yes']

# Calculate the posterior probabilities
posterior_no_given_male = (likelihood_male_no * prior_no) / marginal_male
posterior_yes_given_female = (likelihood_female_yes * prior_yes) / marginal_female

print("P(No | Male):", posterior_no_given_male)
print("P(Yes | Female):", posterior_yes_given_female)


<div class='alert alert-block alert-success' style="font-weight:bolder">

### Task 2c - Compute the Likelihood of having pet using mean, standard deviation, and normal distribution function:

- Mean: $ \mu = \frac{1}{n} \sum^{n}_{i=1}{x_i} $
    
- Standard Deviation $ \sigma = \left[ \frac{1}{n-1} \sum^{n}_{i=1}{(x_i-\mu)^2} \right]^\frac{1}{2}  $
    
- Normal Distribution $f(x)=\dfrac{1}{\sigma\sqrt{2\pi}}\,e^{-\dfrac{(x-\mu)^2}{2\sigma{}^2}}$
    
Compute $P( \text{Income}=90000 \mid \text{Yes})$, $P( \text{Income}=90000 \mid \text{No})$

</div>

In [None]:
from scipy.stats import norm

df = pd.DataFrame([
    ["Female", "University", 103000, "Yes"],
    ["Female", "HighSchool", 90500, "No"],
    # ... add all rows here ...
], columns=["Gender", "Education", "Income", "Has_pet"])

# Function to calculate likelihood using Normal Distribution
def calculate_likelihood(df, feature, target_value, class_label):
    # Filter the DataFrame for the given class
    class_df = df[df['Has_pet'] == class_label]
    
    # Calculate mean and standard deviation for the feature
    mean = class_df[feature].mean()
    std = class_df[feature].std()

    # Calculate the probability density for the target value
    probability_density = norm.pdf(target_value, mean, std)
    return probability_density

# Compute likelihood for Income = 90000 given 'Yes' and 'No'
likelihood_income_yes = calculate_likelihood(df, 'Income', 90000, 'Yes')
likelihood_income_no = calculate_likelihood(df, 'Income', 90000, 'No')

print("P(Income = 90000 | Yes):", likelihood_income_yes)
print("P(Income = 90000 | No):", likelihood_income_no)

<div class='alert alert-block alert-success' style="font-weight:bolder">

### Task 2d - Making inference / casting predictions

- $X=(Education=University, Gender=Female, Income=100000)$
    
- $X=(Education=HighSchool, Gender=Male, Income=92000)$

</div>



In [None]:
from scipy.stats import norm

def calculate_likelihood(df, feature, target_value, class_label):
    # Filter the DataFrame for the given class
    class_df = df[df['Has_pet'] == class_label]
    
    # Calculate mean and standard deviation for the feature
    mean = class_df[feature].mean()
    std = class_df[feature].std()

    # Calculate the probability density for the target value
    probability_density = norm.pdf(target_value, mean, std)
    return probability_density

def calculate_posterior(df, feature_values, class_label):
    # Initialize posterior probability with the prior probability
    posterior = df['Has_pet'].value_counts(normalize=True)[class_label]
    
    # Multiply with the likelihoods of each feature
    for feature, value in feature_values.items():
        likelihood = calculate_likelihood(df, feature, value, class_label)
        posterior *= likelihood

    return posterior

# Feature values for the predictions
features_1 = {'Education': 'University', 'Gender': 'Female', 'Income': 100000}
features_2 = {'Education': 'HighSchool', 'Gender': 'Male', 'Income': 92000}

# Calculate posterior probabilities for 'Yes' and 'No' for the first scenario
posterior_yes_1 = calculate_posterior(df, features_1, 'Yes')
posterior_no_1 = calculate_posterior(df, features_1, 'No')

# Normalize probabilities
total_1 = posterior_yes_1 + posterior_no_1
posterior_yes_1 /= total_1
posterior_no_1 /= total_1

# Calculate posterior probabilities for 'Yes' and 'No' for the second scenario
posterior_yes_2 = calculate_posterior(df, features_2, 'Yes')
posterior_no_2 = calculate_posterior(df, features_2, 'No')

# Normalize probabilities
total_2 = posterior_yes_2 + posterior_no_2
posterior_yes_2 /= total_2
posterior_no_2 /= total_2

print("Prediction for (Education=University, Gender=Female, Income=100000):")
print("P(Yes):", posterior_yes_1, "P(No):", posterior_no_1)

print("\nPrediction for (Education=HighSchool, Gender=Male, Income=92000):")
print("P(Yes):", posterior_yes_2, "P(No):", posterior_no_2)

<div class='alert alert-block alert-success' style="font-weight:bolder">

### Task 2e (Extra Credit) Implementing a Naive Bayes Classifier and performing classification on the Iris dataset. Note that the Iris dataset only contains numerical features.

</div>




In [None]:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Gaussian Naive Bayes classifier
gnb = GaussianNB()

# Train the model
gnb.fit(X_train, y_train)

# Make predictions
y_pred = gnb.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)

print(f"Model Accuracy: {accuracy}")
print("\nClassification Report:\n", report)
print("\nConfusion Matrix:\n", conf_matrix)