<a href="https://colab.research.google.com/github/balamurugan-palaniappan-CEP/AIML_CEP_2021/blob/main/Naive%20Bayes/NaiveBayesClassifier_AIML_CEP_31Oct2021_7Nov2021.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

$\large{\text{Naive Bayes Classifier}}$

Consider the data set as given below.



In [None]:
#First, we import the required packages
import pandas as pd #the pandas library is useful for data processing 
import matplotlib.pyplot as plt #the matplotlib library is useful for plotting purposes

#Get the data from the csv file 
sample_data = pd.read_csv('dataset.csv', index_col=False)

#print the data 
sample_data


Unnamed: 0,Temp gt 100,Travel to foreign country,Cough,Antibodies in blood,Disease presence
0,Yes,Yes,No,Yes,Yes
1,No,Yes,Yes,Yes,Yes
2,Yes,No,Yes,No,Yes
3,Yes,No,Yes,Yes,Yes
4,No,No,No,Yes,No
5,No,Yes,Yes,No,No
6,No,Yes,No,Yes,No
7,Yes,Yes,No,No,No


Now suppose we have the observation $x$ given by  

$(\text{Temp gt 100=Yes, Travel to foreign country=Yes, Cough = Yes, Antibodies in blood= Yes})$, 

then how do we classify $x$ into the Disease presence category? 

Consider $X$ to be a random variable of which $x$ is an observation. 

Suppose $Y$ denotes the random variable associated with the Disease presence category, then the problem of finding the Disease presence category can be cast as finding the conditional probability of class label $P(Y=\text{Yes}|X=x)$ and the conditional probability $P(Y=\text{No}|X=x)$. 

To find these conditional probabilities, we shall use the famous $\textbf{Bayes' theorem}$ idea.

$\large{\text{Bayes' Theorem}}$ 

The conditional probability $P(Y|X)$ can be written as:

$\begin{align}
P(Y|X) = \frac{P(Y,X)}{P(X)}.
\end{align}
$

Since we know that $P(X|Y) = \frac{P(X,Y)}{P(Y)}$, and since the $\textbf{joint probability}$ $P(X,Y)$ is the same as $\text{joint probability}$ $P(Y,X)$, we can write: 

$P(Y,X) = P(X|Y) P(Y)$. 

Substituting $P(Y,X)$ in the previous conditional probability $P(Y|X)$, we have:

$\begin{align}
P(Y|X) = \frac{P(Y,X)}{P(X)} = \frac{P(X|Y)P(Y)}{P(X)}. 
\end{align}
$

Thus the $\textbf{posterior probability}$ $P(Y|X)$ of observing $Y$ after receiving $X$ can be seen to be proportional to the product of the $\textbf{likelihood}$ term $P(X|Y)$ and the $\textbf{prior probability}$ $P(Y)$ of $Y$. 

$P(X)$ is a $\textbf{normalization}$ factor and is usually a constant for any value of the observation of $Y$. 

Hence we may write $P(Y|X) \propto P(X|Y)P(Y)$. 


Hence using Bayes' Theorem we can write: 

$P(Y=\text{Yes}|X=x) \propto P(X=x|Y=\text{Yes}) P(Y=\text{Yes})$. 

and

$P(Y=\text{No}|X=x) \propto P(X=x|Y=\text{No}) P(Y=\text{No})$.

$\textbf{Question:}$ How do we compute the prior probabilities $P(Y=\text{Yes})$, $P(Y=\text{No})$ and the likelihood terms $P(X=x|Y=\text{Yes})$, $P(X=x|Y=\text{No})$?


$\textbf{Idea:}$

Use the training data to get the required probabilities.

$\textbf{Computing the prior probabilities}$ 

Let us relook at the data at hand. Then we can use a frequency based computation for prior probabilities $P(Y=\text{Yes})$ and $P(Y=\text{No})$. 

In [None]:
#print the labels
sample_data['Disease presence']


0    Yes
1    Yes
2    Yes
3    Yes
4     No
5     No
6     No
7     No
Name: Disease presence, dtype: object

We note that out of the $8$ samples, $Y=\text{Yes}$ appears $4$ times and $Y=\text{No}$ appears $4$ times. 

Hence we can write: 

$P(Y=\text{Yes}) = \frac{4}{8} = 0.5$ and $P(Y=\text{No}) = \frac{4}{8} = 0.5$. 


$\textbf{How to compute P(X=x|Y=y)}$?

Note that in our case $X$ is multi-dimensional attribute of the form $X=(X_1,X_2,X_3,X_4)$ where:


*   $X_1$ is the random variable associated with $\texttt{Temp gt 100}$
*   $X_2$ is the random variable associated with $\texttt{Travel to foreign country}$
*   $X_3$ is the random variable associated with $\texttt{Cough}$
*   $X_4$ is the random variable associated with $\texttt{Antibodies in blood}$

Similarly, an observation $x$ of $X$ is also multi-variate and is of the form $x=(x_1,x_2,x_3,x_4)$. 

Thus $P(X=x|Y=y)$ can be equivalently written as: $P( (X_1,X_2,X_3,X_4) = (x_1,x_2,x_3,x_4)|Y=y)$. 

Hence given an observation $x=(x_1,x_2,x_3,x_4)$ of $X=(X_1,X_2,X_3,X_4)$ how do we find the conditional probability $P((X_1,X_2,X_3,X_4)=(x_1,x_2,x_3,x_4)|Y=y)$? 


$\textbf{Conditional Independence Assumption}$ 

In Naive Bayes classifier, we make an important assumption about the data. 

That is the covariates (or) attributes are conditionally independent given the class label. 

This assumption is called the conditional independence assumption. 

Using the conditional independence assumption, we can write: 

$
\begin{align}
P( (X_1,X_2,X_3,X_4) = (x_1,x_2,x_3,x_4) | Y=y) = \prod_{i=1}^{4} P(X_i=x_i|Y=y).
\end{align}
$

Now finding $P( (X_1,X_2,X_3,X_4) = (x_1,x_2,x_3,x_4) | Y=y)$ boils down to finding $P(X_i=x_i|Y=y)$ for $i=1,2,3,4$. 

$\textbf{Question:}$ How do we find $P(X_i=x_i|Y=\text{Yes})$ and $P(X_i=x_i|Y=\text{No})$?

$\textbf{Idea:}$ 

Use the training data.  

$\textbf{Note:}$ Luckily, each $X_i$ in the data is discrete-valued and hence finding the probabilities would be easy. 

$\textbf{Recall:}$

The observation $x$ is given by  

$(\text{Temp gt 100=Yes, Travel to foreign country=Yes, Cough = Yes, Antibodies in blood= Yes})$. 

Let us now compute $P(X_1 = \text{Yes}|Y=\text{Yes})$. 

We shall look into the data. 

In [None]:
#print the column 'Temp gt 100' along with label 'Yes'
(sample_data.loc[sample_data['Disease presence'] == 'Yes'])[['Temp gt 100', 'Disease presence']]

Unnamed: 0,Temp gt 100,Disease presence
0,Yes,Yes
1,No,Yes
2,Yes,Yes
3,Yes,Yes


Thus from the above data we can compute: 

$P(X_1=\text{Yes}|Y=\text{Yes}) = \frac{3}{4} = 0.75$.  

Let us now compute $P(X_2 = \text{Yes}|Y=\text{Yes})$. 

We shall look into the data. 

In [None]:
#print the column 'Travel to foreign country' along with label 'Yes'
(sample_data.loc[sample_data['Disease presence'] == 'Yes'])[['Travel to foreign country', 'Disease presence']]

Unnamed: 0,Travel to foreign country,Disease presence
0,Yes,Yes
1,Yes,Yes
2,No,Yes
3,No,Yes


Thus from the above data we can compute: 

$P(X_2=\text{Yes}|Y=\text{Yes}) = \frac{2}{4} = 0.5$.  

Let us now compute $P(X_3 = \text{Yes}|Y=\text{Yes})$. 

We shall look into the data. 

In [None]:
#print the column 'Cough' along with label 'Yes'
(sample_data.loc[sample_data['Disease presence'] == 'Yes'])[['Cough', 'Disease presence']]

Unnamed: 0,Cough,Disease presence
0,No,Yes
1,Yes,Yes
2,Yes,Yes
3,Yes,Yes


Thus from the above data we can compute: 

$P(X_3=\text{Yes}|Y=\text{Yes}) = \frac{3}{4} = 0.75$. 

Let us now compute $P(X_4 = \text{Yes}|Y=\text{Yes})$. 

We shall look into the data. 

In [None]:
#print the column 'Antibodies in blood' along with label 'Yes'
(sample_data.loc[sample_data['Disease presence'] == 'Yes'])[['Antibodies in blood', 'Disease presence']]

Unnamed: 0,Antibodies in blood,Disease presence
0,Yes,Yes
1,Yes,Yes
2,No,Yes
3,Yes,Yes


Thus from the above data we can compute: 

$P(X_4=\text{Yes}|Y=\text{Yes}) = \frac{3}{4} = 0.75$. 


Now that we have computed all relevant probabilities, we can compute 
$P((X_1,X_2,X_3,X_4)=(\text{Yes,Yes,Yes,Yes})|Y=\text{Yes}) = \prod_{i=1}^{4} P(X_i = \text{Yes}|Y=\text{Yes})$. 

Thus $P((X_1,X_2,X_3,X_4)=(\text{Yes,Yes,Yes,Yes})|Y=\text{Yes}) = 0.75 \times 0.5 \times 0.75 \times 0.75 = 0.2109375$. 

Similarly, we can compute $P((X_1,X_2,X_3,X_4)=(\text{Yes,Yes,Yes,Yes})|Y=\text{No}) = \prod_{i=1}^{4} P(X_i = \text{Yes}|Y=\text{No})$. 

$\textbf{Exercise:}$ Find  $P((X_1,X_2,X_3,X_4)=(\text{Yes,Yes,Yes,Yes})|Y=\text{No})$. 

Now we can compute $P(Y=\text{Yes}|X=x)$ as a quantity proportional to $P(X=x|Y=\text{Yes}) P(Y=\text{Yes}) = 0.2109375 \times 0.5$. 

Also similarly, we can compute $P(Y=\text{No}|X=x)$ as a quantity proportional to $P(X=x|Y=\text{No}) P(Y=\text{No.})$. 

$\textbf{Exercise:}$ Compute $P(Y=\text{No}|X=x)$. 

Having computed the posterior probabilities for $Y=\text{Yes}$ label and $Y=\text{No}$ label, we can compare them and assign a label which has the maximimum posterior probability. 

$\textbf{Exercise:}$ What is the label predicted for observation $x$? 


$\textbf{Question:}$ How to deal with Continuous attributes? 

$\textbf{Idea:}$ Assume that the column containing continuous data has a Gaussian distribution. 

Hence $P(X_i=x_i|Y=y_j)$ can be characterized as $f(x_i; \mu_{ij},\sigma^2_{ij}) = \frac{1}{\sqrt{2\pi}\sigma_{ij}}  e^{-\frac{(x_i-\mu_{ij})^2}{2\sigma_{ij}^2}}$. 

The mean $\mu_{ij}$ and variance $\sigma_{ij}$ are estimated from data corresponding to $X_i=x_i$ and $Y=y_j$.

Ideally we would be computing $P(x_i \leq X_i \leq x_i + \epsilon|Y=y_j)$. 

This would yield a quantity $P(x_i \leq X_i \leq x_i + \epsilon|Y=y_j) \approx \epsilon f(x_i; \mu_{ij},\sigma^2_{ij})$. 

Since $\epsilon$ is constant for each class, during the comparative analysis of the posterior probabilities, the effect of $\epsilon$ will be effectively ignored. 

$\large{\text{Some notable features of Naive Bayes Classifier}}$



1.   $\textbf{Smoothing of noisy attributes:}$ Can Naive Bayes classifier deal with noise in attribute? 
2.   $\textbf{Dealing with irrelevant attributes:}$ For attribute $X$ which might not be relevant to prediction of class label $Y$, how will $P(Y|X)$ look like? What impact does it have?
3.  $\textbf{Dealing with correlated attributes:}$ Can Naive Bayes classifier handle correlated attributes? 





$\large{\text{Exercise}}$

Suppose the attribute column $\texttt{Temp gt 100}$ is replaced with $\texttt{Temperature}$ attribute column with the values $(100.3, 98.6, 100.5, 99.5, 101.01, 98.3, 99.5, 100.2)$. Find the posterior probabilities 



*   $P(Y=\text{Yes}|X=(Yes, No, No, Yes))$ and $P(Y=\text{No}|X=(Yes, No, No, Yes))$. 
*   $P(Y=\text{Yes}|X=(No, Yes, No, No))$ and $P(Y=\text{No}|X=(No, Yes, No, No))$. 
