### **Objective**
In this week, we will implement Naive Bayes (NB) classifier from scratch.

### **Naive Bayes Classifier**

* Naive bayes classifier is a **generative classifier**.

* It **estimates probability** of a sample belonging to a class using **Bayes theorem** :
\begin{align} 
\text {posterior} &=& \frac{\text {prior} \times \text{likelihood} } { \text {evidence} }
\\ p(y|\mathbf x) &=& \frac{p(y) \times p(\mathbf x |y) }{p(\mathbf x)} 
\end{align}


![Image source](https://editor.analyticsvidhya.com/uploads/374484.png)

* It simplifies the calculation of **likelihood** with **conditional independence assumption** i.e NB assumes that the features are conditionally independent.

    * The **likelihood** can be expressed as : 
    \begin{align} 
    p(\mathbf x| y) &=& p(x_1,x_2,\ldots,x_m|y) 
    \\&=& p(x_1|y) \ p(x_2|y) \ldots\ p(x_m|y) \\
    \end{align}

    \begin{align}
    &=& \prod_{j=1}^{m}p(x_j|y) 
    \end{align}

    * Substituting likelihood in the Bayes theorem gives us the following formula :
    \begin{equation}
    p(y=y_c|\mathbf x)= \frac{p(y_c)\prod_{j=1}^{m}p(x_j|y_c)}{\sum_{r=1}^kp(y_r) \prod_{j=1}^{m}p(x_j|y_r)}
    \end{equation}

* The algorithm is called **Naive** because of the assumption that 2 variables are independent when they may not be. In a real-world scenario, there is hardly any situation where the features are independent.

### **Inference**
* We perform this calculation in **log-space** to avoid issues with underflow due to multiplication of small numbers.

* The label that results in the **highest value of the numerator** i.e $ { \text {likelihood} \times \text{prior}}$ is assigned to the given example. 

**Note** : *The evidence is fixed for all labels and acts as a normalizing constant.* 

\begin{equation} 
y = \text {argmax}_y \log p(y) + \sum_{j=1}^{m}\log p(x_j|y) 
\end{equation}

* Posterior probability however needs full calculation of the Bayes formula:

    * We first product of likelihood and prior for each label in log space:
    \begin{equation} 
    \log p(\mathbf x, y_r) =\log p(y_r) + \sum_{j=1}^{m} 
    \log p(x_j|y_r) 
    \end{equation} 
    
        and convert that to probability by taking :

    \begin{equation} 
    p(\mathbf x, y_r)= {\exp (\log p(\mathbf x,y_r))}
    \end{equation}

    * Summing up these probablities for obtaining the evidence or the denominator of the formula.
    \begin{equation} 
    p(\mathbf x) = \sum_{r=1}^{k} p(\mathbf x,y_r)=\sum_{r=1}^{k} \exp (\log p(\mathbf x,y_r)) 
    \end{equation}

    * Substituting these values one can obtain the posterior probability.
    \begin{equation} 
    p(y_r|\mathbf x) = \frac{p(\mathbf x, y_r)}{p(\mathbf x)}
    \end{equation}
    
NB classifier is used in applicaiton like **document classification** and **spam filtering**.


### **Learning problem**

* $k$ prior probabilities to be estimated : $\{p(y_1),p(y_2),\ldots,p(y_k)\}$

* $k \times m$ class conditional probabilities : $\{p(x_1|y_1,\ldots, p(x_m|y_1,p(x_1|y_2),\ldots,p(x_m|y_2),\ldots,p(x_1|y_k),\ldots,p(x_m|y_k)\}$

The class conditional densities depend on the **nature of features**.

The following are some popular class conditional densities used in NB classifier: 

* **Bernoulli distribution**: When $x_j$ is a **binary feature**, we use Bernoulli distribution to model the class conditional density : $p(x_j|y_c)$.


* **Categorical distribution**: When $x_j$ is a **categorical feature** i.e. it takes one of the $e \gt 2$ discrete values \[e.g. {red,green,blue} or roll of a dice\], we use categorical distribution to model the class conditional density : $p(x_j|y_c)$

* **Multinomial distribution**: When $\mathbf x$ is a **count vector** i.e. each component $x_j$ is a count of apperance in the object it represents and $\sum x_j=l$, which is the length of the object, we use multinomial distribution to model $p(\mathbf x|y_c)$

* **Gaussian distribution**: When $x_j$ is a **continuous feature** i.e. it takes a real value, we use gaussian (or normal) distribution to model the class conditional density $p(x_j|y_c)$.




We will implement these different class conditional densities in different NB implementations. 

We will discuss parameter estimation in detail in the respective sections.