## Why naïve?
A naïve Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

The classifier is naïve. Because of its assumptions 
1. all variables in the dataset are “naïve” i.e not correlated to each other
2. all the predictors have an equal effect on the outcome

## Types of Naive Bayes Classifier:

1. Multinomial Naive Bayes: Used for multi category classification problem
2. Bernoulli Naive Bayes: Similar to the multinomial naive bayes except that the predictors are boolean variables.
3. Gaussian Naive Bayes: This is used Predictors are continuous valued. We assume that the predictor values are sampled from a gaussian distribution.

## Bayesian Inference

TBD

## Advantages

1. Needs less training data.
2. This example shows binary outcome. However the algorithm also performs well in multi-class prediction (TODO).
2. A Naive Bayes classifier performs better compared to other models like logistic regression and less training data is sufficient. (Variables should be independent)
3. It performs well with categorical input variables compared to numerical variables. 

For numerical variable, normal distribution is assumed (This example uses numerical predictor variables)

## Disadvantages
1. Zero Frequency problem: If categorical variable has a category in test data set that was not observed in training data set, then model will assign a zero probability and will not make prediction. We use one of many smoothing technique to address this. One of the simplest smoothing techniques is called Laplace estimation. (https://www.quora.com/How-does-Laplacian-add-1-smoothing-work-for-a-Naive-Bayes-classfier-algorithm) TODO. A notebook on this later
2. It is almost impossible to have completely independent predictors in real life and this classifier will not perform well in such cases.

This notebook is meant to demonstrate the technique nonetheless.

## Dataset

Dataset used is from here -  https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Dataset consists of medical predictor variables and one target variable Outcome. 
Predictor variables 
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
4. Insulin: 2-Hour serum insulin (mu U/ml)
5. BMI: Body mass index (weight in kg/(height in m)^2)
6. DiabetesPedigreeFunction: Diabetes pedigree function
7. Age: Age (years)

Outcome: Class variable (0 or 1)

In [None]:
import numpy as np
import pandas as pd

from IPython.display import Image
from IPython.core.display import HTML 
%matplotlib  inline

In [None]:
column = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin",
          "BMI","DiabetesPedigreeFunction","Age","Outcome"]

#data = pd.read_csv('pima-indians-diabetes.data.csv',names=column)
data = pd.read_csv("https://raw.githubusercontent.com/datavector-io/datascience/main/Bayesian/houseprices.csv")
data.head()

From Bayes theorem

\begin{equation}
P(y \mid x) =  \dfrac{P(x \mid y)\,P(y)}{P(x)}
\end{equation}

where,

1. P(y|x) is the posterior probability of class y given predictor ( features).
2. P(y) is the probability of class.
3. P(x|y) is the likelihood which is the probability of predictor given class.
4. P(x) is the prior probability of predictor.

Or more preceisely

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) =  \dfrac{P(x_{1}, x_{2}, ... , x_{n} \mid y)\,P(y)}{P(x_{1}, x_{2}, ... , x_{n})}
\end{equation}

can be written as 

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) =  \dfrac{P(x_{1} \mid y) \, P(x_{2} \mid y), ... \, \, P(x_{n} \mid y) \, P(y)}{P(x_{1}) \, P(x_{2})\, ... \, P(x_{n})}
\end{equation}

For all entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be removed and a proportionality can be introduced

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) \, \propto  \, P(x_{1} \mid y) \, P(x_{2} \mid y) \, ... \, P(x_{n} \mid y) \, P(y)
\end{equation}

Rewriting in short form

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) \, \propto P(y) \, \prod_{i=1}^n \,P(x_{i} \mid y)
\end{equation}

In our case, the class variable(y) has only two outcomes, 1 or 0. There could be cases where the classification could be multivariate. Therefore, we need to find the class y with maximum probability such that:


\begin{equation}
\begin{aligned}
y = argmax_{y} \, \prod_{i=1}^n \,P(x_{i} \mid y)
\end{aligned}
\tag{Equation 1}
\end{equation}

\begin{equation}
\end{equation}

To find the maxima, we need to calculate derivative of the RHS in above equation and equate to 0.

Assuming a Gausian distribution for each of the numerical predictor variables in the dataset, each $ P(x_{i} \mid y) $ takes the form 

\begin{equation}
\begin{aligned}
P(x_{i} \mid y) = \dfrac{1}{\sigma_{y}\sqrt{2\pi}} \, e^{ {\dfrac{-1}{2} \, (\dfrac{x_{i} \, - \, \mu_{y}}{ \sigma })}^{2}}
\end{aligned}
\tag{Equation 1}
\end{equation}

Calculating derivative of a long product like above is tedious. Instead we take the log on both sides and take its derivative. We can use several properties of log to simplify this. The main reason we are able to take derivative of either the function or the log is because the function is monotonous and both the function and its derivative peak at the same point.

At that point were we get a argmax, the x is such such that it is the mean of the of all the $ x_{i} $ with corresponding $ \sigma $ This overlaps with the maximum likelihood estimation (MLE) techniques.

Details of derivation and some intuition into the MLE can be found in this stat quest: 
1. https://www.youtube.com/watch?v=XepXtl9YKwc
2. https://www.youtube.com/watch?v=Dn6b9fCIUpM

# Acknowledgements
1. https://github.com/2796gaurav/Naive-bayes-explained