## Why naïve?
A naïve Bayes classifier is a probabilistic machine learning model that’s used for classification task. The crux of the classifier is based on the Bayes theorem.

The classifier is naïve. Because of its assumptions 
1. all variables in the dataset are “naïve” i.e not correlated to each other
2. all the predictors have an equal effect on the outcome

## Types of Naive Bayes Classifier:

1. Multinomial Naive Bayes: Used for multi category classification problem
2. Bernoulli Naive Bayes: Similar to the multinomial naive bayes except that the predictors are boolean variables.
3. Gaussian Naive Bayes: This is used Predictors are continuous valued. We assume that the predictor values are sampled from a gaussian distribution.

## Advantages

1. Needs less training data.
2. This example shows binary outcome. However the algorithm also performs well in multi-class prediction (TODO).
2. A Naive Bayes classifier performs better compared to other models like logistic regression and less training data is sufficient. (Variables should be independent)
3. It performs well with categorical input variables compared to numerical variables. 

For numerical variable, normal distribution is assumed (This example uses numerical predictor variables)

## Disadvantages
1. Zero Frequency problem: If categorical variable has a category in test data set that was not observed in training data set, then model will assign a zero probability and will not make prediction. We use one of many smoothing technique to address this. One of the simplest smoothing techniques is called Laplace estimation. (https://www.quora.com/How-does-Laplacian-add-1-smoothing-work-for-a-Naive-Bayes-classfier-algorithm) TODO. A notebook on this later
2. It is almost impossible to have completely independent predictors in real life and this classifier will not perform well in such cases.

This notebook is meant to demonstrate the technique nonetheless

## Bayesian Inference

1. Statistical inference is the process of deducing properties about a population (hence about its probability distribution) from data. Using standard Maximum Likelihood Estimation technique, we can determine the maximum likelihood esitmate of a mean from a set of observed data points. 
2. Bayesian inference is therefore just the process of deducing properties about a population or probability distribution from data using Bayes’ theorem

NOTE: Probability and Likelihood are different beasts in technical terms. https://www.youtube.com/watch?v=pYxNSUDSFH4

## Using Bayes’ theorem with distributions

### Standard Bayes' theorem reproduced below:

\begin{equation}
\begin{aligned}
P(A \mid B) =  \dfrac{P(B \mid A)\,P(A)}{P(B)}
\end{aligned}
\tag{Equation 1}
\end{equation}

1. P(A|B) is called the posterior; this is what we are trying to estimate. In the above example, this would be the “probability of having cancer given that the person is a smoker”.
2. P(A) is called the prior; this is the probability of our hypothesis without any additional prior information. In the above example, this would be the “probability of having cancer”.
3. P(B|A) is called the likelihood; this is the probability of observing the new evidence, given our initial hypothesis. In the above example, this would be the “probability of being a smoker given that the person has cancer”.
4. P(B) is called the marginal likelihood; this is the total probability of observing the evidence. In the above example, this would be the “probability of being a smoker”. In many applications of Bayes Rule, this is ignored, as it mainly serves as normalization.

### Modifications

Bayes' theorem can be used beyond numbers with two particular changes. We will replace to get a new form of the equation

1. B with data
2. A with $ \theta $, (predictors)

\begin{equation}
\begin{aligned}
P(\theta \mid data) =  \dfrac{P(data \mid \theta)\,P(\theta)}{P(data)}
\end{aligned}
\tag{Equation 2}
\end{equation}

Here $\theta $ can represent a single predictor variable or vector of predictor variables.  Following techniques will be adopted for each of RHS

1. By using a distribution for $ P(\theta) $
2. By using maximum likelihood techniques for $ P(data \mid \theta) $
3. Ignoring P(data) (Very hard to calculate and it can be ignored as explained below)

We now have new names for each of the Bayes' theorem participants:
1. $ P(\theta \mid data) $ is called the posterior distribution
2. $ P(\theta) $ is called the prior distribution. Generally this would be the distribution of each of the predictor variable for the entire training set
3. $ P(data \mid \theta) $ is called the likelihood distribution. Sometimes it is also written as $\mathcal{L}(data \mid \theta) $

### Expanding on Bayes' theorem adaptation to distribution

Here I am using x(s) for Theta.

From Bayes theorem

\begin{equation}
P(y \mid x) =  \dfrac{P(x \mid y)\,P(y)}{P(x)}
\end{equation}

where,

1. P(y|x) is the posterior probability of class y given predictor (aka features).
2. P(y) is the probability of class.
3. P(x|y) is the likelihood which is the probability of predictor given class.
4. P(x) is the prior probability of predictor.

Or more preceisely

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) =  \dfrac{P(x_{1}, x_{2}, ... , x_{n} \mid y)\,P(y)}{P(x_{1}, x_{2}, ... , x_{n})}
\end{equation}

can be written as 

\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) =  \dfrac{P(x_{1} \mid y) \, P(x_{2} \mid y), ... \, \, P(x_{n} \mid y) \, P(y)}{P(x_{1}) \, P(x_{2})\, ... \, P(x_{n})}
\end{equation}

For all entries in the dataset, the denominator does not change, it remain static. Therefore, the denominator can be removed and a proportionality can be introduced

\begin{equation}
\begin{aligned}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) \, \propto  \, P(x_{1} \mid y) \, P(x_{2} \mid y) \, ... \, P(x_{n} \mid y) \, P(y)
\end{aligned}
\tag{Equation 3}
\end{equation}


Rewriting in short form

\begin{equation}
\begin{aligned}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) \, \propto P(y) \, \prod_{i=1}^n \,P(x_{i} \mid y)
\end{aligned}
\tag{Equation 4}
\end{equation}

The two terms in equation 3 are analyzed separately
1. P(y) by looking at its distribution. 
2. $ \prod_{i=1}^n \,P(x_{i} \mid y) $ is analyzed with Maximum Likelihood Techniques

But first let us look at the dataset to which we will apply these

## Dataset

Dataset used is from here -  https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv

Dataset consists of medical predictor variables and one target variable Outcome. 
Predictor variables 
1. Pregnancies: Number of times pregnant
2. Glucose: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
3. BloodPressure: Diastolic blood pressure (mm Hg)
4. SkinThickness: Triceps skin fold thickness (mm)
4. Insulin: 2-Hour serum insulin (mu U/ml)
5. BMI: Body mass index (weight in kg/(height in m)^2)
6. DiabetesPedigreeFunction: Diabetes pedigree function
7. Age: Age (years)

Outcome: Class variable (0 or 1)

In [None]:
import numpy as np
import pandas as pd

from IPython.display import Image
from IPython.core.display import HTML 
%matplotlib  inline

In [None]:
column = ["Pregnancies","Glucose","BloodPressure","SkinThickness","Insulin",
          "BMI","DiabetesPedigreeFunction","Age","Outcome"]

#data = pd.read_csv('pima-indians-diabetes.data.csv',names=column)
git_file_path = "https://raw.githubusercontent.com/datavector-io/datascience/main/Bayesian/pima-indians-diabetes.data.csv"
data = pd.read_csv(git_file_path, names=column)
data.head()

We apply the equation 3 now. 

The predictor variables "Pregnancies","Glucose","BloodPressure" can be thought of as $ x_{1} x_{2} ... x_{n} $ etc.


\begin{equation}
P(y \mid x_{1}, x_{2}, ... , x_{n} ) \, \propto  \, P(x_{1} \mid y) \, P(x_{2} \mid y) \, ... \, P(x_{n} \mid y)\, P(y) 
\end{equation}

i.e. $ P(Outcome=0 \mid Data ) \, $ is proportional to the following product of probabilities:

\begin{equation}
P(Outcome=0) \,\, P(Pregnancies \mid Outcome=0) \, P(Glucose \mid Outcome=0) \, P(BloodPressure \mid Outcome=0) \, P(SkinThickness \mid Outcome=0) \, P(Insulin \mid Outcome=0) \, P(BMI \mid Outcome=0) \, P(DiabetesPedigreeFunction \mid Outcome=0) \, P(Age \mid Outcome=0) 
\end{equation}

1. P(Outcome=0)is simply the total probability that outcome=0 in the existing dataset
2. P(Pregnancies | Outcome=0) \, P(Glucose | Outcome=0) is the likelihood of each predictor variable distribution

<b>NOTE: Each term in the likelihood is really not independent. But that is the naive assumption we do in this algorithm</b>

But first we need to split train and test set. Then we can do the likelihood calc for each term

## Shuffle and Split data into train and test

In [None]:
# shuffle dataset with sample
data = data.sample(frac=1, random_state=1).reset_index(drop=True)
print(data.shape)
data

In [None]:
# 70-30 split
total_record_count = len(data.index)
train_record_count = round(0.7 * total_record_count)
test_record_count = total_record_count - train_record_count

train_records = data.iloc[:train_record_count,:]
test_records = data.iloc[train_record_count:,:]
print("Shape of new dataframes - {} , {}".format(train_records.shape, test_records.shape))

## Step 1: Calculate Prior Probability

In [None]:
train_outcome_0_num = train_records['Outcome'][train_records['Outcome'] == 0].count()
train_outcome_1_num = train_records['Outcome'][train_records['Outcome'] == 1].count()

# Total people
train_total_num = train_records['Outcome'].count()
print("Train Total = {} , Outcome 0 Sum = {}, Outcome 1 Sum = {}".format(train_total_num, train_outcome_0_num, train_outcome_1_num))

train_p_outcome_0 = train_outcome_0_num/train_total_num
train_p_outcome_1 = train_outcome_1_num/train_total_num
print("P(Outcome=0) = {} , P(Outcome=1) = {}".format(train_p_outcome_0, train_p_outcome_1))

## Step 1: Calculate Likelihood

Since we assume gaussian distribution for all predictor variables, it can be proved that the max likelihood occurs for mean. (Links below in Maximum Likelihood Estimation).

Hence we now need to calculate the mean and std dev so as to substitute in the Guassian distribution function and get the likelihood value

\begin{aligned}
P(x_{i} \mid y) = \dfrac{1}{\sigma_{y}\sqrt{2\pi}} \, e^{ {\dfrac{-1}{2} \, (\dfrac{x_{i} \, - \, \mu_{y}}{ \sigma })}^{2}}
\end{aligned}

And for that purpose we need mean and standard deviation for each of the predictor variables and also further sub divided by Outcome=0 and 1

In [None]:
# Calculate the means of each predictor divided to outcome

data_means = data.groupby('Outcome').mean()
data_means

In [None]:
# Calculate the std dev of each predictor divided to outcome
data_stddev = data.groupby('Outcome').std()
data_stddev

In [None]:
def gaussian_likelihood(x, mean_y, stddev_y):
    p = 1/(stddev_y*np.sqrt(2*np.pi)) * np.exp((-(x-mean_y)**2)/(2*(stddev_y**2)))
    return p

We need these probabilities for outcome=0 to calculate posterior probabilities for outcome = 0
\begin{equation}
P(Pregnancies \mid Outcome=0) \, P(Glucose \mid Outcome=0) \, P(BloodPressure \mid Outcome=0) \, P(SkinThickness \mid Outcome=0) \, P(Insulin \mid Outcome=0) \, P(BMI \mid Outcome=0) \, P(DiabetesPedigreeFunction \mid Outcome=0) \, P(Age \mid Outcome=0) 
\end{equation}

and we need these probabilities for outcome=1 to calculate posterior probabilities for outcome = 1
\begin{equation}
P(Pregnancies \mid Outcome=1) \, P(Glucose \mid Outcome=1) \, P(BloodPressure \mid Outcome=1) \, P(SkinThickness \mid Outcome=1) \, P(Insulin \mid Outcome=1) \, P(BMI \mid Outcome=1) \, P(DiabetesPedigreeFunction \mid Outcome=1) \, P(Age \mid Outcome=1) 
\end{equation}

In [None]:
#Iterate over test records and 
# 1. fetch mean and std dev for outcome = 0 for each predictors
# 2. use them to calculate likelihood(predictor | outcome=0) by passing the mean & std dev of each predictor along with the test predictor
# 3. Repeat 2 for outcome = 1 predictor variables 
# 4. compare probability and whichever is greater is the predicted outcome

#likelihood_pregnancy_outcome0 = gaussian_likelihood() .....

## Maximum Likelihood estimation
In our case, the class variable(y) has only two outcomes, 1 or 0. There could be cases where the classification could be multivariate. Therefore, we need to find the class y with maximum probability such that the probability product in equation 3 is maximized:


\begin{equation}
\begin{aligned}
y = argmax_{y} \, \prod_{i=1}^n \,P(x_{i} \mid y)
\end{aligned}
\tag{Equation 4}
\end{equation}

\begin{equation}
\end{equation}

To find the maxima, we need to calculate derivative of the RHS in above equation and equate to 0.

Assuming a Gausian distribution for each of the numerical predictor variables in the dataset, each $ P(x_{i} \mid y) $ takes the form 

\begin{equation}
\begin{aligned}
P(x_{i} \mid y) = \dfrac{1}{\sigma_{y}\sqrt{2\pi}} \, e^{ {\dfrac{-1}{2} \, (\dfrac{x_{i} \, - \, \mu_{y}}{ \sigma })}^{2}}
\end{aligned}
\tag{Equation 5}
\end{equation}

Calculating derivative of a long product like above is tedious. Instead we take the log on both sides and take its derivative. We can use several properties of log to simplify this. The main reason we are able to take derivative of either the function or the log is because the function is monotonous and both the function and its derivative peak at the same point.

At that point were we get a argmax, the x is such such that it is the mean of the of all the $ x_{i} $ with corresponding $ \sigma $ 

A intuitive understanding of maximum likelihood estimation (MLE) techniques and details of derivation can be found in this stat quest: 
1. https://www.youtube.com/watch?v=XepXtl9YKwc
2. https://www.youtube.com/watch?v=Dn6b9fCIUpM

NOTE: Maximum likelihood estimation cannot always be solved in an exact manner. The derivative of the log-likelihood function could be way too hard/impossible to differentiate.In such cases, iterative methods like Expectation-Maximization algorithms are used to find numerical solutions for the parameter estimates

https://machinelearningmastery.com/expectation-maximization-em-algorithm/

https://www.youtube.com/watch?v=93fPFOf547Q&list=RDCMUCjknLK_siVSCY14qfDu-f-w&start_radio=1

# Acknowledgements
1. https://github.com/2796gaurav/Naive-bayes-explained
2. https://chrisalbon.com/code/machine_learning/naive_bayes/naive_bayes_classifier_from_scratch/

# More resources
1. Implementation from total scratch (not even using Pandas) on multi class prediction for Iris Dataset https://machinelearningmastery.com/naive-bayes-classifier-scratch-python/ 
2. Other implementation https://towardsdatascience.com/implementing-naive-bayes-algorithm-from-scratch-python-c6880cfc9c41 (Once the concept is understood, custom Classifier classes can be written like here for maximum reuse of the technique) Github repo link also in article
3. https://towardsdatascience.com/algorithms-from-scratch-naive-bayes-classifier-8006cc691493 Github repo link also in article
4. https://ijcsmc.com/docs/papers/April2020/V9I4202015.pdf
5. https://www.kaggle.com/vinayshaw/iris-species-100-accuracy-using-naive-bayes
6. https://towardsdatascience.com/machine-learning-basics-naive-bayes-classification-964af6f2a965
7. https://blog.floydhub.com/naive-bayes-for-machine-learning/
8. https://medium.com/machine-learning-101/chapter-1-supervised-learning-and-naive-bayes-classification-part-1-theory-8b9e361897d5
9. https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1
10. Towards MAP as generalization of MLE https://towardsdatascience.com/probability-concepts-explained-bayesian-inference-for-parameter-estimation-90e8930e5348
11. https://towardsdatascience.com/name-classification-with-naive-bayes-7c5e1415788a

NOTE: Some of these resources are sometimes very light on implementing from grounds up and instead use Guassian Classifier from sklearn. Thats it! Use at your own risk

# Books

1. Think Bayes https://learning.oreilly.com/library/view/think-bayes-2nd/9781492089452/
2. Bayesian Methods for Hackers

# TODO
1. Consider building a multi class predictor from scratch. Consider using OO approach
2. Consider building a mesaure for accuracy of the Bayesian classifier prediction 
3. Using a confusion matrix to visualize the performance