# Programming for Data Science and Artificial Intelligence

## Supervised Learning - Classification - Naive Bayesian - Gaussian

### Readings: 
- [VANDER] Ch5
- [HASTIE] Ch6

In [1]:
import numpy as np
import matplotlib.pyplot as plt

## Gaussian Naive Classification

In Bayesian classification, we're interested in finding the probability of a label given some observed features, which we can write as $P(y | x)$ (also known as **posteriors**).
Bayes's theorem tells us how to express this in terms of quantities as:

$$
P(y|x) = \frac{P(x|y)P(y)}{P(x)}
$$

The proof is as follows:

- the probabilty of two events x and y happening, $P(x \cap y)$ is the probability of $x$ or $P(x)$, times the probability of $y$ given that $P(x)$ has occured, $P(y \mid x)$

$$ P(x \cap y) = P(x)P(y \mid x)$$

- on the other hand, the probability of $x$ and $y$ is also equal to the probability of $y$ timese the probabilty of $x$ given $y$

$$ P(x \cap y) = P(y)P(x \mid y)$$

- Equating the two yields:

$$ P(x)P(y \mid x) = P(y)P(x \mid y)$$

- Thus

$$ P(y \mid x) = \frac{P(y)P(x \mid y)}{P(x)}$$

-----


Thus, if we know all these three terms on the right, we can find $P(y \mid x)$ (posteriors).  Since if we want to use for classification, we can simply compare the upper term, thus we need to know two terms!  The $P(y)$ (priors) and $P(x \mid y)$ (likelihoods or conditional probability).

$P(y)$ (also known as **priors**) is simply

$$P(y = 1) = \frac{\sum_{i=1}^m 1(y=1)}{m}$$

$$P(y = 0) = \frac{\sum_{i=1}^m 1(y=0)}{m}$$



$P(x \mid y)$ (also known as **likelihoods** or **conditional probability**) is a little bit tricky but if we are willing to make a "naive" assumption, then we can find a rough approximation of the generative model for each class, and then proceed with the Bayesian classification.  Perhaps the easiest naive Bayes classifier to understand is Gaussian naive Bayes.  In this classifier, the assumption is that *data from each label is drawn from a simple Gaussian distribution* as follows:

$$ P(x \mid y=1 ; \mu_1, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e ^{-\frac{(x-\mu_1)^{2}}{2\sigma^{2}}}$$
$$ P(x \mid y=0 ; \mu_0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e ^{-\frac{(x-\mu_0)^{2}}{2\sigma^{2}}}$$

where

The mean of feature $j$ when $y=0$ is

$$\mu_{0j} = \frac{\sum_{i=1}^m x_{ij}}{m} $$

This is how the normal distribution looks like

<center><img src="../../../figures/normal.png" width=400/></center>



Naive classification assumes all features are independent, thus the total likelihood is just the product:
$$P(x \mid y) = \prod_{i=1}^n P( x_i \mid y )$$

Finally, do $P(y)P(x|y)$

Predict based on which one is bigger.

### Putting everything together

1. Prepare your data
    - $\mathbf{X}$ and $\mathbf{y}$ in the right shape
        - $\mathbf{X}$ -> $(m, n)$
        - $\mathbf{y}$ -> $(m,  )$
        - Note that theta is not needed.  Why?
    - train-test split
    - feature scale
    - clean out any missing data
    - (optional) feature engineering
2. Calculate the mean and std of each feature for each class (from the X_train). 
    $$\mu_{0j} = \frac{\sum_{i=1}^m x_{ij}}{m} $$
   The shape of your mean and std will be $(k, n)$
3. Calculate the **likelihoods** of each sample of each feature (for X_test) using

    $$ P(x \mid y=1 ; \mu_1, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e ^{-\frac{(x-\mu_1)^{2}}{2\sigma^{2}}}$$
    $$ P(x \mid y=0 ; \mu_0, \sigma^{2}) = \frac{1}{\sqrt{2\pi\sigma^{2}}}e ^{-\frac{(x-\mu_0)^{2}}{2\sigma^{2}}}$$
    
    - The shape of likelihood for class 0 will be $(m, n)$
    - Total likelihood is the product as follows:
    
    $$p(x \mid y) = \prod_{i=1}^n p(x_i \mid y)$$
    
    - The shape of this total likelihood for class 0 will be $(m, )$
    
4. Find **priors** P(y)
$$P(y = 1) = \frac{\Sigma_{i=1}^m 1(y=1)}{m}$$
$$P(y = 0) = \frac{\Sigma_{i=1}^m 1(y=0)}{m}$$

    - The shape of priors for class 0 will be simply a scalar

5. Multiply $P(y)P(x \mid y)$ for each class which will give us $p(y \mid x)$ (**posteriors**)
    
    - For each class, the result of this is simply a multiplication between scalar and $(m, )$ resulting in a shape of $(m, )$, and you will have $k$ of such result.

6. Simply compare $P(y)P(x \mid y)$ for each class, whichever is bigger wins.  Note that we can ignore $P(x)$ since they can be canceled on both sides.

#### 1. Prepare your data

#### 1.1 Get your X and y in the right shape

#### 1.2 Feature scale your data to reach faster convergence

#### 1.3 Train test split your data

#### 2. Calculate the mean and std for each feature for each class

#### 3. Define the probability density function so we can later calculate $p(x \mid y)$

#### 3. Calculate the likelihood by calculating the probability density of each class $p(x \mid y)$

#### 3.1 Calculate thetotal likelihood by calculating the product of $p(x \mid y) = \prod_{i=1}^n p(x = i \mid y)$

#### 4. Calculate the prior $p(y)$

#### 5. Calculate the posterior $p(x \mid y)p(y)$ for each class

#### 6. Calculate accuracy

### Sklearn

Of course, once we are able to code from scratch, we can turn to our sklearn so we don't need to implement from scratch from now.  Naive Bayes Gaussian is implemented in Scikit-Learn's ``sklearn.naive_bayes.GaussianNB`` estimator:

We can also use predict_proba to print out the actual probabilities

### ===Task===

Generate a 2 class data using sklearn, and use them on Gaussian Naive Classification.  Put them into class and calculate accuracy accordingly.