## Topics covered in this notebook:
1. Types of Classifiers - Generative vs. Discriminative.
2. Bayes Rule.
3. Bayes Classifiers.
4. Naive Bayes.
5. Naive Bayes vs. KNN.
6. Revisit Bayes Classifier.
7. References.

## 1. Types of Classifiers:
Given inputs **x** and classes **y** we can do classification in several ways:
1. **Discriminative Classifiers**: These classifiers estimate parameters of decision boundary/class sepratoe from labeled examples.
    1. Learn mappings directly from space of inputs X to class labels {0, 1, 2,....., K}. For example:
        1. Linear Regression as a classifier.
        2. Neural Networks.
    2. Learn p(y|x) directly. For example:
        1. Logistic Regression.
2. **Generative Classifiers**: Model the distribution of inputs characteristic of the class.
    1. These classifiers try to model p(x|y).
    2. Apply Bayesrile for classification. Thus called **Bayes Classifiers**.

In [1]:
%%javascript
MathJax.Hub.Config({
    TeX: { equationNumbers: { autoNumber: "AMS" } }
});

<IPython.core.display.Javascript object>

## 2. Bayes Rule:
Given **x** we want to compute class(C) probabilities using Bayes Rule:
<br>
<br>
\begin{equation}
 \large p(C \,|\, x) \,=\, \frac {p(x\,|\,C)p(C)}{p(x)}
\end{equation}
<br>
<br>
More formally:
<br>
<br>
\begin{equation}
 \large posterior \,=\, \frac {Class likelihood \,\,*\,\, prior}{Evidence}
\end{equation}

## 3. Bayes Classifier:

1. Determine if an email is spam:
    1. Can look at words like **'free','pills','money'**, etc.
    2. Find:
        1. p(money|spam).
        2. p(money|not spam).
        
How do we find these?

Discrete probabilities are just counts. For example,

$$ \large p(money\,|\,spam) = \frac {\large count\,\,(spam\,\,messages\,\,containing\,\,'money\,')} {\large\,count\,(spam\,\,messages)}$$

Similarly we can calculate p(money|not spam). This looks a lot like a Bayes rule.

### What makes this Bayesian?
Let's consider a spam classifier as an example.
1. We want p(spam|X):
    1. Apply Bayes rule.
    2. **p(spam|X) = p(X|spam) * p(spam)/ p(X)** where p(spam) = class prior, p(X|spam) is likelihood & p(spam|X) is posterior.
    3. Similarly we calculate p(not spam|X).
        1. We classify based on what is bigger.
        2. p(spam|X) > p(not spam|X) -> Spam!
        3. p(spam|X) < p(not spam|X) -> Not spam!
    4. Y = argmax{p(C|X)} = argmax{p(X|C) * p(C)}, where p(X) can be ignored as it is independent of C.
        1. For example: 10 spam emails, 20 not spam emails:
            1. p(spam) = 1/3.
            2. p(not spam) = 2/3.
            
Before looking more into Bayes classifiers let's look with Naive Bayes.

## 4. Naive Bayes(NB):

Consider **p(cash|spam)**. Is it correlated with **p(money|spam)**? 
Probably. But if we assume those are independent it is called '**Naive Bayes**'.<br>


### Modeling P(X|C)

P(X|C) = P(words|C). All words are independent hence we can multiply the probabilities.<br>

P(words|C) = P(w|C) * (1 - P(w|c)) where w is each word from the set words.<br>

### NB Implementation - Use Gaussian Distribution.

1. We won't use full covaraince matrix since all dimensions are independent in NB.<br>

2. Cov(i,j) = E[ (x_i - mu_i)(x_j - mu_j) ] ( =0 if x_i is independent of x_j).<br>

3. Cov(i,i) = var(x_i) = sigma^2.
    1. This is called axis aligned elliptical covariance.
    2. Instead of DxD covariance matrix store a D sized vector.
    3. Scipy allows us to pass in either.
    
4. Effectively still doing:
    1. p(X|C) = p(x1|C) p(x2|C)......p(xn|C)
    2. p(X|C) = N(x1; mu1, var1-sq) N(x2; mu2, var2-sq).....N(xn; mun, varn-sq) ->multivariate gaussian.
    
5. Exponential slows down. Hence use log probabilities. Scipy has a function to calculate log probabilites too.
    1. Prediction = argmax{p(X|C) p(C)} --> argmax{log p(X|C) + log p(C)}.
6. Smoothing:
    1. Singular covariance problem - matrix equivalent of divison by zero.
    2. Add smoothing.
        1. MLE = transpose((X - mu)) * (X - mu)/(N-1).
        2. Smoothed MLE = transpose((X - mu)) * (X - mu)/(N-1) + a * I, where is a very small number (0.0001). -> Adds numerical
           stability.
                        

## 5. Naive Bayes vs. KNN:

1. Concept is almost opposite of KNN.
    1. KNN: We approximate some function f(words in document) -> spam/not spam.
    2. NB : Assume data arises/ produced from the target label.
        1. Spam -> Spammy document -> model p(document|spam).

## 6. Revisit Bayes Classifier:

In section 4 where NB implementation is discussed we assumed that the off-diagonal elements in the covariance matrix is zero. If it is non-zero it is referred to as **Bayes Classifier**.

### Advantages:
1. Grounded in probability, which can be powerful.
2. Each variable is modeled explicitly, change model of p(x|C) if the result is poor.
    1. You know exactly how each variable affects results.


### Disadvantages:
1. Historically discrimative models have worked better. Ex. Deep learning.

## 7. References:
1. An Introduction to Statistical Learning Textbook by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
2. University of Michigan EECS 445 - Machine Learning Course (https://github.com/eecs445-f16/umich-eecs445-f16).<br>
3. University of Toronto CSC 411 - Intro. to Machine Learning (http://www.cs.toronto.edu/~urtasun/courses/CSC411_Fall16/CSC411_Fall16.html).<br>
4. Stanford CS109 - Intro. to proabability for computer scientists (https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/). <br>
5. Few online courses on Udemy, Coursera etc.