# Bayesian Classification

Ref: 
* https://www.youtube.com/playlist?list=PLBv09BD7ez_6CxkuiFTbL3jsn2Qd1IU7B
* https://www.youtube.com/watch?v=ivBSZZyaRHY&feature=youtu.be

Goal: learning function `f(x) -> y`

Where
* `y` is one of k classes (e.g. spam/ham, digit 0-9)
* x = x<sub>1</sub>...x<sub>d</sub> - values of attributes (numeric or categorical)

Naive Bayes Classifier
* Part of Bayesian Classifiers
    * Part of Probablistic Classifiers - Assign class based on computed probablity of that class. First calculates the probablity, then selects the class which has the highest probablity given the observation
    
`y_hat = arg max_y P(y|x)`

y_hat is the predicted class

* Good for multi class prediction

Bayesian Probablity of a Class:

![Bayesian Classification](images/bayesian_classification.png)

* Probablity of `y (spam)` given `x (email)` equals to
    * Probablity of the class `P(y)` (spam let's say) - `prior (this is independent of the email`
    * Multiplyed by probablity of that particular email `x (spam)` occuring given the class `y` - `class model (assuming the email is spam, what is the combination of words that we should expect occuring and not occuring)`
    * Divided by probablity of `x` - `normalizer`

* Further, consider an example. Given symptoms (cough, temperature, skin paleness etc.), we want to predict if a patient has Ebola or not
    * In this case, `P(y)` is prior - i.e. without knowing anything about the patient, what is the probablity that he has Ebola
        * Encodes which class are common, which class are rare
        * Apriori much more likely to have other disease than Ebola
    * `P(x|y)` - class model - given that he has Ebola, what is the likelyhood that he will exhibit symptoms `x`
    * `normalizer` - usually left out because it doesn't affect the selection of class - it's a constant
        * Normalize probablities across observations
        * It affects the calculation if your question is: Who is more at risk amongst all the patients? - bumps up posterior probablity, normalizes the probablities to bring them on the same scale

* Naive Bayes is a generative model
* A complete probablity distribution for each class
    * Defines likeluhood for any point x
    * P(y|x) proportional to P(x|y) * P(y)
    * P(class) via P(observation) 
    * Can _generate_ synthetic observation
* All generative classifiers are probabilistic, but all probabilistic classifiers are not generative
    * i.e. in these cases, one can generate the probablity directly without generating a model for each class. E.g. logistic regression
    * 
    

![Bayes Formula](images/bayes_formula.png)
![Bayes Classifier](images/bayes_classifier.png)

## Naive Base Classifier

* Generative Model - Classifies by first generating models of each class
* Normal Baysian approach does not scale if number of features are huge
* E.g. if each observation is characterised by 3 binary features, then the total number of possible types of observations are 2<sup>3</sup>
* Computationally intensive, also model gets copmplex - overfitting

Naive:

* Instead of assuming that the features in an observation are dependent (`2^n`), assume that they are independent of each other. As a result:
    * `P(a, b) = P(a) * P(b)`

* Independence of events makes it Naive
* Compute `P(x1...xd|y)` for every observation x1...xd
    * class-conditional `counts`, based on training data
    * problem: may not have seen every x1...xd for every y
        * digits: 2<sup>400</sup> possible black/white patterns (20x20)
        * spam: every possible combination of words: 2<sup>10000</sup>
    * often have observations for individual x<sub>i</sub> for every class

Independence Assumption
![Independence Assumption](images/independence_assumption.png)

Conditional Independence:
* `P(Beach, Stroke) > P(Beach) P(Stroke)`
* But, If we argue that it's the heat which is causing the stroke:

    `P(Beach, Stroke|Heat) == P(Beach|Heat) P(Stroke|Heat)`
* In Classification: Class Value explains all the dependence between attributes

* We use Classical Naive Bayes when the features  are categorical
* We use Gaussian Naive Bayes when the features are continuous

Where can NB go wrong?
* If the distribution of the observations is such that the mean and variance are same - i.e. no way to differentiate them
    * One could have take advantage of corelation, but NB cannot do coorelation
* NB assumes independent occurances, hence can mis classify at times
* Zero frequency problem
    * Zipf's law: will happen with half of your attributes
* Missing data (easy with Naive Bayes):
    * Ignore attribute in instance where value is missing
    * Compute likelihood based on observed attributes
    * no need to fill in or explicititly model missing values
    * based on conditional independence between attributes

asdf

In [None]:
%matplotlib inline

import numpy as np
from sklearn import datasets
import matplotlib.pyplot as plt


X, y = datasets.make_classification(n_features=2, n_redundant=0, n_informative=2,
                           random_state=1, n_clusters_per_class=1)

In [None]:
X.shape

In [None]:
y.shape

In [None]:
X1 = []
X2 = []
for x in X:
    X1.append(x[0])
    X2.append(x[1])

In [None]:
fig = plt.figure()
ax1 = fig.add_subplot(111)

ax1.scatter(X1, X2, c=y)

In [None]:
from sklearn import naive_bayes
from sklearn.model_selection import train_test_split

clf = naive_bayes.GaussianNB()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

clf.fit(X_train, y_train)
clf.score(X_test, y_test)

In [None]:
if hasattr(clf, "decision_function"):
    Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
else:
    Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
ax1.contourf(xx, yy, Z, cmap=cm, alpha=.8)
