# Naive Bayes

---

Bayes theorem is applied with 'naive' assumption of the independence among each features.

It is used in classification problem when we know its prior probability ( supervised learning algorithms ).

Great for very high dimensional problems because of its strong assumption; and only required a small amount of training data (it doesn't suffer from curse of dimensionality)



## Bayesian Classification

Problem statement: 
* Given features $x_1, x_2, \ldots, x_n $  
* Predict a label $Y$  

Given joint distribution on $x_1, \ldots, x_n $ and $Y$, predict the probability of belonging to a label by using: 

 $ \underset{Y}{\arg\max}\  P(Y \ |\  x_1, \ldots, x_n ) $
 


#### Bayes Rule:

<img src="bayes_rule.png">

For every label in $Y$, we will compute each of their probability and choose the highest one as our prediction

For example, there are 2 class label, $Y = 5$ or $Y=6$. We would compute probability for each class label. 

$$
\begin{align}
P(Y = 5 \ |\  x_1, \ldots, x_n ) & = \frac{P( x_1, \ldots, x_n \ |\ Y = 5) \ P(Y=5)}{P( x_1, \ldots, x_n \ |\ Y = 5) \ P(Y=5) + P( x_1, \ldots, x_n \ |\ Y = 6) \ P(Y=6)} \\ \\
P(Y = 6 \ |\  x_1, \ldots, x_n ) & = \frac{P( x_1, \ldots, x_n \ |\ Y = 6) \ P(Y=6)}{P( x_1, \ldots, x_n \ |\ Y = 5) \ P(Y=5) + P( x_1, \ldots, x_n \ |\ Y = 6) \ P(Y=6)}
\end{align}
$$

Get the highest probability between both class.

If $P(Y = 5 \ |\  x_1, \ldots, x_n )$ is higher than our prediction would be belong to class label 5 instead of 6

## Naive Bayes

using the Bayes Rule

<img src="naive_bayes_intuition.png">

Most of the top 10 classification algorithms are discriminative (K-NN, CART,
C4.5, SVM, AdaBoost).

 Assumption:   Features are independent given class. So,
    

$$ 
\begin{align}
P(X_1, \ldots, X_n  \ |\  Y ) &= P(X_1 \ |\  Y ) P(X_2 \ |\  Y ) \ldots P(X_n \ |\  Y ) \\
 &= \prod_{i=1}^{n} P(X_i \ |\  Y )
\end{align}
$$

Instead of learning a joint distribution of all features, we learn $P(X_i\ |\ y)$ separately for each feature $X_i$

## Naive Bayes Classifier

Given :
* Prior $P(Y)$
* $n$ conditionally independent features $X$ given the class $Y$
* likelihood $P(X_i \ | \ Y) $   for each $X_i$ 

Decision Rule:
    

$$  
\begin{align}
y^* = h_{NB} (x) & = \underset{Y}{\arg\max}\ P(y) \ P(x_1, \ldots, x_n \ | \ y) \\
& = \underset{Y}{\arg\max}\ P(y)  \prod_{i=1}^{n} P(X_i \ |\  Y )
\end{align}
$$

Potential problem : 
    
    Most of the conditional probabilities are 0 because the dimensionality of the data is very high compared to the amount of data. This causes a problem because if even one 
$$P(X_i \ | \ Y)$$

    is zero then the whole right side is zero. 
    In other words, if no training examples from class “spam” have the word “tomato,” we’d never classify a test example containing the word “tomato” as spam!
    


Solution to problem:
 
     Laplace Smoothing (or “Bayesian shrinkage estimate”)
     
$$   P(X_i = x \ | \ Y = y ) = \frac{n_{ik} + 1}{n_{k}+K} \\$$

$ \ n_{ik} =  \# \text{ of examples with } Y_i = y, X_i = x $ 

$ \  n_k = \# \text{ of examples with } Y_i = y$

$ \  k = \# \text{ of possible values of } X $

> Naive Bayes is not necessarily the best algorithm, but is a good first thing to try, and performs surprisingly well given its simplicity!