# Lecture 2: Naive Bayes
***

<img src="figs/bayes.jpg" width=1201 height=50> 

<!---
![my_image](files/figs/bayes.jpg)
-->


<a id='prob1'></a>

<a id='prob2'></a>

### Problem 1: Naive Bayes on Symbols
***

> This problem was adopted from [Naive Bayes and Text Classification I: Introduction and Theory](https://arxiv.org/abs/1410.5329) by Sebastian Raschka

Consider the following training set of 12 symbols which have been labeled as either + or -: 

<br>

<img src="figs/shapes.png" width=500>


Answer the following questions: 


**A**: What are the general features associated with each training example? 

**Answer:**

Shape: {circle, square}  
Color: {green, red, blue}  
Symbol: {+, -}

In the next part, we'll use Naive Bayes to classify the following test example: 

<img src="figs/bluesquare.png" width=200>

OK, so this symbol actually appears in the training set, but let's pretend that it doesn't.  

The decision rule can be defined as 

>Classify ${\bf x}$ as + if <br>
>$p(+ ~|~ {\bf x} = [blue,~ square]) \geq p(- ~|~ {\bf x} = [blue, ~square])$ <br>
>else classify sample as -

**B**: What are the Maximum Likelihood Estimates of the priors $p(+)$ and $p(-)$? 


**Answer:**

$$
p(+) = \frac{\#~instances~of~positive}{\#~total~instances} = \frac{7}{12}
$$

$$
p(-) = \frac{\#~instances~of~negative}{\#~total~instances} = \frac{5}{12}
$$

**C**: Identify and compute estimates of the class-conditional probabilities required to predict the class of ${\bf x} = [blue,~square]$?

**Answer:**
The Naive Bayes approach assumes that features are **CONDITIONALLY INDEPENDENT**.  Conditional independence can be summarized as follows: 

Just a bit about notation: $$p([blue, square]) = p(blue \cap square)$$  
independence means:
$$
p([blue, square]) = p(blue)*p(square)
$$

The best way to think about it is to break it down into a table as follows:

| +      | -      |
|--------|--------|
| R, sq  | B, sq  |
| B, cir | B, sq  |
| G, sq  | R, cir |
| B, cir | B, cir |
| R, sq  | R, sq  |
| G, sq  |        |
| B, sq  |        |




relating to class-conditional probabilities:

$$
p(+ ~|~ {\bf x} = [blue,~ square])~=~\frac{p(~[blue, square]~|~+)*p(~+)}{p(~[blue, square]~)}
$$

$$
p(~[blue, square]~|~+) = p(~blue~|~+)*p(~square~|~+)
$$

$$
p(~blue~|~+)~=~\frac{3}{7}
$$

$$
p(~square~|~+)~=~\frac{5}{7}
$$

$$
p(~blue~|~-)~=~\frac{3}{5}
$$

$$
p(~square~|~-)~=~\frac{3}{5}
$$

$$
p(~blue~)~=~\frac{6}{12}
$$

$$
p(~square~)~=~\frac{8}{12}
$$



**D**: Using the estimates computed above, compute the **posterior** scores for each label, and find the Naive Bayes prediction of the label for ${\bf x} = [blue,~square]$. 

**Answer:**  
$$
\frac{p(~[blue, square]~|~+)*p(~+)}{p(~[blue, square]~)}=~\frac{\frac{3}{7}~*~\frac{5}{7}~*~\frac{7}{12}}{\frac{6}{12}~*~\frac{8}{12}}~=~0.536
$$

$$
\frac{p(~[blue, square]~|~-)*p(~-)}{p(~[blue, square]~)}=~\frac{\frac{3}{5}~*~\frac{3}{5}~*~\frac{5}{12}}{\frac{6}{12}~*~\frac{8}{12}}~=~0.450
$$

Therefore, the Naive Bayesian prediction for ${\bf x} = [blue,~square]$ would be ${\bf+}$.

**E**: If you haven't already, compute the class-conditional probabilities scores $\hat{p}({\bf x} = [blue,~square] ~|~ +)$ and $\hat{p}({\bf x} = [blue,~square] ~|~ -)$ under the Naive Bayes assumption.  How can you reconsile these values with the final prediction that would made? 

**Answer:**
$$
\hat{p}({\bf x} = [blue,~square] ~|~ +) = p(blue ~|~ +) * p(square ~|~ +) = \frac{3}{7}*\frac{5}{7} = 0.31
$$

$$
\hat{p}({\bf x} = [blue,~square] ~|~ -) = p(blue ~|~ -) * p(square ~|~ -) = \frac{3}{5}*\frac{3}{5} = 0.36
$$

<a id='prob3'></a>

### Problem 2: Laplace Smoothing 
***

Consider the same training set from Problem 2, but suppose we see the following test example: 
    
<img src="figs/greencircle.png" width=200>

Before you get too far into trying to predict the label of the green circle, look carefully at the training set.  Notice that there are no green shapes labeled - in the training set, so when we try to compute the class-conditional probability $p(green ~|~ -)$ we'll get a zero probability.  To fix this, you'll implement Laplace smoothing. Notice that this is a little different than the SPAM vs HAM example shown in the video.  We actually have two very different features in shapes and colors. We'll apply Laplace Smoothing to the shape and color class-conditional probabilities separately. 

**A**: What would the general formula for the estimate of $p(shape ~|~ class)$ with Laplace Smoothing look like for the given training set?  What is the *vocabulary* in the shape case?  

**Answer: **
Vocabulary: {square, circle}  
General formula: $p(shape ~|~ class) = \frac{#(shape ~and~ class)+1}{#class + 1*2}$

**B**: What would the general formula for the estimate of $p(color ~|~ class)$ with Laplace Smoothing look like for the given training set?  What is the *vocabulary* in the shape case?  

**Answer: **


**C**: Predict the label for the green circle using the Laplaced smoothed class-conditional probability formulas.  Don't forget to apply Laplace Smoothing to the priors as well! 

**Answer: **


<a id='prob4'></a>

### Problem 3: Unknown Features
***

Once again consider the training set from Problem 2, but suppose we see the following test example: 
    
<img src="figs/yellowsquare.png" width=200>

OK, this is a weird one.  Up until this point, we've never seen the color *yellow*, and thus don't include it in the color vocabulary.  One way that we could handle this is to add to the color vocabulary, and then recompute the the class-conditional probabilities with *yellow* included in the vocabulary. 

But what happens when on the next test example we see a *pink* circle (or worse, a triangle)? We'd rather not continue to modify our probability estimates whenever we see shape or color that we haven't see before.  One solution to this is to just assume we'll see weird things in the future and combine all of the posibilities into an UNK feature. If we do this, then our class-conditional probabilities become 

$$
p(feature ~|~ class) = \frac{\#~instances~of~feature~in~class + 1}{\#~total~symbols~in~class + |V| + 1}
$$

where here the vocabular $V$ is the same vocabular defined by the training set. 

**A**: Predict the label of the yellow square.  

***
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>

<a id='prob1ans'></a>

<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
<br><br><br><br>
### Helper Functions 
***

In [2]:
from IPython.core.display import HTML
HTML("""
<style>
.MathJax nobr>span.math>span{border-left-width:0 !important};
</style>
""")

from IPython.display import Image