$\LaTeX \text{ commands here}
\newcommand{\R}{\mathbb{R}}
\newcommand{\im}{\text{im}\,}
\newcommand{\norm}[1]{||#1||}
\newcommand{\inner}[1]{\langle #1 \rangle}
\newcommand{\span}{\mathrm{span}}
\newcommand{\proj}{\mathrm{proj}}
\newcommand{\OPT}{\mathrm{OPT}}
\newcommand{\vx}{\vec{x}}
\newcommand{\I}{\mathbb{I}}
$


<hr style="border: 5px solid black">

**Georgia Tech, CS 4540**

# Lecture 18: Naive Bayes Algorithm, and the Beta-Binomial Model

Naveen Kodali and Jacob Abernethy
*Date:  Thursday, November 1, 2018*

## Outline
- Basics
    - Review of MAP (*maximum a posteriori*) estimation
    - MAP estimation for linear regression
- Naive Bayes Classifier
    - Independence Assumption
    - MLE Estimation

## Maximum Likelihood: Simple Linear Regression

One of the most basic techniques for trying to estimate a real number $y$ given an observed vector $x \in \R^d$ is known as *linear regression*, aka *least squares*. The key idea is to assume that the data $x_1, \ldots, x_n$ are fixed vectors, and the labels $y_1, \ldots, y_n$ are *independent* random real numbers whose probabiliy distribution is *guassian*. For some parameter $\theta \in \R^d$ we have
$$P( y_i | x_i, \theta) = \mathcal{N}(y_i | x_i^\top \theta, \sigma^2)= 
\frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - x_i^\top \theta)^2}{2\sigma^2}\right)$$
**Problem**: What is the MLE for $\theta$? (You don't necessarily have to find a closed-form solution)

Useful fact: by independence we have $P(y_1, \ldots, y_n | x_1, \ldots, x_n, \theta) = P(y_1 | x_1, \theta) \cdots P(y_n | x_n, \theta)$

Want:
$$
\begin{align*}
\arg\max_\theta & \; \sum_{i=1}^n \log\left( \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right)\right) \\
= \arg\max_\theta & \; \sum_{i=1}^n \log \exp\left(-\frac{(y_i - \theta^\top x_i)^2}{2\sigma^2}\right)
\\
= \arg\min_\theta & \; \sum_{i=1}^n (y_i - \theta^\top x_i)^2\\
\nabla = 0 \implies & \;  \sum_{i=1}^n (y_i - \theta^\top x_i) x_i^\top = 0\\
\implies & \;  \theta (X^\top X) = X^\top Y
\end{align*}
$$
where $X$ is the matrix whose rows are $x_1^\top, \ldots, x_n^\top$ and $Y$ is the vector with entries $y_1, \ldots, y_n$. Now we are solving a linear system!

## Maximum A Posterior estimation

Previously, we have been looking at the Max. Likelihood estimator. A more advanced estimator involves a *prior* on the parameters. This means that we "prefer", a priori, certain parameters over others. In other words, we think some parameters are more likely to be true than others. You describe your "preference" for some parameters using a prior distribution $P(\theta)$. Given data $x_1, \ldots, x_n$, the Maximum A Posterior (MAP) estimate for $\theta$ is then
$$\arg\max_\theta P(\theta) \prod_{i=1}^n P(x_i | \theta) \quad \text{ equiv. to } \quad 
\arg\max_\theta \log P(\theta) +  \sum_{i=1}^n \log P(x_i | \theta) 
$$


## Ridge Regression

For the linear regression model above, we can add a prior to the parameter vector $\theta$. The standard prior is the so-called "multivariate gaussian" with a single *fixed* parameter $\nu$, i.e.
$$P(\theta) = \frac{1}{(2\pi \nu^2)^{d/2}} \exp\left(\frac{-\|\theta\|^2}{2\nu^2}\right)$$

The likelihood function is still the same:
$P( y_i | x_i, \theta) =
\frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left(-\frac{(y_i - x_i^\top \theta)^2}{2\sigma^2}\right)$.

**Problem A**: What is the MAP estimate for this model *before* we have data?
$$\arg\max_\theta \log P(\theta)$$

**Problem B**: What is the MAP estimate for this model *after* we have data?
$$\arg\max_\theta \log P(\theta) +  \sum_{i=1}^n \log P(y_i | x_i, \theta)$$

## Observation

* We use the term *posterior distribution* to refer to the probability of the parameters $\theta$ given the dataset $(x_1, y_1), \ldots, (x_n,y_n)$. You can see from Bayes rule that
$$P(\theta | data) \propto P(data | \theta) P(\theta) = P(\theta) \prod_i P(y_i | x_i, \theta)$$
* If you work this out for the ridge regression model discussed above, you find that
$$P(\theta | data) \propto \exp\left( - \frac{1}{2} (\theta - \mu)^\top \Lambda (\theta-\mu)\right)$$
where $\Lambda = I + X^\top X$ and $\mu = \Lambda^{-1} X^\top Y$. In other words, the distribution on $\theta$ given the data is now a multivariate gaussian $\mathcal{N}(\mu, \Lambda^{-1})$ with mean $\mu$ and covariance matrix $\Lambda^{-1}$.
* When the posterior distribution is *in the same distribution family* as the prior distirbution, we say the prior is *conjugate to* the likelihood. But this is very lucky when this happens!



## Classification

- Given some data $\{x_1,...,x_n\}$ and their labels $\{y_1,...,y_n\} \in \{1,...,K\}$. The goal of classification is to find a function $f: \mathcal{X} \to P(Y=k|X)$ that fits this data and also performs well on unseen examples.

## Naive Bayes Classifier

### Naive Bayes:  Problem

- We will use **Naive Bayes** to solve the following classification problem:
    - **Categorical** feature vector $\vx = (x_1, x_2, \dots, x_D)$ with length $D$
        - Each feature $x_d \in \{0,1\}$, $\forall d = 1, \dots, D$
        - Note: you can allow for non-binary features - $x_d \in \{0,1, \ldots M\}$
    - Predict discrete class label $y \in \{1, 2, \dots, C \}$

- For example, in **Spam Mail Classification**,
    - Predict whether an email is `SPAM` ($y=1$) or `HAM` ($y=0$)
    - Use words / metadata in the email as features
    - For simplicity, we can use **bag-of-words** features,
        - Assume fixed vocabulary $V$ of size $|V| = D$
        - Feature $x_d$, for $d \in \{1, 2, \dots, D \}$, indicates the existence of $d\text{th}$ word in the email
        - Eg. $x_d = 1$ if $d\text{th}$ word is in the email; $x_d = 0$ otherwise
        - In this case $M=2$

### Naive Bayes:  Independence Assumption and Full model

- The essence of Naive Bayes is the **conditionally independence assumption**
    $$
    P(\vx | y = c) = \prod_{d=1}^D P(x_d | y=c)
    $$
    i.e., given the label, all features are independent.
    
- The **full generative** model of Naive Bayes is:
    $$
    \begin{align}
    P(y = c ) & = \pi_c \quad \forall\, c=0,1 \\
    P(x_d = 1 | y = c ) &= \theta_{cd} \quad \forall\, d = 1,\dots,D
    \end{align}
    $$
- Parameter $\pi$ and $\theta$ are learned from training data.

>**Remark**
> - **NOTE** in definition and derivation of this lecture, we assume a more general case $x_d \in \{1, \dots ,M \}$ of which $M>2$. But in spam email classification and the derivation in textbook, binary feature, i.e. $M=2$, is used. So don't get confused!

> - When $M=2$, $x_d | y=c$ is also Bernoulli distribution.

### Naive Bayes: Prediction

- Given the independence assumption and full model, for some new data $\vx^{\text{new}} = (x_1^{\text{new}}, \dots, x_D^{\text{new}})$ we will classify based on
    $$
    \begin{align}
    y
    &=\underset{c \in \{0,1\}}{\arg \max} P(y=c|\vx = \vx^{\text{new}}) \\
    &=\underset{c \in \{0,1\}}{\arg \max} P(\vx = \vx^{\text{new}} | y=c) P(y=c) \\
    &=\underset{c \in \{0,1\}}{\arg \max} P(y=c) \prod \nolimits_{d=1}^{D} P(x_d = x_d^{\text{new}} | y=c) \\
    &=\boxed{\underset{c \in \{0,1\}}{\arg \max} \pi_c \prod \nolimits_{d=1}^{D} \theta_{cd}^{x_d^{\text{new}}} (1-\theta_{cd})^{1-x_d^{\text{new}}}} \\
    \end{align}
    $$
    
- So as long as we learned parameter $\pi$ and $\theta$, we could classify.

> **Remark**

> - Indicator function
    $$
    \mathbb{I}(m=x_d^{\text{new}}) = 
    \begin{cases}
    1 & \text{ if } m=x_d^{\text{new}}\\ 
    0 & \text{ otherwise}
    \end{cases}
    $$
    
> - In inner product $\prod \nolimits_{m=1}^{M} \theta_{cdm}^{\mathbb{I}(m=x_d^{\text{new}})}$, only $\theta_{cdx_d^{\text{new}}}$ is multiplied and all the other multipliers are 1 due to the power of indicator function.
    
> - One thing to note is that the above classification criterion is the product of a series numbers smaller than 1 which will generate a rather small number. A better way is to take **logarithm** to transform product into summation and then compare.

### Naive Bayes:  Parameter Estimation

- **Goal:** Given training data $\mathcal{D} = \{ (\vec{x}_1, y_1), \dots, (\vec{x}_N, y_N) \}$, estimate **class-conditional probabilities** $\theta$ and **class priors** $\pi$.


- We will discuss the **MLE** and **MAP** parameter estimates.    

### Naive Bayes:  Maximum Likelihood

- The **likelihood** for a single data case $(\vec{x}_n, y_n=c)$ is
    $$
    \begin{align}
    & P((\vec{x}_n, y_n) | \pi, \theta) \\
    &= P(y_n) \prod \nolimits_{d=1}^D P(x_{nd}|y_n) \\
    &= \prod \nolimits_{c=1}^C P(y_n=c)^{\I(y_n=c)} \cdot \prod \nolimits_{c=1}^C \prod \nolimits_{d=1}^D \prod \nolimits_{m=1}^M P(x_{nd}=m|y_n=c)^{\I(x_{nd}=m) \I(y_n=c)}\\
    &= \prod \nolimits_{c=1}^C \pi_c^{\I(y_n=c)} \cdot \prod \nolimits_{c=1}^C \prod \nolimits_{d=1}^D \prod \nolimits_{m=1}^M \theta_{cdm}^{\I(x_{nd}=m) \I(y_n=c)}\\
    \end{align}
    $$

- Therefore, the **log-likelihood** is
    $$
    \begin{split}
    & \log P((\vec{x}_n, y_n) | \pi, \theta) \\
    & = \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c + \sum \nolimits_{c=1}^C \sum \nolimits_{d=1}^D \sum \nolimits_{m=1}^M \I(x_{nd}=m) \I(y_n=c) \log \theta_{cdm}
    \end{split}
    $$
    
- The **log-likelihood** for all training data $\mathcal{D} = \{ (\vec{x}_n, y_n) \}_{n=1}^N $ is
    $$
    \begin{align}
    & \log P(\mathcal{D}| \pi, \theta)\\
    &= \log \prod \nolimits_{n=1}^N P((\vec{x}_n, y_n) | \pi, \theta) = \sum \nolimits_{n=1}^N \log P((\vec{x}_n, y_n) | \pi, \theta) \\
    &= \boxed{\sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \I(y_n=c) \log \pi_c + \sum \nolimits_{n=1}^N \sum \nolimits_{c=1}^C \sum \nolimits_{d=1}^D \sum \nolimits_{m=1}^M \I(x_{nd}=m) \I(y_n=c) \log \theta_{cdm}}
    \end{align}
    $$

### Naive Bayes:  Maximum Likelihood


- We have alread solved the MLE for the multinomial distribution (categorical variable)! We observed that:
    $$
    \hat{\pi}_c = \frac{N_c}{N} \quad \hat{\theta}_{cd} = \frac{N_{cd}}{N_c}
    $$
    where
    - $N = $ Number of examples in $\mathcal{D}$
    - $N_c = $ Number of examples in class $c$ in $\mathcal{D}$
    - $N_{cd} = $ Number of examples in class $c$ with $x_d = 1$
    
- Intuitive Interpretation
    - The class prior $\pi$ is obtained from the density of each class $\{1, \dots, C\}$ in $\mathcal{D}$
    - The class-conditional probability $\theta_{cd}$ is obtained from the density of $x_d \in \{0,1\}$ among all examples in class $c$

#### Example No.  |  Color  |  Type  |  Origin  |  Stolen?

1 | Red | Sports | Domestic | Yes <br>
2 | Red | Sports | Domestic | No <br>
3 | Red | Sports | Domestic | Yes <br>
4 | Yellow | Sports | Domestic | No <br>
5 | Yellow | Sports | Imported | Yes <br>
6 | Yellow | SUV | Imported | No <br>
7 | Yellow | SUV | Imported | Yes <br>
8 | Yellow | SUV | Domestic | No <br>
9 | Red | SUV | Imported | No <br>
10 | Red | Sports | Imported | Yes 

### Question: 
We want to classify a Red Domestic SUV. Note there is no example of a Red Domestic SUV in our data
set

### Exercise 4.a (Estimation of parameters)

We need <br>
P(Red|Yes), P(SUV|Yes), P(Domestic|Yes) <br>
P(Red|No) , P(SUV|No), and P(Domestic|No) <br>
P(Yes) and P(No)

### Exercise 4.b (Classification)

Compute P(Yes | Red Domestic SUV) and (No | Red Domestic SUV)

## MAP Estimation for Naive Bayes with Beta Prior

In the above example, what if we never see a red car that is stolen (perhaps because we didn't have much data)? What will be $P(\text{Stolen} | \text{Red Imported Sports})$? The predicted probability will be 0! This is not desireable, since it would essentially be "overfitting" to the data.

This is where we want a prior distribution. As we discussed previously, it's best to use a conjugate prior if you can, because the calculations are very convenient. The conjugate distribution to the binomial model is the *beta distribution*, parameterized by $\alpha, \beta > 0$:
$$P(\theta | \alpha, \beta) := \frac{\theta^{\alpha - 1}(1-\theta)^{\beta - 1}}{B(\alpha, \beta)}$$
where the normalization term $B$ is defined in terms of the [gamma function](https://en.wikipedia.org/wiki/Gamma_function), $B(\alpha, \beta) := \frac{\Gamma(\alpha)\Gamma(\beta)}{\Gamma(\alpha + \beta)}$. 

### Naive Bayes:  Maximum a Posteriori


- We have alread solved the MLE for Naive Bayes:
    $$
    \hat{\pi}_c = \frac{N_c}{N} \quad \hat{\theta}_{cd} = \frac{N_{cd}}{N_c}
    $$
    where $N = $ #examples in the dataset, $N_c = $ #examples in class $c$ in dataset, $N_{cd} = $ #examples in class $c$ with $x_d = 1$
    
*Problem*: What is the MAP estimate of the parameters $\theta_{cd}$ for this model, when we assume the prior on every $\theta_{cd}$ is (independently) distributed according to $\text{Beta}(\alpha,\beta)$?
    


#### Answer

You get the "smoothed" version of the counts:
    $$
     \hat{\theta}_{cd}^{\text{MAP}(\alpha,\beta)} = \frac{N_{cd} + \alpha}{N_c + \alpha + \beta}
    $$