# Generative Classification

### Probabilistic Classification

\only<presentation>{
  \includegraphics[height=8cm]<1>{./figures/fig-classification-generative-1}
  \includegraphics[height=8cm]<2>{./figures/fig-classification-generative-2}
  \includegraphics[height=8cm]<3>{./figures/fig-classification-generative-3}
}
\includegraphics[height=8cm]<4->{./figures/fig-classification-generative-4}


### Generative Classification Problem Statement
Given is data  $D = \{(x_1,y_1),\dotsc,(x_N,y_N)\}$
  - inputs $x_n \in \Re^D$ are called **features**.
  - outputs $y_n \in \mathcal{C}_k$, with $k=1,\ldots,K$; The **discrete** targets $\mathcal{C}_k$ are called **classes**.

We will again use the 1-of-$K$ notation for the discrete classes. Define the binary **class selection variable**
$$
y_{nk} = \begin{cases} 1 & \text{if $y_n$ in class $\mathcal{C}_k$}\\
        0 & \text{otherwise} \end{cases}
$$

- Supervised learning goal: build a model for the joint $$p(x,y)= p(x|y)p(y)$$
-  The plan for generative classification: model spec for the joint pdf $p(x|y)p(y)$ and use Bayes to infer the posterior class probabilities 
$$
p(y|x) = \frac{p(x|y) p(y)}{\sum_{y} p(x|y) p(y)}
$$


###  Model specification

-  **Likelihood**. Assume Gaussian **class-conditional distributions** with **constant covariance matrix** across the classes,
 $$
 p(x|\mathcal{C}_k,\theta) = \mathcal{N}(x|\mu_k,\Sigma)
 $$

with notational shorthand: $\mathcal{C}_k \equiv (y_k=1)$

- **Prior** on class labels $y_k$ is multinomial $$p(\mathcal{C}_k|\pi) = \pi_k$$

- This leads to
$$
 p(x,\mathcal{C}_k) =  \pi_k \cdot \mathcal{N}(x|\mu_k,\Sigma)
$$ 
- As usual, the rest (inference for parameters and model prediction) through straight probability theory.




###  Parameter Inference for Classification

-  Goal: ML estimation of $\theta = \{ \pi_k, \mu_k, \Sigma \}$ from data $D$

-  We will maximize the _joint_ log-likelihood $\sum_n \log p(x_n,y_n|\theta)$,

\begin{align}
\log p(D|\theta) &= \sum_n \log \prod_k p(x_n,y_{nk}|\theta)^{y_{nk}} \\
   &=  \sum_{n,k} y_{nk} \log  p(x_n,y_{nk}|\theta)\\
   &=  \sum_{n,k} y_{nk} \underbrace{ \log\mathcal{N}(x_n|\mu_k,\sigma) }_{ \text{Gaussian dens. est.} } + \underbrace{ \sum_{n,k} y_{nk} \log \pi_k }_{ \text{multinom. est.} } 
\end{align}

-  Problem breaks down into
  -  **Multinomial density estimation** for class priors $\pi_k$
  -  **Gaussian density estimation** for parameters $\mu_k, \Sigma$




### ML Estimation for Generative Classification

Prior is ML for multinomial (done this before!)
\begin{align}   
\hat \pi_k = N_k/N \tag{class prior}
\end{align}

Now group the data into separate classes and do MVG ML for class-conditional parameters (done this as well).
\begin{align}
 \hat \mu_k &= \frac{ \sum_n y_{nk} x_n} { \sum_n y_{nk} } = \frac{1}{N_k} \sum_n y_{nk} x_n \tag{class-cond. mean}\\
 \hat \Sigma  &= \frac{1}{N} \sum_{n,k} y_{nk} (x_n-\hat \mu_k)(x_n-\hat \mu_k)^T \tag{variance}\\
  &= \sum_k \hat \pi_k \cdot \underbrace{ \left( \frac{1}{N_k} \sum_{n} y_{nk} (x_n-\hat \mu_k)(x_n-\hat \mu_k)^T  \right) }_{ \text{class-cond. variance} }
\end{align}

Note that $y_{nk}$ groups data from the same class.





###  Model Prediction for Generative Classification

-  Given a `new' input $x$, use Bayes rule to get posterior class probability

\begin{align}
 p(\mathcal{C}_k|x,\theta ) &= \frac{{p(x|\mathcal{C}_k ,\theta )p(\mathcal{C}_k |\pi)}}{{\sum_j {p(x|\mathcal{C}_j ,\theta )p(\mathcal{C}_j |\pi)} }}  \\
  &= \frac{{\pi_k \exp \left\{ { - {\frac{1}{2}}(x - \mu_k )^T \Sigma^{ - 1} (x - \mu_k )} \right\}}}{{\sum_j {\pi_j \exp \left\{ { - {\frac{1}{2}}(x - \mu_j )^T \Sigma^{ - 1} (x - \mu_j )} \right\}} }} \\
  &= \frac{{\exp \left\{ {\mu_k^T \Sigma^{ - 1} x - {\frac{1}{2}}\mu_k^T \Sigma^{ - 1} \mu_k  + \log \pi_k } \right\}}}{{\sum_j {\exp \left\{ {\mu_j^T \Sigma^{ - 1} x - {\frac{1}{2}}\mu_j^T \Sigma^{ - 1} \mu_j  + \log \pi_j } \right\}} }}  \\
  &=  \exp\{\beta_k^T x + \gamma_k\}/Z
\end{align}

where $\beta_k = \Sigma^{-1} \mu_k$, and $\gamma_k = - \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k  + \log \pi_k$ and $Z=\sum_j \mathrm{exp}\{\beta_j^T x + \gamma_j\}$ is a normalizing factor.
%-  Note how the class priors are represented in the posterior class probabilities.

Note that the (softmax)} function $\phi(a_k) = \frac{\mathrm{exp}\{a_k\}}{\sum_j \mathrm{exp}\{a_j\}}$ is by construction normalized, $\sum_k \phi(_k)=1$ .



###  Discrimination Boundaries

-  The class posterior $\log p(\mathcal{C}_k|x,\theta)= \beta_k^T x + \gamma_k - \log Z$ is a linear function of the input features.

-  Thus, the contours of equal probability (**discriminant functions**) are lines (hyperplanes) in feature space
$$
\log \frac{{p(\mathcal{C}_k|x,\theta )}}{{p(\mathcal{C}_j|x,\theta )}} = \beta_{kj}^T x + \gamma_{kj} = 0
$$
where we defined $\beta_{kj}=\beta_k - \beta_j$ and similarly for $\gamma_{kj}$.

-  (homework). What happens if we had not assumed class-independent variances $\Sigma_k=\Sigma$? Are the discrimination functions still linear? quadratic?

-  How to apply a trained classifier to a classification problem? E.g., choose class with
maximum posterior class probability
\begin{align}
k^* &= \arg\max_k p(\mathcal{C}_k|x_{new},\theta) \\
  &= \arg\max_k \left( \beta _k^T x_{new} + \gamma_k \right)
\end{align}



### Example: Binary Classification

Special case example: binary classification ($y \in \{0,1\}$)
\begin{align}
p(y=1|x,\theta) &= \frac{\mathrm{exp}\{\beta_1^T x + \gamma_1\}} {\mathrm{exp}\{\beta_1^T x + \gamma_1\} + \mathrm{exp}\{\beta_0^T x + \gamma_0\}} \\
  & = \frac{1}{1+\mathrm{E}\{-\left(\beta_{10}^T x + \gamma_{10}\right)\}} \\
 p(y=0|x,\theta) &= 1- p(y=1|x,\theta) = \frac{1}{1+\mathrm{E}\{-\left(\beta_{01}^T x + \gamma_{01}\right)\}}
\end{align}

-   The function $\phi(a) \equiv 1/(1+e^{-a})$ is called the **logistic** function, which is the binary case of the **softmax** function.

The Logistic Function as Posterior Probability}
\begin{center}
\includegraphics[height=8cm]{./figures/fig-logistic}
\end{center}



###  Recap Generative Classification

- Model spec. $p(x,\mathcal{C}_k|\,\theta) = \pi_k \cdot \mathcal{N}(x|\mu_k,\Sigma)$
- If the class-conditional distributions are Gaussian with equal covariance matrices across classes ($\Sigma_k = \Sigma$), then
    the discriminant functions are hyperplanes in feature space.
- ML estimation for $\{\pi_k,\mu_k,\Sigma\}$ breaks down to simple density estimation for Gaussian and multinomial.
- Posterior class probability is a softmax (or logistic) function
$$ p(\mathcal{C}_k|x,\theta ) = \mathrm{exp}\{\beta_k^T x + \gamma_k\}/Z$$
where $\beta _k= \Sigma^{-1} \mu_k$ and $\gamma_k=- \frac{1}{2} \mu_k^T \Sigma^{-1} \mu_k  + \log \pi_k$.
