# Discriminative Classification

###  Difficult Class-Conditional Densities

Sometimes, the class-conditional densities are very non-Gaussian, but the linear discriminative boundary looks easy enough:
 
\begin{center}
\includegraphics[height=6cm]{./figures/fig-classification-not-gaussian}
\end{center}

###  Discriminative Linear Classification

-  Sometimes, the precise assumptions of the generative model $$p(x,\mathcal{C}_k|\theta) =  \pi_k \cdot \mathcal{N}(x|\mu_k,\Sigma)$$ are not met.
-  Alternative approach: model the posterior $p(\mathcal{C}_k|x)$ directly, without any assumptions on the class densities.
-  [Q.] What model should we use for $p(\mathcal{C}_k|x)$?

-  [A.] Get inspiration from the generative approach: choose the familiar softmax structure for the posterior class probability
$$
p(\mathcal{C}_k|x,\theta_k) = \frac{e^{\theta_k^T x}}{\sum_j e^{\theta_j^T x}}
$$
but **do not impose a Gaussian structure on the classes**.




###  Discriminative vs. Generative Classification

**Two key differences** for discriminative approach
1. Parameter $\theta_k$ is **not** structured into $\{\mu_k,\Sigma,\pi_k \}$. This provides discriminative approach with more flexibility.
2. ML learning by optimization of _conditional_ likelihood $\prod_n p(t_n|x_n,\theta)$ rather than _joint_ likelihood $\prod_n p(t_n,x_n|\theta)$.

As we will see, ML estimation for discriminative classification is more complex than for the linear Gaussian generative approach.

-  Discriminative model-based prediction for a new input $x$
is easy, namely substitute the ML estimate in the model to get
$$p(\mathcal{C}_k|\, x,\hat\theta) = \frac{ \mathrm{exp}\left( \hat \theta_k^T x\right) }{ \sum_j \mathrm{exp}\left(\hat \theta_j^T x\right)} \propto \mathrm{exp}\left(\hat \theta_k^T x\right) $$



 ###  ML Estimation for Discriminative Classification
 
-  Work out the conditional log-likelihood ($y_{k}$ is the target, while $y_{k}=p(\mathcal{C}_k|x,\theta)$ is the model prediction).
     $$
    \mathrm{L}(\theta) = \log \prod_n \prod_k {\underbrace{p(\mathcal{C}_k|x_n,\theta)}_{y_{nk}}}^{t_{nk}} = \sum_{n,k} t_{nk} \log y_{nk}
     $$
     -  Note that softmax $\phi_k=e^{z_k}/{\sum_j e^{z_j}}$ has analytical derivative,
 \begin{align}
 \frac{\partial \phi_k}{\partial z_j} &= \frac{(\sum_j e^{z_j})e^{z_k}\delta_{kj}-e^{z_j}e^{z_k}}{(\sum_j e^{z_j})^2} = \frac{e^{z_k}}{\sum_j e^{z_j}}\delta_{kj} - \frac{e^{z_j}}{\sum_j e^{z_j}} \frac{e^{z_k}}{\sum_j e^{z_j}}\\
     &= \phi_k \cdot(\delta_{kj}-\phi_j)
 \end{align}

%    -  Again we try to minimize the cross-entropy ($\sum_{nk} y_{nk} \log \frac{y_{nk}}{p_{nk}}$) between the data `targets' $t_{nk}$ and the model outputs $p_{nk}$.

 -  Taking the derivative (or: how to spend a hour ...)
\begin{align} 
\nabla_{\theta_j} \mathrm{L}_{nk}(\theta_j) &= \sum_{n,k} \frac{\partial \mathrm{L}_{nk}}{\partial y_{nk}}\,\frac{\partial y_{nk}}{\partial z_{nj}}\,\frac{\partial z_{nj}}{\partial \theta_j} = \sum_{n,k} \frac{t_{nk}}{y_{nk}}\,y_{nk} (\delta_{kj}-y_{nj})\, x_n \\
  &= \sum_n \left( t_{nj} (1-y_{nj}) -\sum_{k\neq j} t_{nk} y_{nj} \right) x_n \\
  &= -\sum_n (t_{nj} - y_{nj})\, x_n
\end{align}




### Linear vs. Logistic Regression

The parameter vector $\theta$ for logistic regression can be estimated through gradient-based adaptation, e.g.,

\begin{align}
        \theta_{n+1} &= \theta_n - \eta \cdot \nabla_\theta \mathrm{L}(\theta) \\
        &= \theta_n + \sum_n \left( t_n - \frac{1}{1+e^{-\theta^Tx_n}} \right)x_n
\end{align}
    
    Run \texttt{demo\_classification.m} in matlab

    
Compare the gradients for linear and logistic regression,

\begin{align}
\nabla_\theta \mathrm{L}(\theta) &= - \sum_n \left(t_n - \theta^T x_n \right) x_n  \tag{linear regression} \\
\nabla_\theta \mathrm{L}(\theta) &= - \sum_n \left(t_n - \frac{1}{1+e^{-\theta^Tx_n}} \right) x_n
 \tag{logistic regression}
\end{align}