### Generative Classification

- **[1]** You have a machine that measures property $x$, the "orangeness" of liquids. You wish to discriminate between $C_1 = \text{`Fanta'}$ and $C_2 = \text{`Orangina'}$. It is known that

$$\begin{align*}
p(x|C_1) &= \begin{cases} 10 & 1.0 \leq x \leq 1.1\\
    0 & \text{otherwise}
    \end{cases}\\
p(x|C_2) &= \begin{cases} 200(x - 1) & 1.0 \leq x \leq 1.1\\
0 & \text{otherwise}
\end{cases}
\end{align*}$$

The prior probabilities $p(C_1) = 0.6$ and $p(C_2) = 0.4$ are also known from experience.       
  (a) (##) A "Bayes Classifier" is given by
  
$$ \text{Decision} = \begin{cases} C_1 & \text{if } p(C_1|x)>p(C_2|x) \\
                               C_2 & \text{otherwise}
                 \end{cases}
$$

Derive the optimal Bayes classifier.      
  (b) (###) The probability of making the wrong decision, given $x$, is
  
$$
p(\text{error}|x)= \begin{cases} p(C_1|x) & \text{if we decide $C_2$}\\
    p(C_2|x) & \text{if we decide $C_1$}
\end{cases}
$$

Compute the **total** error probability  $p(\text{error})$ for the Bayes classifier in this example.

> (a) We choose $C_1$ if $p(C_1|x)/p(C_2|x) > 1$. This condition can be worked out as
$$
\frac{p(C_1|x)}{p(C_2|x)} = \frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)} = \frac{10 \times 0.6}{200(x-1)\times 0.4}>1
$$
which evaluates to choosing
$$\begin{align*}
C_1 &\quad \text{ if $1.0\leq x < 1.075$}\\ 
C_2 &\quad \text{ if $1.075 \leq x \leq 1.1$ }
\end{align*}$$
The probability that $x$ falls outside the interval $[1.0,1.1]$ is zero.       
> (b) The total probability of error $p(\text{error})=\int_x p(\text{error}|x)p(x) \mathrm{d}{x}$. We can work this out as

$$\begin{align*}
p(\text{error}) &= \int_x p(\text{error}|x)p(x)\mathrm{d}{x}\\
&= \int_{1.0}^{1.075} p(C_2|x)p(x) \mathrm{d}{x} + \int_{1.075}^{1.1} p(C_1|x)p(x) \mathrm{d}{x}\\
&= \int_{1.0}^{1.075} p(x|C_2)p(C_2) \mathrm{d}{x} + \int_{1.075}^{1.1} p(x|C_1)p(C_1) \mathrm{d}{x}\\
&= \int_{1.0}^{1.075}0.4\cdot 200(x-1) \mathrm{d}{x} + \int_{1.075}^{1.1} 0.6\cdot 10 \mathrm{d}{x}\\
&=80\cdot[x^2/2-x]_{1.0}^{1.075} + 6\cdot[x]_{1.075}^{1.1}\\
&=0.225 + 0.15\\
&=0.375
\end{align*}$$



- **[2]** (#) (see Bishop exercise 4.8): Using (4.57) and (4.58) (from Bishop's book), derive the result (4.65) for the posterior class probability in the two-class generative model with Gaussian densities, and verify the results (4.66) and (4.67) for the parameters $w$ and $w0$. 

> Substitute 4.64 into 4.58 to get  
$$\begin{align*}
a &= \log \left( \frac{ \frac{1}{(2\pi)^{D/2}} \cdot \frac{1}{|\Sigma|^{1/2}} \cdot \exp\left( -\frac{1}{2}(x-\mu_1)^T \Sigma^{-1} (x-\mu_1)\right) \cdot p(C_1)}{\frac{1}{(2\pi)^{D/2}} \cdot \frac{1}{|\Sigma|^{1/2}}\cdot  \exp\left( -\frac{1}{2}(x-\mu_2)^T \Sigma^{-1} (x-\mu_2)\right) \cdot p(C_2)}\right) \\
&= \log \left(  \exp\left(-\frac{1}{2}(x-\mu_1)^T \Sigma^{-1} (x-\mu_1) + \frac{1}{2}(x-\mu_2)^T \Sigma^{-1} (x-\mu_2) \right) \right) + \log \frac{p(C_1)}{p(C_2)} \\
&= ... \\
&=( \mu_1-\mu_2)^T\Sigma^{-1}x - 0.5\left(\mu_1^T\Sigma^{-1}\mu_1 - \mu_2^T\Sigma^{-1} \mu_2\right)+ \log \frac{p(C_1)}{p(C_2)} 
\end{align*}$$ 
Substituting this into the right-most form of (4.57) we obtain (4.65), with $w$ and $w0$ given by (4.66) and (4.67), respectively.

- **[3]** (###) (see Bishop exercise 4.9). 
> The Log-likelihood is given by
$$\log p(\{\phi_n,t,n\} | \{\pi_k\}) = \sum_n \sum_k t_{nk}\left(\log p(\phi_n|C_k) + \log \pi_k\right)\,.$$ Using the method of Lagrange multipliers (Bishop app.E), we augment the log-likelihood with the constraint and obtain the Lagrangian $$\log p(\{\phi_n,t_{nk}\} | \{\pi_k\})+\lambda \left(\sum_k \pi_k -1  \right)\,.$$ In order to maximize, we set the derivative with respect to $\pi_k$ equal to zero and obtain 
$$\begin{align*}
\sum_n \frac{t_{nk}}{\pi_k}+\lambda &=0 \\
-\pi_k\lambda &=\sum_n t_{nk} = N_k \\
-\lambda \sum_k \pi_k &= \sum_k \sum_n t_{nk} \\
\lambda &= -N 
\end{align*}$$

- **[4]** (##) (see Bishop exercise 4.10).    
> We can write the log-likelihood as
$$\begin{align*}
\log p(\{\phi_n,t_n\}|\{\pi_k\}) \propto -0.5\sum_n\sum_kt_{nk}\left(\log |\Sigma|+(\phi_n-\mu_k)^T\Sigma^{-1}(\phi-\mu)\right)
\end{align*}$$
The derivatives of the likelihood with respect to mean and shared covariance are respectively
$$\begin{align*}
\nabla_{\mu_k}\log p(\{\phi_n,t_n\}|\{\pi_k\}) &= \sum_n\sum_k t_{nk}\Sigma^{-1}\left(\phi_n-\mu_k\right) = 0 \\
\sum_n t_{nk}\left(\phi_n-\mu_k\right))&=0 \\
\mu_k &= \frac{1}{N_k}\sum_n t_{nk}\phi_n \\
\nabla_{\Sigma}\log p(\{\phi_n,t_n\}|\{\pi_k\})&=\sum_n\sum_k t_{nk}\left(\Sigma - (\phi_n-\mu_k)(\phi_n-\mu_k)^T\right) = 0 \\
\sum_n\sum_k t_{nk}\Sigma &= \sum_n\sum_k t_{nk}(\phi_n-\mu_k)(\phi_n-\mu_k)^T \\
\Sigma &=  \frac{1}{N}\sum_k\sum_n t_{nk}(\phi_n-\mu_k)(\phi_n-\mu_k)^T
\end{align*}$$

<!----
- **[2]** Consider a given data set $D=\{(x_1,y_1),\ldots,(x_N,y_N)\}$ where $x_n \in \mathbb{R}^M$ are input features and $y_n \in \{0,1\}$ are given class labels.    
  (a) (#) Write down the generative model for this classification task using a Gaussian likelihood with Bernoulli priors for class labels.  
  (b) How do you compute the posterior distribution for class labels, given a new input $x_\bullet$, ie, how do you compute $p(y_\bullet|x_\bullet,D)$?      
  (c) (##) Work out the likelihood function for the parameters.     
  (d) (###) Derive the Maximum Likelihood estimates for the parameters by the gradient of the likelihood to zero.
  
- **[3]** Refer to [Bishop's PRML book](https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf) for the following exercises. Chapter 4: 4.8, 4.9, 4.10.
--->