# Errata for Lecture 13

In the lecture on Oct 10, I presented the notion of an optimal classifier in the sense of minimizing
risk when provided imperfect features.  Perfect features would have no region of ambiguity.
Where there is ambiguity in the features, it means that the features of a given feature vector
could occur with non-zero probability in samples belonging to more than one class.   

Let $\alpha_i$ denote action $i$.

Let $\lambda(\alpha_i | \omega_j) = \lambda_{ij}$ the loss from taking action $\alpha_i$ given that
the true class is $\omega_j$.  

The risk $R$ for action $\alpha_i$ given vector $\bar{x}$ is the expected loss across all 
possible decisions (classifications) made by our classifier 

$$
   R(\alpha_i | \bar{x}) = \sum_{i=1}^K \lambda_{ij} P(\omega_j| \bar{x}) \hspace{1in} (1)
$$

$P(\omega_j | \bar{x})$ is the *posterior probability* of class $\omega_j$ given a feature
vector $\bar{x}$.  We can rewrite the *posterior probability* in terms of its likelihood
and prior using Bayes' Law:

$$
   P(\omega_j | \bar{x}) = \frac{P(\bar{x} | \omega_j) P(\omega_j)}{P(\bar{x})} \hspace{1in} (2)
$$

where $P(\bar{x} | \omega_j)$ is called the likelihood of $\bar{x}$ across all instances of
class $\omega_j$ and $P(\omega_j)$ is the prior probability of $\omega_j$ meaning the probability
of encountered instances of $\omega_j$ without any conditions (i.e., if I were to randomly
gather samples from the environment in which the classifier would be employed).

Using Bayes' law, we can rewrite $(2)$ as 

$$
   R(\alpha_i | \bar{x}) = \frac{1}{P(\bar{x})} \sum_{i=1}^K \lambda_{ij} P(\bar{x} | \omega_j) P(\omega_j) \hspace{1in} (3)
$$

A minimum risk classifier uses the decision rule that when given an input with 
an ambiguous feature vector $\bar{x}$, it chooses the class that presents the minimum risk.



I then restated this for the binary classification case.  The classifier $h$ decides


\begin{align}
 \omega_1 \hspace{0.5in} & 
     \text{if } \frac{\lambda_{11} P(\bar{x} | \omega_1) P(\omega_1) + \lambda_{12} P(\bar{x} | \omega_2) P(\omega_2)}
                    {P(\bar{x})} 
     <
       \frac{\lambda_{21} P(\bar{x} | \omega_1) P(\omega_1) + \lambda_{22} P(\bar{x} | \omega_2) P(\omega_2)}
            {P(\bar{x})} 
       \hspace{1in} (4)  \\
 \omega_2  \hspace{0.5in} & \text{ otherwise}
\end{align}

Because $P(\bar{x})$ is the same on both sides, we can rewrite this

\begin{align}
 \omega_1 \hspace{0.5in} & 
     \text{if } \lambda_{11} P(\bar{x} | \omega_1) P(\omega_1) + \lambda_{12} P(\bar{x} | \omega_2) P(\omega_2)  
     <
       \lambda_{21} P(\bar{x} | \omega_1) P(\omega_1) + \lambda_{22} P(\bar{x} | \omega_2) P(\omega_2)           
       \hspace{1in} (5)  \\
 \omega_2  \hspace{0.5in} & \text{ otherwise}
\end{align}

Grouping terms in (5) yields

\begin{align}
 \omega_1 \hspace{0.5in} & 
     \text{if } (\lambda_{11} - \lambda_{21}) P(\bar{x} | \omega_1) P(\omega_1)  
     <
       (\lambda_{22} - \lambda_{12}) P(\bar{x} | \omega_2) P(\omega_2)           
       \hspace{1in} (6)  \\
 \omega_2  \hspace{0.5in} & \text{ otherwise}
\end{align}


<!-- 
Reasons for using log likelihoods as a loss function:

1. **Numerical stability:** Likelihood values are often products of many probabilities especially in
models involving multiple observations.  The products of small probabilities
results in extermely smaller numbers. These small values can cause underflow
in floating-point arithmetic, making the calculations unreliable. By taking
the logarithm of the likelihood, we convert these products into sums, which
are more numerically stable and cheaper to compute.

2. **Simplified gradient computation:** When we take a log or products, multiplications
turn into additions.  When compute a derivative we can compute it separately
for each summand, simplifyibng the gradient calculation of the loss function.
This makes gradient descent easier and more efficient to compute.
-->



We can present this as a ratio

\begin{align}
 \omega_1 \hspace{0.5in} & 
     \text{if } \frac{P(\bar{x} | \omega_1)}{P(\bar{x} | \omega_2)}
       < \frac{\lambda_{22} - \lambda_{12}}{\lambda_{11} - \lambda_{21}} \frac{P(\omega_2)}{P(\omega_1)}
       \hspace{1in} (7)  \\
 \omega_2  \hspace{0.5in} & \text{ otherwise}
\end{align}

I should have left it there, but I added this next erroneous inequality.

\begin{align}
 \omega_1 \hspace{0.5in} & 
     \text{if } \frac{P(\bar{x} | \omega_1)}{P(\bar{x} | \omega_2)}
       > \frac{\lambda_{12}- \lambda_{22}}{\lambda_{21} - \lambda_{11}} \frac{P(\omega_2)}{P(\omega_1)}
       \hspace{1in} (8)  \\
 \omega_2  \hspace{0.5in} & \text{ otherwise}
\end{align}

I should not have flipped the less than symbol to a greater than symbol.  It would've been better
to just stop at (7).
