# Information Quantities for Probabilistic Classifiers

**CS5483 Data Warehousing and Data Mining**
___

This notebook will introduce the information quantities often used for training probabilistic classifiers.

As an example, the following handwritten digit classifier is [trained by deep learning](https://www.cs.cityu.edu.hk/~ccha23/deepbook/divedeep.html) using cross entropy loss:
1. Handwrite a digit from 0, ..., 9.
1. Click predict to see if the app can recognize the digit.

::::{card}
:header: Open in [new tab](https://www.cs.cityu.edu.hk/~ccha23/mnist)
:::{iframe} https://www.cs.cityu.edu.hk/~ccha23/mnist
:::
::::

## Information Divergence

A fundamental property of mutual information is that:

::::{prf:theorem} Positivity of mutual information
:label: MI-positivity

$I(X;Y)\geq 0$ with equality iff $X$ and $Y$ are independent.

::::

To show this, we think of the mutual information as a statistical distance called the [Kullback-Leibler divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence):

::::{prf:definition} 

The information/KL divergence between two probability measures $P$ and $Q$ over $\mathcal{Z}$ is defined as

$$
D(P\|Q) := \int_{\mathcal{Z}} (dP) \log \frac{dP}{dQ}.
$$ (D)

The conditional divergence is defined as 

$$
D(V\|W|U):=D(U\times V\| U\times W).
$$

::::

For the information divergence to be called a [divergence](https://en.wikipedia.org/wiki/Divergence_(statistics)), it has to satisfy the following property:

::::{prf:lemma} Positivity of divergence
:label: D-positivity

$D(P\|Q) \geq 0$, with equality iff $P=Q$ almost everywhere.

::::

::::{prf:proof} 

Without loss of generality, we can rewrite the divergence as the expectation:

$$
\begin{align}
D(P_{Z}\|P_{Z'}) &:= E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')} \log\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]\\
&\geq E\left[ \frac{p_{Z}(Z')}{p_{Z'}(Z')}\right] \log \underbrace{E\left[\frac{p_{Z}(Z')}{p_{Z'}(Z')} \right]}_{=1} = 0,
\end{align}
$$

where the last inequality follows from [Jensen's inequality](https://en.wikipedia.org/wiki/Jensen%27s_inequality) and the convexity of $r \mapsto r \log r$. Since $r$ is strictly convex, the inequality holds iff $\frac{p_{Z}(Z')}{p_{Z'}(Z')}=1$ almost surely, i.e., $P_{Z}=P_{Z'}$ almost everywhere.  

::::

::::{exercise}
:label: ex:1
Prove the positivity of mutual information ({prf:ref}`MI-positivity`).
::::

YOUR ANSWER HERE

## Cross Entropy

A probabilistic classifier returns a conditional probability estimate $\hat{P}_{Y|X}$ as a function of the training data $S$, which consists of i.i.d. samples of $(X,Y)$ but independent of $(X,Y)$.

A sensible choice of the loss function is

$$\ell(\hat{P}_{Y|X}(\cdot|x), P_{Y|X}(\cdot|x)):=D(\hat{P}_{Y|X}(\cdot|x)\|P_{Y|X}(\cdot|x))$$ (D-loss)

because, by the positivity of divergence (Lemma {prf:ref}`D-positivity`), the above loss is non-negative and equal to $0$ iff $P_X\times \hat{P}_{Y|X}=P_X \times P_{Y|X}$ almost surely. Using this loss function, we have a simple bias-variance trade-off:

::::{prf:theorem} Bias-variance trade-off

The expected loss (risk) for the loss function {eq}`D-loss` is

$$
\begin{align}
E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)] = \overbrace{I(S;\hat{Y}|X)}^{\text{Variance}} + \overbrace{D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X)}^{\text{Bias}}
\end{align}
$$ (Bias-Variance)

where $\hat{Y}$ is distributed according to

$$
\begin{align}
P_{X,Y,S,\hat{Y}}&=P_{X,Y}\times P_{S} \times P_{\hat{Y}|X,S} && \text{where}\\
P_{\hat{Y}|X,S}(y|x,s) &= \hat{P}_{Y|X}(y|x) && \text{for }(x,y,s)\in \mathcal{X}\times \mathcal{Y}\times \mathcal{S}.
\end{align}
$$ (Yhat)

::::

- The variance $I(S;\hat{Y}|X)$ (also $I(S;X,\hat{Y})$ as $I(S;X)=0$) reflects the level of overfitting as it measures how much the estimate depends on the training data.
- The bias $D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X)$ reflects the level of underfitting as it measures how much the expected estimate

::::{prf:proof} 

$$
\begin{align}
E[D(\hat{P}_{Y|X} \| P_{Y|X}|P_X)] 
&= E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{p_{Y|X}(\hat{Y}|X)}\right]\\
&= \underbrace{E\left[\log \frac{\hat{p}_{Y|X}(\hat{Y}|X)}{E[\hat{p}_{Y|X}](\hat{Y}|X)}\right]}_{\text{(i)}} + \underbrace{E\left[\log \frac{E[\hat{p}_{Y|X}](\hat{Y}|X)]}{p_{Y|X}(\hat{Y}|X)}\right]}_{=D(E[\hat{P}_{Y|X}]\|P_{Y|X}|P_X) \text{(bias)}}
\end{align}
$$

It remains to show (i) is the variance. By {eq}`Yhat`,

$$
\begin{align}
E[\hat{p}_{Y|X}](y|x)
&= E[p_{\hat{Y}|X,S}(y|x,S)|X=x]\\
&= p_{\hat{Y}|X}(y|x).
\end{align}
$$

Substituting {eq}`Yhat` and the above into (i), we have

$$
\begin{align}
\text{(i)} &= E\left[\log \frac{p_{\hat{Y}|X,S}(\hat{Y}|X,S)}{p_{\hat{Y}|X}(\hat{Y}|X)}\right]\\
&= I(S;\hat{Y}|X),
\end{align}
$$

which completes the proof.

::::

The loss in {eq}`D-loss`, however, cannot be evaluated on $S$ for training because $P_{Y|X}(\cdot|x_i)$ is not known. Instead, we often use the [cross entropy](https://en.wikipedia.org/wiki/Cross_entropy) loss

$$
\ell(\hat{P}_{Y|X}(\cdot |x),y) := \log \frac{1}{\hat{p}_{Y|X}(y|x)}.
$$ (CE-loss)

::::{prf:theorem} Cross entropy

The risk for the loss in {eq}`CE-loss` is

$$
\begin{align}
E\left[\log \frac1{\hat{p}_{Y|X}(Y|X)}\right] 
&= H(Y|X) + E[D(P_{Y|X}\| \hat{P}_{Y|X}|P_X)] \\
&\geq H(Y|X) 
\end{align}
$$

with equality iff $P_{X}\times P_{Y|X}=P_{X}\times \hat{P}_{Y|X}$ almost everywhere.

::::

::::{exercise}
:label: ex:2
Prove the above result.
::::

YOUR ANSWER HERE