# $\S$ 7.8. Minimum Description Length

The minimum description length (MDL) approach gives a selection criterion formally indentical to the $\text{BIC}$ approach, but is motivated from an optimal coding viewpoint.

We first review the theory of coding for data compression, and then apply it to model selection.

### Data compression

We think of
* our datum $z$ as a message that we want to encode and send to someone else (the "receiver"),
* our model as a way of encoding the datum,

and we will choose the most parsimonious model, that is the shortest code, for the transmission.

Suppose
* the possible messages we might want to transmit are $z_1,z_2,\cdots,z_m$.
* Our code uses a finite alphabet of length $A$: e.g., we might use a binary code $\{0,1\}$ of length $A=2$.

Here is an example with four possible messages and a binary coding:

<table>
    <tr>
        <td>Message</td>
        <td>$z_1$</td>
        <td>$z_2$</td>
        <td>$z_3$</td>
        <td>$z_4$</td>
    </tr>
    <tr>
        <td>Code</td>
        <td>$0$</td>
        <td>$10$</td>
        <td>$110$</td>
        <td>$111$</td>
    </tr>
</table>

This code is known as an instantaneous prefix code: No code is the prefix of any other, and the receiver (who knows all of the possible codes), knows exactly when the message has been completely sent. We restrict our discussion to such instantaneous prefix codes.

One could use the coding in the above table or we could permute the codes, e.g., use codes $110$, $10$, $111$, $0$ for $z_1,z_2,z_3,z_4$. How do you decide which to use? It depends on how often we will be sending each of the messages. The one most sent to be $0$. Using this kind of strategy -- shorter codes for more frequent messages -- the average message length will be shorter.

### Entropy

In general, if messages are sent with probabilities $\text{Pr}(z_i)$, $i=1,2,3,4$, a famous theorem due to Shannon says we should use code lengths

\begin{equation}
l_i = -\log_2 \text{Pr}(z_i)
\end{equation}

and the average message length satisfies

\begin{equation}
\text{E}(\text{length}) \ge -\sum \text{Pr}(z_i) \log_2 \text{Pr}(z_i).
\end{equation}

The RHS above is also called the entropy of the distribution $\text{Pr}(z_i)$.

The inequality is an equality when the probabilities satisfy $p_i = A^{-l_i}$.

In our example, if $\text{Pr}(z_i) = 1/2, 1/4, 1/8, 1/8$, respectively, then the coding shown in the above table is optimal and achieves the entropy lower bound.

In general the lower bound cannot be achieved, but procedures like the Huffman coding scheme can get close to the bound. Note that with an infinite set of messages, the entropy is replaced by

\begin{equation}
-\int \text{Pr}(z) \log_2 \text{Pr}(z) dz.
\end{equation}

From this result we glean the following:

> To transmit a random variable $z$ having probability density function $\text{Pr}(z)$, we require about $-\log_2 \text{Pr}(z)$ bits of information.

We henceforth change notation from $\log_2 \text{Pr}(z)$ to $\log \text{Pr}(z) = \log_e \text{Pr}(z)$; this is for convenience, and just introduces an unimportant multiplicative constant.

### Model selection

Now we apply this result to the problem of model selection. We have
* a model $M$ with parameters $\theta$, and
* data $\mathbf{Z} = (\mathbf{X}, \mathbf{y})$ consisting of both inputs and outputs,
* $\text{Pr}(\mathbf{y}|\theta,M,\mathbf{X})$, the (conditional) probability of the outputs under the model,
* assumed the receiver knows all of the inputs, and
* a wish to transmit the outputs.

Then the message length required to transmit the outputs is

\begin{equation}
\text{length} = -\log \text{Pr}(\mathbf{y}|\theta,M,\mathbf{X}) - \log\text{Pr}(\theta|M),
\end{equation}

the log-probability of the target values given the inputs.
* The second term is the average code length for transmitting the model parameters $\theta$,
* while the first term is the average code length for transmitting the discrepancy between the model and actual target values.

For example suppose we have a single target $y \sim N(\theta, \sigma^2)$, parameter $\theta \sim N(0,1)$ and no input (for simplicity). Then the message length is

\begin{equation}
\text{length} = \text{constant} + \log\sigma + \frac{(y-\theta)^2}{2\sigma^2} + \frac{\theta^2}{2}
\end{equation}

Note that the smaller $\sigma$ is, the shorter on average is the message length, since $y$ is more concentrated around $\theta$.

### Summary

The MDL principle says that we should choose the model that minimizes

\begin{equation}
\text{length} = -\log \text{Pr}(\mathbf{y}|\theta,M,\mathbf{X}) - \log\text{Pr}(\theta|M).
\end{equation}

We recognize this as the (negative) log-posterior distribution, and hence minimizing description length is equivalent to maximizing posterior probability. Hence the $\text{BIC}$ criterion, derived as approximation to log-posterior probability, can also be viewed as a device for (approximate) model choice by MDL.

### For continuous variables

Note that we have ignored the precision with which a random variable $z$ is coded. With a finite code length we cannot code a continuous variable exactly.

However, if we code $z$ within a tolerance $\delta z$, the message length needed is the log of the probability in the interval $[z, z+\delta z]$ which is well approximated by $\delta z \text{Pr}(z)$ if $\delta z$ is small. Since

\begin{equation}
\log \delta z \text{Pr}(z) = \log \delta z + \log \text{Pr}(z),
\end{equation}

this means we can just ignore the constant $\log \delta z$ and use $\log\text{Pr}(z)$ as our measure of message length, as we did above.

The preceding view of MDL for model selection says that we should choose the model with highest posterior probability. However, many Bayesian would instead do inference by sampling from the posterior distribution.