# $\S$ 7.7. The Bayesian Approach and $\text{BIC}$

The Bayesian information criterion ($\text{BIC}$), like $\text{AIC}$, is applicable in settings where the fitting is carried out by maximization of a log-likelihood.

The generic form of $\text{BIC}$ is

\begin{equation}
\text{BIC} = -2\cdot\text{loglik} + (\log N)\cdot d.
\end{equation}

The $\text{BIC}$ statistic (times $1/2$) is a.k.a. the Schwarz criterion (Schwarz, 1978).

### $\text{BIC}$ and $\text{AIC}$

Under the Gaussian model, assuming the variance $\sigma_\epsilon^2$ is known, $-2\cdot\text{loglik}$ equals (up to a constant)

\begin{equation}
\frac{\sum_i (y_i - \hat{f}(x_i))^2}{\sigma_\epsilon^2},
\end{equation}

which is

\begin{equation}
\frac{N\cdot\overline{\text{err}}}{\sigma_\epsilon^2}
\end{equation}

for squared error loss.

Hence we can write

\begin{equation}
\text{BIC} = \frac{N}{\sigma_\epsilon^2} \left[ \overline{\text{err}} + \frac{d\sigma_\epsilon^2}{N}\log N \right].
\end{equation}

Therefore $\text{BIC}$ is proportional to $\text{AIC}$ ($C_p$), with the factor $2$ replaced by $\log N$.

### Prefer simpler

Assuming $N > e^2 \approx 7.4$, $\text{BIC}$ tends to penalize complex models more heavily, giving preference to simpler models in selection. As with $\text{AIC}$, $\sigma_\epsilon^2$ is typically estimated by the MSE of a low-bias model.

For classification problems, use of the multinomial log-likelihood leads to a similar relationship with the $\text{AIC}$, using cross-entropy as the error measure.

Note however that the misclassification error measure does not arise in the $\text{BIC}$ context, since it does not correspond to the log-likelihood of the data under any probability model.

### Bayesian motivation

> Therefore, choosing the model with minimum $\text{BIC}$ is equivalent to choosing the model with largest (approximate) posterior probability.

Despite its similarity with $\text{AIC}$, $\text{BIC}$ is motivated in quite a different way. It arises in the Bayesian approach to model selection.

Suppose
* we have a set of candidate models $\mathcal{M}_m$, $m=1,\cdots,M$ and
* corresponding model parameters $\theta_m$, and
* we wish to choose a best model from among them.
* Also we have a prior distribution $\text{Pr}(\theta_m|\mathcal{M}_m)$ for the parameters of each model $\mathcal{M}_m$,

Then the posterior probability of a given model is

\begin{align}
\text{Pr}(\mathcal{M}_m | \mathbf{Z}) &\propto \text{Pr}(\mathcal{M}_m) \cdot \text{Pr}(\mathbf{Z} | \mathcal{M}_m) \\
&\propto \text{Pr}(\mathcal{M}_m) \cdot \int \text{Pr} (\mathbf{Z} | \theta_m,\mathcal{M}_m) \text{Pr}(\theta_m | \mathcal{M}_m) d\theta_m,
\end{align}

where $\mathbf{Z} = \{x_i,y_i\}_1^N$ represents the training data.

To compare two models $\mathcal{M}_m$ and $\mathcal{M}_l$, we form the posterior odds

\begin{equation}
\frac{\text{Pr}(\mathcal{M}_m | \mathbf{Z})}{\text{Pr}(\mathcal{M}_l | \mathbf{Z})} = \frac{\text{Pr}(\mathcal{M}_m)}{\text{Pr}(\mathcal{M}_l)} \cdot \frac{\text{Pr}( \mathbf{Z}|\mathcal{M}_m)}{\text{Pr}(\mathbf{Z}|\mathcal{M}_l)} = \frac{\text{Pr}(\mathcal{M}_m)}{\text{Pr}(\mathcal{M}_l)} \cdot \text{BF}(\mathbf{Z}).
\end{equation}

If the odds are greater than one we choose model $m$, otherwise we choose model $l$.

The rightmost quantity $\text{BF}(\mathbf{Z})$ is called the _Bayes factor_, the contribution of the data toward the posterior odds.

Typically we assume that the prior over models is uniform, so that $\text{Pr}(\mathcal{M}_m)$ is constant.

We need some way of approximating $\text{Pr}(\mathbf{Z}|\mathcal{M}_m)$. A so-called Laplace approximation to the integral followed by some other simplification (Riple, 1996, page 64) to the above integral gives

\begin{equation}
\log \text{Pr}(\mathbf{Z}|\mathcal{M}_m) = \log \text{Pr}(\mathbf{Z}|\hat\theta_m,\mathcal{M}_m) - \frac{d_m}{2} \log N + O(1).
\end{equation}

Here $\hat\theta_m$ is a MLE and $d_m$ is the number of free parameters in model $m$.

If we define our loss function to be $-2\log \text{Pr}(\mathbf{Z}|\hat\theta_m,\mathcal{M})$, this is equivalent to the $\text{BIC}$ criterion specified at the top.

> Therefore, choosing the model with minimum $\text{BIC}$ is equivalent to choosing the model with largest (approximate) posterior probability.

### Bonus

This framework gives us more.

If we compute the $\text{BIC}$ criterion for a set of $M$ models, giving $\text{BIC}_m$, then we can estimate the posterior probability of each model $\mathcal{M}_m$ as

\begin{equation}
\frac{\exp\left( -\frac{1}{2} \text{BIC}_m \right)}{\sum_{l=1}^m \exp\left( -\frac{1}{2} \text{BIC}_l \right)}.
\end{equation}

Thus we can estimate not only the best model, but also assess the relative merits of the models considered.

### Meaning and comparison

For model selection purposes, there is no clear choice between $\text{AIC}$ and $\text{BIC}$.

$\text{BIC}$ is asymptotically consistent as a selection criterion. What this means is that given a family of models, including the true model, the probabilty that $\text{BIC}$ will select the correct model approaches one as the sample size $N \rightarrow \infty$.

This is not the case for $\text{AIC}$, which tends to choose models which are too complex as $N \rightarrow \infty$. On the other hand, for finite samples, $\text{BIC}$ often chooses models that are too simple, because of its heavy penalty on complexity.