# Model Selection



Suppose that we have a $K\times 1$ estimator $\hat{\boldsymbol{\theta}}$ which has mean $\boldsymbol{\theta}$ and variance-covariance matrix $\boldsymbol{V}$. An alternative feasible estimator is $\tilde{\boldsymbol{\theta}}=\boldsymbol{0}$. The latter may seem like a silly estimator, but it captures the feature that model selection typically takes the form of exclusion restrictions setting coefficients to 0. In this context we can compare the accuracy of the two estimators by their weighted mean-squared error (WMSE). For a given weight matrix $\boldsymbol{W}$ define

$$
\text{WMSE}(\hat{\boldsymbol{\theta}})=\text{tr}\left(\mathrm{E}\left((\hat{\boldsymbol{\theta}}-\boldsymbol{\theta})(\boldsymbol{\theta}-\boldsymbol{\theta})^{\prime}\right) \boldsymbol{W}\right)=\mathrm{E}\left((\hat{\boldsymbol{\theta}}-\boldsymbol{\theta})^{\prime} \boldsymbol{W}(\hat{\boldsymbol{\theta}}-\boldsymbol{\theta})\right).
$$

The calculations simplify by setting $\boldsymbol{W}=\boldsymbol{V}^{-1},$ which we do for our remaining calculations.

For our two estimators we calculate that
$$
\begin{array}{l}{\text { WMSE }(\widehat{\boldsymbol{\theta}})=K} \\ {\text {WMSE }(\tilde{\boldsymbol{\theta}})=\boldsymbol{\theta}^{\prime} \boldsymbol{V}^{-1} \boldsymbol{\theta} \stackrel{\text {def}}{=} \lambda.}\end{array}
$$

The WMSE of $\widehat{\boldsymbol{\theta}}$ is smaller if $K>\lambda$ and the WMSE of $\tilde{\boldsymbol{\theta}}$ is smaller if $K<\lambda$. One insight from this simple analysis is that we should prefer smaller (simpler) models when potentially omitted variables have small coefficients relative to estimation variance, and should prefer larger (more complicated) models when
these variables have large coefficients relative to estimation variance.

Now consider a somewhat broader comparison. Suppose $\hat{\boldsymbol{\theta}}$ is $\overline{K} \times 1$ with mean $\boldsymbol{\theta}$ and variance matrix $V$. For some $\bar{K} \times(\bar{K}-K)$ full-rank matrix $\boldsymbol{R}$ consider

$$
\begin{aligned} \text { WMSE }(\tilde{\boldsymbol{\theta}}) &=E\left((\tilde{\boldsymbol{\theta}}-\boldsymbol{\theta})^{\prime} \boldsymbol{V}^{-1}(\tilde{\boldsymbol{\theta}}-\boldsymbol{\theta})\right) \\ &=\boldsymbol{\theta}^{\prime} \boldsymbol{R}\left(\boldsymbol{R}^{\prime} \boldsymbol{V} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \boldsymbol{\theta}+K \end{aligned}
$$

The first term is the squared bias, the second is the weighted variance. This simple expression illustrates the basic bias-variance trade-off. Increasing $K$ increases the estimation variance but decreases the squared bias, the latter by decreasing the rank of $R$.

The bias can be estimated by replacing $\hat{\boldsymbol{\theta}}$ with $\boldsymbol{\theta}$. This squared bias estimate is biased since

$$
E\left[\widehat{\boldsymbol{\theta}}^{\prime} \boldsymbol{R}\left(\boldsymbol{R}^{\prime} \boldsymbol{V} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \widehat{\boldsymbol{\theta}}\right]=\boldsymbol{\theta}^{\prime} \boldsymbol{R}\left(\boldsymbol{R}^{\prime} \boldsymbol{V} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \boldsymbol{\theta}+\bar{K}-K.
$$

Putting these calculations together we see that an unbiased estimator for the weighted MSE is

$$
\begin{aligned} M_{K} &=\widehat{\boldsymbol{\theta}}^{\prime} \boldsymbol{R}\left(\boldsymbol{R}^{\prime} \boldsymbol{V} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \widehat{\boldsymbol{\theta}}+2 K-\bar{K} \\ &=(\widehat{\boldsymbol{\theta}}-\widetilde{\boldsymbol{\theta}})^{\prime} V^{-1}(\widehat{\boldsymbol{\theta}}-\widetilde{\boldsymbol{\theta}})+2 K-\bar{K} \end{aligned}
$$

<font color='blue'>**Theorem:** If $\widehat{\boldsymbol{\theta}}$ has mean $\boldsymbol{\theta}$ and variance $\boldsymbol{V}$ and $\tilde{\boldsymbol{\theta}}=\widehat{\boldsymbol{\theta}}-\boldsymbol{V} \boldsymbol{R}\left(\boldsymbol{R}^{\prime} \boldsymbol{V} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \hat{\boldsymbol{\theta}}$
then $E\left(M_{K}\right)=\text { WMSE }(\tilde{\boldsymbol{\theta}})-\text { WMSE }(\widehat{\boldsymbol{\theta}})$.</font>

The factor $\bar{K}$ in $M_{K}$ is constant across models so can be omitted for the purposes of model comparison.

In practice $V$ is unknown. It can be replaced with a consistent estimator and we arrive at the <font color='red'>MSE Selection Criterion</font>
$$
\begin{aligned} M_{K} &=\widehat{\boldsymbol{\theta}}^{\prime} R\left(\boldsymbol{R}^{\prime} \widehat{\boldsymbol{V}} \boldsymbol{R}\right)^{-1} \boldsymbol{R}^{\prime} \hat{\boldsymbol{\theta}}+2 K \\ &=(\widehat{\boldsymbol{\theta}}-\tilde{\boldsymbol{\theta}})^{\prime} \widehat{\boldsymbol{V}}^{-1}(\widehat{\boldsymbol{\theta}}-\tilde{\boldsymbol{\theta}})+2 K. \end{aligned}
$$

MSE selection picks the model for which the estimated WMSE $M_{K}$ is the smallest. For implementation, a set of models are estimated, $M_{K}$ calculated, and the model with the smallest $M_{K}$ selected.

<ins>Note</ins>: The MSE selection criterion described here is not a common model selection tool, but we have presented it as it is the simplest to derive and understand. Furthermore, it turns out to be quite similar to several popular methods, as we show later.

## Selection Criteria: MLR

We first list selection criteria for the linear regression model $y_{i}=x_{i}^{\prime} \boldsymbol{\beta}+e_{i}$ with $\sigma^{2}=E\left(e_{i}^{2}\right)$ and a $(k+1)\times 1$ coefficient vector $\boldsymbol{\beta}$. Let $\widehat{\boldsymbol{\beta}}$ be the OLS estimator, $\widehat{e}_{i}$ the OLS residual, and $\widehat{\sigma}^{2}=n^{-1} \sum_{i=1}^{n} \widehat{e}_{i}^{2}$ be the variance estimator. The number of estimated parameters ( $\boldsymbol{\beta}$ and $\sigma^{2}$ ) is $K=k+2$.

**_Adjusted $\bar{R}^2$_**
$$
\bar{R}^{2}=1-\left(1-R^{2}\right) \frac{n-1}{n-K-1},
$$
where $R^2$ is the standard regression coefficient of determination.

**_Bayesian Information Criterion_**
$$
\mathrm{BIC}=n+n \log \left(2 \pi \widehat{\sigma}^{2}\right)+K \log (n).
$$
**_Akaike Information Criterion_**
$$
\mathrm{AIC}=n+n \log \left(2 \pi \widehat{\sigma}^{2}\right)+2 K.
$$

**_Mallows' $C_p$_**
$$
C_{p}=n \widehat{\sigma}^{2}+2 K \widetilde{\sigma}^{2},
$$
where $\widetilde{\sigma}^{2}$ is a preliminary estimator of $\sigma^{2}$ (typically based on fitting a large model, i.e., the one containing all the predictors).

**_Shibata_**
$$
\text{Shibata}=\widehat{\sigma}^{2}\left(1+\frac{2 K}{n}\right).
$$

**_Final Precition Error_**
$$
\mathrm{FPE}=\widehat{\sigma}^{2}\left(\frac{1+K / n}{1-K / n}\right).
$$

**_Cross-Validation_**
$$
\mathrm{CV}=\frac{1}{n}\sum_{i=1}^{n} \widetilde{e}_{i}^{2},
$$
where $\widetilde{e}_{i}$ are the least squares leave-one-out prediction errors.

<ins>Prediction erros</ins>: We define the leave-one-out estimator as that obtained by applying an estimation formula to the sample omitting the $i$th observation, i.e.,

$$
\widehat{\boldsymbol{\beta}}_{(-i)}=\widehat{\boldsymbol{\beta}}-\frac{1}{\left(1-h_{i i}\right)}\left(\boldsymbol{X}^{\prime} \boldsymbol{X}\right)^{-1} \boldsymbol{x}_{i} \widehat{e}_{i},
$$

where $\widehat{e}_{i}$ are the least squares residuals and $h_{ii}$ are the [leverage](https://en.wikipedia.org/wiki/Leverage_(statistics)) values. We also define the leave-one-out residual or prediction error as that obtained using the leave-one-out regression estimator, thus

$$
\tilde{e}_{i}=y_{i}-x_{i}^{\prime} \widehat{\boldsymbol{\beta}}_{(-i)}=\left(1-h_{i i}\right)^{-1} \widehat{e}_{i}.
$$

We define the out-of-sample mean squared error as
$$
\tilde{\sigma}^{2}=\frac{1}{n} \sum_{i=1}^{n} \widetilde{e}_{i}^{2}=\frac{1}{n} \sum_{i=1}^{n}\left(1-h_{i i}\right)^{-2} \widehat{e}_{i}^{2}
$$

**_Generalized Cross-Validation_**
$$
\mathrm{GCV}=\frac{n \widehat{\sigma}^{2}}{(n-K)^{2}}.
$$