# Chapter 4: Model Adequacy Checking


The major assumptions that we have made thus far in our study of regression analysis are as follows:
1. The relationship between the response y and the regressors is linear, at least approximately.
2. The error term ε has zero mean.
3. The error term ε has constant variance $\sigma^2$.
4. The errors are uncorrelated.
5. The errors are normally distributed.

Taken together, assumptions 4 and 5 imply that the errors are independent random variables. Assumption 5 is required for hypothesis testing and interval estimation.
We should always consider the validity of these assumptions to be doubtful and conduct analyses to examine the adequacy of the model we have tentatively entertained. The types of model inadequacies discussed here have potentially serious consequences. Gross violations of the assumptions may yield an unstable model in the sense that a different sample could lead to a totally different model with opposite conclusions. We usually cannot detect departures from the underlying assumptions by examination of the standard summary statistics, such as the t or F statistics, or R2. These are “global” model properties, and as such they do not ensure model adequacy.

In this chapter we present several methods useful for diagnosing violations of the basic regression assumptions. These diagnostic methods are primarily based on study of the model residuals.


## 4.2 Residual Analysis

We have defined residuals as:
$$e_i = y_i  -\hat{y}_i, \quad i=1,2,...,n.$$

Since a residual may be viewed as the deviation between the data and the fit, it is also a measure of the variability in the response variable not explained by the regression model. It is also convenient to think of the residuals as the realized or observed values of the model errors. Thus, any departures from the assumptions on the errors should show up in the residuals. 

The residuals have several important properties. They have zero mean, and their approximate average variance is estimated by:
$$MS_{Res} = \frac{SS_{Res}}{n-p}.$$

### 4.2.2 Methods for scaling residuals

Sometimes it is useful to work with scaled residuals. In this section we introduce four popular methods for scaling residuals. These scaled residuals are helpful in finding observations that are outliers, or extreme values.

#### Standardized residuals

Since the approximate average variance of a residual is estimated by $MS_{Res}$, a logical scaling for the residuals would be the standardized residuals:
$$d_i = \frac{e_i}{\sqrt{MS_{Res}}}.$$

#### Studentizd residuals

Using $MS_{Res}$ as the variance of the $i$-th residual, $e_i$ is only an approximation. We can improve the residual scaling by dividing $e_i$ by the exact standard deviation of the ith residual. Recall that we may write the vector of residuals as:
$$e = (I-H)y.$$

It can be shown that: 
$$e = (I-H) \epsilon,$$

so the residuals are the same linear transformation of the observations $y$ and the errors $\epsilon$. The covariance matrix of residuals is:
$$Var(e) = \sigma^2 (I-H),$$
so the variance of the $i$-th residual is:
$$Var(e_i) = \sigma^2(1-h_{ii}).$$

The covariance between $e_i, e_j$ is $Cov(e_i, e_j) = -\sigma^2h_{ij}$.

Now since $0 \leq h_{ii} \leq 1$, using the residual mean square $MS_{Res}$ to estimate the variance of the residuals actually overestimates $Var(e_i)$. Furthermore, since $h_{ii}$ is a measure of the location of the $i$-th point in $x$ space (recall the discussion of hidden extrapolation in Section 3.7), the variance of $e_i$ depends on where the point $x_i$ lies. Generally points near the center of the $x$ space have larger variance (poorer least-squares fit) than residuals at more remote locations. Violations of model assumptions are more likely at remote points, and these violations may be hard to detect from inspection of the ordinary residuals $e_i$ (or the standardized residuals $d_i$) because their residuals will usually be smaller.

Let $y_n$ be the observed response for the nth data point, let $x_n$ be the specific values for the regressors for this data point, let $y^*_n$ be the predicted value for the response based on the other $n − 1$ data points, and let $\delta = y_n − y^*_n$ be the difference between the actually observed value for this response compared to the predicted value based on the other values. In other words, $y_n = y^*_n + \delta$. If a data point is remote in terms of the regressor
values and $|\delta|$ is large, then we have an influential point. Finally, let $\hat{y}_n$ be the predicted value for the $n^{th}$ response using all the data. It can be shown that:
$$\hat{y}_n = y^*_n + h_{nn}\delta,$$
where $h_{nn}$ is the $n$-th diagonal element of the hat matrix. If the $n$-th data point is remote in terms of the space defined by the data values for the regressors, then $h_{nn}$ approaches 1, and $\hat{y}_n$ approaches $y_n$. The remote data value "drags" the prediction to itself. 

This point is easier to see within a simple linear regression example. Let $\overline{x}^*$ be the average value for the other $n − 1$ regressors. It can be shown that:
$$\hat{y}_n = y^*_n + \bigg( \frac{1}{n} + (\frac{n-1}{n})^2\frac{(x_n - \overline{x}^*)^2}{S_{xx}} \bigg) \delta.$$

The bottom line is two-fold. As we discussed in Sections 2.4 and 3.4, the prediction variance for data points that are remote in terms of the regressors is large. However, these data points do draw the prediction equation to themselves. As a result, the variance of the residuals for these points is small. This combination pres- ents complications for doing proper residual analysis.

A logical procedure, then, is to examine the studentized residuals:

$$r_i = \frac{e_i}{\sqrt{MS_{Res}(1-h_{ii})}}.$$

Standardized and studentized residuals often convey equivalent information. However, since any point with a large residual and a large $h_{ii}$ is potentially highly influential on the least-squares fit, examination of the studentized residuals is generally recommended.

Some of these points are very easy to see by examining the studentized residuals for a simple linear regression model. If there is only one regressor, it is easy to show that the studentized residuals are

$$r_i = \frac{e_i}{\sqrt{MS_{Res}( 1-(1/n + (x_i - \overline{x})^2/S_{xx}) )}}, i=1,...,n.$$


#### PRESS

The standardized and studentized residuals are effective in detecting outliers. Another approach to making residuals useful in finding outliers is to examine the quantity that is computed from $y_i - \hat{y}_{(i)}$, where $\hat{y}_{(i)}$ is the fitted value of the $i$th response based on all observations except the $i$th one. 

The logic behind this is that if the $i$th observation $y_i$ is really unusual, the regression model based on all observations may be overly influenced by this observation. This could produce a fitted value $\hat{y}_i$ that is very similar to the observed value $y_i$, and consequently, the ordinary residual $e_i$ will be small. Therefore, it will be hard to detect the outlier. However, if the $i$th observation is deleted, then $\hat{y}_{(i)}$ cannot be influenced by that observation, so the resulting residual should be likely to indicate the presence of the outlier.

If we delete the $i$th observation, fit the regression model to the remaining $n − 1$ observations, and calculate the predicted value of $y_i$ corresponding to the deleted observation, the corresponding prediction error is:
$$e_{(i)} = y_i - \hat{y}_{(i)}.$$

This prediction error calculation is repeated for each observation $i=1,...,n$. These prediction errors are usually called PRESS residuals (because of their use in computing the prediction error sum of squares, discussed in Section 4.3). Some authors call the $e_{(i)}$ deleted residuals.

It turns out that the ith PRESS residual is:
$$e_{(i)} = \frac{e_i}{1 - h_{ii}}, i=1,...,n.$$

It is easy to see that the PRESS residual is just the ordinary residual weighted according to the diagonal elements of the hat matrix $h_{ii}$. Residuals associated with points for which $h_{ii}$ is large will have large PRESS residuals. These points will generally be high influence points. Generally, a large difference between the ordinary residual and the PRESS residual will indicate a point where the model fits the data well, but a model built without that point predicts poorly. In Chapter 6, we discuss some other measures of influential observations.

Finally, the variance of the $i$-th PRESS residual is:
$$Var(e_{(i)}) = \frac{\sigma^2}{1-h_{ii}},$$

so that a standardized PRESS residual is:
$$\frac{e_i}{\sqrt{\sigma^2(1-h_{ii})}},$$

which, if we use $MS_{Res}$ to estimate $\sigma^2$, is just the studentized residual discussed previously. 

#### R-student

The studentized residual $r_i$ discussed above is often considered an outlier diagnostic. It is customary to use $MS_{Res}$ as an estimate of $\sigma^2$ in computing $r_i$. This is referred to as internal scaling of the residual because $MS_{Res}$ is an internally generated estimate of $\sigma^2$ obtained from fitting the model to all $n$ observations.
Another approach would be to use an estimate of $\sigma^2$ based on a data set with the $i$th observation removed. Denote the estimate of $\sigma^2$ so obtained by $S_{(i)}^2.$ We can show that:
$$S_{(i)}^2 = \frac{(n-p)MS_{Res} - e_i^2/(1-h_{ii})}{n-p-1}.$$

This estimate of $\sigma^2$ is used instead of $MS_{Res}$ to produce an externally studentized residual, usually called R-student given by:

$$t_i = \frac{e_i}{\sqrt{S_{(i)}^2(1-h_{ii})}}, \quad i=1,...,n.$$

In many situations $t_i$ will differ little from the studentized residual $r_i$. However, if the $i$th observation is influential, then $S^2_{(i)}$ can differ significantly from $MS_{Res}$ , and thus the R-student statistic will be more sensitive to this point.


### 4.2.3 Residual plots

#### Normality plots

Small departures from the normality assumption do not affect the model greatly, but gross nonnormality is potentially more serious as the t or F statistics and confidence and prediction intervals depend on the normality assumption. Furthermore, if the errors come from a distribution with thicker or heavier tails than the normal, the least-squares fit may be sensitive to a small subset of the data. Heavy-tailed error distributions often generate outlier that “pull” the least-squares fit too much in their direction.

A very simple method of checking the normality assumption is to construct a normal probability plot of the residuals. This is a graph designed so that the cumulative normal distribution will plot as a straight line.

Let $t_{[1]} < t_{[2]} < . . . < t_{[n]}$ be the externally studentized residuals ranked in increasing order. If we plot $t_{[i]}$ against the cumulative probability $P_i = (i − 1/2 )/ n$, $i = 1, 2, . . ., n$, on the normal probability plot, 
the resulting points should lie approximately on a straight line.

Substantial departures from a straight line indicate that the distribution is not normal. Sometimes normal probability plots are constructed by plotting the ranked residual $t_{[i]}$ against the “expected normal value”, 
$\Phi^{-1}[(i-1/2)/n],$ where $\Phi$ denotes the standard normal cumulative distribution. This follows from the fact that:
$E[t_{[i]}] \sim \Phi^{-1}[(i-1/2)/n].$

Figure 4.3a displays an “idealized” normal probability plot. Notice that the points lie approximately along a straight line. Panels b–e present other typical problems. Panel b shows a sharp upward and downward curve at both extremes, indicating that the tails of this distribution are too light for it to be considered normal. Con- versely, panel c shows flattening at the extremes, which is a pattern typical of samples from a distribution with heavier tails than the normal. Panels d and e exhibit patterns associated with positive and negative skew, respectively.

<img src='images/4_3.png' style='height: 500px;'>


Small sample sizes ($n \leq 16 $) often produce normal probability plots that deviate substantially from linearity. For larger sample sizes ($n \geq 32 $) the plots are much better behaved. Usually about 20 points are required to produce normal probability plots that are stable enough to be easily interpreted.

Andrews [1979] and Gnanadesikan [1977] note that normal probability plots often exhibit no unusual behavior even if the errors $e_i$ are not normally distributed. This problem occurs because the residuals are not a simple random sample; they are the remnants of a parameter estimation process. The residuals are actually linear combinations of the model errors (the $e_i$). Thus, fitting the parameters tends to destroy the evidence of nonnormality in the residuals, and consequently we cannot always rely on the normal probability plot to detect departures from normality.

A common defect that shows up on the normal probability plot is the occurrence of one or two large residuals. Sometimes this is an indication that the corresponding observations are outliers.


#### Plot of residuals against the fitted values

A plot of the (preferrably the externally studentized residuals, $t_i$) versus the corresponding fitted values $\hat{y}_i$ is useful for detecting several common types of model inadequacies. If this plot resembles Figure 4.5a, which indicates that the residuals can be contained in a horizontal band, then there are no obvious model defects. Plots of $t_i$ versus $\hat{y}_i$ that resemble any of the patterns in panels b–d are symptomatic of model deficiencies.

The patterns in panels $b$ and $c$ indicate that the variance of the errors is not constant. The outward-opening funnel pattern in panel $b$ implies that the variance is an increasing function of $y$ [an inward-opening funnel is also possible, indicating that $Var(e)$ increases as $y$ decreases]. The double-bow pattern in panel $c$ often occurs when $y$ is a proportion between zero and 1. The variance of a binomial proportion near 0.5 is greater than one near zero or 1. The usual approach for dealing with inequality of variance is to apply a suitable transformation to either the regressor or the response variable (see Sections 5.2 and 5.3) or to use the method of weighted least squares (Section 5.5). In practice, transformations on the response are generally employed to stabilize variance.

<img src='images/4_5.png' style='height: 500px;'>

A curved plot such as in panel $d$ indicates nonlinearity. This could mean that other regressor variables are needed in the model. For example, a squared term may be necessary. Transformations on the regressor and/or the response variable may also be helpful in these cases.

#### Plot of Residuals against the Regressor

Plotting the residuals against the corresponding values of each regressor variable can also be helpful. These plots often exhibit patterns such as those in Figure 4.5, except that the horizontal scale is $x_{ij}$ for the $j$th regressor rather than $\hat{y}_i$. Once again an impression of a horizontal band containing the residuals is desirable. The funnel and double-bow patterns in panels $b$ and $c$ indicate nonconstant variance. The curved band in panel $d$ or a nonlinear pattern in general implies that the assumed relationship between $y$ and the regressor $x_j$ is not correct. Thus, either higher order terms in $x_j$ (such as $x^2_j$ ) or a transformation should be considered.

#### Plot of Residuals in Time Sequence

If the time sequence in which the data were collected is known, it is a good idea to plot the residuals against time order. Ideally, this plot will resemble Figure 4.5a; that is, a horizontal band will enclose all of the residuals, and the residuals will fluctuate in a more or less random fashion within this band. However, if this plot resembles the patterns in Figures 4.5b–d, this may indicate that the variance is changing with time or that linear or quadratic terms in time should be added to the model.

The time sequence plot of residuals may indicate that the errors at one time period are correlated with those at other time periods. The correlation between model errors at different time periods is called autocorrelation. A plot such as Figure 4.8a indicates positive autocorrelation, while Figure 4.8b is typical of negative autocorrelation. The presence of autocorrelation is a potentially serious violation of the basic regression assumptions. More discussion about methods for detecting autocorrelation and remedial measures are discussed in Chapter 14.

### 4.2.4 Partial Regression and Partial Residual Plots

We noted in Section 4.2.3 that a plot of residuals versus a regressor variable is useful in determining whether a curvature effect for that regressor is needed in the model. A limitation of these plots is that they may not completely show the correct or complete marginal effect of a regressor, given the other regressors in the model. A partial regression plot is a variation of the plot of residuals versus the predictor that is an enhanced way to study the marginal relationship of a regressor given the other variables that are in the model. This plot can be very useful in evaluating whether we have specified the relationship between the response and the regressor variables correctly. Sometimes the partial residual plot is called the added-variable plot or the adjusted-variable plot. Partial regression plots can also be used to provide information about the marginal usefulness of a variable that is not currently in the model.

Partial regression plots consider the marginal role of the regressor $x_j$ given other regressors that are already in the model. In this plot, the response variable $y$ and the regressor $x_j$ are both regressed against the other regressors in the model and the residuals obtained for each regression. The plot of these residuals against each other provides information about the nature of the marginal relationship for regressor $x_j$ under consideration.

To illustrate, suppose we are considering a first-order multiple regression model with two regressors variables, that is, $y = b_0 + b_1x_1 + b_2x_2 + e$. We are concerned about the nature of the marginal relationship for regressor $x_1$—in other words, is the relationship between $y$ and $x_1$ correctly specified? First we would regress $y$ on $x_2$ and obtain the fitted values and residuals:

$$\hat{y}_i(x_2) = \hat{\theta}_0 + \hat{\theta}_1 x_{i2}$$
$$e_i(y | x_2) = y_i - \hat{y}_i(x_2), \quad i=1,...,n.$$

Now regress $x_1$ on $x_2$ and calculate the residuals:
$$\hat{x}_{i1}(x_2) = \hat{\alpha}_0 + \hat{\alpha}_1 x_{i2},$$
$$e_i(x_1 | x_2) = x_{i1} - \hat{x}_{i1}(x_2), \quad i=1,...,n.$$

The partial regression plot for regressor variable $x_1$ is obtained by plotting the $y$ residuals $e_i(y|x_2)$ against the $x_1$ residuals $e_i(x_1 | x_2)$. If the regressor $x_1$ enters the model lineaely, then the partial regression plot should show a linear relatioship, that is the partial residuals will fall along a straight line with a nonzero slope. The slope of tis line will be the regression coefficient of $x_1$ in the multiple linear regression model. If the partial regression plot shows a curvilinear band, then higher order terms in $x_1$ or a transformation may be helpful. When $x_1$ is a candidate variable being considered for inclusion in the model, a horizonal band on the partial regression plot indicated that there is no additional useful information in $x_1$ predicting $y$.

## 4.3 PRESS statistic

The PRESS residuals are defined as:
$$e_{(i)} = y_i - \hat{y}_{(i)},$$
where $\hat{y}_{(i)}$ is the predicted value of the $i$-th observed response based on a model fit to the remaining $n-1$ sample points. We noted that large PRESS residuals are potentially useful in identifying observations where the model does not fit the data well or observations for which the model is likely to provide poor future predictions.

The PRESS statistic is:
$$PRESS = \sum_{i=1}^{n} (y_i - \hat{y}_{(i)})^2 = \sum_{i=1}^{n} \bigg( \frac{e_i}{1-h_{ii}} \bigg)^2.$$

PRESS is generally regarded as a measure of how well a regression model will perform in predicting new data. A model with a small value of PRESS is desired.

The PRESS statistic can be used to compute an $R^2$-like statistic for prediction, say:
$$R^2_{prediction} = 1-\frac{PRESS}{SS_T}.$$