# Specification: Choosing the Independent Variables

## Omitted Variables

Say you forget to include one of the relevant independent variables when you first specify an equation or you cannot get data for one of the variables that you *do* think of. These are examples of situations with an **omitted variable**, defined as an important explanatory variable that has been left out of a regression equation.

The bias caused by leaving a variable out of an equation is called **omitted variable bias** (or, more generally, **specification bias**). In an equation, the coefficient $\beta_k$ is the impact of change in the dependent variable $Y$ given a one unit increase in the independent variable $X_k$, holding constant all the other independent variables. If a variable is omitted, then it is not included as an independent variable, and it is not held constant for the calculation and interpretation of $\hat{\beta_k}$. The omission can cause bias: it can force the expected value of the coefficient away from the true value of the population coefficient.


## The Consequences of an Omitted Variable

The major consequence of omitting a relevant independent variable from an equation is to cause bias in the regression coefficients that remain in the equation. Say the true regression model is 
$$
\begin{equation}
Y_i = \beta_0 + \beta_1X_{1i} + \beta_2X_{2i} + \epsilon_i
\end{equation}
$$

where $\epsilon_i$ is a classical error term. If you omit $X_2$ from the equation then it becomes 
$$
\begin{equation}
Y_i = \beta_0 + \beta_1X_{1i} + \epsilon_i^* \label{eg6_2}
\end{equation}
$$

where 
$$
\begin{equation}
\epsilon_i^* = \epsilon_i + \beta_2X_{2i} \label{eg6_3}
\end{equation}
$$

because the stochastic term includes the effects of any omitted variables. From () and (), it might seem as though we could get unbiased estimates of $\beta_0$ and $\beta_1$ even if we left $X_2$ out of the equation. Unfortunately, this is not the case because the included coefficients almost surely pick up some of the effect of the omitted variable and therefore will change, causing bias.

If we leave an important variable out of an equation, we violate Classical Assumption III which says that the explanatory variables are independent of the error term. Unless the omitted variable is uncorrelated with **all** the included independent variables. In general, when there is a violation of one of the Classical Assumptions, the Gauss-Markov Theorem does not hold, and the OLS estimates are not BLUE. They are not unbiased, or minimum variance, or both.

An omitted variable causes Classical Assumption III to be violated in a way that causes bias. Estimating () when () is the truth will cause bias. This means that
$$
\begin{equation}
\mathbb{E}(\hat{\beta_1}) \neq \beta_1
\end{equation}
$$

Instead of having an expected value equal to the true $\beta_1$, the estimate will compensate for the fact that $X_2$ is omitted from the equation, then the OLS estimation procedure will attribute to $X_1$ variations in $Y$ actually caused by $X_2$, and a biased estimate will result.


To see this, consider a production function that states that output, $Y$ depends on the amount of labour, $X_1$ and capital, $X_2$. What would happen if data on capital were unavailable for some reason and $X_2$ was omitted from the equation? We leave the impact of capital on the model. This omission would almost surely biases  the estimate of the coefficient of labour because it is likely that capital and labour are positively correlated. The OLS program would attribute to labour the increase in output actually caused by capital to the extent that labour and capital were correlated. Thus the bias would be a function of the *impact of capital on output and the correlation between capital and labour*.


To generalize a model with two independent variables, the expected value of the coefficient of an included variable, $X_1$, when a *relevant* variable, $X_2$ is omitted from the equation equals
$$
\begin{equation}
\mathbb{E}(\beta_1) = \beta_1 + \beta_2 \cdot \alpha_1 
\end{equation}
$$

where $\alpha_1$ is the slope coefficient of the secondary regression that relates $X_2$ to $X_1$:
$$
\begin{equation}
X_{2i} = \alpha_0 + \alpha_1X_{1i} + u_i
\end{equation}
$$


where $u_i$ is a classical error term. $\alpha_1$ can be expressed as a function of the correlation between $X_1$ and $X_2$, the included and excluded variables, or $f(r_{12})$. Equation () states that the expected value of the included variables coefficient is equal to its true value plus the omitted variable's true coefficient times a function of the correlation between the included (in) and the omitted (om) variables. Since the expected value of an unbiased estimate is the true value, the right hand term measures the omitted variable bias in the equation:

$$
\begin{equation}
\text{Bias} = \beta_2\alpha_1 \quad \text{or} 
\quad \text{Bias} = \beta_{\text{om}} \cdot f(r_{\text{in, om}})
\end{equation}
$$


In general terms, the bias thus equals $\beta_{\text{om}}$, the coefficient of the omitted variable, times $f(r_{\text{in, om}})$, a function of the correlation between the included and omitted variables. The bias exists unless:
- the true coefficient equals zero
- the included and omitted variables are not correlated.

The term $\beta_{\text{om}}f(r_{\text{in, om}})$ is the amount of specification biased introduced to the estimate of the coefficient of the included variable by leaving out the omitted variable.

## An Example of Specification Bias

Consider the following equation for the annual consumption of chicken in the US.
$$
\begin{align*}
\label{eg6_8}
\hat{Y_t} &= 27.6 - {0.61PC_t} + {0.09PB_t} + {0.24YD_t}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.16) 
\>\>\>\>\>\>\>\>\>\> (0.04)
\>\>\>\>\>\>\>\>\>\> (0.011)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> -3.86 
\>\>\>\>\>\>\>\>\> +2.31
\>\>\>\>\>\>\> +22.07\\
\bar{R^2}&= 0.990 \quad\quad N=40 \text{  (Annual 1960 -1999)}
\end{align*}
$$

where
- $Y_t$ is the per capita consumption (in pounds) in year $t$
- $PC_t$ is the price of chicken (in cents per pound) in year $t$
- $PB_t$ is the price of beef (in cents per pound) in year $t$
- $YD_t$ is the per capita disposable income (in hundreds of dollars) in year

This is a simple demand for chicken model that includes the price of the good, a close substitute and income variable. If we estimate the equation without the price of the substitute, we get

$$
\begin{align*}
\label{eg6_9}
\hat{Y_t} &= 27.5 - {0.42PC_t} + {0.27YD_t}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.14) 
\>\>\>\>\>\>\>\>\>\> (0.005)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\> -2.95 
\>\>\>\>\>\>\>\>\> +55.00\\
\bar{R^2}&= 0.988 \quad\quad N=40 \text{  (Annual 1960 -1999)}
\end{align*}
$$

Compare equations () and () to see if dropping the beef price variable had an impact on the estimated equations. Comparing the overall fit, $\bar{R^2}$ fell from $0.990$ to $0.988$ ($0.002$ points) when $PB$ was dropped, exactly what we'd expect when a relevant variable was omitted.


Dropping $PB$ caused $\hat{\beta}_{PC}$ to become more positive, from $-0.61$ to $-0.42$. Similarly, a shift in the same direction was observed for $\hat{\beta}_{YD}$ from $0.24$ to $0.27$.  The direction of this bias, by this way, is considered \textit{positive} because the biased coefficient of $PC$ of $-0.42$ is more positive than the suspected unbiased one with $-0.61$ and biased coefficient of $YD$ is more positive than the suspected unbiased one of $0.24$.

The fact that the bias is positive could have been guessed before any regressions were run if () were used. The specification bias by omitting $PB$ is expected to be positive because the expected sign of the coefficient of $PB$ is positive and the expected correlation between the price of beef and chicken are positive.
$$
\begin{align*}
\text{Expected bias in } \hat{\beta}_{PC} &= \beta_{PB} \cdot f(r_{PC, PB}) = (+) \cdot (+) = (+) \quad \text{and}\\
\text{Expected bias in } \hat{\beta}_{PB} &= \beta_{YD} \cdot f(r_{PB, YD}) = (+) \cdot (+) = (+)
\end{align*}
$$

Both correlation coefficients are anticipated to be positive. Using economic theory, an increase in the price of chicken will result in an increase in the price of beef. An increase in income also increases the price of beef.

To sum, if a relevant variable is left out of a regression equation,
- there is no longer an estimate of the coefficient of that variable in the equation
- the coefficients of the remaining variables are likely to be biased


## Correcting for an Omitted Variable

The solution to a specification error seems easy - add the omitted variable to the equation. However, it is easier said than done.

First, omitted variable bias is hard to detect. The amount of bias might be small and hard to detect. This is especially true when there is no reason to believe that you have mis-specified the model. Some indications of specification bias are obvious but some are not so. The best indicators for an omitted relevant variable are the theoretical underpinnings of the model itself. The best way to avoid omitting an important variable is to think through the equation and model before entering anything into the computer.

Second, there is the problem of choosing which variable to add to an equation once you have decided that it is suffering from omitted variable bias. Some beginning researchers will add all the possible relevant variables to the equation at once but this leads to less precise estimates. Others will test a number of different variables and keep the one in the equation that does the best statistical job of appearing to reduce the bias. This technique is invalid because the variable that best corrects a case of specification bias might only do so only by chance rather than by being the true solution to the problem. It might give superb statistical results but does not describe the characteristics of the true population.

If the estimated coefficient is significantly different from our expectations in sign or magnitude, then it is extremely likely that some sort of specification bias exists in our model. If an unexpected result leads you to believe that you have an omitted variable, one way to decide which variable to add to the equation is to use expected bias analysis. \textbf{Expected bias} is the likely bias that omitting a particular variable would have caused in the estimated coefficient of one of the included variables. It can be estimated with:

$$
\begin{equation*}
\text{Bias} = \beta_{\text{om}} \cdot f(r_{\text{in, om}}) 
\end{equation*}
$$



If the sign of the expected bias is the same as the sign of your unexpected result then the variable might be the source of the apparent bias. If the sign of the expected bias is *not* the same as the sign of your unexpected result then the variable is extremely unlikely to have caused your unexpected result. Expected bias analysis should only be used when you are choosing between theoretically sound potential variables.

As an example, return to (). Assume you expect the coefficient of $\beta_{PC}$ to be in the range of $-1.0$ and that you were surprised by the unexpected positive coefficient of $PC$ in ().  This unexpectedly positive result could have been caused by an omitted variable with positive expected bias. One such variable is the price of beef. The expected bias in $\hat{\beta}_{PC}$ due to leaving out $PB$ is positive since both the expected coefficient of $PB$ and the expected correlation between $PC$ and $PB$ are positive. Hence, the price of beef is a reasonable candidate to be an omitted variable in ().

## Irrelevant Variables

When you include a variable in an equation that does not belong there, you are adding **irrelevant variables**. This is the converse of omitted variables and can be analysed using the model in Section 1. The addition of a variable to an equation where it does not belong does not cause bias but it increases the variances of the estimated coefficients of the included variables. 

## Impact of Irrelevant Variables

If the true regression coefficient is
$$
\begin{equation}
Y_i = \beta_0 + \beta_1X_{1i} + \epsilon_i 
\end{equation}
$$

but the researcher for some reason includes and extra variable,
$$
\begin{equation}
Y_i = \beta_0 + \beta_1X_{1i} + \beta_{2}X_{2i} + \epsilon_i^{**} 
\end{equation}
$$

where the misspecified error term is
$$
\begin{equation}
\epsilon_i^{**} = \epsilon_i - \beta_{2}X_{2i}
\end{equation}
$$

Such a mistake would not cause bias if the true coefficient of the extra variable is zero. That is, $\hat{\beta}_1$ in () is unbiased if $\beta_{2} = 0$. However, the inclusion of an irrelevant variable will increase the variance of the estimated coefficients, and this increased variance will tend to decrease the absolute value of the $t$-scores. An irrelevant variable will also decrease the $\bar{R^2}$.

## An Example of an Irrelevant Variable
Return to the annual consumption of chicken example. Recall that
\begin{align*}
\begin{split}
\hat{Y_t} &= 27.6 - {0.61PC_t} + {0.09PB_t} + {0.24YD_t}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.16) 
\>\>\>\>\>\>\>\>\>\> (0.04)
\>\>\>\>\>\>\>\>\>\> (0.01)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> -3.86 
\>\>\>\>\>\>\>\>\> +2.31
\>\>\>\>\>\>\> +22.07\\
\bar{R^2}&= 0.990 \quad\quad N=40 \text{  (Annual 1960 -1999)}
\end{split}
\end{align*}
Consider that you hypothesize that the demand for chicken also depends on $IR$, the interest rate. With the new, but irrelevant variable, the equation becomes
\begin{align}
\label{eg6_13}
\begin{split}
\hat{Y_t} &= 27.6 - {0.58PC_t} + {0.12PB_t} + {0.24YD_t} - {0.14IR_t}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.16) 
\>\>\>\>\>\>\>\>\>\> (0.05)
\>\>\>\>\>\>\>\>\>\> (0.01)
\>\>\>\>\>\>\>\>\>\> (0.18)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> -3.64 
\>\>\>\>\>\>\>\>\> +2.33
\>\>\>\>\>\>\> +18.72
\>\>\>\>\>\> -0.81\\
\bar{R^2}&= 0.989 \quad\quad N=40 \text{  (Annual 1960 -1999)}
\end{split}
\end{align}
A comparison between () and () will explain the theory in Section 2.1. $\bar{R^2}$ has fallen indicating the reduction in fit adjusted for degrees of freedom. None of the estimated regression coefficients changed significantly, as compared between () and ().  The standard errors also increased or remained the same. The $t$-score for the interest rate is very small, indicating that it is not significantly different from zero. Given the theoretical shakiness (not entirely theoretically sound) of the new variable, these results indicate that it is irrelevant and never should have been included in the regression. 

## Four Important Specification Criteria
The four criteria used to decide whether a given variable belongs in the equation are:
- **Theory** - is the variable's place in the equation theoretically sound and unambiguous?
- **$t$-test** - is the variable's estimated coefficient significant in the expected direction?
- **$R^2$** - Does the overall fit of the equation improve when the variable is added to the equation?
- **Bias** - Do other variables' coefficients change significantly when the variable is added into the equation?

If all of these conditions hold then the variable belongs in the equation. If none of them hold then it does not and can be safely excluded from the equation. 

## An Illustration of the Misuse of Specification Criteria

Since economic theory is the most important test for including a variable, consider an example of why a variable should *not* be dropped from an equation simply because it has an insignificant $t$-score.

Suppose you believe that the demand for Brazilian coffee in the US is a negative function of the real price of Brazilian Coffee, $P_{bc}$, and a positive function of both the real price of tea, $P_t$ and the real disposable income of the US, $Y_d$. Suppose further that you obtain data, run the implied regression and observe the following results
$$
\begin{align}
\widehat{COFFEE} &= 9.1 + {7.8P_{bc}} + {2.4P_t} + {0.0035Y_d}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (15.6) 
\>\>\>\>\> (1.2)
\>\>\>\>\>\>\>(0.0010)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\> +0.5 
\>\>\>\>\> +2.0
\>\>\>\>\>\>\>\>+3.5\\
\bar{R^2}&= 0.60 \quad\quad N=25
\end{align}
$$

The coefficients of $P_t$ and $Y_d$ appear to be fairly significant in the direction you hypothesized. But $P_{bc}$ appears to have an insignificant coefficient with an unexpected sign. If you think that there is a possibility that the demand for Brazilian coffee is perfectly price inelastic (coefficient is zero) you run the same equation without the price variable, obtaining
$$
\begin{align}
\widehat{COFFEE} &= 9.3 + {2.6P_t} + {0.0036Y_d}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (1.0) 
\>\>\>\>\>\> (0.0009)\\
t&=\>\>\>\>\>\>\>\>\>\>\> +2.6 
\>\>\>\>\>\>\> +4.0\\
\bar{R^2}&= 0.61 \quad\quad N=25
\end{align}
$$

By comparing (\ref{eg6_14}) and (\ref{eg6_15}), we can apply four specification criteria for the inclusion of a variable in an equation learnt earlier:
- Since the demand for coffee could be perfectly price inelastic, the theory behind dropping the variable might be plausible
\item The \textit{t}-score of the possibly irrelevant variable is 0.5, insignificant at any level
- The $\bar{R^2}$ increased, meaning the fit increased, indicating that the variable is irrelevant
- The coefficients changed little when $P_{bc}$ was dropped,suggesting that there was little, if any biased caused by dropping the variable

Based on this analysis you could conclude that the demand for Brazilian coffee is indeed price inelastic and that the variable is therefore irrelevant. However, this is not the end, indeed.

Although the elasticity of demand of coffee is generally low, it is hard to believe that Brazilian coffee is immune to price competition from other kinds of coffee. Indeed, one would expect a bit of sensitivity in the demand for Brazilian coffee with respect to the price of Colombian coffee, a substitute. To test this hypothesis the price of Colombian coffee, $P_{cc}$ was added to ():
$$
\begin{align}
\widehat{COFFEE} &= 10.0 + {8.0P_{cc}}  -{5.6P_{bc}} + {2.6P_t} + {0.0030Y_d}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (4.0) 
\>\>\>\>\>\>\>\> (2.0)
\>\>\>\>\>\>\>(1.3)
\>\>\>\>\>\>\> (0.0010)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\> +2.0 
\>\>\>\>\> -2.8
\>\>\>\>\> +2.0
\>\>\>\>\>\>\>\> +3.0\\
\bar{R^2}&= 0.65 \quad\quad N=25
\end{align}
$$

By comparing () and () with, once again the four specification criteria:
- Both prices should have been included in the model. The logical justification is strong
- The \textit{t}-scores of the new variable is 2.0, significant on most levels
- The value of $\bar{R^2}$ increased with the addition of th new variable, indicating the variable was an omitted variable
- Two of the coefficients did not change significantly indicating that the correlations between these variables and the price of Colombian coffee are low. However the coefficient for the price of Brazilian coffee changed significantly, indicating bias in the original result.

Since the expected sign of the coefficient of the omitted variable, $P_{cc}$ is positive and the simple correlation coefficient between the two competitive prices, $(r_{P_{bc}, P_{cc}})$ is also positive, then the direction of the expected bias in $\hat{\beta}_{P_{bc}}$ is positive. The positive bias could be seen as the coefficient moved a lot from $-5.6$ to $+7.8$.

## Specification Searches

One of the weaknesses of econometrics is that a researcher could potentially manipulate a data set to produce almost *any* result by specifying different regressions until estimates with the desired properties are obtained. The subject of how to search for the best specification is quite controversial among econometricians. This section does not aim to solve the controversy but provide some guidance and insight for beginning researchers.

## Best Practices in Specification Searches

The recommendations for specification searches include

- Rely on theory rather than statistical fit as much as possible when choosing variables, functional forms and the like
- Minimize the number of equations estimated
- Reveal, in a foot note or appendix, all alternative specifications estimated

If theory, not $\bar{R^2}$ or $t$-scores, is the most important criterion for the inclusion of avariable in a regression equation, then it follows that most of the work of specifying a model should be done before you attempt to estimate the equation. 

## Sequential Specification Searches

The **sequential specification search** technique allows a researcher to estimate an undisclosed number of regressions and then present a final choice as if it were the only specification estimated. Such a method misstates the statistical validity of the regression results for two reasons:

- The statistical significance of the results is overestimated because the estimations of the previous regressions are ignored
- The expectations used by the researcher to choose between various regression results rarely, if ever, are disclosed. Thus the reader has no way of knowing whether or not all the other regression results had opposite signs or insignificant coefficients for the important variables

Unfortunately, there is no universally accepted way of conducting sequential searches, primarily because the appropriate test at one stage in the process depends on which tests previously were conducted, and also because the tests have been very difficult to invent. One possibility is to reduce the degrees of freedom in the "final" equation by one for each alternative specification attempted. This procedure is far from exact, but it does impose an explicit penalty for specification searches.

Instead, we recommend trying to keep the number of regressions estimated to be as low as possible; to focus on theoretical considerations when choosing variables or functional forms; and to document all the various specifications investigated. That means, we recommend parsimony (using theory and analysis to limit the number of specifications estimated) with disclosure (reporting all the equations estimated).

## Bias Caused by Relying on the $t$-Test to Choose Variables

We stated in the previous section that sequential specification searches are likely to mislead researchers bout the statistical properties of their results. In particular, the practice of dropping a potential independent variable simply because its $t$-score indicates that its estimated coefficient is insignificantly different from zero will cause systematic bias in the estimated coefficients (and their \textit{t}-scores) of the remaining variables.

Consider the hypothesized model:
$$
\begin{equation}
Y_i = \beta_0 + \beta_{1}X_{1i} + \beta_{2}X_{2i} + \epsilon_i
\end{equation}
$$

Assume further that, on the basis of theory, we are certain that $X_1$ belongs to the equation but we are not certain that $X_2$ belongs. Even though we have stressed our four criteria to determine whether $X_2$ should be included, many researchers just use the $t$-test on $\hat{\beta_2}$ to determine whether $X_2$ should be included.  If this preliminary \textit{t}-test is significantly different from zero, then researchers drop $X_2$ from the equation and consider $Y$ to be a function of $X_1$.

Two kinds of mistakes could be made with such a system. First, $X_2$ can sometimes be left in the equation when it does not belong there, but this does not change the expected value of $\hat{\beta}_1$. This is a Type I Error.

Second, $X_2$ sometimes can be dropped from the equation when it belongs. This is a Type II Error. In this second case, the estimated coefficient of $X_1$ will be biased by the value of the true $\beta_2$ to the extent that $X_1$ and $X_2$ are correlated.

## Sensitivity Analysis

**Sensitivity Analysis** consists of purposely running a number of alternative specifications to determine whether particular results are *robust* (not statistical flukes). In essence, we are trying to determine how sensitive a potential "best" equation is to a change in specification because the true specification is *not* known. Researchers who use sensitivity analysis run (and report on) a number of different reasonable specifications and tend to discount a result that appears significant in some specifications and insignificant in others. Indeed, the whole purpose of sensitivity analysis is to gain confidence that a particular result is significant in a variety of alternative specifications, functional forms,variable definitions and/or subsets of the data. 

## Data Mining

**Data Mining** involves exploring a data set not for the purpose of testing hypotheses or finding a specification, but for the purpose of uncovering empirical regularities that can inform economic theory. 

*However*, if you develop a hypothesis using data mining techniques, you *must* test that hypothesis on a *different* data set. A new data set must be used because our typical statistical tests have little meaning if the new hypothesis is tested on the data set that was used to generate it. After all, the researcher already knows, ahead of time, what the results will be! The use of dual data sets is easiest when there is a plethora of data.

In essence, improper data mining to obtain desired statistics for the final regression equation is a potentially unethical empirical research method. Whether the improper data mining is accomplished by estimating one equation at a time or by estimating batches of equations or by techniques like stepwise regression procedures, the conclusion is the same. Hypotheses developed by data mining should always be tested on a data set different from the one that was used to develop the hypothesis. Otherwise the researcher has not found any scientific evidence to support the hypothesis; rather a specification has been chosen in a way that is essentially misleading.

## An Example of Choosing Independent Variables

Suppose a friend surveys all 25 members of her econometrics class and asks for your help in choosing a specification:


- $GPA_i$ is the cumulative college GPA on the \textit{i}-th student on a four point scale
- $HGPA_i$ is the cumulative high school GPA on the \textit{i}-th student on a four point scale
- $MSAT_i$ is the highest score obtained by the \textit{i}-th student on the mathematics section of the SAT test, maximum 800
- $VSAT_i$ is the highest score obtained by the \textit{i}-th student on the verbal section of the SAT test, maximum 800
- $SAT_i$ = $MSAT_i + VSAT_i$
- $GREK_i$ is a dummy variable equal to 1 if the \textit{i}-th student is a member of a fraternity or sorority, 0 otherwise
- $HRS_i$ is the \textit{i}-th student's estimate of the average number of hours spent studying per course per week in college
- $PRIV_i$ is a dummy variable equal to 1 if the \textit{i}-th student graduated from a private high school, 0 otherwise
- $JOCK_i$ is a dummy variable equal to 1 if the \textit{i}-th student is or was a member of a varsity intercollegiate athletic team for at least one season, 0 otherwise
- $\ln EX_i$ is the natural log of the number of full courses that the $i$-th student has completed in college

Letting $GPA_i$ be the dependent variable, which independent variables would you choose? After the author's arguments, and running the regressions, he derived the following OLS model:
$$
\begin{split}
\widehat{GPA_i} &= -0.26 + {0.49HGPA_i} + {0.06HRS_i} + {0.42\ln EX_i}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.21) 
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.02)
\>\>\>\>\>\>\>\>\>\>\>\>\> (0.14)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> +2.33 
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>+3.00
\>\>\>\>\>\>\>\>\>\>\> +3.00\\
\bar{R^2}&= 0.585 \quad\quad N=25
\end{split}
$$

Since the overall fit seems reasonable and since each each coefficient meets our expectations in terms of sign, size and significance, we consider this an acceptable equation. If we believe that we might have omitted SAT scores, we consider

$$
\begin{align}
\label{eg6_19}
\begin{split}
\widehat{GPA_i} &= -0.92 + {0.47HGPA_i} + {0.05HRS_i} + {0.44\ln EX_i} + 0.00060 SAT_i\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.22) 
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.02)
\>\>\>\>\>\>\>\>\>\>\>\>\> (0.14)
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.00064)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> +2.12 
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>+2.50
\>\>\>\>\>\>\>\>\>\>\>\> +3.12
\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> +0.93\\
\bar{R^2}&= 0.583 \quad\quad N=25
\end{split}
\end{align}
$$

Using the four specification criteria to compare () and ():
- The theoretical validity of SAT scores is controversial but it is the most widely accepted way of testing academic potential
- \textit{t}-test The coefficient of SAT is positive but not significantly different from 0
- $\bar{R^2}$ decreases when SAT scores were added
- None of the estimated slopes changed significantly when SAT was added, though some of the \textit{t}-scores did change because of the increase in the $\mathbb{SE}(\hat{\beta})$s caused by the addition of the SAT

Thus, the statistical criteria support our theoretical contention that SAT is irrelevant.

## Additional Specification Criteria

We shall describe three of the most popular specification criteria.

### Ramsey's Regression Specification Error Test (RESET)
The **Ramsey RESET test** is a general test that determines the likelihood of an omitted variable or some other specification error by measuring whether the fit of a given equation can be significantly improved by the addition of $\hat{Y_2}$, $\hat{Y_3}$ or $\hat{Y_4}$ terms. The additional terms act as proxies for any possible unknown omitted variables or incorrect functional forms. If the proxies can be shown by the $F$-test to have improved the overall fit of the original equation, then we have evidence that there is some sort of specification error in our equation.

The Ramsey RESET test has three steps:

Estimate the equation to be tested using OLS:
\begin{equation}
\hat{Y_i} = \hat{\beta_{0}} + \hat{\beta_1}X_{1i} + \hat{\beta_2}X_{2i} \label{eg6_26}
\end{equation}

Take the $\hat{Y}_i$ values from () and create $\hat{Y_2}$, $\hat{Y_3}$ and $\hat{Y_4}$ terms. Add these terms to equation () as additional explanatory variables and estimate the new equation with OLS:

$$
\begin{equation}
\hat{Y_i} = \beta_{0} + \beta_1X_{1i} + \beta_2X_{2i} + \beta_3\hat{Y_i}^2 + \beta_4\hat{Y_i}^3 + \beta_5\hat{Y_i}^4 + \epsilon_i \label{eg6_27}
\end{equation}
$$

Compare the fits of () and () using the $F$-test. If the two equations are significantly different in overall fit, then we can conclude that () is misspecified.

While the Ramsey RESET test is fairly easy to use, it does little more than signal *when* a major specification error might exist. If you encounter a significant Ramsey RESET test result then you face the daunting task of figuring out what the error is.

As an example of the Ramsey RESET test, consider the chicken demand model to see if RESET could detect that the known specification error. In step 1, run the equation with the omission of $PB$.

$$
\begin{align*}
\begin{split}
\hat{Y_t} &= 27.5 - {0.42PC_t} + {0.27YD_t}\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.14) 
\>\>\>\>\>\>\>\>\>\> (0.005)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\> -2.95 
\>\>\>\>\>\>\>\>\> +55.00\\
\bar{R^2}&= 0.988 \quad\quad N=40 \text{  (Annual 1960 -1999)}
\end{split}
\end{align*}
$$

Take $\hat{Y}_t$ from (), calculate $\hat{Y_t}^2$, $\hat{Y_t}^3$ and $\hat{Y_t}^4$ and reestimate () with the new terms added in:
$$
\begin{align}
\label{eg6_28}
\begin{split}
\hat{Y_t} &= 243.8 - {6.30PC_t} + {4.20YD_t} - 0.41\hat{Y_t}^2 + 0.005\hat{Y_t}^3 + 0.00002\hat{Y_t}^4 + e_t\\
&\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\>\> (0.98) 
\>\>\>\>\>\>\>\>\>\> (0.66)
\>\>\>\>\>\>\>\>\>\> (0.07)
\>\>\>\>\>\>\>\> (0.0009)
\>\>\>\>\>\> (0.000004)\\
t&=\>\>\>\>\>\>\>\>\>\>\>\>\>\> -6.39 
\>\>\>\>\>\>\>\>\> +6.35
\>\>\>\>\>\>\>\> -5.84
\>\>\>\>\>\>\>\>\> +5.73
\>\>\>\>\>\>\>\>\> -5.61\\
\bar{R^2}&= 0.994 \quad\quad N=40 \text{  (Annual 1960 -1999)}\\
\text{RSS} &= 79.27
\end{split}
\end{align}
$$

In step 3, we compare the fits of the two equations by using the $F$-test. Specifically, we test the hypothesis that the coefficients of all three added terms are equal to zero:
$$
\begin{align*}
 \text{H}_0 & \text{: }\beta_3 = \beta_4 = \beta_5=0\\ \text{H}_\text{A} & \text{: } \text{otherwise}
\end{align*}
$$

The appropriate $F$-statistic to use is:
$$
\begin{equation}
F = \frac{\frac{(\text{RSS}_M-\text{RSS})}{M}}{\frac{\text{RSS}}{N-K-1}} = 12.16
\end{equation}
$$

The critical value to use is $M=3$ and degrees of freedom=$34$ so the value is $2.89$. Since $12.16>2.89$ then we can reject the null hypothesis that the coefficients of the added variables are jointly zero, concluding that there is a specification error in (). True enough, the price of beef was not included in the equation. Remember that the RESET test only tells us that there is a specification error but does not tell us the details of the error.


### Akaike's Information Criterion and Schwarz Criterion

**Akaike's Information Criterion (AIC)** and **Schwarz Criterion (SC)** are methods of comparing alternative specifications by adjusting RSS for the sample size, $N$ and the number of independent variables, $K$. These criteria can be uses to augment our four basic specification criteria when we try to decide if the improved fit caused by an additional variable is worth the decreased degrees of freedom and increased complexity caused by addition. Their equations are:
$$
\begin{equation}
\text{AIC} = \log \frac{\text{RSS}}{N} + \frac{2(K+1)}{N} \label{eg6_30}
\end{equation}
$$

$$
\begin{equation}
\text{SC} = \log \frac{\text{RSS}}{N} + \frac{(K+1)\log N}{N} \label{eg6_31}
\end{equation}
$$

To use AIC and SC, estimate two alternative specifications and calculate AIC and SC for each equation. The lower the AIC or SC are, the better the specification. Both criteria penalize the addition of another explanatory variable more than $\bar{R^2}$ does. Applying AIC and SC to the chicken demand example, we need to calculate AIC and SC for equations with and without the price of beef. The equation with the lower AIC and SC values, all else being equal, will be our preferred specification. Plugging the numbers from () into () and (), AIC and SC can be

$$
\text{AIC} = \log \frac{\text{143.07}}{40} + \frac{2(4)}{40} = 1.47
$$


$$
\text{SC} = \log \frac{\text{143.07}}{40} + \frac{(4)\log 40}{40} = 1.64
$$

against () which omits the price of beef,

$$
\text{AIC} = \log \frac{\text{164.31}}{40} + \frac{2(3)}{40} = 1.56
$$
$$
\text{SC} = \log \frac{\text{164.31}}{40} + \frac{(3)\log 40}{40} = 1.69
$$

Since AIC and SC are better with the inclusion of the $BC$ variable, we say that AIC and SC provide evidence that () is preferable to ().

As it turns out, all three new specification criteria indicate that the presence of a specification error when we leave the price of beef out of the equation.

RESET is most useful as a general test of the existence of a specification error, while AIC and SC are more usefulas a measn of comparing two or more alternative specifications.