# Multiple Regression Analysis: OLS Asymptotics

So other than the finite sample properties in the previous chapters, we also need to know the ***asymptotic  properties*** or ***large sample properties*** of estimators and test statistics. And fortunately, under the assumptions we have made, OLS has satisfactory large sample  properties.

$Review\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\plim}{plim}
\newcommand{\using}[1]{\stackrel{\mathrm{#1}}{=}}
\newcommand{\ffrac}{\displaystyle \frac}
\newcommand{\space}{\text{ }}
\newcommand{\bspace}{\;\;\;\;}
\newcommand{\QQQ}{\boxed{?\:}}
\newcommand{\void}{\left.\right.}
\newcommand{\CB}[1]{\left\{ #1 \right\}}
\newcommand{\SB}[1]{\left[ #1 \right]}
\newcommand{\P}[1]{\left( #1 \right)}
\newcommand{\abs}[1]{\left| #1 \right|}
\newcommand{\norm}[1]{\left\| #1 \right\|}
\newcommand{\dd}{\mathrm{d}}
\newcommand{\Tran}[1]{{#1}^{\mathrm{T}}}
\newcommand{\d}[1]{\displaystyle{#1}}
\newcommand{\RR}{\mathbb{R}}
\newcommand{\EE}{\mathbb{E}}
\newcommand{\NN}{\mathbb{N}}
\newcommand{\ZZ}{\mathbb{Z}}
\newcommand{\QQ}{\mathbb{Q}}
\newcommand{\AcA}{\mathscr{A}}
\newcommand{\FcF}{\mathscr{F}}
\newcommand{\Exp}{\mathrm{E}}
\newcommand{\Var}[2][\,\!]{\mathrm{Var}_{#1}\left[#2\right]}
\newcommand{\Cov}[2][\,\!]{\mathrm{Cov}_{#1}\left(#2\right)}
\newcommand{\Corr}[2][\,\!]{\mathrm{Corr}_{#1}\left(#2\right)}
\newcommand{\I}[1]{\mathrm{I}\left( #1 \right)}
\newcommand{\N}[1]{\mathrm{N} \left( #1 \right)}
\newcommand{\ow}{\text{otherwise}}$

1. Expected values unbiasedness: $\text{MLR}.1 \sim \text{MLR}.4$
2. Variance formulas: $\text{MLR}.1 \sim \text{MLR}.5$
3. Gauss-Markov Theorem: $\text{MLR}.1 \sim \text{MLR}.5$
4. Exact sampling distributions/tests: $\text{MLR}.1 \sim \text{MLR}.6$

## Consistency

In practise, time series data regressions will fail the unbiasedness, only **consistency** remains.

$Def$

>Let $W_n$ be an estimator of $\theta$ based on a sample $Y_1,Y_2,\dots,Y_n$ of size $n$. Then, $W_n$ is a consistent estimator of $u$ if for every $\varepsilon > 0$,
>
>$$P\CB{\abs{W_n - \theta} > \varepsilon} \to 0 \text{ as } n \to \infty  $$
>
>Or alternatively, for arbitrary $\epsilon > 0$ and $n \to \infty$, we have $P\CB{\abs{W_n - \theta}< \epsilon} \to 1$

We can also write this as $\text{plim}\P{W_n} = \theta$

$Remark$

>In our real life we don't have infinite samples thus this property involves a thought experiment about what would happen as the sample size gets *large*.

$Theorem.1$

Under assumptions $MLR.1$ through $MLR.4$, the OLS estimator $\hat\beta_j$ is consistent for $\beta_j$, for all $j = 0,1,\dots,k$ meaning that $\plim\P{\hat\beta_j} = \beta_j$.

$Proof$

>$$
\hat\beta_1 = \ffrac{\d{\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1} y_i}} {\d{\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1}^2}}
\\[0.6em]$$
>
><center>since $y_i = \beta_0 + \beta_1 x_1 + u_i$</center>

>$$\begin{align}
\hat\beta_1&= \beta_1 + \ffrac{\d{\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1} u_i}} {\d{\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1}^2}}\\
&= \beta_1 + \ffrac{\d{\ffrac{1} {n}\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1} u_i}} {\d{\ffrac{1} {n}\sum_{i=1}^{n} \P{x_{i1} - \bar{x}_1}^2}} \\[0.5em]
\end{align}\\[0.6em]$$
>
><center> by **law of large number**</center>
>
>$$\begin{align}
\plim\P{\hat\beta_1}&= \beta_1 + \ffrac{\Cov{x_1,u}} {\Var{x_1}}\\
&= \beta_1 + \ffrac{0} {\Var{x_1}} = \beta_1
\end{align}$$
***

$Assumption.4'$ $MLR.4'$ **Zero Mean and Zero Correlation**

$\Exp\SB{u} = 0$ and $\Cov{x_j, u} = 0$, for $j = 1,2,\dots,k$

$Remark$

>The original one is the assumption of **Zero conditional mean** that is $\Exp\SB{u \mid x_1,x_2,\dots,x_k} = 0$. $MLR.4$ is stronger than $MLR.4'$.
>
>Also $MLR.4'$ cannot guarantee the unbiasdness but consistency only.

### Deriving the inconsistency in OLS

> If the error $u$ is correlated with any of the independent variables $x_j$, then OLS is biased and inconsistent.

In the simple regression case, the ***inconsistency*** in $\hat\beta_1$ (or loosely called the ***asymptotic bias***) is $\plim\P{\hat\beta_1} - \hat\beta_1 = \ffrac{\Cov{x_1,u}} {\Var{x_1}}$. And it's positive if $x_1$ and $u$ are positively correlated, negative otherwise.

And this formula will help us find the asymptotic analog of the omitted variable bias (ref **Chap_03.3**). Suppose the true model is $y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + v$ and the OLS estimators with **first four Gauss-Markov assumptions** is $\hat\beta_0$, $\hat\beta_1$, and $\hat\beta_2$, and obviously these three are **consistent**. Then omit $x_2$ and do the simple regression of $y$ on $x_1$ with model $y = \beta_0 + \beta_1 x_1 + u$, then we have $u = \beta_2 x_2 + v$. Let $\tilde\beta_1$ denote the simple regression slope estimator. Then

$$\plim \tilde\beta_1 = \beta_1 + \beta_2 \ffrac{\Cov{x_1,x_2}} {\Var{x_1}} = \beta_1 + \beta_2 \delta_1$$

If $x_1$ and $x_2$ are *uncorrelated* (in the population), then $\delta_1 = 0$, and $\tilde\beta_1$ is a consistent estimator of $\beta_1$ (although not necessarily unbiased). However, if $x_2$ has a positive partial effect on $y$, so that $\beta_2 > 0$ and $\Cov{x_1,x_2}>0$, $\delta_1> 0$. Then the inconsistency in $\tilde\beta_1$ is positive.

## Asymptotic Normality and Large Sample Inference

>$\text{MLR}.6 \iff $ the distribution of $y$ given $x_1,x_2,\dots,x_k$, which is just $u$ then, is normal. Normality has nothing to do with the unbiasedness however to do statistical inference, we need that. And fortunately, by **central limit theorem**, even though the $y_i$ are not from a normal distribution, the OLS estimator still satisfy ***asymptotic normality***, which means they are approximately normally distributed in 
large enough sample sizes.

$Theorem.2$ Asymptotic Normality of OLS

Under the Gauss-Markov Assumptions, $\text{MLR}.1$ through $\text{MLR}.5$, 

- $\sqrt{n}\P{\hat\beta_j - \beta_j} \newcommand{\asim}{\overset{\text{a}}{\sim}}\asim \N{0, \ffrac{\sigma^2} {a_j^2}}$, where $\ffrac{\sigma^2} {a_j^2}$ is the ***asymptotic variance*** of $\sqrt{n}\P{\hat\beta_j - \beta_j} $; and for the slope coefficients, $a_j^2 = \plim \P{\ffrac{1} {n} \sum_{i=1}^{n} \hat r_{ij}^{2}}$ where the $r_{ij}$ are the residuals from regressing $x_j$ on the other independent variables. We say that $\hat\beta_j$ is *asymptotically normally distributed*;
- $\hat\sigma^2$ is a consistent estimator of $\sigma^2 = \Var{u}$;
- For each $j$, $\ffrac{\hat\beta_j - \beta_j} {\text{sd}\P{\hat\beta_j}}\asim \N{0,1}$; $\ffrac{\hat\beta_j - \beta_j} {\text{se}\P{\hat\beta_j}}\asim \N{0,1}$ where $\text{se}\P{\hat\beta_j} = \sqrt{\widehat{\Var{\hat\beta_j}}} = \sqrt{\ffrac{\hat\sigma^2} {\text{SST}_j \P{1-R^2_j}}}$ is the usual OLS standard error. 

$Remark$

>Here we dropped the assumption $\text{MLR}.6$ and the only one restriction remained is that the error has finite variance.
>
>Also note that the population distribution of the error term, $u$, is immutable and has nothing to do with the sample size. Thie theorem only says that regardless of the population distribution of $u$, the OLS estimators, when properly standardized, have approximate standard normal distributions. 
>
>$\text{sd}\P{\hat\beta_j}$ depends on $\sigma$ and is not observable, while $\text{se}\P{\hat\beta_j}$ depends on $\hat\sigma$ and can be computed. In the previous chapter we've already seen that: under **CLM**, $\text{MLR}.1$ through $\text{MLR}.6$, we have $\ffrac{\hat\beta_j - \beta_j} {\text{sd}\P{\hat\beta_j}}\sim \N{0,1}$ and $\ffrac{\hat\beta_j - \beta_j} {\text{se}\P{\hat\beta_j}}\sim t_{n-k-1} = t_{df}$.
>
>In large samples, the $t$-distribution is close to the $\N{0,1}$ distribution and thus $t$ test are valid in large samples *without* $\text{MLR}.6$. But still we need $\text{MLR}.1$ to $\text{MLR}.5$.

Now from $\hat\sigma^2$ is a consistent estimator of $\sigma^2$, let's have a closer look of ${\widehat{\Var{\hat\beta_j}}} = {\ffrac{\hat\sigma^2} {\text{SST}_j \P{1-R^2_j}}}$, where $\text{SST}_j$ is the total sum of squares of $x_j$ in the sample, $R^2_j$ is the $R$-squared from regressing $x_j$ on all of the other independent variables. As the **sample size** *grows*, $\hat\sigma^2$ converges in probability to the constant $\sigma^2$. Further, $R^2_j$ approaches a number strictly between $0$ and $1$. Then about the rate, the sample variance of $x_j$ is $\ffrac{\text{SST}_j} {n}$, so that it converges to $\Var{x_j}$ as the sample size grows, meaning that we have: $\text{SST}_j \approx n\sigma_j^2$, where $\sigma_j^2$ is the population variance of $x_j$. Combining all these facts:

$\bspace \widehat{\Var{\hat\beta_j}}$ shrinks to zero at the rate of $1/n$, $\text{se}\P{\hat\beta_j}$ also shrinks to zero at the rate of $\sqrt{1/n}$ . And the larger sample, the better.

When $u$ is not normally distributed, $\sqrt{\widehat{\Var{\hat\beta_j}}} = \sqrt{\ffrac{\hat\sigma^2} {\text{SST}_j \P{1-R^2_j}}}$ is called the **asymptotic standard error** and $t$ statistics are called **asymptotic *$\textbf{t}$* statistics**. We also have **asymptotic confidence interval**.

### Other Large Sample Tests: The Lagrange Multiplier Statistic

The ***Lagrange multiplier (LM) statistic***. We first consider the model: $y = \beta_0 + \beta_1 x_1 + \cdots + \beta_k x_k + u$. The null hypothesis is $H_0:\beta_{k-q+1} = \beta_{k-q+2} = \cdots = \beta_k = 0$, the last $q$ parameters, putting $q$ exclusion restrictions on the model. The $LM$ statistic requires estimation of the restricted model only. Thus, assume that we have run the regression: $y = \tilde\beta_0 + \tilde\beta_1 x_1 + \cdots + \tilde\beta_{k-q} x_{k-q} + \tilde u$, where $\tilde\void$ indicates that the estimates are from the restricted model.

However it turns out that to get a usable test statistic, we must include *all* of the independent variables in the regression so that we run the regression of $\tilde u$ on $x_1, x_2,\dots,x_k$, that we call an ***auxiliary regression***, a regression that is used to compute a test statistic but whose coefficients are not of direct interest.

Then under the null hypothesis, the sample size $n$, multiplied by the usual $R$-squared from the auxiliary regression is distributed asymptotically as a $\chi^2$ $r.v.$ with $q$ degrees of freedom. Here's the overall procedure for testing the joint significance of a set of $q$ independent variables using this method.

***

<center>Lagrange Multiplier Statistic for $q$ exclusion restrictions</center>

1. Regress $y$ on the *restricted set* of independent variables, $x_1,\dots, x_{k-q}$, and save the residuals, $\tilde u$
2. Regress $\tilde u$ on *all* of the independent variables and obtain the $R$-squared, $R^2_u$. Just to distinguich from regress $y$ on them
3.  Compute **Lagrange multiplier statistic**: $LM = nR_u^2$
4.  Compare $LM$ to the appropriate critical value $c$, in a $\chi_q^2$ distribution; if $LM > c$ then the null hypothesis is *rejected*. Even better, obtain the $p$-value as the probability that a $\chi_q^2$ $r.v.$ exceeds the value of the test statistic. If the $p$-value is less than the desired significance level, then $H_0$ is rejected. If not, we fail to reject $H_0$. The rejection rule is essentially the same as for $F$ testing.

## Asymptotic Efficiency of OLS

In general, the $k$ regressor case, the class of consistent estimators is obtained by generalizing the OLS first order conditions:

$$\sum_{i=1}^n g_j\P{\mathbf{x}_i} \P{y_i - \tilde\beta_0 - \tilde\beta_1 x_{i1} - \cdots - \tilde\beta_k x_{ik}} = 0, \bspace j = 0,1,\dots,k$$

where $g_j\P{\mathbf{x}_i}$ denotes any function of all explanatory variables for observation $i$. And obviously $g_0\P{\mathbf{x}_i} = 1$ and $g_j\P{\mathbf{x}_j} = x_{ij}$ for $j=1,2,\dots,k$ are the conditions to obtain the OLS estimators.

Here's the theorem:

$Theorem.3$ Asymptotic Efficiency of OLS

Under the Gauss-Markov assumptions, let $\tilde\beta_j$ denote estimators that solve equations of the equation:

$$\sum_{i=1}^n g_j\P{\mathbf{x}_i} \P{y_i - \tilde\beta_0 - \tilde\beta_1 x_{i1} - \cdots - \tilde\beta_k x_{ik}} = 0, \bspace j = 0,1,\dots,k$$

and let $\hat\beta_j\newcommand{\Avar}[2][\,\!]{\mathrm{Avar}_{#1}\left[#2\right]}$ denote the OLS estimators. Then for $j=0,1,\dots,k$, the OLS estimators have the smallest asymptotic variances: $\Avar{\sqrt{n} \P{\hat\beta_j - \beta_j}} \leq \Avar{\sqrt{n} \P{\tilde\beta_j - \beta_j}}$



***