# Lab 3: Asymptotic Properties of Best Linear Predictor and Partialling Out

To see how partialling out functions both at the population and at the sample level, we will first review the asymptotic properties of the BLP that we defined in previous classes.

Recall the usual linear model. We have a collection of random variables $(Y, X) \in \mathbb{R}\times\mathbb{R}^p$, with some joint distribution $\mathcal{D}$, and . We can build a linear approximation of $Y$ given $X$ at the population level by finding a $\beta$ such that

\begin{gather*}
\beta \in \arg\min_{b\in\mathbb{R}^p}E\left[\left(Y-Xb\right)^2\right]
\end{gather*}

However, when working with samples, we estimate $\beta$ by building $\hat{\beta}$ such that

\begin{gather*}
\hat{\beta} = \arg\min_{b\in\mathbb{R}^p}\mathbb{E}_n\left[\left(Y-Xb\right)^2\right]
\end{gather*}

Like in the previous lab, we will approximate $Y$ drawn from the following data generating process

\begin{align*}
Y &= e^{4X} + \epsilon_Y\\
X &= \epsilon_X\\
\epsilon_Y &\sim\mathcal{N}(0, 1)\\
\epsilon_X&\sim\mathcal{U}(0, 1)
\end{align*}

In [2]:
N <- 10000000
X <- runif(N, 0, 1)
epsilon_y <- rnorm(N, 0, 1)
y <- exp(4 * X) + epsilon_y

Remember that our initial goal is to approximate $Y$ through some linear function, which in this case means we want to approximate $g(X) = e^{4X}$. To do this, we will use a third degree polynomial like in last lab, meaning we will define $Z = (X, X^2, X^3)$ and use the least squares method to find $m(Z)\approx g(X)$. With this approximation we have that

\begin{gather*}
Y = m(Z) + \varepsilon\\
m(Z) = Z\gamma_{YZ}
\end{gather*}

where $\gamma_{YZ}$ is the vector of coefficients. This time, we will take a closer look at these coefficients.

In [4]:
data <- data.frame(y, X, X^2, X^3)

In [17]:
model <- lm(y ~ 0 + ., data = data)
summary(model)


Call:
lm(formula = y ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4841 -0.7465 -0.0069  0.7381  5.8592 

Coefficients:
              Estimate Std. Error  t value Pr(>|t|)    
(Intercept)  -0.101520   0.001391   -72.97   <2e-16 ***
X            24.014522   0.012048  1993.31   <2e-16 ***
X.2         -69.031696   0.027995 -2465.85   <2e-16 ***
X.3          98.015596   0.018402  5326.25   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.1 on 9999996 degrees of freedom
Multiple R-squared:  0.9938,	Adjusted R-squared:  0.9938 
F-statistic: 5.31e+08 on 3 and 9999996 DF,  p-value: < 2.2e-16


We now pick three samples: a small one (about 50 observations), a medium one (about 250 observations) and a large (close to 10 000 observations)

In [13]:
sampling_values <- runif(N, 0, 1)
small_sample = sampling_values < .000005
medium_sample = sampling_values < .000025
large_sample = sampling_values < .001

print(paste("Small sample size:", sum(small_sample)))
print(paste("Medium sample size:", sum(medium_sample)))
print(paste("Large sample size:", sum(large_sample)))

[1] "Small sample size: 52"
[1] "Medium sample size: 248"
[1] "Large sample size: 10067"


Our $\hat{\beta}$ should more closely approximate $\beta$'s distribution as $n$ grows. We can see this when we fit a model with all the population and compare it with larger and larger samples because our estimator $\hat{\beta}$ is distributed as

\begin{gather*}
\hat{\beta}\sim\mathcal{N}(\beta, \bold{V}/n)
\end{gather*}

Notice that the variance shrinks with the sample size. In the following example results, pay attention to what happens to the confidence intervals when as the sample size grows

In [20]:
summary(lm(y ~ 0 + ., data = data[small_sample,]))


Call:
lm(formula = y ~ ., data = data[small_sample, ])

Residuals:
     Min       1Q   Median       3Q      Max 
-3.14883 -0.63070 -0.00062  0.61648  1.79872 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -2.1391     0.8141  -2.628   0.0115 *  
X            36.6691     6.5529   5.596 1.03e-06 ***
X.2         -91.5972    14.4192  -6.352 7.24e-08 ***
X.3         110.3891     9.1534  12.060 3.90e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.022 on 48 degrees of freedom
Multiple R-squared:  0.9954,	Adjusted R-squared:  0.9952 
F-statistic:  3497 on 3 and 48 DF,  p-value: < 2.2e-16


In [27]:
summary(lm(y ~ 0 + ., data = data[medium_sample,]))


Call:
lm(formula = y ~ 0 + ., data = data[medium_sample, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-3.7283 -0.6859 -0.0361  0.7006  3.6030 

Coefficients:
     Estimate Std. Error t value Pr(>|t|)    
X    23.56222    1.56460   15.06   <2e-16 ***
X.2 -69.37591    4.36998  -15.88   <2e-16 ***
X.3  98.86186    3.15340   31.35   <2e-16 ***
D     2.98915    0.04596   65.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.12 on 244 degrees of freedom
Multiple R-squared:  0.9979,	Adjusted R-squared:  0.9979 
F-statistic: 2.948e+04 on 4 and 244 DF,  p-value: < 2.2e-16


In [22]:
summary(lm(y ~ ., data = data[large_sample,]))


Call:
lm(formula = y ~ ., data = data[large_sample, ])

Residuals:
    Min      1Q  Median      3Q     Max 
-4.1958 -0.7730 -0.0002  0.7451  4.7971 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.04591    0.04401  -1.043    0.297    
X            23.64670    0.38212  61.883   <2e-16 ***
X.2         -68.53592    0.89005 -77.003   <2e-16 ***
X.3          97.83299    0.58645 166.823   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.116 on 10063 degrees of freedom
Multiple R-squared:  0.9935,	Adjusted R-squared:  0.9935 
F-statistic: 5.156e+05 on 3 and 10063 DF,  p-value: < 2.2e-16


Now that we have reviewed this, we can add another variable to the mix. This variable will be jointly distributed with X and with Y. We will define the data generating process as:

\begin{align*}
Y &= \theta D + e^{4X} + \epsilon_Y\\
D &= e^X - \log{(X)} + \epsilon_D\\
X &= \epsilon_X\\
\epsilon_Y, \epsilon_D&\sim\text{i.i.d.}\mathcal{N}(0, 1)\\
\epsilon_X&\sim\mathcal{U}(0, 1)
\end{align*}

In [24]:
epsilon_d <- rnorm(N, 0, 1)
D <- exp(X) - log(X) + epsilon_d
y <- 3 * D + exp(4 * X) + epsilon_y
data = data.frame(y, X, X^2, X^3, D)

We will focus on understanding partialling out with this example. We will go step by step to help illustrate each step of the partialling out process and how we get the results. First, let's take a look at the coefficients for our regressors when they are jointly estimated, meaning that we find the BLP of the form

\begin{gather*}
Y = \theta D + Z\gamma + \varepsilon \quad\quad(1)
\end{gather*}

In [26]:
base_model_results <- lm(y ~ 0 + ., data = data)

summary(base_model_results)


Call:
lm(formula = y ~ 0 + ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5025 -0.7561 -0.0163  0.7286  5.8844 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
X    2.301e+01  7.805e-03    2948   <2e-16 ***
X.2 -6.697e+01  2.167e-02   -3090   <2e-16 ***
X.3  9.679e+01  1.541e-02    6280   <2e-16 ***
D    3.011e+00  2.168e-04   13887   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.1 on 9999996 degrees of freedom
Multiple R-squared:  0.9982,	Adjusted R-squared:  0.9982 
F-statistic: 1.354e+09 on 4 and 9999996 DF,  p-value: < 2.2e-16


Here we have a linear approximation that uses all the variables at the same time, and all coefficients are estimated jointly. As before, this means that

\begin{gather*}
Y = \hat{\theta}D + Z\hat{\gamma} + \varepsilon
\end{gather*}

We will save the residuals $\varepsilon$ of this regression for later

In [28]:
y_residuals <- base_model_results$residuals

As we saw in class, partialling out consists of computing the residuals $\~V$ for each variable $V$ in our regression with some collection of regressors $W$.

\begin{gather*}
\~V=V-\hat{V}
\end{gather*}

Where $\hat{V}=m(W)$ is a prediction rule for $V$ with $W$. The vectors $\~V$ are called the "residuals of $V$ on $W$." In the following examples, we will focus on the best linear predictor built with $Z = (X, X^2, X^3)$. This means we estimate

\begin{align*}
Y =& Z\gamma_{ZY} + \varepsilon_{ZY}\\
D =& Z\gamma_{ZD} + \varepsilon_{ZD}\\
&\text{etc...}
\end{align*}

After estimating these linear predictors, we build the residuals

\begin{align*}
\tilde{Y} =& Y - Z\hat{\gamma}_{ZY}\\
\tilde{D} =& D - Z\hat{\gamma}_{ZD}\\
&\text{etc...}
\end{align*}

We will be building these residuals for all the variables in our regression on our collection of constructed regressors $Z$, starting with the residuals of $Y$ on $Z$, $\~Y$

In [32]:
Z_form <- "~ 0 + X + X.2 + X.3"

In [33]:
y_Z_model <- lm(paste("y", Z_form), data = data)
y_Z_residuals = y_Z_model$residuals
summary(y_Z_model)


Call:
lm(formula = paste("y", Z_form), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-18.208  -2.163   0.310   3.052  60.610 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
X     91.88538    0.02714    3386   <2e-16 ***
X.2 -219.88240    0.08408   -2615   <2e-16 ***
X.3  191.95954    0.06218    3087   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.954 on 9999997 degrees of freedom
Multiple R-squared:  0.9626,	Adjusted R-squared:  0.9626 
F-statistic: 8.584e+07 on 3 and 9999997 DF,  p-value: < 2.2e-16


$\~D$:

In [34]:
D_Z_model <- lm(paste("D", Z_form), data = data)
D_Z_residuals = D_Z_model$residuals
summary(D_Z_model)


Call:
lm(formula = paste("D", Z_form), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4848 -0.6987  0.1118  1.0153 19.1000 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
X    22.877377   0.008789    2603   <2e-16 ***
X.2 -50.791034   0.027231   -1865   <2e-16 ***
X.3  31.610655   0.020137    1570   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.604 on 9999997 degrees of freedom
Multiple R-squared:  0.7085,	Adjusted R-squared:  0.7085 
F-statistic: 8.101e+06 on 3 and 9999997 DF,  p-value: < 2.2e-16


$\~X$:

In [37]:
X1_Z_model <- lm(paste("X", Z_form, "+ X_self"), data = cbind(data, X_self = data$X))
X1_Z_residuals = X1_Z_model$residuals
summary(X1_Z_model)

"the response appeared on the right-hand side and was dropped"
"problem with term 1 in model.matrix: no columns are assigned"



Call:
lm(formula = paste("X", Z_form, "+ X_self"), data = cbind(data, 
    X_self = data$X))

Residuals:
       Min         1Q     Median         3Q        Max 
-2.004e-11  0.000e+00  0.000e+00  0.000e+00  5.000e-15 

Coefficients:
         Estimate Std. Error    t value Pr(>|t|)    
X.2     3.294e-13  1.076e-16  3.062e+03   <2e-16 ***
X.3    -2.306e-13  7.955e-17 -2.898e+03   <2e-16 ***
X_self  1.000e+00  3.472e-17  2.880e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.338e-15 on 9999997 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 2.767e+34 on 3 and 9999997 DF,  p-value: < 2.2e-16


$\tilde{X^2}$:

In [40]:
X2_Z_model <- lm(paste("X.2", Z_form, "+ X_self"), data = cbind(data, X_self = data$X.2))
X2_Z_residuals = X2_Z_model$residuals
summary(X2_Z_model)

"the response appeared on the right-hand side and was dropped"
"problem with term 2 in model.matrix: no columns are assigned"



Call:
lm(formula = paste("X.2", Z_form, "+ X_self"), data = cbind(data, 
    X_self = data$X.2))

Residuals:
       Min         1Q     Median         3Q        Max 
-7.000e-16  0.000e+00  0.000e+00  0.000e+00  4.331e-12 

Coefficients:
        Estimate Std. Error   t value Pr(>|t|)    
X      2.298e-14  7.502e-18 3.063e+03   <2e-16 ***
X.3    5.363e-14  1.719e-17 3.120e+03   <2e-16 ***
X_self 1.000e+00  2.324e-17 4.302e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.369e-15 on 9999997 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 3.556e+35 on 3 and 9999997 DF,  p-value: < 2.2e-16


$\tilde{X^3}$:

In [41]:
X3_Z_model <- lm(paste("X.3", Z_form, "+ X_self"), data = cbind(data, X_self = data$X.3))
X3_Z_residuals = X3_Z_model$residuals
summary(X3_Z_model)

"the response appeared on the right-hand side and was dropped"
"problem with term 3 in model.matrix: no columns are assigned"



Call:
lm(formula = paste("X.3", Z_form, "+ X_self"), data = cbind(data, 
    X_self = data$X.3))

Residuals:
       Min         1Q     Median         3Q        Max 
-1.000e-15  0.000e+00  0.000e+00  0.000e+00  6.139e-12 

Coefficients:
         Estimate Std. Error    t value Pr(>|t|)    
X       3.099e-14  1.064e-17  2.914e+03   <2e-16 ***
X.2    -1.031e-13  3.295e-17 -3.128e+03   <2e-16 ***
X_self  1.000e+00  2.437e-17  4.104e+16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.941e-15 on 9999997 degrees of freedom
Multiple R-squared:      1,	Adjusted R-squared:      1 
F-statistic: 1.264e+35 on 3 and 9999997 DF,  p-value: < 2.2e-16


$\tilde{\epsilon_Y}$

In [42]:
residuals_Z_model <- lm(paste("residuals", Z_form), data = cbind(data, residuals = y_residuals))
residuals_Z_residuals = residuals_Z_model$residuals
summary(residuals_Z_model)


Call:
lm(formula = paste("residuals", Z_form), data = cbind(data, residuals = y_residuals))

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5025 -0.7561 -0.0163  0.7286  5.8844 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)
X   -8.765e-17  6.026e-03       0        1
X.2  3.475e-16  1.867e-02       0        1
X.3 -5.880e-16  1.381e-02       0        1

Residual standard error: 1.1 on 9999997 degrees of freedom
Multiple R-squared:  1.11e-32,	Adjusted R-squared:  -3e-07 
F-statistic: 3.698e-26 on 3 and 9999997 DF,  p-value: 1


Now let us divert our attention to the relationship between the base model and a model constructed with the partialled out variables. The partialling out operation is linear, meaning that

\begin{align*}
A = bB + cC \Rightarrow \tilde{A} = b\tilde{B} + c\tilde{C}
\end{align*}

If we apply it to our initial BLP equation (1)

\begin{gather*}
\tilde{Y} = \theta \tilde{D} + \tilde{Z}\gamma + \tilde{\varepsilon} \quad\quad(2)
\end{gather*}

We can verify this by computing the right side of this equation:




In [44]:
residuals_data = data.frame(y_Z_residuals, D_Z_residuals, X1_Z_residuals, X2_Z_residuals, X3_Z_residuals)
right_side_residuals <- lm(y_Z_residuals ~ 0 + ., data = residuals_data)$fitted.values + residuals_Z_residuals
print(right_side_residuals)

            1             2             3             4             5 
-3.528963e+00  5.340305e+00  1.970987e+00  4.539399e+00  1.081923e+00 
            6             7             8             9            10 
-2.963575e-01 -8.908738e+00 -1.025743e+00  2.733898e+00 -9.870433e-01 
           11            12            13            14            15 
 1.452224e+01 -3.925711e+00  9.328851e+00  1.951845e+00 -3.208234e+00 
           16            17            18            19            20 
-3.201170e+00 -1.957280e+00 -5.119784e-01 -1.294875e+00  1.460243e+00 
           21            22            23            24            25 
-8.885301e-02  6.507022e-02 -1.067400e+00 -6.199894e+00  1.214276e+00 
           26            27            28            29            30 
 4.550435e+00  8.128839e-02  1.738922e-01  2.521467e+00 -5.264314e+00 
           31            32            33            34            35 
-9.103418e+00 -3.405928e-01 -5.877821e+00  3.655517e+00 -3.680380e+00 
      

Now we compute the left side, which is just the residuals of $Y$ on $Z$

In [45]:
print(y_Z_residuals)

            1             2             3             4             5 
-3.520243e+00  5.236916e+00  2.307596e+00  4.539460e+00  1.081935e+00 
            6             7             8             9            10 
-2.965684e-01 -8.908770e+00 -1.025536e+00  2.734003e+00 -9.877901e-01 
           11            12            13            14            15 
 1.452222e+01 -3.925445e+00  9.329237e+00  1.951725e+00 -3.208270e+00 
           16            17            18            19            20 
-3.201155e+00 -1.957353e+00 -5.119406e-01 -1.294462e+00  1.460534e+00 
           21            22            23            24            25 
-8.870799e-02  6.532304e-02 -1.067427e+00 -6.200134e+00  1.214046e+00 
           26            27            28            29            30 
 4.550367e+00  8.132150e-02  1.738674e-01  2.520707e+00 -5.264531e+00 
           31            32            33            34            35 
-9.103346e+00 -3.403951e-01 -5.878740e+00  3.655297e+00 -3.681157e+00 
      

If we compute the difference between the left side and the right side, we get extremely small values, which are likely due to floating point errors from our system

In [46]:
print(y_Z_residuals - right_side_residuals)

            1             2             3             4             5 
 8.720524e-03 -1.033887e-01  3.366094e-01  6.129037e-05  1.207352e-05 
            6             7             8             9            10 
-2.109124e-04 -3.242151e-05  2.065491e-04  1.046738e-04 -7.468066e-04 
           11            12            13            14            15 
-1.509522e-05  2.657269e-04  3.857797e-04 -1.200024e-04 -3.634812e-05 
           16            17            18            19            20 
 1.480070e-05 -7.285243e-05  3.777143e-05  4.136909e-04  2.913932e-04 
           21            22            23            24            25 
 1.450196e-04  2.528227e-04 -2.647915e-05 -2.393517e-04 -2.299526e-04 
           26            27            28            29            30 
-6.864363e-05  3.310734e-05 -2.486949e-05 -7.597159e-04 -2.178251e-04 
           31            32            33            34            35 
 7.178921e-05  1.976710e-04 -9.184836e-04 -2.203168e-04 -7.772056e-04 
      

Why is the notable? If we take a look at (2), we should realize that
1. $\tilde{Z}=0$ as it can perfectly "predict" its own value; and
2. $\tilde{\varepsilon}=\varepsilon$ as $\varepsilon$ is independent from (orthogonal to) X, and therefor cannot be predicted with it or with Z.

This means that we can simplify (2) to

\begin{gather*}
\tilde{Y} = \theta \tilde{D} + \varepsilon\text{,}\quad\quad(3)
\end{gather*}

meaning that we can regress $\tilde{Y}$ on $\tilde{D}$ and we will find the same result as with the jointly estimated predictor. To test this, first let's take a look again at the jointly estimated predictor.

In [47]:
summary(base_model_results)


Call:
lm(formula = y ~ 0 + ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.5025 -0.7561 -0.0163  0.7286  5.8844 

Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
X    2.301e+01  7.805e-03    2948   <2e-16 ***
X.2 -6.697e+01  2.167e-02   -3090   <2e-16 ***
X.3  9.679e+01  1.541e-02    6280   <2e-16 ***
D    3.011e+00  2.168e-04   13887   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.1 on 9999996 degrees of freedom
Multiple R-squared:  0.9982,	Adjusted R-squared:  0.9982 
F-statistic: 1.354e+09 on 4 and 9999996 DF,  p-value: < 2.2e-16


Now, according to (3), the result of regressing the residuals should be the same as above for $D$'s coefficient $\theta$

In [48]:
summary(lm(y_Z_residuals ~ D_Z_residuals, data = residuals_data))


Call:
lm(formula = y_Z_residuals ~ D_Z_residuals, data = residuals_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.4937 -0.7464 -0.0067  0.7383  5.8945 

Coefficients:
                Estimate Std. Error  t value Pr(>|t|)    
(Intercept)   -0.0100031  0.0003544   -28.23   <2e-16 ***
D_Z_residuals  3.0119042  0.0002209 13635.39   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.1 on 9999998 degrees of freedom
Multiple R-squared:  0.949,	Adjusted R-squared:  0.949 
F-statistic: 1.859e+08 on 1 and 9999998 DF,  p-value: < 2.2e-16
