# Lab 3: Asymptotic Properties of Best Linear Predictor and Partialling Out

To see how partialling out functions both at the population and at the sample level, we will first review the asymptotic properties of the BLP that we defined in previous classes.

First we load the libraries.

In [1]:
import numpy as np
import statsmodels.api as sms

Recall the usual linear model. We have a collection of random variables $(Y, X) \in \mathbb{R}\times\mathbb{R}^p$, with some joint distribution $\mathcal{D}$, and . We can build a linear approximation of $Y$ given $X$ at the population level by finding a $\beta$ such that

\begin{gather*}
\beta \in \arg\min_{b\in\mathbb{R}^p}E\left[\left(Y-Xb\right)^2\right]
\end{gather*}

However, when working with samples, we estimate $\beta$ by building $\hat{\beta}$ such that

\begin{gather*}
\hat{\beta} = \arg\min_{b\in\mathbb{R}^p}\mathbb{E}_n\left[\left(Y-Xb\right)^2\right]
\end{gather*}

Like in the previous lab, we will approximate $Y$ drawn from the following data generating process

\begin{align*}
Y &= e^{4X} + \epsilon_Y\\
X &= \epsilon_X\\
\epsilon_Y &\sim\mathcal{N}(0, 1)\\
\epsilon_X&\sim\mathcal{U}(0, 1)
\end{align*}

In [2]:
N = 10_000_000
X = np.random.uniform(0, 1, (N, 1))
epsilon_y = np.random.normal(0, 1, (N, 1))
y = np.exp(4 * X) + epsilon_y

Remember that our initial goal is to approximate $Y$ through some linear function, which in this case means we want to approximate $g(X) = e^{4X}$. To do this, we will use a third degree polynomial like in last lab, meaning we will define $Z = (X, X^2, X^3)$ and use the least squares method to find $m(Z)\approx g(X)$. With this approximation we have that

\begin{gather*}
Y = m(Z) + \varepsilon\\
m(Z) = Z\gamma_{YZ}
\end{gather*}

where $\gamma_{YZ}$ is the vector of coefficients. This time, we will take a closer look at these coefficients.

In [3]:
Z = np.hstack((X, X ** 2, X ** 3))

In [4]:
model = sms.OLS(y, Z)
print(model.fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.997
Model:                            OLS   Adj. R-squared (uncentered):              0.997
Method:                 Least Squares   F-statistic:                          1.026e+09
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:44:20   Log-Likelihood:                     -1.5139e+07
No. Observations:            10000000   AIC:                                  3.028e+07
Df Residuals:                 9999997   BIC:                                  3.028e+07
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

We now pick three samples: a small one (about 50 observations), a medium one (about 250 observations) and a large (close to 10 000 observations)

In [7]:
sampling_values = np.random.uniform(0, 1, (N, 1))
small_sample = (sampling_values < .000005).reshape(-1)
medium_sample = (sampling_values < .000025).reshape(-1)
large_sample = (sampling_values < .001).reshape(-1)

print("Small sample size:", small_sample.sum())
print("Medium sample size:", medium_sample.sum())
print("Large sample size:", large_sample.sum())

Small sample size: 58
Medium sample size: 275
Large sample size: 10014


Our $\hat{\beta}$ should more closely approximate $\beta$'s distribution as $n$ grows. We can see this when we fit a model with all the population and compare it with larger and larger samples because our estimator $\hat{\beta}$ is distributed as

\begin{gather*}
\hat{\beta}\sim\mathcal{N}(\beta, \bold{V}/n)
\end{gather*}

Notice that the variance shrinks with the sample size. In the following example results, pay attention to what happens to the confidence intervals when as the sample size grows

In [8]:
print(sms.OLS(y[small_sample], Z[small_sample, :]).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.997
Model:                            OLS   Adj. R-squared (uncentered):              0.997
Method:                 Least Squares   F-statistic:                              6456.
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                    5.11e-70
Time:                        11:45:58   Log-Likelihood:                         -86.138
No. Observations:                  58   AIC:                                      178.3
Df Residuals:                      55   BIC:                                      184.5
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [9]:
print(sms.OLS(y[medium_sample], Z[medium_sample]).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.997
Model:                            OLS   Adj. R-squared (uncentered):              0.997
Method:                 Least Squares   F-statistic:                          3.195e+04
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:00   Log-Likelihood:                         -417.14
No. Observations:                 275   AIC:                                      840.3
Df Residuals:                     272   BIC:                                      851.1
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [10]:
print(sms.OLS(y[large_sample], Z[large_sample]).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.997
Model:                            OLS   Adj. R-squared (uncentered):              0.997
Method:                 Least Squares   F-statistic:                          1.068e+06
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:05   Log-Likelihood:                         -15073.
No. Observations:               10014   AIC:                                  3.015e+04
Df Residuals:                   10011   BIC:                                  3.017e+04
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Now that we have reviewed this, we can add another variable to the mix. This variable will be jointly distributed with X and with Y. We will define the data generating process as:

\begin{align*}
Y &= \theta D + e^{4X} + \epsilon_Y\\
D &= e^X - \log{(X)} + \epsilon_D\\
X &= \epsilon_X\\
\epsilon_Y, \epsilon_D&\sim\text{i.i.d.}\mathcal{N}(0, 1)\\
\epsilon_X&\sim\mathcal{U}(0, 1)
\end{align*}

In [11]:
epsilon_d = np.random.normal(0, 1, (N, 1))
D = np.exp(X) - np.log(X) + epsilon_d
y = 3 * D + np.exp(4 * X) + epsilon_y

We will focus on understanding partialling out with this example. We will go step by step to help illustrate each step of the partialling out process and how we get the results. First, let's take a look at the coefficients for our regressors when they are jointly estimated, meaning that we find the BLP of the form

\begin{gather*}
Y = \theta D + Z\gamma + \varepsilon \quad\quad(1)
\end{gather*}

In [12]:
base_model_results = sms.OLS(y, np.hstack((D, Z))).fit()
print(base_model_results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.998
Model:                            OLS   Adj. R-squared (uncentered):              0.998
Method:                 Least Squares   F-statistic:                          1.355e+09
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:17   Log-Likelihood:                     -1.5137e+07
No. Observations:            10000000   AIC:                                  3.027e+07
Df Residuals:                 9999996   BIC:                                  3.027e+07
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Here we have a linear approximation that uses all the variables at the same time, and all coefficients are estimated jointly. As before, this means that

\begin{gather*}
Y = \hat{\theta}D + Z\hat{\gamma} + \varepsilon
\end{gather*}

We will save the residuals $\varepsilon$ of this regression for later

In [13]:
y_residuals = base_model_results.resid.reshape((-1, 1))

As we saw in class, partialling out consists of computing the residuals $\~V$ for each variable $V$ in our regression with some collection of regressors $W$.

\begin{gather*}
\~V=V-\hat{V}
\end{gather*}

Where $\hat{V}=m(W)$ is a prediction rule for $V$ with $W$. The vectors $\~V$ are called the "residuals of $V$ on $W$." In the following examples, we will focus on the best linear predictor built with $Z = (X, X^2, X^3)$. This means we estimate

\begin{align*}
Y =& Z\gamma_{ZY} + \varepsilon_{ZY}\\
D =& Z\gamma_{ZD} + \varepsilon_{ZD}\\
&\text{etc...}
\end{align*}

After estimating these linear predictors, we build the residuals

\begin{align*}
\tilde{Y} =& Y - Z\hat{\gamma}_{ZY}\\
\tilde{D} =& D - Z\hat{\gamma}_{ZD}\\
&\text{etc...}
\end{align*}

We will be building these residuals for all the variables in our regression on our collection of constructed regressors $Z$, starting with the residuals of $Y$ on $Z$, $\~Y$

In [14]:
y_Z_model = sms.OLS(y, Z).fit()
y_Z_residuals = y_Z_model.resid.reshape((-1, 1))
print(y_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.963
Model:                            OLS   Adj. R-squared (uncentered):              0.963
Method:                 Least Squares   F-statistic:                          8.587e+07
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:28   Log-Likelihood:                     -3.0188e+07
No. Observations:            10000000   AIC:                                  6.038e+07
Df Residuals:                 9999997   BIC:                                  6.038e+07
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

$\~D$:

In [15]:
D_Z_model = sms.OLS(D, Z).fit()
D_Z_residuals = D_Z_model.resid.reshape((-1, 1))
print(D_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.709
Model:                            OLS   Adj. R-squared (uncentered):              0.709
Method:                 Least Squares   F-statistic:                          8.106e+06
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:41   Log-Likelihood:                     -1.8914e+07
No. Observations:            10000000   AIC:                                  3.783e+07
Df Residuals:                 9999997   BIC:                                  3.783e+07
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

$\~X$:

In [16]:
X1_Z_model = sms.OLS(Z[:, 0], Z).fit()
X1_Z_residuals= X1_Z_model.resid.reshape((-1, 1))
print(X1_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          4.519e+34
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:51   Log-Likelihood:                      3.1519e+08
No. Observations:            10000000   AIC:                                 -6.304e+08
Df Residuals:                 9999997   BIC:                                 -6.304e+08
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

$\tilde{X^2}$:

In [17]:
X2_Z_model = sms.OLS(Z[:, 1], Z).fit()
X2_Z_residuals= X2_Z_model.resid.reshape((-1, 1))
print(X2_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          1.252e+34
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:46:59   Log-Likelihood:                      3.1132e+08
No. Observations:            10000000   AIC:                                 -6.226e+08
Df Residuals:                 9999997   BIC:                                 -6.226e+08
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

$\tilde{X^3}$:

In [18]:
X3_Z_model = sms.OLS(Z[:, 2], Z).fit()
X3_Z_residuals= X3_Z_model.resid.reshape((-1, 1))
print(X3_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   1.000
Model:                            OLS   Adj. R-squared (uncentered):              1.000
Method:                 Least Squares   F-statistic:                          1.235e+35
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:47:10   Log-Likelihood:                      3.2445e+08
No. Observations:            10000000   AIC:                                 -6.489e+08
Df Residuals:                 9999997   BIC:                                 -6.489e+08
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

$\tilde{\varepsilon_Y}$

In [19]:
residuals_Z_model = sms.OLS(y_residuals, Z).fit()
residuals_Z_residuals = residuals_Z_model.resid.reshape((-1, 1))
print(residuals_Z_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.000
Model:                            OLS   Adj. R-squared (uncentered):             -0.000
Method:                 Least Squares   F-statistic:                          5.136e-10
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        1.00
Time:                        11:47:21   Log-Likelihood:                     -1.5137e+07
No. Observations:            10000000   AIC:                                  3.027e+07
Df Residuals:                 9999997   BIC:                                  3.027e+07
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Now let us divert our attention to the relationship between the base model and a model constructed with the partialled out variables. The partialling out operation is linear, meaning that

\begin{align*}
A = bB + cC \Rightarrow \tilde{A} = b\tilde{B} + c\tilde{C}
\end{align*}

If we apply it to our initial BLP equation (1)

\begin{gather*}
\tilde{Y} = \theta \tilde{D} + \tilde{Z}\gamma + \tilde{\varepsilon} \quad\quad(2)
\end{gather*}

We can verify this by computing the right side of this equation:




In [20]:
right_side_residuals = sms.OLS(y_Z_residuals, np.hstack((D_Z_residuals, X1_Z_residuals, X2_Z_residuals, X3_Z_residuals))).fit().fittedvalues.reshape((-1, 1)) + residuals_Z_residuals
right_side_residuals

array([[-5.90340289],
       [ 4.14541129],
       [ 3.68388298],
       ...,
       [-0.13436055],
       [ 0.41223384],
       [-0.33203731]], shape=(10000000, 1))

Now we compute the left side, which is just the residuals of $Y$ on $Z$

In [21]:
y_Z_residuals

array([[-5.90274871],
       [ 4.14594263],
       [ 3.68433286],
       ...,
       [-0.13432159],
       [ 0.41278974],
       [-0.33181887]], shape=(10000000, 1))

If we compute the difference between the left side and the right side, we get extremely small values, which are likely due to floating point errors from our system

In [22]:
y_Z_residuals - right_side_residuals

array([[6.54184905e-04],
       [5.31336378e-04],
       [4.49884625e-04],
       ...,
       [3.89607594e-05],
       [5.55901212e-04],
       [2.18442100e-04]], shape=(10000000, 1))

Why is the notable? If we take a look at (2), we should realize that
1. $\tilde{Z}=0$ as it can perfectly "predict" its own value; and
2. $\tilde{\varepsilon}=\varepsilon$ as $\varepsilon$ is independent from (orthogonal to) X, and therefor cannot be predicted with it or with Z.

This means that we can simplify (2) to

\begin{gather*}
\tilde{Y} = \theta \tilde{D} + \varepsilon\text{,}\quad\quad(3)
\end{gather*}

meaning that we can regress $\tilde{Y}$ on $\tilde{D}$ and we will find the same result as with the jointly estimated predictor. To test this, first let's take a look again at the jointly estimated predictor.

In [23]:
print(base_model_results.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.998
Model:                            OLS   Adj. R-squared (uncentered):              0.998
Method:                 Least Squares   F-statistic:                          1.355e+09
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:48:20   Log-Likelihood:                     -1.5137e+07
No. Observations:            10000000   AIC:                                  3.027e+07
Df Residuals:                 9999996   BIC:                                  3.027e+07
Df Model:                           4                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

Now, according to (3), the result of regressing the residuals should be the same as above for $D$'s coefficient $\theta$

In [24]:
print(sms.OLS(y_Z_residuals, D_Z_residuals).fit().summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.951
Model:                            OLS   Adj. R-squared (uncentered):              0.951
Method:                 Least Squares   F-statistic:                          1.929e+08
Date:                Wed, 06 Aug 2025   Prob (F-statistic):                        0.00
Time:                        11:48:24   Log-Likelihood:                     -1.5137e+07
No. Observations:            10000000   AIC:                                  3.027e+07
Df Residuals:                 9999999   BIC:                                  3.027e+07
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------