# Measurment Error and Attenuation Bias
### Econ 21410 Lecture Notes
#### Ari Boyarky (aboyarsky@uchicago.edu)

Suppose $(y_i, z_i)$ are random variables and that $y_i = \beta z_i + \epsilon$. However we do not observe $z_i$ rather we have,

$$x_i = z_i + u_i$$

Where $u_i \perp \epsilon_i, z_i$. This is called measurment error. When $\mathbb{E}[u_i] = 0$ this is called classical measurement error. As such we have,

$$y_i = z_i\beta + \epsilon_i = (x_i - u_i)\beta + \epsilon_i = x_i\beta + v_i$$

Where $v_i = \epsilon_i - u_i\beta$. So, $\mathbb{E}[x_iv_i]=\mathbb{E}[(z_i + u_i)(\epsilon_i - u_i\beta)] = -\mathbb{E}[u_iu_i]\beta$. So if $\beta \neq 0$ we have a biased estimate.

We can actually go further and charecterize this bias. Specifcally, assuming 1 regressor we have

$$\beta^\star = \beta + \frac{\mathbb{E}[x_iu_i]}{\mathbb{E}[x_i^2]} = \beta(1-\frac{\mathbb{E}[u_i^2]}{\mathbb{E}[x_i^2]})$$

And because $\frac{\mathbb{E}[u_i^2]}{\mathbb{E}[x_i^2]}<1$ the $\beta$ will shrink to 0. We call this attenuation bias.

To solve this problem we often turn to instrumental variables. A good choice of IV is another measurment of $z_i$ that is independent of $u_i$. In particular,

$$\hat\beta_{IV} = \frac{Cov(y,w)}{Cov(x,w)} = \frac{Cov(\beta z_i + u_i,w_i)}{Cov(z_i + u_i,w_i)}$$

And so,

$$\underset{n\to\infty}{\text{plim}}\;\; \hat\beta_{IV} = \beta$$

Now, we the theory out of the way let us turn to a monte carlo simulation.

In [62]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.sandbox.regression.gmm import IV2SLS
import linearmodels

In [78]:
def simulate_data_ex1(N=50, seed=65594):
    if seed == None:
        pass
    else:
        np.random.seed(seed)
    # Let's ensure that x_1 and x_2 are correlated
    mean = np.array([0, 0])
    X =  np.random.randn(N,)
    eps = np.random.randn(N)
    u = np.random.randn(N)
    v = np.random.randn(N)
    beta = 5
    df = pd.DataFrame(eps, columns=['eps'])
    df['u'] = u
    df['v'] = v
    df['y'] = beta*X + df.eps
    df['z'] = X + df.u
    df['w'] = X + df.v
    df = df[['y', 'z', 'w']]
    return df

In [79]:
df = simulate_data_ex1(N=100, seed=75100)
df.head()
reg = smf.ols('y ~ z', df).fit()

In [85]:
np.random.seed(100)
M = 1000
results = pd.DataFrame(index=range(M), columns=reg.params.index, dtype=np.float)
for m in range(M):
    df = simulate_data_ex1(seed=None, N=100)
    reg = smf.ols('y ~ z', df).fit()
    results.loc[m, :] = reg.params

In [86]:
results.head()

Unnamed: 0,z
0,2.183365
1,2.210971
2,2.791391
3,2.780626
4,2.4249


In [87]:
np.mean(results.z)

2.5055386361206273

In [88]:
np.random.seed(100)
M = 1000
ivs = list()
for m in range(M):
    df = simulate_data_ex1(seed=None, N=100)
    reg = linearmodels.IV2SLS.from_formula('y ~ [z ~ w]', data=df).fit()
    ivs.append(reg._params["z"])

In [89]:
np.mean(ivs)

5.12501128890838