In [1]:
import numpy as np
from numpy import linalg as la
from scipy.stats import chi2
from tabulate import tabulate
import LinearModelsWeek3_post as lm
%load_ext autoreload
%autoreload 2

In [2]:
y, x, T, year, label_y, label_x = lm.load_example_data()

# Problem set introduction
On Tuesday we briefly discussed how the presence of fixed effects causes the estimator to be biased. To recap, consider the following model,

$$ y_{it} = \boldsymbol{x}_{it}\boldsymbol{\beta} + c_i + u_{it}, \quad i=1, \dotsc, N \quad t=1, \dotsc, T \tag{1} $$

where $c_i$ is an unobservable individual specific component which is constant across time. We consider two different scenarios: 

* **Part 1:** If $c$ is systematically related to one or more of the observed variables in the sense of $E[c_{i}\boldsymbol{x}_{it}] \neq \boldsymbol{0}$, then the POLS estimator is _not_ consistent for $\boldsymbol{\beta}$.
* **Part 2:*** If $c_i$ is uncorrelated with the regressors such that $E[c_i\boldsymbol{x}_{it}]=0$ for all $t$, then $\boldsymbol{\beta}$ can be consistently estimated by pooled OLS (POLS) and random effects (RE). 

### Example
Let's take a look at a proper example. We are interested in the effect of unionization on wage, this could be modelled as such.

$$
\ln(wage_{it}) = \beta_0 + \beta_1\textit{union} + c_{i} + u_{it} \tag{2}
$$

Consider what could be in $c_i$ that may be correlated with unionizing? Let us first calcualte what the average union participation is, by checking the mean of the union variable.

In [3]:
mean_union = x[:, -1].mean()
print(f'About {mean_union * 100:.2f}% of our sample is in an union.')

About 24.40% of our sample is in an union.


There are some fixed effects that we could control for, for example if we believe afro americans are more or less prone to unionizing, because of some social economic factors. We can look at the conditional mean for afro americans and hispanics.

In [4]:
black_union = x[x[:, 1] == 1, -1].mean()
hispanic_union = x[x[:, 2] == 1, -1].mean()
print(f'If we look at the unionization of some sub populations, afro american membership is at {black_union * 100:.2f}%, hispanic membership is at {hispanic_union * 100:.2f}%.')

If we look at the unionization of some sub populations, afro american membership is at 37.10%, hispanic membership is at 27.35%.


Ethnicity may therefore be a fixed effect which is systematically related to $\textit{union}$ (again, most likely ethnicity does not affect union, but it might be a proxy for some socio-economic factors that affect union membership). In our data, this is something which we can control for, by including it in our regression.

We therefore consider the somewhat more elaborate model from Tuesday,

$$
\begin{align}
\ln\left(wage_{it}\right) & =\beta_{0}+\beta_{1}\textit{exper}_{it}+\beta_{2}\textit{exper}_{it}^{2}+\beta_{3}\textit{union}_{it}+\beta_{4}\textit{married}_{i}\nonumber \\
 & \quad+\beta_{5}\textit{educ}_{i}+\beta_{6}\textit{hisp}_{i}+\beta_{7}\textit{black}_{i}+c_{i}+u_{it}. \tag{3}
\end{align}
$$

This should solve some of our problems compared to eq. (2), but we still have an issue if for example people select into union or non-union jobs based on which sector rewards their innate characteristics best, then $E[uniont_{it}c_i]\neq0$.

## Part 1: Compare POLS to FE/FD
### Question 1:

Start by estimating eq. (3) by POLS. You should already have all the data and code that you need, print it out in a nice table. Is the unionization coefficient statistically significant?

In [5]:
## FILL IN
# First, estimate y on x without any transformations. Store the resulting dictionary.
# Then, print the resulting dictionary using the provided print_table() function. The labels should have been provided to you.

In [6]:
ols_result = lm.estimate(y, x)
lm.print_table(
    (label_y, label_x), ols_result, title="Pooled OLS", floatfmt='.4f'
)

Pooled OLS
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Constant        -0.0347  0.0646     -0.5375
Black           -0.1438  0.0236     -6.1055
Hispanic         0.0157  0.0208      0.7543
Education        0.0994  0.0047     21.2476
Experience       0.0892  0.0101      8.8200
Experience sqr  -0.0028  0.0007     -4.0272
Married          0.1077  0.0157      6.8592
Union            0.1801  0.0171     10.5179
R² = 0.187
σ² = 0.231


You should get a table that look like this:

Pooled OLS <br>
Dependent variable: Log wage <br>

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Constant       | -0.0347 | 0.0646 |    -0.5375 |
| Black          | -0.1438 | 0.0236 |    -6.1055 |
| Hispanic       |  0.0157 | 0.0208 |     0.7543 |
| Education      |  0.0994 | 0.0047 |    21.2476 |
| Experience     |  0.0892 | 0.0101 |     8.8200 |
| Experience sqr | -0.0028 | 0.0007 |    -4.0272 |
| Married        |  0.1077 | 0.0157 |     6.8592 |
| Union          |  0.1801 | 0.0171 |    10.5179 |
R² = 0.187 <br>
σ² = 0.231

### Short introduction to fixed effects

A solution to control for fixed effects, is to "demean" the data. We need to calculate the mean within each person, so we define $\bar{y}_{i}=T^{-1}\sum_{t=1}^{T}y_{it}$, $\mathbf{\bar{x}}_{i}=T^{-1}\sum_{t=1}^{T}\mathbf{x}_{it}$, $\mathbf{\bar{u}}_{i}=T^{-1}\sum_{t=1}^{T}\mathbf{u}_{it}$, and $c_{i} = T^{-1}\sum_{t=1}^{T}c_{i}$.

Subtracting these means from eq. (1) we are able to demean away the fixed effects,

$$
\begin{align}
y_{it}-\bar{y}_{i} & =\left(\mathbf{x}_{it}-\mathbf{\bar{x}}_{i}\right)\mathbf{\beta}+(\color{red}{c_{i}-c_{i}})+\left(u_{it}-\bar{u}_{i}\right) \\
\Leftrightarrow\ddot{y}_{it} & =\ddot{\mathbf{x}}_{it}\mathbf{\beta} + \ddot{u}_{it}. \tag{4}
\end{align}
$$

To substract the mean within each person is not immediately easy. But you are provided with a `perm` function, that takes a "transformation matrix" Q, and uses it to permutate some vector or matrix A.

In order to demean the data, we need to give this `perm` function the following transformation matrix:

\begin{equation}
\mathbf{Q}_{T}:=\mathbf{I}_{T}-\left(\begin{array}{ccc}
1/T & \ldots & 1/T\\
\vdots & \ddots & \vdots\\
1/T & \ldots & 1/T
\end{array}\right)_{T\times T}.
\end{equation}

### Question 2:
Estimate eq. (3) by fixed effects. You need to perform the following steps:
* Create the demeaning matrix Q.
* Demean x and y using the `perm` function and Q.
* Remove the columns in the demeaned x that are only zeroes (remember to shorten the `label_x` as well).
* Estimate y on x using the demeaned arrays.
* Print it out in a nice table.

In [7]:
## FILL IN
# The steps are outlined in question 2 above.

In [8]:
def demeaning_matrix(T):
    Q_T = np.eye(T) - np.tile(1/T, (T, T))
    return Q_T

# Demean the matrices
Q_T = demeaning_matrix(T)
y_demean = lm.perm(Q_T, y)
x_demean = lm.perm(Q_T, x)

Bonus: check where your linearly dependent columns are

NB this is not part of the curriculum so pls dont fret if this doesnt make sense!!! Meant as a help (at your own risk...)

In [9]:
null_cols = ~x_demean.any(axis=0)

In [10]:
x_demean[:,~null_cols]

array([[ -3.5  , -24.5  ,   0.   ,  -0.125],
       [ -2.5  , -21.5  ,   0.   ,   0.875],
       [ -1.5  , -16.5  ,   0.   ,  -0.125],
       ...,
       [  1.5  ,  22.5  ,   0.375,  -0.375],
       [  2.5  ,  43.5  ,   0.375,   0.625],
       [  3.5  ,  66.5  ,   0.375,   0.625]])

In [11]:
def null(a):
    rank = la.matrix_rank(a)
    if rank == a.shape[1]:
        print('full rank, yay!')
        indep_cols = None
    else:
        u, s, v = la.svd(a)
        col_loc =np.arange(0,a.shape[1])
        locate_right_singular_vec =np.sum(v[rank:],axis=0).round(0)
        print(f'Columns {col_loc[locate_right_singular_vec!=0]} are linearly dependent')
        indep_cols = col_loc[locate_right_singular_vec==0]
    return indep_cols

In [12]:
linindepcols =null(x_demean)

Columns [0 1 2 3] are linearly dependent


In [13]:
x_demean = x_demean[:,linindepcols]
#x_demean = x_demean[:, 4:]
label_x_fe = label_x[4:]
null(x_demean)

full rank, yay!


In [14]:
# Estimate using OLS and print the results
fe_result = lm.estimate(
    y_demean, x_demean, transform='fe', T=T
)
lm.print_table(
    (label_y, label_x_fe), 
    fe_result, title='FE regression', floatfmt='.4f'
)

FE regression
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1168  0.0084     13.8778
Experience sqr  -0.0043  0.0006     -7.1057
Married          0.0453  0.0183      2.4743
Union            0.0821  0.0193      4.2553
R² = 0.178
σ² = 0.123


You should get a table that looks like this:

FE regression<br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1168 | 0.0084 |    13.8778 |
| Experience sqr | -0.0043 | 0.0006 |    -7.1057 |
| Married        |  0.0453 | 0.0183 |     2.4743 |
| Union          |  0.0821 | 0.0193 |     4.2553 |
R² = 0.178 <br>
σ² = 0.123

## Short introduction to first differences

The within transformation is one particular transformation
that enables us to get rid of $c_{i}$. An alternative is the first-difference transformation. To see how it works, lag Equation (1) one period and subtract it from (1) such that

\begin{equation}
\Delta y_{it}=\Delta\mathbf{x}_{it}\mathbf{\beta}+\Delta u_{it},\quad t=\color{red}{2},\dotsc,T, \tag{5}
\end{equation}

where $\Delta y_{it}:=y_{it}-y_{it-1}$, $\Delta\mathbf{x}_{it}:=\mathbf{x}_{it}-\mathbf{x}_{it-1}$ and $\Delta u_{it}:=u_{it}-u_{it-1}$. As was the case for the within transformation, first differencing eliminates the time invariant component $c_{i}$. Note, however, that one time period is lost when differencing.

In order to first difference the data, we can pass the following transformation matrix to the `perm` function,

\begin{equation}
\mathbf{D}:=\left(\begin{array}{cccccc}
-1 & 1 & 0 & \ldots & 0 & 0\\
0 & -1 & 1 &  & 0 & 0\\
\vdots &  &  & \ddots &  & \vdots\\
0 & 0 & 0 & \ldots & -1 & 1
\end{array}\right)_{T - 1\times T}.
\end{equation}

### Question 3:
Estimate eq. (3) by first differences. You need to perform the following steps:
* Create the first difference matrix D.
* First difference x and y using the `perm` function and Q.
* Remove the columns in the first differenced x that are only zeroes (remember to shorten the `label_x` as well).
* Estimate y on x using the first differenced arrays.
* Print it out in a nice table.

In [15]:
## FILL IN
# The steps are outlined in question 3 above.

In [16]:
# First difference the matrices
def fd_matrix(T):
    D_T = np.eye(T) - np.eye(T, k=-1)
    D_T = D_T[1:]
    return D_T
D_T = fd_matrix(T)
y_diff = lm.perm(D_T, y)
x_diff = lm.perm(D_T, x)
x_diff = x_diff[:, 4:]

# Estimate using OLS and print the results
fd_result = lm.estimate(y_diff, x_diff, transform='fd')
lm.print_table(
    (label_y, label_x_fe), 
    fd_result, title='FD regression', floatfmt='.4f'
)

FD regression
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1158  0.0196      5.9096
Experience sqr  -0.0039  0.0014     -2.8005
Married          0.0381  0.0229      1.6633
Union            0.0428  0.0197      2.1767
R² = 0.004
σ² = 0.196


You should get a table that look like this:

FD regression <br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1158 | 0.0196 |     5.9096 |
| Experience sqr | -0.0039 | 0.0014 |    -2.8005 |
| Married        |  0.0381 | 0.0229 |     1.6633 |
| Union          |  0.0428 | 0.0197 |     2.1767 |
R² = 0.004 <br>
σ² = 0.196

## Summing up part questions 1, 2, and 3.
Compare the results from your POLS, FE and FD estimations. We were mainly interested in the effect of $\textit{union}$ on wages, did the POLS estimation give a correct conlcusion on this? Is the effect greater or lower than we first though? Is the effect still statistically significant?

# Part 2: The random effects (RE) estimator.
In part 1 we assumed that $E[uniont_{it}c_i]\neq0$, and used two methods to remove these fixed effects from each person. Now, what if $E[uniont_{it}c_i] )= 0$? Then POLS is consistent, but not efficient, since POLS is not using the panel structure of the data. We can therefore do better with the RE estimator.

## A short introduction to the RE estimator
As with the FE and FD estimators, we estimate them by OLS, but by first transforming them in a specific way. We do the same now, but our mission is no longer to transform away the fixed effects, but rather estimate the following model,

$$\check{y}_{it} = \mathbf{\check{x}}_{it}\mathbf{\beta} + \check{v}_{it},\tag{6}$$ 

 $\mathbf{\check{y}}_{it} = \mathbf{y}_{it} - \hat{\lambda}\mathbf{\bar{y}}_{it}$, $\mathbf{\check{x}}_{it} = \mathbf{x}_{it} - \hat{\lambda}\mathbf{\bar{x}}_{it}$, and $\check{v}_{it} = v_{it} - \hat{\lambda}\bar{v}_{it}$, where we have gathered the errors $v_{it} = c_i + u_{it}$. We are *"quasi-demeaning"* the variables, by premultiplying the means by $\hat{\lambda}$.

 Our challenge is thus to estimate this $\lambda$, which we can construct the following way,

$$\hat{\lambda} = 1 - \sqrt{\frac{\widehat{\sigma}_{u}^{2}}{(\widehat{\sigma}_{u}^{2} + T\widehat{\sigma}_{c}^{2})}}, $$

where $\widehat{\sigma}_{u}^{2}$ is estimated from the fixed effects regression, and $\hat{\sigma}_{c}^{2} = \hat{\sigma}_{w}^{2} - \frac{1}{T}\hat{\sigma}_{u}^{2}$. Finaly, what is $\hat{\sigma}_{w}^{2}$? That is the error variance from the between estimator, 


$$
\hat{\sigma}_{w}^{2} = \frac{1}{N-K}\left(\bar{\mathbf{y}} - \mathbf{\bar{X}}\hat{\mathbf{\beta}}_{BE}\right)^{\prime}\left(\bar{\mathbf{y}} - \mathbf{\bar{X}}\hat{\mathbf{\beta}}_{BE}\right),
$$

where $\mathbf{\beta}_{BE}$ are the between estimater coefficients. The between-groups estimator is not something we have introduced before, but is attained by regressing the time-averaged outcomes $\overline{y}_i$ on the time-averaged regressors $\overline{\mathbf{x}}_i,i=1,2,\dotsc,N$.

### Question 1: The Between Estimator
Estimate the between groups model, which is simply the average within each each individual,

$$
\bar{y}_{i} = \boldsymbol{\bar{x}}_{i}\boldsymbol{\beta} + c_i + \bar{u}_{i}.
$$

So instead of demeaning, like we did in FE, we just calculate the mean with the following transformation *vector* $\mathbf{P}_T$,

\begin{equation} 
\mathbf{P}_T \equiv \left( \frac{1}{T}, \frac{1}{T}, ..., \frac{1}{T} \right)_{1 \times T}
\end{equation}

In order to estimate eq. (3) with the between estimator. You need to perform the following steps:
* Create the mean matrix `P`.
* mean `x` and `y` using the `perm` function and `P`.
* Regress `y_mean` on `x_mean`. Note that there are $N$ rows in each, not $NT$. 
* Print it out in a nice table.

In [17]:
## FILL IN
# The steps are outlined in question 3 above.

In [18]:
def mean_matrix(T):
    return np.tile(1/T, (1, T))
P_T = mean_matrix(T)
y_mean = lm.perm(P_T, y)
x_mean = lm.perm(P_T, x)
be_result = lm.estimate(
    y_mean, x_mean, transform='be')
lm.print_table(
    labels=(label_y, label_x), results=be_result, 
    title='BE',
    floatfmt=['', '.4f', '.4f', '.2f']
)

BE
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Constant         0.4923  0.2210        2.23
Black           -0.1388  0.0489       -2.84
Hispanic         0.0048  0.0427        0.11
Education        0.0946  0.0109        8.68
Experience      -0.0504  0.0503       -1.00
Experience sqr   0.0051  0.0032        1.60
Married          0.1437  0.0412        3.49
Union            0.2707  0.0466        5.81
R² = 0.219
σ² = 0.121


You should get a table that looks like this:

BE <br>
Dependent variable: Log wage

|                |   Beta |     Se |   t-values |
|----------------|--------|--------|------------|
| Constant        |  0.4923 | 0.2210 |  2.23 | 
| Black           | -0.1388 | 0.0489 | -2.84 | 
| Hispanic        |  0.0048 | 0.0427 |  0.11 | 
| Education       |  0.0946 | 0.0109 |  8.68 | 
| Experience      | -0.0504 | 0.0503 | -1.00 | 
| Experience sqr  |  0.0051 | 0.0032 |  1.60 | 
| Married         |  0.1437 | 0.0412 |  3.49 | 
| Union           |  0.2707 | 0.0466 |  5.81 | 
R² = 0.219 <br>
σ² = 0.119

### Question 2
You should now have all the error variances that you need to calculate,

$$\hat{\lambda} = 1 - \sqrt{\frac{\widehat{\sigma}_{u}^{2}}{(\widehat{\sigma}_{u}^{2} + T\widehat{\sigma}_{c}^{2})}}, $$

In [19]:
# FILL IN

In [20]:
sigma_u = fe_result['sigma2']
sigma_c = be_result['sigma2'] - sigma_u/T
_lambda = 1 - np.sqrt(sigma_u/(sigma_u + T*sigma_c))
print(_lambda)

[[0.64264094]]


### Question 3
Now we are finaly ready to estimate eq. (3) with random effects. Since we have to use $\hat{\lambda}$ to quasi-demean within each individual, we again use the `perm` function. This time, we pass it the following transformation matrix,

$$
\mathbf{C}_{T}:=\mathbf{I}_{T} - \hat{\lambda}\mathbf{P}_{T},
$$

where $\mathbf{P}_{T}$ is the $1 \times T$ transformation matrix we used earlier to calculate the mean of each person.

In [21]:
# FILL IN
# Create first the transformation matrix C
# Use the perm function to "quasi-demean" x and y using C
# Estimate RE using OLS and print a nice table

In [22]:
C_t = np.eye(T) - _lambda*mean_matrix(T)
x_re = lm.perm(C_t, x)
y_re = lm.perm(C_t, y)

re_result = lm.estimate(
    y_re, x_re, transform='re', T=T
)
lm.print_table(
    labels=(label_y, label_x), results=re_result, _lambda=_lambda,
    title='RE',
    floatfmt=['', '.3f', '.4f', '.2f']
)

RE
Dependent variable: Log wage

                  Beta      Se    t-values
--------------  ------  ------  ----------
Constant        -0.107  0.1107       -0.97
Black           -0.144  0.0476       -3.03
Hispanic         0.020  0.0426        0.47
Education        0.101  0.0089       11.36
Experience       0.112  0.0083       13.57
Experience sqr  -0.004  0.0006       -6.88
Married          0.063  0.0168        3.74
Union            0.107  0.0178        6.02
R² = 0.178
σ² = 0.124
λ = 0.643


RE <br>
Dependent variable: Log wage <br>

|                |   Beta |     Se |   t-values |
|----------------|--------|--------|------------|
| Constant       | -0.107 | 0.1107 |      -0.97 |
| Black          | -0.144 | 0.0476 |      -3.03 |
| Hispanic       |  0.020 | 0.0426 |       0.47 |
| Education      |  0.101 | 0.0089 |      11.36 |
| Experience     |  0.112 | 0.0083 |      13.57 |
| Experience sqr | -0.004 | 0.0006 |      -6.88 |
| Married        |  0.063 | 0.0168 |       3.74 |
| Union          |  0.107 | 0.0178 |       6.02 |
R² = 0.178 <br>
σ² = 0.124 <br>
λ = 0.643

## Short introduction to Hausman test

It is evident from the previous question that RE has the advantage over FE, in that time-invariant variables are not demeaned away. But if $E[c_{i}\boldsymbol{x}_{it}] \neq \boldsymbol{0}$, then the RE estimator is inconsistent, where the FE estimator is consistent (but inefficient).

We can use the results from the FE and RE estimations to test if RE is consistent, by calculating the following test statistics,

$$
H := (\hat{\mathbf{\beta}}_{FE} - \hat{\mathbf{\beta}}_{RE})'[\widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{FE}) - \widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{RE})]^{-1}(\hat{\mathbf{\beta}}_{FE}-\hat{\mathbf{\beta}}_{RE})\overset{d}{\to}\chi_{M}^{2}, \tag{7}
$$

*Note* The vector for $\hat{\mathbf{\beta}}_{RE}$ is excluded the time invariant variables, as these are not present in $\hat{\mathbf{\beta}}_{FE}$. <br>
*Note2:* $\widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{RE})$ means the RE covariance (but again, we only keep the rows and columns that are for the time-variant variables)

#### Question 4: Comparing FE and RE
Use the results from the FE and RE esimtations to calculate eq. (7).

* Start by calculating the differences in the FE and RE coefficients $\hat{\mathbf{\beta}}_{FE} - \hat{\mathbf{\beta}}_{RE}$ (again, remember to remove the time invariant columns from RE)
* Then calculate the differences in the covariances $\widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{FE}) - \widehat{\mathrm{avar}}(\hat{\mathbf{\beta}}_{RE})$ (you need to keep the "lower right" part of the RE covariance)
* You now have the parts to calculate the H statistics in eq. (7)

In [23]:
# FILL IN
# Follow the steps in the question

b_re = re_result['b_hat']
b_re = b_re[4:]

hat_diff = fe_result['b_hat'] - b_re  # The differences in beta hat

cov_re = re_result['cov']
cov_re = cov_re[4:,4:]


cov_diff = fe_result['cov'] - cov_re  # The difference in covariances
H = hat_diff.T@la.inv(cov_diff)@hat_diff  # The Hausman test value

# This calculates the p-value of the Hausman test.
p_val = chi2.sf(H.item(), 4)

In [24]:
# First calculate the covar matrices.
# Remember to remove the FE time invarant regressors from RE
hat_diff = fe_result['b_hat'] - re_result['b_hat'][4:]
cov_diff = la.inv(fe_result['cov'] - re_result['cov'][4:, 4:])
H = hat_diff.T@(cov_diff@hat_diff)
# This takes the chi2 value, and then DF.
p_val = chi2.sf(H.item(), hat_diff.size)

In [25]:
# This code takes the results that you have made, and prints a nice looking table.
def print_h_test(fe_result, re_result, hat_diff, p_val):
    table = []
    for i in range(len(hat_diff)):
        row = [
            fe_result['b_hat'][i], re_result['b_hat'][4:][i], hat_diff[i]
        ]
        table.append(row)

    print(tabulate(
        table, headers=['b_fe', 'b_re', 'b_diff'], floatfmt='.4f'
        ))
    print(f'The Hausman test statistic is: {H.item():.2f}, with p-value: {p_val:.2f}.')
print_h_test(fe_result, re_result, hat_diff, p_val)

   b_fe     b_re    b_diff
-------  -------  --------
 0.1168   0.1121    0.0047
-0.0043  -0.0041   -0.0002
 0.0453   0.0628   -0.0175
 0.0821   0.1074   -0.0253
The Hausman test statistic is: 31.45, with p-value: 0.00.


Your table should look like this:

| b_fe    |  b_re    | b_diff |
| ------- |  ------- |  -------- |
 | 0.1168  |  0.1121  |   0.0047 |
| -0.0043 |  -0.0041 |   -0.0002 |
 | 0.0453  |  0.0628  |  -0.0175 |
 | 0.0821  |  0.1074  |  -0.0253 |

 The Hausman test statistic is: 31.45, with p-value: 0.00.

# Robust Standard Errors
Please run the above estimations (RE, FE and FD) but this time with robust standard errors. The function for robust standard errors is provided in the LinearModelsPS3_ex_post.py file but you need to fill out the missing parts for the code below to work. Compare your results to the ones you got above (without allowing for heteroskedasticity). How do they differ?

In [26]:
# RE robust (NB does this make sense??)
re_result_robust = lm.estimate(
    y_re, x_re, transform='re', T=T,robust_se=True
)
lm.print_table(
    labels=(label_y, label_x), results=re_result_robust, _lambda=_lambda,
    title='RE regression, robust SE',
    floatfmt=['', '.3f', '.4f', '.2f']
)
lm.print_table(
    labels=(label_y, label_x), results=re_result, _lambda=_lambda,
    title='RE regression',
    floatfmt=['', '.3f', '.4f', '.2f']
)

RE regression, robust SE
Dependent variable: Log wage

                  Beta      Se    t-values
--------------  ------  ------  ----------
Constant        -0.107  0.1150       -0.93
Black           -0.144  0.0503       -2.87
Hispanic         0.020  0.0399        0.51
Education        0.101  0.0089       11.41
Experience       0.112  0.0105       10.65
Experience sqr  -0.004  0.0007       -6.04
Married          0.063  0.0190        3.31
Union            0.107  0.0209        5.15
R² = 0.178
σ² = 0.124
λ = 0.643
RE regression
Dependent variable: Log wage

                  Beta      Se    t-values
--------------  ------  ------  ----------
Constant        -0.107  0.1107       -0.97
Black           -0.144  0.0476       -3.03
Hispanic         0.020  0.0426        0.47
Education        0.101  0.0089       11.36
Experience       0.112  0.0083       13.57
Experience sqr  -0.004  0.0006       -6.88
Married          0.063  0.0168        3.74
Union            0.107  0.0178        6.02
R² = 0.17

RE regression, robust se <br>
Dependent variable: Log wage <br>

|                |   Beta |     Se |   t-values |
|----------------|--------|--------|------------|
| Constant       | -0.107 | 0.1150 |      -0.93 |
| Black          | -0.144 | 0.0503 |      -2.87 |
| Hispanic       |  0.020 | 0.0399 |       0.51 |
| Education      |  0.101 | 0.0089 |      11.41 |
| Experience     |  0.112 | 0.0105 |      10.65 |
| Experience sqr | -0.004 | 0.0007 |      -6.04 |
| Married        |  0.063 | 0.0190 |       3.31 |
| Union          |  0.107 | 0.0209 |       5.15 |
R² = 0.178 <br>
σ² = 0.124 <br>
λ = 0.643

In [27]:
# FE robust
fe_result_robust = lm.estimate(
    y_demean, x_demean, transform='fe', T=T,robust_se=True
)
lm.print_table(
    (label_y, label_x_fe), 
    fe_result_robust, title='FE regression, robust SE', floatfmt='.4f'
)

lm.print_table(
    (label_y, label_x_fe), 
    fe_result, title='FE regression', floatfmt='.4f'
)


FE regression, robust SE
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1168  0.0107     10.9221
Experience sqr  -0.0043  0.0007     -6.2773
Married          0.0453  0.0210      2.1598
Union            0.0821  0.0228      3.6011
R² = 0.178
σ² = 0.123
FE regression
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1168  0.0084     13.8778
Experience sqr  -0.0043  0.0006     -7.1057
Married          0.0453  0.0183      2.4743
Union            0.0821  0.0193      4.2553
R² = 0.178
σ² = 0.123


You should get a table that look like this:

FE regression, robust SE<br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1168 | 0.0107 |    10.9221 |
| Experience sqr | -0.0043 | 0.0007 |    -6.2773 |
| Married        |  0.0453 | 0.0210 |     2.1598 |
| Union          |  0.0821 | 0.0228 |     3.6011 |
R² = 0.178 <br>
σ² = 0.123

In [28]:
# FD Robust 
fd_result_robust = lm.estimate(y_diff, x_diff, transform='fd',robust_se=True)
lm.print_table(
    (label_y, label_x_fe), 
    fd_result_robust, title='FD regression, robust SE', floatfmt='.4f'
)
lm.print_table(
    (label_y, label_x_fe), 
    fd_result, title='FD regression', floatfmt='.4f'
)

FD regression, robust SE
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1158  0.0211      5.4779
Experience sqr  -0.0039  0.0014     -2.7964
Married          0.0381  0.0233      1.6387
Union            0.0428  0.0198      2.1598
R² = 0.004
σ² = 0.196
FD regression
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1158  0.0196      5.9096
Experience sqr  -0.0039  0.0014     -2.8005
Married          0.0381  0.0229      1.6633
Union            0.0428  0.0197      2.1767
R² = 0.004
σ² = 0.196


You should get a table that look like this:

FD regression, robust SE <br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1158 | 0.0211 |     5.4779 |
| Experience sqr | -0.0039 | 0.0014 |    -2.7964 |
| Married        |  0.0381 | 0.0233 |     1.6387 |
| Union          |  0.0428 | 0.0198 |     2.1598 |
R² = 0.004 <br>
σ² = 0.196

# If you have time: Comparing FE and FD and exogeneity test.

There are some important tests that you need to know, which we outline here.

### Question 1: Comparing FE and FD

One of the assumptions we make when first-differencing is that the errors $e_{it} = \Delta u_{it}$ should be serially uncorrelated. Where for fixed-effects it should be -0.5.

We can easily test this assumption given the OLS residuals from equation (5). Run the regression (note that you will loose data for
the *two* first periods)
\begin{equation}
\hat{e}_{it}=\rho\hat{e}_{it-1}+error_{it},\quad t=\color{red}{3},\dotsc,T,\quad i=1,\dotsc,N
\end{equation}

Do you find any evidence for serial correlation? Which of FD or FE seems most appropriate.

*Note:* To lag an array, use the following permutation matrix

$$
\mathbf{P}_ T :=
\begin{equation}
\underset{T - 1\times T}{
\begin{pmatrix}
1 & 0 & 0 & \cdots & 0 & 0\\
0 & 1 & 0 & \cdots & 0 & 0\\
\vdots & \vdots & \vdots & \ddots & \vdots & \vdots\\
0 & 0 & 0 & \cdots & 1 & 0
\end{pmatrix}
}
\end{equation}
$$

*Note2:* You should end up with two columns: one with the residuals and one with the lagged residuals. To make these the same length, you need to remove the first observation for each person in the non-lagged residuals. You can use the `year` variable to do a boolean indexing of the numpy array. (But you need to first remove the very first year, since this was already removed when doing first differencing. But this is done for you.)

In [29]:
# FILL IN
# You should have stored the results from the first differencing, use the beta hat to calculate the residuals e.
# Then create the lag matrix P, and use the perm function with P to lag the residual e.
# Remove the first observation for each person in the non-lagged e.
# Estimate by ols e on its lag.
# Print out in a nice table.

In [30]:
reduced_year = year[year != 1980]
def serial_corr(y, x, T, year):
    b_hat = lm.est_ols(y, x)
    e = y - x@b_hat
    
    # Create a lag to estimate the error on.
    L_T = np.eye(T, k=-1)
    L_T = L_T[1:]
    
    
    e_l = lm.perm(L_T, e)
    
    
    print(e.shape)
    # We then need to remove the first obs for every person again.
    e = e[year != 1981]
    print(e.shape)

    return lm.estimate(e, e_l)

In [31]:
corr_result = serial_corr(y_diff, x_diff, T-1, reduced_year)

label_ye = 'OLS residual, e\u1d62\u209c'
label_e = ['e\u1d62\u209c\u208B\u2081']
title = 'Serial Correlation'
lm.print_table(
    (label_ye, label_e), corr_result, 
    title='Serial Correlation', floatfmt='.4f'
)

(3815, 1)
(3270, 1)
Serial Correlation
Dependent variable: OLS residual, eᵢₜ

          Beta      Se    t-values
-----  -------  ------  ----------
eᵢₜ₋₁  -0.3961  0.0147    -27.0185
R² = 0.182
σ² = 0.143


You should get a table that look like this:

Serial Correlation <br>
Dependent variable: OLS residual, eᵢₜ

|       |    Beta |     Se |   t-values |
|-------|---------|--------|------------|
| eᵢₜ₋₁ | -0.3961 | 0.0147 |   -27.0185 |
R² = 0.182 <br>
σ² = 0.143

### Question 1: Test for strict exogeneity

Add a lead of the union variable, $unioni_{t+1}$ to the equation (3) (note that you will lose data from period $T$ , 1987) and estimate the model with *fixed effects* (i.e., you have to demean $unioni_{t+1}$ along with all the other variables and throw out time constant variables). Is $unioni_{t+1}$ significant? What does this imply for the strict exogeneity assumption?

*Hint:* To lead a variable, think along the same lines as in the previous question.

In [32]:
# FILL IN
# Create first a lead matrix, and use this to transform the union variable only. 
# Remove the last observation for each person using the year column to do an boolean index. Then you can hstack the lead union variable to you x-matrix. 
# Now within transform the data, and estimate using FE as you have done before. 
# Is the lead variable significant/insignificant? What does that mean?

In [33]:
def exogeneity_test(x, y, T, year):
    # Create lead
    F_T = np.eye(T, k=1)
    F_T = F_T[:-1]

    # Lead union
    union_lead = lm.perm(F_T, x[:, 7].reshape(-1, 1))

    # Collect variables to test for exogeneity
    x_exo = x[year != 1987]
    x_exo = np.hstack((x_exo, union_lead))
    y_exo = y[year != 1987]

    # Within transform the data
    Q_T = demeaning_matrix(T - 1)

    yw_exo = lm.perm(Q_T, y_exo)
    xw_exo = lm.perm(Q_T, x_exo)
    xw_exo = xw_exo[:, 4:]

    label_exo = label_x_fe + ['Union lead']
    n = y.size/T
    # Estimate model
    exo_test = lm.estimate(
        yw_exo, xw_exo, T=T - 1, transform='fe'
    )

    lm.print_table(
        (label_y, label_exo), 
        exo_test, title='Exogeneity test', floatfmt='.4f'
    )
exogeneity_test(x, y, T, year)

Exogeneity test
Dependent variable: Log wage

                   Beta      Se    t-values
--------------  -------  ------  ----------
Experience       0.1213  0.0100     12.1001
Experience sqr  -0.0050  0.0008     -6.3579
Married          0.0436  0.0209      2.0898
Union            0.0757  0.0218      3.4784
Union lead       0.0515  0.0223      2.3063
R² = 0.146
σ² = 0.128


The table should look something like this:
Exogeneity test <br>
Dependent variable: Log wage

|                |    Beta |     Se |   t-values |
|----------------|---------|--------|------------|
| Experience     |  0.1213 | 0.0100 |    12.1001 |
| Experience sqr | -0.0050 | 0.0008 |    -6.3579 |
| Married        |  0.0436 | 0.0209 |     2.0898 |
| Union          |  0.0757 | 0.0218 |     3.4784 |
| Union lead     |  0.0515 | 0.0223 |     2.3063 |
R² = 0.146<br>
σ² = 0.128