#  Problem set 8 : Binary response models for panel data: Monte Carlo Evidence


## 1. A dynamic binary response model for panel data
The overall purpose of this exercise is to use Monte Carlo simulation to investigate the properties of estimators for binary response models for panel data. 

For concreteness, consider the dynamic panel data model specified below

\begin{align} 
y_{it}&=&\mathbb{1}(\delta z_{it}  + \rho y_{it-1} + c_i +e_{it}>0) \\
y_{i0}&=&\mathbb{1}(\eta_i>0) \\
c_i   &=& \phi_0 + \phi_{y0} y_{i0} + a_i\\
a_i &\sim&  iidN(0, \sigma_a^2)\\
e_{it} &\sim& iidN(0, 1) \\
\eta_{i} &\sim& iidN(0, 1) 
\end{align}

where the scalar random variable $z_{it}$ is assumed to be iid standard normally distributed. Depending on the parameters, this model can be either dynamic (if $\rho \ne 0$) or static (if $\rho = 0$) and may or may not contain unobserved effects depending on the parameters that determine $c_i$. 

We are specifically interested in estimating the effect of $z_{it}$ and $y_{it-1}$ on $P(y_{it}=1|z_{it}, y_{it-1}, c_i)$, that is the effect of changing $z_{it}$ or $y_{it-1}$ holding constant other variables including $c_i$. 

## 2. Research questions
During the lectures on "Binary response models for panel data" we considered a battery of different models and estimators. Each method may or may not be appropriate depending on the nature of the data and what the object of interest is. Sometimes we are satisfied with estimating the direction of an effect that may be determined by the sign of the coefficient to single explanatory variable. Sometimes the interest is in the average partial effect (APE) and other times we need to make inference about the whole distribution of choice probabilities and corresponding partial effects. 

Each estimator are derived under different assumptions, and will be appropriate in different context and have different asymptotic properties. This may for example depend on 
1. whether there are **unobserved effects** in the data and whether they are **correlated with explanatory variables**
1. whether the model is **static** or **dynamic**, for example because it contains a lagged dependent variable 
1. whether you consider a **pooled analysis** or specify a model for **the conditional distribution for the entire sequence** of observed binary outcomes 
1. how modest your ambitions are in terms of the **object of interest**

This motivates a menu research questions: 
1. Is it possible to identify the parameter of interest? 
1. What is the appropriate estimator? 
1. Is the estimator $\sqrt{N}$-consistent and asymptotic normal? 
1. Is it unbiased?
1. Is it efficient?
1. Are asymptotic results a good approximation for the sample size under consideration? 
1. Are the usual standard errors and tests-statistics valid for inference, or do I need to compute robust standard errors? 

## 3. Models to compare
To analyze the questions above, you are asked to perform a series of Monte Carlo experiments where you change the sampling scheme using the model specified above and compare the performance of different estimators derived from different models such as: 

1. **Linear probability models with unobserved effects** estimated using least squares methods such as for example Pooled OLS, Fixed Effects, First Differencing IV methods etc. 
1. **Pooled index models** estimated using partial MLE, such as Pooled Probit or Pooled Logit 
1. **Index models with unobserved effects under strict exogeneity** such as the Random effects Probit, Chamberlain's Correlated Random effects Probit or Fixed Effects Logit. 
1. **Dynamic unobserved effects models** such as the Dynamic version of the Correlated Random effects Probit 

As a testbed for different estimators we use the dynamic binary probit model with unobserved effects described in Section 1. Use the model to generate data sets for different values of parameters and sample size, and investigate the properties of different estimators. 

**Your analysis does not have to be exhaustive**, but you need to consider *at least one method from at least three out of the four classes mentioned above*. Divide your analysis into static models and dynamic models. In the latter case, special focus in on the causal effect of the lagged dependent variable (state dependence). Here you need to discriminate between true and *spurious state dependence* and analyze the importance of accounting for unobserved heterogeneity and the initial conditions problem. 

The same rule applies regarding your analysis of the properties of the estimators you apply. Think of the list of research questions above as a menu of opportunities. In the next section I give more examples of specific questions to analyze. 

## 4. Some advice on Monte Carlo: Analyzing properties of estimators
I recommend you start simple. Generate data from a model where you know you should be able to recover the underlying parameters by estimating a well specified model that is consistent with how you generated the data. Then you can set up more sophisticated experiments such as the ones we discussed during lectures: 

#### Example experiments - static models
1. No heterogeneity 
    - does LPM give a good approximation of APE?
    - does pooled probit estimate true parameters?
    
1. Neglected heterogeneity - Pooled probit or LPM
   - does pooled probit estimate true parameters?
   - does pooled OLS and probit still estimate APE?

1. Can RE-Probit estimate account for heterogeneity and uncover true parameters

#### Example experiments - dynamic models   
1. Neglecting heterogeneity and initial conditions
   - does pooled OLS and probit consistently estimate APE of lagged y (state dependence)?
   - what about other parameters?
1. Accounting for heterogeneity and initial conditions 
    - is LPM-FE valid for dynamic models?
    - can RE probit estimate APE of lagged y (state dependence)?
    - what if explanatory variables are correlated with c_i?

In the simulation exercise in the lectures on binary response for panel data, we essentially did Monte Carlo using only a single Monte Carlo sample. To appropriately analyze the distribution of an estimator, test-statistic, etc. you need to generate many samples by repeatedly simulating data and estimating the parameter of interest. Using this information, we can for example learn about the distribution of an estimator. As an example, the notebook [clogit_simulations.ipynb](https://github.com/bschjerning/EconometricsB/blob/main/lectures/14_multinomial_response/clogit_simulations.ipynb) provides a simple example in the context of the Condtional Logit.

## 5 More resources
The most relevant sections is Wooldridge's textbook are (in order of priority)
- Section 15.8 on Binary Response models for Panel data
- Section 12.8.1 on Monte Carlo Simulation 
- Section 13.8 and 13.9 on Partial/Pooled MLE and likelihood based panel data methods with unobserved effects. 
- Sections 12.1-12.3, and 12.5.1 on the properties of M-Estimators

Most of of the relevant estimation methods and econometric models are reviewed in the lectures 12 and 13 on "Binary response models for panel data" and implemented in the notebook [binary_choice_panel.ipynb](https://github.com/bschjerning/EconometricsB/blob/main/lectures/12_binary_response_panel/binary_choice_panel.ipynb). Here you will also find demonstrations of the code that implements a selection of panel data methods for binary response. For convenience, I located the current notebook in the same directory [12_binary_response_panel](https://github.com/bschjerning/EconometricsB/tree/main/lectures/12_binary_response_panel), so that you can easily access all resources. 

For Monte Carlo, you may get some inspiration from the first exercise set, where we do a simple Monte Carlo Experiment for the linear model. You may also want to look a the last lecture on the simulation and maximum likelihood estimation of the conditional logit model. As mentioned above, The notebook [clogit_simulations.ipynb](https://github.com/bschjerning/EconometricsB/blob/main/lectures/14_multinomial_response/clogit_simulations.ipynb) briefly illustrates how properties the maximum likelihood estimator for conditional logit can be analyzed though a simple Monte Carlo experiment. 

You may also need the code used during the exercise classes, or even take the challenge of writing code for estimators I have not implemented, such as the Fixed Effects Logit model. You are encouraged to use all resources available and modify it so it fits your application. If you take up this challenge. Start simple. Consider first the model for only two periods, and move on from there.

It may also be that you want to modify the sampling scheme. For example, in the model above (and associated simulation code) $z_{it}$ is assumed to be uncorrelated with $c_i$. So if your are interested in analyzing the consequences of correlated unobserved effects, you are welcome to extend the simulation framework to for example allow for correlation between unobserved effects and the explanatory variables.  

## 6. Getting started
To make sure we are all on the same page, lets do a few initial steps here 

### Initial setup
Before we begin, lets read in some modules used though-out

In [11]:
# initial setup
%reset -f
%load_ext autoreload
%autoreload 2
import pandas as pd
import numpy as np
from numpy import linalg as la
from tabulate import tabulate

### Routine for simulation of model
The model outlined above is implemented in the function ''simulate'' in indexmodels.py 


In [31]:
# Simulate data from dynamic model
from indexmodels import *
df_sim=simulate(n=1000, nT=10, model='probit', rho=.1, delta=1, phi_0=0.2, phi_y0=0.2, sigma_a=1)
df_sim

Unnamed: 0,group,period,y,z,y0,const,l1.y
1,0.0,1.0,0.0,-1.271649,1.0,1.0,1.0
2,0.0,2.0,0.0,-1.343861,1.0,1.0,0.0
3,0.0,3.0,1.0,-0.571454,1.0,1.0,0.0
4,0.0,4.0,1.0,0.015299,1.0,1.0,1.0
5,0.0,5.0,1.0,0.354558,1.0,1.0,1.0
...,...,...,...,...,...,...,...
10995,999.0,6.0,1.0,-0.154665,0.0,1.0,1.0
10996,999.0,7.0,1.0,0.093586,0.0,1.0,1.0
10997,999.0,8.0,1.0,0.505401,0.0,1.0,1.0
10998,999.0,9.0,1.0,-1.066231,0.0,1.0,1.0


### Routines for estimators and estimation methods
The file `indexmodels.py` also contain implementations of pooled probit/logit and random effects probit/logit models. It relies on `mestim.py` that contains a routine for implementation of the general class of M-estimators known from Chapter 12 - including MLE (if you specify the objective function appropriately).  
The file `linearpaneldata.py` includes simple routines for pooled ols and fixed effects regressions. Here you may want to subplement with the code from exercise classes on static and dynamic linear panel data models.  

As a starting point, we could for example use these routines to simulate data from a static unobserved effects model with one strictly exogneous explanatory variable, $z_{it}$, and estimate the effect of $z_{it}$ using a variety of methods. You may get something like this: 

In [81]:
import linearpaneldata as lpd   # simple routines to do linear FE and Pooled OLS regressions
from indexmodels import *       # objective functions etc. for estimation of panel data binary response models
import mestim as M              # routines for M-estimation given general sample objective functions

# Simulate data
df_sim = simulate(n=1000, nT=20, delta=0.7, rho=0.3, psi=0, phi_0=0.1,  phi_y0=0.1, sigma_a=2, 
         model='probit', rng=random.default_rng(seed=43))

# Estimate models
lpm_ols=lpd.estim(df_sim, 'y', xvar=['const','z'], groupvar='group', method='pols', cov_type='robust')
lpm_fe=lpd.estim(df_sim, 'y',  xvar=['z'], groupvar='group', method='fe', cov_type='robust')
res_pp=pooled(df_sim, 'y', xvar =['z', 'const'] , groupvar='group', model='probit', cov_type='sandwich')
res_rep_prob=rand_effect(df_sim, 'y', xvar =['z', 'const'] , groupvar='group', model='probit', cov_type='Binv')
res_rep_logit=rand_effect(df_sim, 'y', xvar =['z', 'const'] , groupvar='group', model='logit', cov_type='Binv')



Specification: Pooled OLS Panel Regression
Dep. var. : y 

parnames         b_hat          se    t-values
----------  ----------  ----------  ----------
const           0.5340      0.0119     44.8462
z               0.1136      0.0040     28.0835
# of groups:       1000
# of observations: 20000 


Specification: Linear Fixed Effects Regression
Dep. var. : y 

parnames         b_hat          se    t-values
----------  ----------  ----------  ----------
z               0.1130      0.0034     33.1253
# of groups:       1000
# of observations: 20000 

Pooled probit
Dep. var. : y 

parnames      theta_hat          se    t-values         jac         APE
----------  -----------  ----------  ----------  ----------  ----------
z               0.29791     0.01155    25.80431     0.00000     0.11348
const           0.08889     0.03125     2.84448     0.00000     0.03386

# of groups:      : 1000
# of observations : 20000
# log-likelihood. : -13284.088933705729 

Iteration info: 3 iterations, 4 e

In [78]:
print_output(res_pp)

Dep. var. : y 

parnames      theta_hat          se    t-values         jac         APE
----------  -----------  ----------  ----------  ----------  ----------
z               0.29791     0.01155    25.80431     0.00000     0.11348
const           0.08889     0.03125     2.84448     0.00000     0.03386

# of groups:      : 1000
# of observations : 20000
# log-likelihood. : -13284.088933705729 

Iteration info: 3 iterations, 4 evaluations of objective, and 4 evaluations of gradients
Elapsed time: 0.1172 seconds



In [82]:
print_output(res_rep_logit)

Dep. var. : y 

parnames      theta_hat          se    t-values         jac         APE
----------  -----------  ----------  ----------  ----------  ----------
z               1.25844     0.03085    40.79400     0.00000     0.07609
const           0.34367     0.13082     2.62710     0.00000     0.02078
sigma_a         3.90555     0.12661    30.84646    -0.00000     0.23615

# of groups:      : 1000
# of observations : 20000
# log-likelihood. : -7151.954356032121 

Iteration info: 15 iterations, 16 evaluations of objective, and 16 evaluations of gradients
Elapsed time: 0.9314 seconds



In [80]:
print_output(res_rep_prob)

Dep. var. : y 

parnames      theta_hat          se    t-values         jac         APE
----------  -----------  ----------  ----------  ----------  ----------
z               0.70866     0.01620    43.74846     0.00001     0.11287
const           0.18998     0.07267     2.61423    -0.00000     0.03026
sigma_a         2.17579     0.07010    31.03940    -0.00000     0.34656

# of groups:      : 1000
# of observations : 20000
# log-likelihood. : -7147.5681691048085 

Iteration info: 11 iterations, 12 evaluations of objective, and 12 evaluations of gradients
Elapsed time: 1.0432 seconds



In [49]:
## From df to values

y = df_sim['y'].values
t = int(max(df_sim['period'].values))
n = int(max(df_sim['group'].values) + 1)
T = np.tile(t, n)
cons = np.ones(n*t).reshape(-1,1)
z = df_sim['z'].values.reshape(-1,1)
y_lag = df_sim['l1.y'].values.reshape(-1,1)
x = np.column_stack((cons, z, y))

array([[ 1.        ,  0.67817832,  1.        ],
       [ 1.        , -0.58552938,  1.        ],
       [ 1.        , -0.90867312,  0.        ],
       ...,
       [ 1.        ,  1.07829422,  1.        ],
       [ 1.        ,  0.75045829,  1.        ],
       [ 1.        ,  2.05202139,  1.        ]])