# The abond command
In this tutorial, we illustrate the functions of pydynpd with examples. The first data set is one from Arellano and Bond (1991). It is an unblanced panel with 140 firms over 9 years (1976-1984). You can download the data (data.csv) from the /benchmark/code folder. We first consider the following basic model:

$$
\begin{align}
n_{i,t}=\alpha_1n_{i,t-1}+\alpha_2n_{i,t-2}+\beta_1w_{i,t}+\gamma_1k_{i,t}+u_{i}+\epsilon_{i,t}
\end{align}
$$

In the model above, variables $n$, $w$, and $k$ are the natural logarithm of employment, wage, and capital respectively. $u_{i}$ is unobserved fixed effect and $\epsilon_{i,t}$ is idiosyncraic error. 

Assumptions:

$w$ is a predetermined variable <br>
$k$ is strictly exogenous

To estimate the model, we first load data to Pandas data frame:



In [4]:
import pandas as pd
from  pydynpd import regression

df = pd.read_csv("data.csv")

Then we construct command string to describe the model. A command string has two or three parts, which are separated by |.

## Part 1
Part 1 is a list starting with the dependent variable, and followed by independent variables except time dummies. Given the model, part 1 is:

```
n L1.n L2.n w k
```
In the command above, L is the lag operator as in L1.n (i.e., $n_{i,t-1}$) or L2.n for 2 lags of n (i.e., $n_{i,t-2}$). 

## Part 2

Part 2 indicates how instruments are created. First, suppose we want to use the second and third lags of dependent variable n (i.e., L2.n and L3.n) as instruments, then we include the following gmm list:

```
gmm(n, 2:3)
```
Next, suppose we believe that variable w is a predetermined variable and use its first and deeper lags (i.e., L1.w, L2.w, ...) as instruments. Then we include a second gmm list:

```
gmm(w, 1:.)
```
The dot (.) above means there is no restriction regarding the maximum lag of $w$. In other words, we use all available lags.
Next, suppose variable $k$ is a strictly exogenous variable. So, we use iv() list:

```
iv(k)
```
This tells pydynpd to use variable $k$ itself as instrument.
Finally, we put all gmm and iv lists together to form part 2:
```
gmm(n, 2:3) gmm(w, 1:.) iv(k)
```
### Example 1
Suppose our command just has the two parts above, then we combine the two parts together:

In [2]:
command_str='n L1.n L2.n w k  | gmm(n, 2:3) gmm(w, 1:.) iv(k)'


Finally, we use abond function to estimate our model. Note that three parameters should be provided. The first one is the command string discussed above. The second one is the data, and the third one is a list of two variables that indentify individual firm and year respectively. 

In [3]:
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, two-step system GMM
 Group variable: id             Number of obs = 751     
 Time variable: year            Min obs per group: 5    
 Number of instruments = 61     Max obs per group: 7    
 Number of groups = 140         Avg obs per group: 5.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.9419692  |      0.1520193      | 6.1963777  | 0.0000000 | *** |
| L2.n | -0.0641474 |      0.1106139      | -0.5799217 | 0.5619674 |     |
|  w   | -0.5047742 |      0.1657322      | -3.0457210 | 0.0023212 |  ** |
|  k   | 0.1087513  |      0.0513892      | 2.1162284  | 0.0343254 |  *  |
| _con | 1.7119125  |      0.5527024      | 3.0973494  | 0.0019526 |  ** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(56) =

We can also combine "L1.n L2.n" in model above to "L(1:2).n" and get the same result:

In [18]:
command_str='n L(1:2).n w k  | gmm(n, 2:3) gmm(w, 1:.) iv(k)'
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, two-step system GMM
 Group variable: id             Number of obs = 751     
 Time variable: year            Min obs per group: 5    
 Number of instruments = 61     Max obs per group: 7    
 Number of groups = 140         Avg obs per group: 5.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.9419692  |      0.1520193      | 6.1963777  | 0.0000000 | *** |
| L2.n | -0.0641474 |      0.1106139      | -0.5799217 | 0.5619674 |     |
|  w   | -0.5047742 |      0.1657322      | -3.0457210 | 0.0023212 |  ** |
|  k   | 0.1087513  |      0.0513892      | 2.1162284  | 0.0343254 |  *  |
| _con | 1.7119125  |      0.5527024      | 3.0973494  | 0.0019526 |  ** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(56) =

The result shows that the regression is a two-step system GMM, which is the default setting because we didn't include part 3 in our command string. There are 140 firms in the imbalanced sample over 7 (=9-2) years as we include the second lag of the dependent variable (i.e., L2.n). Hansen over-identification test is significant, which means that our choices of instruments are not exogenous. Finally, Arellano-Bond test for AR(2) is not significant, indicating that the second lag of dependent variable can be treated as instrument.

Because the regression doesn't pass Hansen over-identification test, we change our assumptions to:
'''
Both $w$ and $k$ are predetermined variables
'''

### Example 2
Then we modify our code as follows. Note that "gmm (w, 1:.) iv(k)" is changed to "gmm(w k, 1:.)" based on the new assumption. We also remove L2.n from the model as it was not significant in the previous regression. This also has a benefit: we can increase the number of observations in regression.

In [8]:
command_str='n L1.n w k  | gmm(n, 2:3) gmm(w k, 1:.)'
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, two-step system GMM
 Group variable: id              Number of obs = 891     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 114     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.7935989  |      0.0565875      | 14.0242815 | 0.0000000 | *** |
|  w   | -0.4136271 |      0.1085985      | -3.8087729 | 0.0001397 | *** |
|  k   | 0.1725401  |      0.0429983      | 4.0127198  | 0.0000600 | *** |
| _con | 1.5551505  |      0.3717905      | 4.1828675  | 0.0000288 | *** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(110) = 126.302 Prob > Chi2 = 0.137
Arellano-Bond test for AR(1) in first dif

As you can see, now the model passes both Hansen and Arellano-Bond tests. 

### Example 3
Apart from gmm() and iv(), we can also use endo() and pred() for convinience in part 2. endo(list of variables) is equivelent to gmm(list of variables, 2:.) while pred(list of variables) is the same as gmm(list of variables, 1:.).

For example, the code for the second model can be changed as follows and we get the same result.

In [9]:
command_str='n L1.n w k  | gmm(n, 2:3) pred(w k)'
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, two-step system GMM
 Group variable: id              Number of obs = 891     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 114     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.7935989  |      0.0565875      | 14.0242815 | 0.0000000 | *** |
|  w   | -0.4136271 |      0.1085985      | -3.8087729 | 0.0001397 | *** |
|  k   | 0.1725401  |      0.0429983      | 4.0127198  | 0.0000600 | *** |
| _con | 1.5551505  |      0.3717905      | 4.1828675  | 0.0000288 | *** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(110) = 126.302 Prob > Chi2 = 0.137
Arellano-Bond test for AR(1) in first dif

## Part 3
We can change the default settings in part 3 of our command string. Part 3 includes the following possible options:
- onestep: perform one-step GMM estimation rather than the default two-step GMM estimation.
- nolevel: only perform difference GMM
- timedumm: automatically include time dummies in part 1, and IV statement in part 2.
- collapse: collapse instruments to reduce the proeblem of too many instruments

### Example 4
For example, we can change the regression to a one-step difference GMM by adding part 3 in our command string. Also, suppose we want to include more lagged dependent variables as instruments. That is, we want to use all available lagged $n$ rather than just L2.n and L3.n in the previous models. So, we change gmm(n, 2:3) to gmm(n, 2:.) which is then simplified to endo(n).

In [15]:
command_str='n L1.n w k  | endo(n) pred(w k) | onestep nolevel'
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, one-step difference GMM
 Group variable: id              Number of obs = 751     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 105     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+------+------------+---------------------+-------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |      z      |   P>|z|   |     |
+------+------------+---------------------+-------------+-----------+-----+
| L1.n | 0.4145541  |      0.0708631      |  5.8500695  | 0.0000000 | *** |
|  w   | -0.9256463 |      0.0891816      | -10.3793415 | 0.0000000 | *** |
|  k   | 0.3773193  |      0.0437854      |  8.6174729  | 0.0000000 | *** |
+------+------------+---------------------+-------------+-----------+-----+
Hansen test of overid. restrictions: chi(102) = 115.193 Prob > Chi2 = 0.175
Arellano-Bond test for AR(1) in first differences: z = -3.99 Pr > z =0.000
Arellano-Bond test for AR(2) i

### Example 5
We can also add time dummy variables to the previous model:

In [16]:
command_str='n L1.n w k  | endo(n) pred(w k) | onestep nolevel timedumm'
mydpd = regression.abond(command_str, df, ['id', 'year'])

Dynamic panel-data estimation, one-step difference GMM
 Group variable: id              Number of obs = 751     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 112     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+-----------+------------+---------------------+------------+-----------+-----+
|     n     |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+-----------+------------+---------------------+------------+-----------+-----+
|    L1.n   | 0.4874768  |      0.0858172      | 5.6804061  | 0.0000000 | *** |
|     w     | -0.6282987 |      0.1413220      | -4.4458657 | 0.0000088 | *** |
|     k     | 0.3321728  |      0.0477929      | 6.9502543  | 0.0000000 | *** |
| year_1978 | -0.0263750 |      0.0129610      | -2.0349499 | 0.0418559 |  *  |
| year_1979 | -0.0322775 |      0.0158671      | -2.0342438 | 0.0419270 |  *  |
| year_1980 | -0.0603802 |      0.0168346      | -3.5866738 | 0.0003349 |

Unless existing economic theory indicates what model to choose, it is researchers' jobs to decide how many lags of each variable to be included on the left hand side. An innovative feature of pydynpd is that it can try all possible lags and return models that pass Hansen and AR(2) tests.

For example, suppose we want the system to try different models with different numbers of lagged dependent variable $n$. We can indicate this request by using "L(1:?).n" in part 2. The question mark means the system is requested to try L(1:1).n, then L(1:2).n, then L(1:3).n, so on and so forth. Each attempt is a candidate model. The system will automatically filter out models and only show models that satisfy all of the conditions below:

- pass Hansen over-identification test (i.e., its P value should be greater than 5%)
- pass AR(2) test
- The P-value of Hansen over-identification test shouldn't be too high. In other words, it should be less than 99.99%. This is because a P value too close to 1 indicates a potential too-many-instrument issue.

### Example 6
In this example, we let the system try different models by increasing the number of lagged dependent variable one by one:

In [17]:
command_str='n L(1:?).n w k  | gmm(n, 2:3) pred(w k)'
mydpd = regression.abond(command_str, df, ['id', 'year'])

model 1
 n  L1.n  w  k | gmm(n, 2:3) pred(w k)
Dynamic panel-data estimation, two-step system GMM
 Group variable: id              Number of obs = 891     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 114     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.7935989  |      0.0565875      | 14.0242815 | 0.0000000 | *** |
|  w   | -0.4136271 |      0.1085985      | -3.8087729 | 0.0001397 | *** |
|  k   | 0.1725401  |      0.0429983      | 4.0127198  | 0.0000600 | *** |
| _con | 1.5551505  |      0.3717905      | 4.1828675  | 0.0000288 | *** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(110) = 126.302 Prob > Chi2 = 

As shown above, the system reports 5 models (model 1 ~ model 5). The results of three of them are displayed because they all satisfy the conditions indicated above. Model 3 and model 4 have not output because each of them doesn't satisfy at least one of the conditions. Also, the system actually tried more than five models. For example, the system tried model 6 with command string "n  L1.n  L2.n  L3.n  L4.n  L5.n  L6.n w  k | gmm(n, 2:3) pred(w k)". However, the model doesn't have enough number of observations to finish the regression (as shown above, from model 1 to model 5, the number of observations is reduced from 891 to 331). Therefore, the system doesn't count it as a model.

Also, in the output above the command string of each candidate model is displayed under its model name. This allows users to replicate the output for any model finally chosen.

### Example 7
Question mark is not restricted to lagged dependent variable $n$. For example, we can try the following model with two question marks. Then the system will try all combinations of lagged $n$ and lagged $w$, resulting in much more candidate models.

In [19]:
command_str='n L(1:?).n L(0:?).w k  | gmm(n, 2:3) pred(w k)'
mydpd = regression.abond(command_str, df, ['id', 'year'])

model 1
 n  L1.n  w  k | gmm(n, 2:3) pred(w k)
Dynamic panel-data estimation, two-step system GMM
 Group variable: id              Number of obs = 891     
 Time variable: year             Min obs per group: 6    
 Number of instruments = 114     Max obs per group: 8    
 Number of groups = 140          Avg obs per group: 6.36 
+------+------------+---------------------+------------+-----------+-----+
|  n   |   coef.    | Corrected Std. Err. |     z      |   P>|z|   |     |
+------+------------+---------------------+------------+-----------+-----+
| L1.n | 0.7935989  |      0.0565875      | 14.0242815 | 0.0000000 | *** |
|  w   | -0.4136271 |      0.1085985      | -3.8087729 | 0.0001397 | *** |
|  k   | 0.1725401  |      0.0429983      | 4.0127198  | 0.0000600 | *** |
| _con | 1.5551505  |      0.3717905      | 4.1828675  | 0.0000288 | *** |
+------+------------+---------------------+------------+-----------+-----+
Hansen test of overid. restrictions: chi(110) = 126.302 Prob > Chi2 = 