# Estimation Tutorial

In this section, we dive into the topic of model estimation using **pydsge**.

Now, for this tutorial we will assume a folder set-up of the form

```
analysis
|   README.md
|-- src/
|  |   estimation.py or .ipynb
|  |   model.yaml
|-- data/
|  |   example_data
|-- output/
````

In [15]:
# Just for the tutorial: Setting up example structure
import tempfile
import os

# Temporary output folder
output_path = Path(tempfile.gettempdir(), 'analysis/output')
os.makedirs(output_path)

# Parsing and loading the model

Let us first load the relevant packages.

In [1]:
from pathlib import Path # For Windows/Unix compatibility
import pandas as pd
import numpy as np
import emcee # For sampling from posterior distribution

from pydsge import DSGE

Text

In [20]:
yaml = "pydsge_doc/rank.yaml"
# TODO 1: use example model provided with package. Potentially adjust if necessary

mod = DSGE.read(yaml)  

mod.name = 'rank_test'
mod.description = 'RANK, crisis sample'

# mod.path = Path("pydsge_doc/npz")
mod.path = myTempFolder

d0 = pd.read_csv(
    Path("pydsge_doc/data.csv"), sep=";", index_col="date", parse_dates=True
).dropna()
# TODO 2: use example data provided with package instead (contains only three time series and no confidential information)

# adjust elb
zlb = mod.get_par('elb_level')
rate = d0['FFR']
d0['FFR'] = np.maximum(rate,zlb)

mod.load_data(d0, start='1998Q1')

Unnamed: 0_level_0,GDP,Cons_JPT,Inv_JPT,Wage,Lab,Infl,FFR,BAASpread
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1998-01-01,0.705224,0.720178,2.392249,1.531065,5.689687,0.145569,1.380000,0.415587
1998-04-01,0.675096,1.104010,0.127271,1.013521,5.517304,0.235932,1.375000,0.415079
1998-07-01,0.985439,0.848967,1.783101,1.189956,5.493090,0.429114,1.383333,0.479831
1998-10-01,1.336121,0.642785,3.165516,0.285096,5.872790,0.275362,1.215000,0.647016
1999-01-01,0.655486,0.853668,1.259925,1.385310,5.785575,0.376072,1.183333,0.600628
...,...,...,...,...,...,...,...,...
2019-10-01,0.433952,0.334165,-1.010784,0.443767,-2.202829,0.371650,0.410833,0.530349
2020-01-01,-1.342152,-1.671383,-2.669948,1.798675,-2.745063,0.410841,0.315000,0.631250
2020-04-01,-9.543759,-11.205801,-11.176176,5.957645,-15.705174,-0.528662,0.050000,0.804762
2020-07-01,7.081864,7.456729,15.610632,-1.461654,-9.158974,0.902692,0.050000,0.665677


# Preparing the estimation

In [21]:
# crucial command. Do some documentation
mod.prep_estim(N=350, seed=0, verbose=True)
# HINT: probably start playing around with the linear estimation first cause it is faster for obvious reasons. See other script provided.

mod.filter.R = mod.create_obs_cov(1e-1)
ind = mod.observables.index('FFR')
mod.filter.R[ind,ind] /= 1e1 

[estimation:]   Model operational. 55 states, 8 observables, 8 shocks, 92 data points.
Adding parameters to the prior distribution...
  parameter sig_c as normal (1.5, 0.375). Init @ 1.2312, with bounds (0.25, 3)...
  parameter sig_l as normal (2, 0.75). Init @ 2.8401, with bounds (-0.5, 6)...
  parameter tpr_beta as gamma (0.25, 0.1). Init @ 0.742, with bounds (0.01, 1.0)...
  parameter h as beta (0.7, 0.1). Init @ 0.7205, with bounds (0.3, 0.95)...
  parameter phiss as normal (4, 1.5). Init @ 6.3325, with bounds (2, 12)...
  parameter i_p as beta (0.5, 0.15). Init @ 0.3291, with bounds (0.01, 0.9)...
  parameter i_w as beta (0.5, 0.15). Init @ 0.4425, with bounds (0.01, 0.9)...
  parameter alpha as normal (0.3, 0.05). Init @ 0.24, with bounds (0.05, 0.4)...
  parameter zeta_p as beta (0.5, 0.1). Init @ 0.7813, with bounds (0.4, 0.99)...
  parameter zeta_w as beta (0.5, 0.1). Init @ 0.7937, with bounds (0.4, 0.99)...
  parameter Phi_p as normal (1.25, 0.125). Init @ 1.4672, with bound

# Running the estimation

Now that the we have all the variables and defined the type of estimation to perform, we can turn to estimating to the model. To be able to deal with very high-dimensional models, `pdygse` uses *Markov Chain Monte Carlo* (MCMC) Integration to sample from the posterior distribution. For further information on MCMC, please refer to the `emcee` [website](https://emcee.readthedocs.io/en/stable/) and the additional resources provided there. We recommend running a **Tempered Ensemble MCMC** first, by using the `tmcmc` method. Doing this is particularly valuable for high-dimensional problems, since defining the initial states of the walkers in the parameterspace in this way is a powerful tool to improve sampling. However, due to its efficiency, we also use it for small models such as the one we are dealing with here.

For our ensemble sampling, we can specify a variety of options. Note, `tmcmc` always requires the specification of the first four arguments, which are the i) number of steps, ii) number of walks, iii) number of temperatures, and iv) a temperature target! Here we do not want to set a target and, in turn, set `fmax = None`. Moreover, we have the option to set different "moves", i.e. coordinate updating algorithms for the walkers. As a wrapper for a lot of `emcee` functionality,  `tmcmc` can work with many different "moves" - for a list and implementation details please consult the `emcee` documentation. For using them here, specify them as a list of tuples, containing the type of move and its "weight". If no move is specified, "StretchMove" is used. For seed setting, the user can choose between three options, here we use the standard numpy seed. Finally, the states are saved in the `p0` object as a numpy array in order to later pass them to our main sampling process.

In [22]:
fmax = None

moves = [(emcee.moves.DEMove(), 0.8), 
         (emcee.moves.DESnookerMove(), 0.2),]

p0 = mod.tmcmc(200, 200, 0, fmax, moves=moves, update_freq=100, lprob_seed='set')
mod.save()

[create_pool:]  Could not import package `threadpoolctl` to limit numpy multithreading. This might reduce multiprocessing performance.


100%|██████████| 200/200 [01:31<00:00,  2.18it/s]


[prior_sample:] Sampling done. 3.85% of the prior is either indetermined or explosive.


0temp(s) [00:00, ?temp(s)/s]

[save_meta:]    Metadata saved as 'C:\Users\Philipp\AppData\Local\Temp\analysis\output\rank_test_meta'





As we can see, the output provides us with various important details. In particular, we lean that `mod.save()` saved the meta data of our model in the directory which we specified earlier in `mod.path`. This information is stored as an `.npz` file so that it is avialable even in the event of a crash and can be loaded anytime using `numpy.load()`.

We now use the initial states derived above to conduct our full Bayesian estimation. Still, initial states do not have to be specified and, unless `mcmc` can identify previous runs or estimations, the initial values of the "prior" section in the `*.yaml` are used. The default number of sampling steps is 3000, so it makes sense to allow this to run in parallel. However, if you want to avoid this, simply set `debug` to "True". And as before, seed setting is essential for creating reproducible results.

[*What is purpose of "tune", "update_freq", "append"?*]

In [22]:
mod.mcmc(p0,
         moves=moves,
        #  nsteps=3000,
         nsteps = 20,
         tune=500,
         update_freq=500,
         lprob_seed='set',
         append=True,
         debug=True)
mod.save()

[create_pool:]  Could not import package `threadpoolctl` to limit numpy multithreading. This might reduce multiprocessing performance.


  0%|          | 0/20 [00:00<?, ?sample(s)/s]

In [7]:
mod.__dict__

{'func_file': 'pydsge_doc/rank_funcs.py',
 'const_obs': FFR,
 'pcompile': <function pydsge.parser.DSGE.get_matrices.<locals>.compile(px)>,
 'parafunc': (['sprd',
   'PI',
   'gamma',
   'beta',
   'betabar',
   'RR',
   'RK',
   'RKstar',
   'lamb_p',
   'W',
   'K',
   'k_1',
   'K_Y',
   'Y',
   'I',
   'C',
   'c_2',
   'kappa',
   'kappa_w',
   'x_bar'],
  <function _lambdifygenerated(_Dummy_33)>),
 'psi': <function _lambdifygenerated(_Dummy_33)>,
 'PSI': <function _lambdifygenerated(_Dummy_27)>,
 'ZZ0': <function _lambdifygenerated(_Dummy_25)>,
 'ZZ1': <function _lambdifygenerated(_Dummy_26)>,
 'AA': <function _lambdifygenerated(_Dummy_28)>,
 'BB': <function _lambdifygenerated(_Dummy_29)>,
 'CC': <function _lambdifygenerated(_Dummy_30)>,
 'bb': <function _lambdifygenerated(_Dummy_31)>,
 'bb_PSI': <function _lambdifygenerated(_Dummy_32)>,
 'QQ': <function _lambdifygenerated(_Dummy_34)>,
 'HH': <function _lambdifygenerated(_Dummy_35)>,
 'par_fix': array([ 0.5  ,  0.05 ,  0.   , 10. 

But, so were are our estimates? Remember that, so far, we have only drawn samples from our posterior distribution. Our converged (burnt-in) MCMC samples are currently stored in the `rank_test_sampler.h5` file created by `mcmc`. To get our parameter estimates, we now still need to draw a sample form the MCMC object. 

In [5]:
pars = mod.get_par('posterior', nsamples=250, full=True)

NameError: A backend file named `pydsge_doc\npz\rank_test_sampler.h5` could not be found.

Now, let's have a look at the estimated shocks. We can do this by using `extract()` which gives us the smoothed shocks. This method takes a variety of arguments, all of which have sensible default values. For example, here we specify the number of parameter draws in each verification sample to 1. [*is that correct?*]

 Note also that the default seed is 0, which we simply use here. 

In [None]:
epsd0 = mod.extract(pars, nsamples=1)
mod.save_rdict(epsd0)

In [17]:
mod.mode_summary()

Empty DataFrame
Columns: []
Index: [sig_c, sig_l, tpr_beta, h, phiss, i_p, i_w, alpha, zeta_p, zeta_w, Phi_p, psi, phi_pi, phi_y, phi_dy, rho, rho_r, rho_g, rho_z, rho_u, rho_p, rho_w, rho_i, mu_p, mu_w, rho_gz, sig_g, sig_u, sig_z, sig_r, sig_p, sig_w, sig_i, rho_fin, sig_fin, trend, mean_l, mean_Pi, mean_spread, loglike]


sig_c
sig_l
tpr_beta
h
phiss
i_p
i_w
alpha
zeta_p
zeta_w
Phi_p


In [18]:
mod.mcmc_summary()

                 distribution  pst_mean  sd/df   mean     sd   mode  hpd_5  \
sig_c                  normal     1.500  0.375  1.472  0.350  1.085  0.921   
sig_l                  normal     2.000  0.750  1.997  0.709  1.695  0.777   
tpr_beta                gamma     0.250  0.100  0.240  0.084  0.236  0.102   
h                        beta     0.700  0.100  0.686  0.099  0.607  0.549   
phiss                  normal     4.000  1.500  4.102  1.425  5.618  2.140   
i_p                      beta     0.500  0.150  0.508  0.142  0.691  0.308   
i_w                      beta     0.500  0.150  0.508  0.149  0.607  0.226   
alpha                  normal     0.300  0.050  0.300  0.048  0.306  0.210   
zeta_p                   beta     0.500  0.100  0.500  0.098  0.587  0.348   
zeta_w                   beta     0.500  0.100  0.496  0.097  0.676  0.345   
Phi_p                  normal     1.250  0.125  1.244  0.109  1.190  1.063   
psi                      beta     0.500  0.150  0.503  0.132  0.

Unnamed: 0,distribution,pst_mean,sd/df,mean,sd,mode,hpd_5,hpd_95,mc_error
sig_c,normal,1.5,0.375,1.472199,0.350116,1.084867,0.920994,2.039454,0.350116
sig_l,normal,2.0,0.75,1.997245,0.708958,1.694885,0.77681,3.027678,0.708958
tpr_beta,gamma,0.25,0.1,0.240099,0.083557,0.235566,0.102281,0.36647,0.083557
h,beta,0.7,0.1,0.6856,0.09937,0.607165,0.54904,0.836168,0.09937
phiss,normal,4.0,1.5,4.101564,1.424912,5.618352,2.140412,6.845431,1.424912
i_p,beta,0.5,0.15,0.507595,0.141779,0.691318,0.308223,0.765521,0.141779
i_w,beta,0.5,0.15,0.507723,0.148623,0.606684,0.226049,0.720866,0.148623
alpha,normal,0.3,0.05,0.30021,0.048007,0.306209,0.209916,0.365633,0.048007
zeta_p,beta,0.5,0.1,0.50015,0.098266,0.586983,0.348059,0.64683,0.098266
zeta_w,beta,0.5,0.1,0.495658,0.097199,0.676203,0.344723,0.659264,0.097199
