# Estimation Tutorial

In this section, we dive into the topic of model estimation using **pydsge**. 

Let us, just for the sake of this tutorial, set up a temporary directory structure:

In [1]:
# Just for the tutorial: Setting up example structure
import tempfile
import os
import shutil # For clean-up of temporary directory
from pathlib import Path # For Windows/Unix compatibility

# Temporary output folder
output_path = Path(tempfile.gettempdir(), 'output')
if not os.path.isdir(output_path):
    os.makedirs(output_path)

## Parsing and loading the model

Let us first load the relevant packages. Besides the DSGE class we already know from [*getting started*](https://pydsge.readthedocs.io/en/latest/getting_started.html), we also want to import the `emcee` package. This will allow us to later specify the desired updating algorithms for sampling from the posterior distribution - we explain this in more detail below.

In [2]:
import pandas as pd
import numpy as np
import emcee # For specifying updating moves

from pydsge import DSGE, example

In this tutorial, we continue to use the example provided in `pydsge`. Like before, we specify the file paths of the model and the data. Please feel free to check-out both files, but from the previous tutorial you might remember that we're dealing with a five equations New Keynesian model and US quarterly data from 1995 to 2018. 

In [3]:
yaml_file, data_file = example

We again parse the model and load-in the data. What is important is that we also specify a location where the (intermediate) output is stored. Here we assign the output folder, as discussed at the beginning. Note also that we can name the model and write a short description, which is very useful when working with several models.

In [4]:
# Parse the model
mod = DSGE.read(yaml_file)  

# Give it a name
mod.name = 'Rank_tutorial'
mod.description = 'RANK, estimation tutorial'

# Storage location for output
mod.path = output_path

# Load data
df = pd.read_csv(data_file, parse_dates=['date'], index_col=['date'])
df.index.freq = 'Q' # let pandas know that this is quartely data

Remember that since the Great Recession, the Federal Funds Rate has been below the ZLB. That is why, like in [*getting started*](https://pydsge.readthedocs.io/en/latest/getting_started.html), we adjust the observed interest rate, so that the data is "within reach" of our model.

In [5]:
# adjust elb
zlb = mod.get_par('elb_level')
rate = df['FFR']
df['FFR'] = np.maximum(rate,zlb)

mod.load_data(df, start='1998Q1')

Unnamed: 0_level_0,GDP,Infl,FFR
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1998-03-31,0.77834,0.14386,1.38
1998-06-30,0.69635,0.22873,1.38
1998-09-30,1.03077,0.36109,1.38
1998-12-31,1.37921,0.26145,1.22
1999-03-31,0.54307,0.37393,1.18
...,...,...,...
2017-03-31,0.41475,0.49969,0.18
2017-06-30,0.54594,0.25245,0.24
2017-09-30,0.54391,0.51972,0.29
2017-12-31,0.48458,0.57830,0.30


## Preparing the estimation

After importing the packages and loading the data, we still need to tell pydsge how to carry out the estimation of our model. The "prep_estim" method can be used to accomplish this. It can be called without any arguments and sets-up a non-linear model by default. However, not all defaults are always a good good choice, and to showcase some of this functionality, we decide to specify several arguments here.

To perform the estimation, `pydsge` uses a Transposed-Ensemble Kalman Filter (TEnKF). For general information on its implementation, see the [EconSieve documentation](https://econsieve.readthedocs.io/en/latest/) , and for more details on running the filter in `pydsge` check-out the [*getting started tutorial*](https://pydsge.readthedocs.io/en/latest/getting_started.html). Again,  the default filter is non-linear, but we can opt for a linear one by setting the argument `linear` to `True`. To choose a custom number of ensemble members for the TEnKF, set `N` to a particular number (default is 300, for e.g. a medium scale model 400-500 is a good choice). We can also set a specific random seed with the argument `seed` (the default seed is `0`). To get additional information on the estimation process, we can set  `verbose` to `True`. Conveniently, this information includes an overview of the parameters’ distribution, their means and standard deviations. Finally, if we already specified the covariance matrix of the measurement errors or want to reuse a previous result, we can load it into the `prep_estim` method by setting `Load.R` to `True`. 

If you run into problems you can turn parallelization off by setting `debug=True`.

In [6]:
mod.prep_estim(N=350, seed=0, verbose=True)

[estimation:]   Model operational. 12 states, 3 observables, 3 shocks, 81 data points.
Adding parameters to the prior distribution...
   - theta as beta (0.5, 0.1). Init @ 0.7813, with bounds (0.2, 0.95)
   - sigma as normal (1.5, 0.375). Init @ 1.2312, with bounds (0.25, 3)
   - phi_pi as normal (1.5, 0.25). Init @ 1.7985, with bounds (1.0, 3)
   - phi_y as normal (0.125, 0.05). Init @ 0.0893, with bounds (0.001, 0.5)
   - rho_u as beta (0.5, 0.2). Init @ 0.7, with bounds (0.01, 0.9999)
   - rho_r as beta (0.5, 0.2). Init @ 0.7, with bounds (0.01, 0.9999)
   - rho_z as beta (0.5, 0.2). Init @ 0.7, with bounds (0.01, 0.9999)
   - rho as beta (0.75, 0.1). Init @ 0.8, with bounds (0.5, 0.975)
   - sig_u as inv_gamma_dynare (0.1, 2). Init @ 0.5, with bounds (0.025, 5)
   - sig_r as inv_gamma_dynare (0.1, 2). Init @ 0.5, with bounds (0.01, 3)
   - sig_z as inv_gamma_dynare (0.1, 2). Init @ 0.5, with bounds (0.01, 3)
[estimation:]   11 priors detected. Adding parameters to the prior distrib

As in the filtering tutorial, we set the covariance of measurement errors to correspond to the variances of the data. Additionally, we adjust the measurement errors of the Federal Funds rate since it is perfectly observable.

In [7]:
mod.filter.R = mod.create_obs_cov(1e-1)
ind = mod.observables.index('FFR')
mod.filter.R[ind,ind] /= 1e1 

## Running the estimation

Lets turn to the actual estimation. For a variety of pretty good reasons, `pdygse` uses *Ensemble Markov Chain Monte Carlo* (Ensemble-MCMC) integration to sample from the posterior distribution. For further information on Ensemble-MCMC, please refer to the `emcee` [website](https://emcee.readthedocs.io/en/stable/) and the additional resources provided there. 

We first require an initial ensemble, which is provided by `tmcmc`. `tmcmc` is a very sophisticated function with many options, but right now, all we are interested in is to obtain a sample that represents the prior distribution:

In [8]:
p0 = mod.prior_sampler(50, verbose=True) # rule of thumb: number_of_parameters times 4 

100%|██████████| 50/50 [00:01<00:00, 46.38it/s]

(prior_sample:) Sampling done. Check fails for 1.96% of the prior.





The parameter draws are saved in the object `p0` as a numpy array in order to later pass them to our main sampling process.

In [9]:
mod.save()

[save_meta:]    Metadata saved as '/tmp/output/Rank_tutorial_meta'


`mod.save()` saved the meta data of our model in the directory which we specified earlier in `mod.path`. This information is stored as an `.npz` file so that it is avialable even in the event of a crash and can be loaded anytime using `numpy.load()`.

For posterior sampling using `mcmc` we have the option to set different "moves", i.e. coordinate updating algorithms for the walkers. As a wrapper for a lot of `emcee` functionality,  `mcmc` can work with many different "moves" - for a list and implementation details please consult the `emcee` documentation. For using them here, specify them as a list of tuples, containing the type of move and its "weight". If no move is specified, `StretchMove` is used. 

In [10]:
moves = [(emcee.moves.DEMove(), 0.8), 
         (emcee.moves.DESnookerMove(), 0.2),]

We now use the initial states derived above to conduct our full Bayesian estimation using `mcmc`. Note that, instead of using the specified initial ensemble, `mcmc` can identify previous runs or estimations, or the initial values of the "prior" section in the `*.yaml` can be used. 

The default number of sampling steps is 3000, which is parallelized by default. With `tune` we can determine the size of the Markov Chain we wish to retain to represent the posterior, i.e. after burn-in. This is not to be confused this with the updating frequency, which only affects the number of summary statements `pydsge`reports during the estimation. 

With the option `lprob_seed` the user can choose how to set the random seed of the likelihood evaluation - here we use the seed specified in `prep_estim`. 

In [None]:
mod.mcmc(p0,
         moves=moves,
         nsteps=3000,
         tune=500,
         update_freq=500,
         ) # this may take some time. Better run on a machine with MANY cores...
mod.save() # be sure to save the internal state!

Great. So where are our estimates? Our (hopefully) converged MCMC samples are currently stored in the `rank_test_sampler.h5` file created by `mcmc`. 

You can load and use this data using the methods introduced in the [*processing estimation results tutorial*](https://pydsge.readthedocs.io/en/latest/getting_started.html).

In [None]:
# Just for the tutorial: Cleaning the temporary directory
shutil.rmtree(output_path)