<a href="https://colab.research.google.com/github/agi2019/ppi-gci/blob/main/tutorials/02%20-%20model%20calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Model calibration</center>

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, <a href="https://twitter.com/guerrero_oa">@guerrero_oa</a>)

In this tutorial I will calibrate the free parameters of PPI's model. First, I will load all the data that I have prepared in the previous tutorials. Then, I extract the relevant information and put it in adequate data structures. Finally, I run the calibration function and save the results with the parameter values.

## Importing Python libraries to manipulate data

In [2]:
import pandas as pd
import numpy as np

## Importing PPI functions

In this tutorial, I will import the PPI source code directly from its repository. This means that I will place a request to GitHub, download the `policy_priority_inference.py` file, and copy it locally into the folder where these tutorials are saved. Then, I will import PPI. This approach is useful if you want to run this tutorial in a cloud computing service.

An alternative would be to manually copy the `policy_priority_inference.py` file into the folder where this tutorial is located.

In [3]:
import requests # the Python library that helps placing requests to websites
url = 'https://raw.githubusercontent.com/agi2019/ppi-gci/main/source_code/policy_priority_inference.py'
r = requests.get(url)
with open('policy_priority_inference.py', 'w') as f:
    f.write(r.text)
import policy_priority_inference as ppi

## Load data

### Indicators

In [5]:
df_indis = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_indicators.csv')

N = len(df_indis) # number of indicators
I0 = df_indis.I0.values # initial values
IF = df_indis.IF.values # final values
success_rates = df_indis.successRates.values # success rates
R = df_indis.instrumental # instrumental indicators
qm = df_indis.qm.values # quality of monitoring
rl = df_indis.rl.values # quality of the rule of law
indis_index = dict([(code, i) for i, code in enumerate(df_indis.seriesCode)]) # used to build the network matrix

### Interdependency network

In [6]:
df_net = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_network.csv')

A = np.zeros((N, N)) # adjacency matrix
for index, row in df_net.iterrows():
    i = indis_index[row.origin]
    j = indis_index[row.destination]
    w = row.weight
    A[i,j] = w

### Budget

In [7]:
df_exp = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_expenditure.csv')

Bs = df_exp.values[:,1::] # disbursement schedule (assumes that the expenditure programmes are properly sorted)

### Budget-indicator mapping

In [8]:
df_rela = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_relational_table.csv')

B_dict = {} # PPI needs the relational table in the form of a Python dictionary
for index, row in df_rela.iterrows():
    B_dict[indis_index[row.seriesCode]] = [programme for programme in row.values[1::][row.values[1::].astype(str)!='nan']]

## Calibrate

Now I run the calibration function to show that it works. Before that, let me explain a couple of new inputs that the user needs to provide:

* <strong>threshold</strong>: How well should the model be fit.
* <strong>parallel_processes</strong>: The number of processes (workers) to be ran in parallel.
* <strong>verbose</strong>: Whether to print or not the outputs as the calibration progresses.
* <strong>low_precision_counts</strong>: The number of iterations that use few Monte Carlo simulations.

The <strong>threshold</strong> parameter indicates the quality of the goodness of fit. More specifically, how good should the worst-fitted indicator be. The best possible fit is close to 1, but cannot be exactly 1 due to the stochasticity of the model. The higher the threshold, the mode Monte Carlo simulations are needed and, thus, the more time and computational resources are needed to complete the calibration.

Parameter <strong>parallel_processes</strong> is used to enhance efficiency. Since each Monte Carlo simulation is independent of each other, this workload can be distributed across multiple cores or processors. Today, most personal devices have the capability of handling this distributed load, so here I show how to calibrate the model using 4 parallel processes. It is recommended that you know how many cores or processors your equipment has, and that <strong>parallel_processes</strong> does not exceed that number. Otherwise, the performance of the calibration may be sub-optimal.

Finally, the <strong>low_precision_counts</strong> parameter helps accelerating the calibration. At the beginning of the calibration, the algorithm proposes a random set of parameters for the model. Because this proposal is unrelated to the true parameters, the errors tend to be large. In the presence of large errors, one can improve the goodness of fit without needing too much precision in each evaluation (i.e., without running too many Monte Carlo simulations). Hence, this parameter determines how many low-precision iterations of the algorithm should be run before proceeding to the high-precision ones. This accelerates the calibration procedure substantially.

In [10]:
T = Bs.shape[1]
parallel_processes = 4 # number of cores to use
threshold = 0.8 # the quality of the calibration (I choose a medium quality for illustration purposes)
low_precision_counts = 50 # number of low-quality iterations to accelerate the calibration

parameters = ppi.calibrate(I0, IF, success_rates, A=A, R=R, qm=qm, rl=rl, Bs=Bs, B_dict=B_dict,
              T=T, threshold=threshold, parallel_processes=parallel_processes, verbose=True,
             low_precision_counts=low_precision_counts)

Iteration: 1 .    Worst goodness of fit: -16707456178.56341
Iteration: 2 .    Worst goodness of fit: -4635656710.594559
Iteration: 3 .    Worst goodness of fit: -1074155471.9355645
Iteration: 4 .    Worst goodness of fit: -239550926.64713275
Iteration: 5 .    Worst goodness of fit: -58323441.11444951
Iteration: 6 .    Worst goodness of fit: -15935075.566584602
Iteration: 7 .    Worst goodness of fit: -3715396.0976495706
Iteration: 8 .    Worst goodness of fit: -1029124.8748600112
Iteration: 9 .    Worst goodness of fit: -224632.5431805897
Iteration: 10 .    Worst goodness of fit: -55303.96352662644
Iteration: 11 .    Worst goodness of fit: -15339.488819730363
Iteration: 12 .    Worst goodness of fit: -4183.550392874412
Iteration: 13 .    Worst goodness of fit: -827.0528534832639
Iteration: 14 .    Worst goodness of fit: -349.9369398608804
Iteration: 15 .    Worst goodness of fit: -963.2174152676753
Iteration: 16 .    Worst goodness of fit: -625.3913198689885
Iteration: 17 .    Worst go

## Calibration outputs

The output of the calibration function is a matrix with the following columns:

* <strong>alpha</strong>: the parameters related to structural constraints
* <strong>alpha_prime</strong>: the parameters related to structural costs
* <strong>beta</strong>: the parameters related to the probability of success
* <strong>T</strong>: the number of simulation periods
* <strong>error_alpha</strong>: the errors associated to the parameters $\alpha$ and $\alpha'$
* <strong>error_beta</strong>: the errors associated to the parameters $\beta$
* <strong>GoF_alpha</strong>: the goodness-of-fit associated to the parameters $\alpha$ and $\alpha'$
* <strong>GoF_beta</strong>: the goodness-of-fit associated to the parameters $\beta$

The top row of this matrix contains the column names, so I just need to transform these data into a DataFrame to export it.

In [11]:
df_params = pd.DataFrame(parameters[1::], columns=parameters[0])

In [12]:
df_params

Unnamed: 0,alpha,alpha_prime,beta,T,error_alpha,error_beta,GoF_alpha,GoF_beta
0,0.026561445723475314,2.2771520996030424e-07,5.261650319253358e-09,60,0.00388691232721039,0.007308758209577493,0.9934018752031485,0.985382483580845
1,0.04661921488772814,0.00014196119688736803,1.5616570118740753e-08,,0.02019465176294799,0.02275316282084161,0.9793768145126394,0.9696624495722111
2,0.03102974153121878,6.022682024496768e-05,1.1413579876380086e-08,,0.0025678636219798046,-0.020615530866777254,0.9974321363780202,0.978299441192866
3,0.016927393722950275,1.1624038919631803e-07,3.967554391433942e-09,,-0.0028226210273336605,-0.008066842811805142,0.9938340965340037,0.9838663143763897
4,4.1328047610827715e-06,0.019793483642555256,7.38276303136188e-09,,-0.012120829225240382,0.0017695725557291264,0.9830641669797835,0.9964608548885417
...,...,...,...,...,...,...,...,...
58,0.038622380018015004,0.00012076624224617928,2.416244058358479e-08,,-0.006999784528333608,0.001823757423919159,0.9905426408490138,0.9975683234347744
59,0.024083393229467634,0.00988399870083348,2.7592015056363315,,-0.005984343974595907,0.0038931557978522102,0.9940156560254041,0.9959019412654188
60,0.03361818830441011,0.00046521675859511897,9.453415123012292e-09,,-0.001403074923983283,0.006614703826928525,0.9985525068529636,0.9911803948974286
61,0.01988909345109015,1.4028313235455713e-05,6.019441941804601e-09,,-0.010012483171938147,-0.008540307612066922,0.9855676677267502,0.9886129231839108


## Save parameters data

In [13]:
df_params.to_csv('parameters.csv', index=False)