<a href="https://colab.research.google.com/github/agi2019/ppi-gci/blob/main/tutorials/02%20-%20model%20calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Model calibration</center>

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, <a href="https://twitter.com/guerrero_oa">@guerrero_oa</a>)

In this tutorial I will calibrate the free parameters of PPI's model. First, I will load all the data that I have prepared in the previous tutorials. Then, I extract the relevant information and put it in adequate data structures. Finally, I run the calibration function and save the results with the parameter values.

## Importing Python libraries to manipulate data

In [1]:
import pandas as pd
import numpy as np

## Importing PPI functions

In this tutorial, I will import the PPI source code directly from its repository. This means that I will place a request to GitHub, download the `policy_priority_inference.py` file, and copy it locally into the folder where these tutorials are saved. Then, I will import PPI. This approach is useful if you want to run this tutorial in a cloud computing service.

An alternative would be to manually copy the `policy_priority_inference.py` file into the folder where this tutorial is located.

In [2]:
import requests # the Python library that helps placing requests to websites
url = 'https://raw.githubusercontent.com/agi2019/ppi-gci/main/source_code/policy_priority_inference.py'
r = requests.get(url)
with open('policy_priority_inference.py', 'w') as f:
    f.write(r.text)
import policy_priority_inference as ppi

## Load data

### Indicators

In [3]:
df_indis = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_indicators.csv')

N = len(df_indis) # number of indicators
I0 = df_indis.I0.values # initial values
IF = df_indis.IF.values # final values
success_rates = df_indis.successRates.values # success rates
R = df_indis.instrumental # instrumental indicators
qm = df_indis.qm.values # quality of monitoring
rl = df_indis.rl.values # quality of the rule of law
indis_index = dict([(code, i) for i, code in enumerate(df_indis.seriesCode)]) # used to build the network matrix

### Interdependency network

In [4]:
df_net = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_network.csv')

A = np.zeros((N, N)) # adjacency matrix
for index, row in df_net.iterrows():
    i = indis_index[row.origin]
    j = indis_index[row.destination]
    w = row.weight
    A[i,j] = w

### Budget

In [5]:
df_exp = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_expenditure.csv')

Bs = df_exp.values[:,1::] # disbursement schedule (assumes that the expenditure programmes are properly sorted)

### Budget-indicator mapping

In [6]:
df_rela = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_relational_table.csv')

B_dict = {} # PPI needs the relational table in the form of a Python dictionary
for index, row in df_rela.iterrows():
    B_dict[indis_index[row.seriesCode]] = [programme for programme in row.values[1::][row.values[1::].astype(str)!='nan']]

## Calibrate

Now I run the calibration function to show that it works. Before that, let me explain a couple of new inputs that the user needs to provide:

* <strong>threshold</strong>: How well should the model be fit.
* <strong>parallel_processes</strong>: The number of processes (workers) to be ran in parallel.
* <strong>verbose</strong>: Whether to print or not the outputs as the calibration progresses.
* <strong>low_precision_counts</strong>: The number of iterations that use few Monte Carlo simulations.

The <strong>threshold</strong> parameter indicates the quality of the goodness of fit. More specifically, how good should the worst-fitted indicator be. The best possible fit is close to 1, but cannot be exactly 1 due to the stochasticity of the model. The higher the threshold, the mode Monte Carlo simulations are needed and, thus, the more time and computational resources are needed to complete the calibration.

Parameter <strong>parallel_processes</strong> is used to enhance efficiency. Since each Monte Carlo simulation is independent of each other, this workload can be distributed across multiple cores or processors. Today, most personal devices have the capability of handling this distributed load, so here I show how to calibrate the model using 4 parallel processes. It is recommended that you know how many cores or processors your equipment has, and that <strong>parallel_processes</strong> does not exceed that number. Otherwise, the performance of the calibration may be sub-optimal.

Finally, the <strong>low_precision_counts</strong> parameter helps accelerating the calibration. At the beginning of the calibration, the algorithm proposes a random set of parameters for the model. Because this proposal is unrelated to the true parameters, the errors tend to be large. In the presence of large errors, one can improve the goodness of fit without needing too much precision in each evaluation (i.e., without running too many Monte Carlo simulations). Hence, this parameter determines how many low-precision iterations of the algorithm should be run before proceeding to the high-precision ones. This accelerates the calibration procedure substantially.

In [7]:
T = Bs.shape[1]
parallel_processes = 4 # number of cores to use
threshold = 0.95 # the quality of the calibration (I choose a medium quality for illustration purposes)
low_precision_counts = 100 # number of low-quality iterations to accelerate the calibration

parameters = ppi.calibrate(I0, IF, success_rates, A=A, R=R, qm=qm, rl=rl, Bs=Bs, B_dict=B_dict,
              T=T, threshold=threshold, parallel_processes=parallel_processes, verbose=True,
             low_precision_counts=low_precision_counts)

Iteration: 1 .    Worst goodness of fit: -37390038151.7565
Iteration: 2 .    Worst goodness of fit: -8284276671.098168
Iteration: 3 .    Worst goodness of fit: -1859978075.1744826
Iteration: 4 .    Worst goodness of fit: -439350289.44561684
Iteration: 5 .    Worst goodness of fit: -110165943.5455683
Iteration: 6 .    Worst goodness of fit: -30000456.860864367
Iteration: 7 .    Worst goodness of fit: -6533126.090753049
Iteration: 8 .    Worst goodness of fit: -2091775.6007586692
Iteration: 9 .    Worst goodness of fit: -466719.66069446784
Iteration: 10 .    Worst goodness of fit: -97956.60242160223
Iteration: 11 .    Worst goodness of fit: -33097.897945920806
Iteration: 12 .    Worst goodness of fit: -7678.026960369543
Iteration: 13 .    Worst goodness of fit: -1632.9698714579863
Iteration: 14 .    Worst goodness of fit: -2491.9311349789714
Iteration: 15 .    Worst goodness of fit: -917.1917607352106
Iteration: 16 .    Worst goodness of fit: -264.09047590608924
Iteration: 17 .    Worst 

## Calibration outputs

The output of the calibration function is a matrix with the following columns:

* <strong>alpha</strong>: the parameters related to structural constraints
* <strong>alpha_prime</strong>: the parameters related to structural costs
* <strong>beta</strong>: the parameters related to the probability of success
* <strong>T</strong>: the number of simulation periods
* <strong>error_alpha</strong>: the errors associated to the parameters $\alpha$ and $\alpha'$
* <strong>error_beta</strong>: the errors associated to the parameters $\beta$
* <strong>GoF_alpha</strong>: the goodness-of-fit associated to the parameters $\alpha$ and $\alpha'$
* <strong>GoF_beta</strong>: the goodness-of-fit associated to the parameters $\beta$

The top row of this matrix contains the column names, so I just need to transform these data into a DataFrame to export it.

In [8]:
df_params = pd.DataFrame(parameters[1::], columns=parameters[0])

In [9]:
df_params

Unnamed: 0,alpha,alpha_prime,beta,T,error_alpha,error_beta,GoF_alpha,GoF_beta
0,0.005969547865276116,2.626768178800425e-11,6.803724295112542e-09,60,0.0027700081118997577,0.005474645069958539,0.9768788575127194,0.9890507098600829
1,0.010584773595690784,1.9487672534525416e-08,1.1810486359313003e-08,,0.0026300869254948234,0.01077955634373573,0.9884307363701748,0.9856272582083524
2,0.0098006687164233,8.233486099790456e-09,1.0118938054457443e-08,,-0.012460324463863026,-0.014861771276703184,0.9607973706150807,0.9843560302350493
3,0.0037972021092014434,1.3985638193965271e-12,3.4444776654424026e-09,,-0.000711764509784274,-0.00631526935191995,0.9930959686874188,0.9873694612961601
4,4.491646530301139e-09,0.005300529558562776,6.3107932187003555e-09,,-5.8978990997826664e-05,0.002490909726412416,0.9996884825290965,0.9950181805471752
...,...,...,...,...,...,...,...,...
58,0.0017951053930286463,3.264036598822547e-07,1.8468246237960906e-08,,-0.0004882462021453815,0.005396631090272885,0.986244396957863,0.9928044918796362
59,0.014938125457666198,0.0036660530797738037,2.958240738284741,,-0.002957290273885249,-0.014394743951629874,0.995137620220935,0.9848476379456528
60,0.015373038338731984,3.331043612086757e-06,1.1772487952686034e-08,,0.006282573958334314,0.011942770006254544,0.984088151765224,0.9840763066583272
61,0.007387050685879244,1.980442398168218e-06,5.890697021603439e-09,,-0.004890832935355416,-0.0010133572744347452,0.9810301200502968,0.9986488569674203


## Save parameters data

In [10]:
df_params.to_csv('parameters.csv', index=False)