<a href="https://colab.research.google.com/github/agi2019/ppi-gci/blob/main/tutorials/02%20-%20model%20calibration.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Model calibration</center>

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, <a href="https://twitter.com/guerrero_oa">@guerrero_oa</a>)

In this tutorial I will calibrate the free parameters of PPI's model. First, I will load all the data that I have prepared in the previous tutorials. Then, I extract the relevant information and put it in adequate data structures. Finally, I run the calibration function and save the results with the parameter values.

## Importing Python libraries to manipulate data

In [1]:
import pandas as pd
import numpy as np

Select Scenario

In [2]:
#scenario = '_scenario1'
scenario = '_scenario2'
#scenario = '_scenario3'

## Importing PPI functions

In this tutorial, I will import the PPI source code directly from its repository. This means that I will place a request to GitHub, download the `policy_priority_inference.py` file, and copy it locally into the folder where these tutorials are saved. Then, I will import PPI. This approach is useful if you want to run this tutorial in a cloud computing service.

An alternative would be to manually copy the `policy_priority_inference.py` file into the folder where this tutorial is located.

In [3]:
import requests # the Python library that helps placing requests to websites
url = 'https://raw.githubusercontent.com/agi2019/ppi-gci/main/source_code/policy_priority_inference.py'
r = requests.get(url)
with open('policy_priority_inference.py', 'w') as f:
    f.write(r.text)
import policy_priority_inference as ppi

## Load data

### Indicators

In [4]:
df_indis = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data'+scenario+'/data_indicators.csv')

N = len(df_indis) # number of indicators
I0 = df_indis.I0.values # initial values
IF = df_indis.IF.values # final values
success_rates = df_indis.successRates.values # success rates
R = df_indis.instrumental # instrumental indicators
qm = df_indis.qm.values # quality of monitoring
rl = df_indis.rl.values # quality of the rule of law
indis_index = dict([(code, i) for i, code in enumerate(df_indis.seriesCode)]) # used to build the network matrix

### Interdependency network

In [5]:
df_net = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data'+scenario+'/data_network.csv')

A = np.zeros((N, N)) # adjacency matrix
for index, row in df_net.iterrows():
    i = indis_index[row.origin]
    j = indis_index[row.destination]
    w = row.weight
    A[i,j] = w

### Budget

In [6]:
df_exp = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data'+scenario+'/data_expenditure.csv')

Bs = df_exp.values[:,1::] # disbursement schedule (assumes that the expenditure programmes are properly sorted)

### Budget-indicator mapping

In [7]:
df_rela = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data'+scenario+'/data_relational_table.csv')

B_dict = {} # PPI needs the relational table in the form of a Python dictionary
for index, row in df_rela.iterrows():
    B_dict[indis_index[row.seriesCode]] = [programme for programme in row.values[1::][row.values[1::].astype(str)!='nan']]

## Calibrate

Now I run the calibration function to show that it works. Before that, let me explain a couple of new inputs that the user needs to provide:

* <strong>threshold</strong>: How well should the model be fit.
* <strong>parallel_processes</strong>: The number of processes (workers) to be ran in parallel.
* <strong>verbose</strong>: Whether to print or not the outputs as the calibration progresses.
* <strong>low_precision_counts</strong>: The number of iterations that use few Monte Carlo simulations.

The <strong>threshold</strong> parameter indicates the quality of the goodness of fit. More specifically, how good should the worst-fitted indicator be. The best possible fit is close to 1, but cannot be exactly 1 due to the stochasticity of the model. The higher the threshold, the mode Monte Carlo simulations are needed and, thus, the more time and computational resources are needed to complete the calibration.

Parameter <strong>parallel_processes</strong> is used to enhance efficiency. Since each Monte Carlo simulation is independent of each other, this workload can be distributed across multiple cores or processors. Today, most personal devices have the capability of handling this distributed load, so here I show how to calibrate the model using 4 parallel processes. It is recommended that you know how many cores or processors your equipment has, and that <strong>parallel_processes</strong> does not exceed that number. Otherwise, the performance of the calibration may be sub-optimal.

Finally, the <strong>low_precision_counts</strong> parameter helps accelerating the calibration. At the beginning of the calibration, the algorithm proposes a random set of parameters for the model. Because this proposal is unrelated to the true parameters, the errors tend to be large. In the presence of large errors, one can improve the goodness of fit without needing too much precision in each evaluation (i.e., without running too many Monte Carlo simulations). Hence, this parameter determines how many low-precision iterations of the algorithm should be run before proceeding to the high-precision ones. This accelerates the calibration procedure substantially.

In [8]:
T = Bs.shape[1]
parallel_processes = 4 # number of cores to use
threshold = 0.95 # the quality of the calibration (I choose a medium quality for illustration purposes)
low_precision_counts = 100 # number of low-quality iterations to accelerate the calibration

parameters = ppi.calibrate(I0, IF, success_rates, A=A, R=R, qm=qm, rl=rl, Bs=Bs, B_dict=B_dict,
              T=T, threshold=threshold, parallel_processes=parallel_processes, verbose=True,
             low_precision_counts=low_precision_counts)

Iteration: 1 .    Worst goodness of fit: -1213077512.039545
Iteration: 2 .    Worst goodness of fit: -233288763.7942655
Iteration: 3 .    Worst goodness of fit: -71108364.15286408
Iteration: 4 .    Worst goodness of fit: -14021016.085346982
Iteration: 5 .    Worst goodness of fit: -4122811.7234277157
Iteration: 6 .    Worst goodness of fit: -849178.4710997158
Iteration: 7 .    Worst goodness of fit: -232432.86467424023
Iteration: 8 .    Worst goodness of fit: -65751.65758314297
Iteration: 9 .    Worst goodness of fit: -11065.57621438256
Iteration: 10 .    Worst goodness of fit: -3437.7697583244526
Iteration: 11 .    Worst goodness of fit: -733.3961207905568
Iteration: 12 .    Worst goodness of fit: -162.98673660241195
Iteration: 13 .    Worst goodness of fit: -35.98379797728263
Iteration: 14 .    Worst goodness of fit: -8.643357741945298
Iteration: 15 .    Worst goodness of fit: -7.64792592582968
Iteration: 16 .    Worst goodness of fit: -12.515353918328643
Iteration: 17 .    Worst goo

## Calibration outputs

The output of the calibration function is a matrix with the following columns:

* <strong>alpha</strong>: the parameters related to structural constraints
* <strong>alpha_prime</strong>: the parameters related to structural costs
* <strong>beta</strong>: the parameters related to the probability of success
* <strong>T</strong>: the number of simulation periods
* <strong>error_alpha</strong>: the errors associated to the parameters $\alpha$ and $\alpha'$
* <strong>error_beta</strong>: the errors associated to the parameters $\beta$
* <strong>GoF_alpha</strong>: the goodness-of-fit associated to the parameters $\alpha$ and $\alpha'$
* <strong>GoF_beta</strong>: the goodness-of-fit associated to the parameters $\beta$

The top row of this matrix contains the column names, so I just need to transform these data into a DataFrame to export it.

In [9]:
df_params = pd.DataFrame(parameters[1::], columns=parameters[0])

In [10]:
df_params

Unnamed: 0,alpha,alpha_prime,beta,T,error_alpha,error_beta,GoF_alpha,GoF_beta
0,0.023010571583765176,6.416533198283739e-06,3.041915261061742e-09,60,0.009225903433565108,0.010470417541713961,0.9861611448536944,0.9869119780728576
1,0.022579538350893106,2.3378183503233593e-05,2.9226040238826823e-09,,-0.0062179298123749005,-0.010201284385034537,0.990673105284762,0.9872483945187068
2,0.036730432551951825,3.739738087619254e-05,5.000655217777263e-09,,0.0014279675290379545,-0.0014899470770236567,0.997858048706443,0.9981375661537204
3,0.037770630830913,4.4343391079294724e-07,3.550626009458645e-08,,-0.003962971807434101,-0.0035518992200644917,0.9940555422906739,0.9955601259749194
4,0.03536126089047218,1.7503049173893082e-07,7.643495803650228e-08,,-0.01756787107915392,-0.01171754757157284,0.9736481933924018,0.9853530655355339
...,...,...,...,...,...,...,...,...
58,0.02205642907300448,1.903672009326713e-05,1.7998901266434721e-09,,0.010128113654000015,0.004687341647771848,0.9848078295263665,0.9941408229402852
59,0.01771328563661925,0.00943562449544351,2.2453228286487006,,-0.005082560834178351,-0.005438949503809409,0.9923761587512466,0.9932013131202382
60,0.03601168330115082,0.00015459941072064044,3.34729738401626e-09,,-0.0021503248818616205,0.005869570606015673,0.9967745126772076,0.9926630367424805
61,0.01913252763617104,8.930577035222634e-06,1.3830867286362258e-09,,-0.022377734431063057,-0.01506593027267189,0.9664333983534055,0.9811675871591601


## Save parameters data

In [11]:
df_params.to_csv('parameters.csv', index=False)