<a href="https://colab.research.google.com/github/agi2019/ppi-gci/blob/main/tutorials/01b%20-%20data%20preparation%20(interdependency%20network).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Data preparation – interdependency networks</center>

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, <a href="https://twitter.com/guerrero_oa">@guerrero_oa</a>) Adapted for PPI-GCI project on cybersecurity policy prioritisation

In this page, the aim is to demonstrate how to construct the input network for GCI indicators in the context of PPI simulations. For illustrative purposes—and aligned with the PPI tutorial—I will adopt a simple correlation-based approach to estimate pairwise relationships between cybersecurity indicators over time. Specifically, I will:

1.   Load the pre-processed GCI indicator dataset, spanning multiple years (e.g., 2014, 2017, 2018, 2020, and 2024).
2.   Compute pairwise correlations between the changes in indicators, incorporating lagged values to infer directionality and construct a directed, asymmetric network.
3.  Apply a threshold criterion to filter weak relationships, retaining only those edges that surpass a chosen significance level.
4.  Convert the resulting matrix into a structured format suitable for use in the PPI model.

⚠️ Note: This method is applied as a temporary simplification to support the initial model setup. While it does not capture the full complexity of indicator interdependencies, it enables early-stage simulations and validation. In the next phase, this correlation-based network will be revisited and enhanced using a more appropriate network-estimation method tailored to the GCI indicators. The future goal is to incorporate a weighted structure that reflects the influence of each cybersecurity indicator on others, aligned with the systemic nature of cybersecurity policy domains and the GCI framework.

Ultimately, the network structure will support more accurate policy prioritisation by embedding realistic interdependencies between cybersecurity indicators into the PPI model.

## Import the necessary Python libraries to manipulate data

In [None]:
import pandas as pd
import numpy as np

## Import the raw development indicators

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data/data_indicators.csv')

## Construct a matrix with pairwise Pearson correlations

The directionality of the edges is from row to column.

In [None]:
N = len(data)
M = np.zeros((N, N))
years = [column_name for column_name in data.columns if str(column_name).isnumeric()]

for i, rowi in data.iterrows():
    for j, rowj in data.iterrows():
        if i!=j:
            serie1 = rowi[years].values.astype(float)[1::]
            serie2 = rowj[years].values.astype(float)[0:-1]
            change_serie1 = serie1[1::] - serie1[0:-1]
            change_serie2 = serie2[1::] - serie2[0:-1]
            if not np.all(change_serie1 == change_serie1[0]) and not np.all(change_serie2 == change_serie2[0]):
                M[i,j] = np.corrcoef(change_serie1, change_serie2)[0,1]

## Filter edges that have a weight of magnitude lower than 0.5

In [None]:
M[np.abs(M) < 0.5] = 0

## Save the network as a list of edges using the indicators' ids

In [None]:
ids = data.seriesCode.values
edge_list = []
for i, j in zip(np.where(M!=0)[0], np.where(M!=0)[1]):
    edge_list.append( [ids[i], ids[j], M[i,j]] )
df = pd.DataFrame(edge_list, columns=['origin', 'destination', 'weight'])
df.to_csv('data_network.csv', index=False)