<a href="https://colab.research.google.com/github/agi2019/ppi-gci/blob/main/tutorials/01b%20-%20data%20preparation%20(interdependency%20network).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center>Data preparation – interdependency networks</center>

Prepared by Omar A. Guerrero (oguerrero@turing.ac.uk, <a href="https://twitter.com/guerrero_oa">@guerrero_oa</a>) Adapted for PPI-Global Cybersecurity Index (GCI) project on cybersecurity policy prioritisation

In this page, the aim is to demonstrate how to construct the input network for GCI indicators in the context of PPI simulations. For illustrative purposes—and aligned with the PPI tutorial—I will adopt a simple correlation-based approach to estimate pairwise relationships between cybersecurity indicators over time. Specifically, I will:

1.   Load the pre-processed GCI indicator dataset, spanning multiple years (e.g., 2014, 2017, 2018, 2020, and 2024).
2.   Compute pairwise correlations between the changes in indicators, incorporating lagged values to infer directionality and construct a directed, asymmetric network.
3.  Apply a threshold criterion to filter weak relationships, retaining only those edges that surpass a chosen significance level.
4.  Convert the resulting matrix into a structured format suitable for use in the PPI model.

⚠️ Note: This method is applied as a temporary simplification to support the initial model setup. While it does not capture the full complexity of indicator interdependencies, it enables early-stage simulations and validation. In the next phase, this correlation-based network will be revisited and enhanced using a more appropriate network-estimation method tailored to the GCI indicators. The future goal is to incorporate a weighted structure that reflects the influence of each cybersecurity indicator on others, aligned with the systemic nature of cybersecurity policy domains and the GCI framework.

Ultimately, the network structure will support more accurate policy prioritisation by embedding realistic interdependencies between cybersecurity indicators into the PPI model.

## Import the necessary Python libraries to manipulate data

In [1]:
import pandas as pd
import numpy as np
import scipy.stats as stats
import networkx as nx
import plotly.graph_objects as go

Select Scenario

In [2]:
#scenario = '_scenario1'
scenario = '_scenario2'
#scenario = '_scenario3'

## Import the raw development indicators

In [3]:
data = pd.read_csv('https://raw.githubusercontent.com/agi2019/ppi-gci/main/tutorials/clean_data'+scenario+'/data_indicators.csv')

Check Normal Distribution?
<i>Pearson correlation assumes that the year-to-year changes in indicator data follow a normal distribution. If this assumption is violated, the correlation results may be biased, and it is advisable to perform a normality test or use Spearman correlation instead, particularly when the data is ordinal or non-normally distributed</i>

In [4]:
pd.set_option('display.max_rows', None)

# 1. Baca file CSV
df=data
# 2. Ambil kolom tahun 2020–2024
year_columns = [col for col in df.columns if col.isnumeric() and int(col) in range(2020, 2025)]
df_years = df[year_columns]

# 3. Lakukan uji normalitas Shapiro-Wilk per baris/indikator
shapiro_results = []

for idx, row in df_years.iterrows():
    values = row.dropna().astype(float)

    if len(values) >= 3:  # Shapiro-Wilk butuh minimal 3 nilai
        stat, p_value = stats.shapiro(values)
        conclusion = 'Normal' if p_value > 0.05 else 'Not Normal'
    else:
        stat, p_value = None, None
        conclusion = 'Insufficient Data'

    shapiro_results.append({
        'seriesCode': df.loc[idx, 'seriesCode'],
        'W-Statistic': stat,
        'p-value': p_value,
        'Distribution': conclusion
    })

# 4. Hasil dalam DataFrame
shapiro_df = pd.DataFrame(shapiro_results)

# 5. Tampilkan hasil
shapiro_df

Unnamed: 0,seriesCode,W-Statistic,p-value,Distribution
0,gci11_Lonline,0.986762,0.967174,Normal
1,gci11_Lforgery,0.986762,0.967174,Normal
2,gci11_Lolsafety,0.986762,0.967174,Normal
3,gci12_Rpdp,0.986762,0.967174,Normal
4,gci12_Rprivacy,0.986762,0.967174,Normal
5,gci12_Rnotif,0.986762,0.967174,Normal
6,gci12_RAuditre,0.986762,0.967174,Normal
7,gci12_Rstandard,0.986762,0.967174,Normal
8,gci12_Rsign,0.986762,0.967174,Normal
9,gci12_Rspam,0.986762,0.967174,Normal


## Construct a matrix with pairwise Pearson correlations

The directionality of the edges is from row to column.

In [5]:
N = len(data)
M = np.zeros((N, N))
years = [column_name for column_name in data.columns if str(column_name).isnumeric()]

for i, rowi in data.iterrows():
    for j, rowj in data.iterrows():
        if i!=j:
            serie1 = rowi[years].values.astype(float)[1::]
            serie2 = rowj[years].values.astype(float)[0:-1]
            change_serie1 = serie1[1::] - serie1[0:-1]
            change_serie2 = serie2[1::] - serie2[0:-1]
            if not np.all(change_serie1 == change_serie1[0]) and not np.all(change_serie2 == change_serie2[0]):
                M[i,j] = np.corrcoef(change_serie1, change_serie2)[0,1]

## Filter edges that have a weight of magnitude lower than 0.5

In [6]:
M[np.abs(M) < 0.5] = 0

## Save the network as a list of edges using the indicators' ids

In [7]:
ids = data.seriesCode.values
edge_list = []
for i, j in zip(np.where(M!=0)[0], np.where(M!=0)[1]):
    edge_list.append( [ids[i], ids[j], M[i,j]] )
df = pd.DataFrame(edge_list, columns=['origin', 'destination', 'weight'])
df.to_csv('data_network.csv', index=False)

In [8]:
# Buat graph
G = nx.DiGraph()

# Tambahkan edge berdasarkan dataframe
for _, row in df.iterrows():
    G.add_edge(row['origin'], row['destination'], weight=row['weight'])

# Ambil posisi node menggunakan layout force-directed
pos = nx.spring_layout(G, k=0.3, iterations=50)

# Buat edge trace (garis antar node)
edge_trace = []
for edge in G.edges(data=True):
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    weight = edge[2]['weight']

    # Warna soft (bisa juga ganti ke warna hex)
    color = 'rgba(0, 200, 0, 0.2)' if weight > 0 else 'rgba(255, 100, 100, 0.2)'
    width = abs(weight) * 3

    edge_trace.append(
        go.Scatter(
            x=[x0, x1], y=[y0, y1],
            line=dict(width=width, color=color),
            hoverinfo='text',
            text=f"{edge[0]} → {edge[1]}<br>Weight: {weight:.3f}",
            mode='lines'
        )
    )

# Buat node trace
node_x = []
node_y = []
node_text = []
for node in G.nodes():
    x, y = pos[node]
    node_x.append(x)
    node_y.append(y)
    node_text.append(node)

node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers+text',
    text=node_text,
    textposition="top center",
    hoverinfo='text',
    marker=dict(
        size=14,
        color='lightblue',
        line=dict(width=2, color='darkblue'),
        opacity=1.0  # node tetap terlihat jelas
    ),
    textfont=dict(
        size=12,
        color='black'  # teks lebih jelas
    )
)

# Gabungkan semua ke layout plotly
fig = go.Figure(
    data=edge_trace + [node_trace],
    layout=go.Layout(
        title='Indicator Relationship Network',
        titlefont_size=16,
        showlegend=False,
        hovermode='closest',
        margin=dict(b=20, l=5, r=5, t=40),
        xaxis=dict(showgrid=False, zeroline=False),
        yaxis=dict(showgrid=False, zeroline=False)
    )
)

fig.show()
