# Introduction to the Dataset

Valeriano, B., & Maness, R. C. (2014). The dynamics of cyber conflict between rival antagonists, 2001–11. Journal of Peace Research, 51(3), 347–360. http://www.jstor.org/stable/24557484

Each pair of states engaged in cyber conflict has two states involved, on opposite sides of the cyber incidents and campaigns.
Dataset: https://drryanmaness.wixsite.com/cyberconflict/cyber-conflict-dataset

- States: A, B
- Cyber incidents: A (initiator) -> B (target)
- Dyadic dataset: A <-> B

For individual cyber conflicts, the dataset uses the concept of ‘cyber incident.’ Incidents may include thousands of events, but accounting for every single intrusion or attack made is impossible. In this case, the issue is how to count campaigns such as the SolarWinds intrusion? Should the Russian intrusion into SolarWinds be coded as a single campaign, an active intrusion into 110 organizations or a latent intrusion into 10,000? In the CDID data set, SolarWinds is counted as one incident.

Codebook: https://a678132e-4067-4ed4-800a-239c80659fd1.filesusr.com/ugd/4b99a4_ca35bdb6bd55443e890d2dab86910b4c.pdf

# Libraries Used

- pandas
- matplotlib
- networkx
- pyvis

In [None]:
import pandas as pd

In [None]:
# Load the dataset
file_path = 'data/DCID_2.0_Release_update_February_2023.xlsx'
# Read file from excel
dcid = pd.read_excel(file_path, "DCID_2.0_Release_update_Februar")

# Avoid duplicates on the Cyberincidentnum 
# FIXME: THIS IS STRANGE but if have to do it if working with this sheet from Excel
dcid = dcid.drop_duplicates(subset='Cyberincidentnum', keep='first')

print(dcid)

In [None]:
# Todo: # Convert the correlates of war country codes in to dcid to ISO codes
cow_countries = pd.read_csv('data/COW-country-codes.csv')
print(cow_countries)

# chek for duplicates in correlates of war country codes
# AAARRHHH Why the hell!!! there are duplicates in the correlates of war country codes
cow_countries = cow_countries.drop_duplicates(subset='CCode', keep='first')
print(cow_countries)

In [None]:
dcid = dcid.merge(cow_countries, left_on='initiator', right_on='CCode', how='left', suffixes=('', '_y'))

# Replace the target column with a country code from a correlates of war country codes
dcid = dcid.merge(cow_countries, left_on='target', right_on='CCode', how='left', suffixes=('', '_y'))

# Rename the columns StateAbb to initiator_iso
dcid = dcid.rename(columns={'StateAbb': 'initiator_iso'})
dcid = dcid.rename(columns={'StateAbb_y': 'target_iso'})

print(len(dcid))
print(dcid)
# save the file
# dcid.to_csv('export.csv', index=False)

dcid_short = dcid[['initiator_iso', 'target_iso']]
# print(dcid_short)

In [None]:
# Create the adjacency matrix for dcid

ct = pd.crosstab(dcid_short['initiator_iso'], dcid_short['target_iso'])
idx = ct.columns.union(ct.index)
adjacency_matrix = ct.reindex(index = idx, columns=idx, fill_value=0)

print(adjacency_matrix)

# Save the adjacency matrix to a csv file
adjacency_matrix.to_csv('adjacency_matrix.csv', index=True)

In [None]:
# Check if the matrix is symmetric
import numpy as np
if np.array_equal(adjacency_matrix, adjacency_matrix.T):
    print('Matrix is symmetric')
else:
    print('Matrix is not symmetric')

# Save the adjacency matrix to a csv file
adjacency_matrix.to_csv('adjacency_matrix.csv') 

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

G = nx.DiGraph(adjacency_matrix)
print(G.edges(data=True))


In [None]:
# Draw G graph including labels and weights
nx.draw(G, with_labels=True)


In [None]:
# Keep only nodes with outdegree > 0
# The node out_degree is the number of edges pointing out of the node
# The ones initiating the attack
outdeg = G.out_degree()

G2 = G.copy()
for node in G.nodes():
    if outdeg[node] == 0:
        G2.remove_node(node)

print(G2.edges(data=True))
nx.draw(G2, with_labels=True)

# Now we can see that some connections are only one way: e.g. 
# 365 Russia -> 732 (South Korea)
# FIXME: not every time I create the graph! WHY IS THIS RANDOM?

In [None]:

# Show the weights of the edges
pos=nx.spring_layout(G) 
nx.draw_networkx(G,pos)

# Todo: Avoid overlapping of the weight labels

# Todo: Avoid overlapping of the nodes

labels = nx.get_edge_attributes(G,'weight')
nx.draw_networkx_edge_labels(G,pos,edge_labels=labels)

In [None]:
from pyvis.network import Network

net = Network(notebook=True, cdn_resources='in_line')
net.from_nx(G)
net.show("output/network.html")

# # Open the HTML file in the browser or
# # read from file to buffer
# with open("network.html", "r") as file:
#     network_html = file.read()

# HTML(network_html)

In [None]:
# Make the graph directed

net = Network(notebook=True, directed =True, cdn_resources='in_line')
net.from_nx(G)
net.show("output/network-directed.html")

# #read from file to buffer
# with open("network-directed.html", "r") as file:
#     network_html = file.read()

# HTML(network_html)

# Network images
![](./output/network-weights.png)
![](./output/network-directed.png)

In [None]:
# Try curved edges
pos = nx.spring_layout(G)
nx.draw(G, with_labels=True, connectionstyle="arc3,rad=0.1")

labels = nx.get_edge_attributes(G,'weight')

# FIXME: place the labels in a better way
nx.draw_networkx_edge_labels(G,pos,edge_labels=labels, label_pos=0.1)
plt.show()

# show the datastructure of the graph
print(G.edges(data=True))


In [None]:
# Show the network on a map
# https://towardsdatascience.com/from-geojson-to-network-graph-analyzing-world-country-borders-in-python-ab81b5a8ce5a 

# TODO: rework this part

# import geopandas as gpd
# plt.rcParams['font.family'] = 'Arial'

# world_map = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
# world_map = world_map[world_map['continent'] != 'Antarctica']
# world_map.plot()
# plt.show() 

# world_map = world_map.set_index('iso_a3')

# # 1. Geocode the countries
# from geopy.geocoders import Nominatim
# geolocator = Nominatim(user_agent="dcid-sna")

# def geolocate(country):
#     loc = geolocator.geocode(country)
#     return (loc.latitude, loc.longitude)

# print(geolocate('USA'))

# 2. Add the latitude and longitude to the graph

# for node in G.nodes:
#     country = labels[node]
#     lat, lon = geolocate(country)
#     G.nodes[node]['latitude'] = lat
#     G.nodes[node]['longitude'] = lon
# node_positions = {n: (d['longitude'], d['latitude']) for n, d in G.nodes(data=True)}

# 3. Plot the network on the map



In [None]:
# https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.centrality.degree_centrality.html
# The degree centrality for a node v is the fraction of nodes it is connected to.
centrality = nx.degree_centrality(G)
centrality = pd.DataFrame(centrality.items(), columns=['Country', 'Centrality'])
centrality = centrality.sort_values(by='Centrality', ascending=False)
print(centrality)


In [None]:
# Out Degree: https://networkx.org/documentation/stable/reference/classes/generated/networkx.DiGraph.out_degree.html
# The node out_degree is the number of edges pointing out of the node. 
out_degree = G.out_degree()
sorted_out_degree = sorted(out_degree, key=lambda x: x[1], reverse=True)
sorted_out_degree = pd.DataFrame(sorted_out_degree, columns=['Country', 'Out Degree'])
print(sorted_out_degree)

# Weighted out degree
# The weighted node degree is the sum of the edge weights for edges incident to that node.
weighted_out_degree = G.out_degree(weight='weight')
sorted_weighted_out_degree = sorted(weighted_out_degree, key=lambda x: x[1], reverse=True)
sorted_weighted_out_degree = pd.DataFrame(sorted_weighted_out_degree, columns=['Country', 'Weighted Out Degree'])
print(sorted_weighted_out_degree)

In [None]:
# In Degree: https://networkx.org/documentation/stable/reference/classes/generated/networkx.DiGraph.in_degree.html
# The node in_degree is the number of edges pointing to the node. 
# Number of antagonsits attacking a country
in_degree = G.in_degree()
sorted_in_degree = sorted(in_degree, key=lambda x: x[1], reverse=True)
sorted_in_degree = pd.DataFrame(sorted_in_degree, columns=['Country', 'In Degree'])
print(sorted_in_degree)

# The weighted node degree is the sum of the edge weights for edges incident to that node.
weighted_in_degree = G.in_degree(weight='weight')
sorted_weighted_in_degree = sorted(weighted_in_degree, key=lambda x: x[1], reverse=True)
sorted_weighted_in_degree = pd.DataFrame(sorted_weighted_in_degree, columns=['Country', 'Weighted In Degree'])
print(sorted_weighted_in_degree)

In [None]:
# Reciprocity: https://networkx.org/documentation/stable/reference/algorithms/reciprocity.html
# The reciprocity of a directed graph is the ratio of the number of edges pointing in both directions
# to the total number of edges in the graph.

reciprocity = nx.overall_reciprocity(G)
print(reciprocity)


In [None]:
# Cliques: https://networkx.org/documentation/stable/reference/algorithms/generated/networkx.algorithms.clique.find_cliques.html
# A clique in an undirected graph is a subset of the nodes such that every two nodes in the subset are adjacent.

# cliques = nx.find_cliques(G)
# Only works for an undirected graph

# Conclusion

Preparing and reshaping the data took longest. I was struggling with messy data (Excel with multiple working sheets) and merging data to get country codes. Additionally, drawing my particular graph, including the labels on the edges, proved to be more difficult than expected. The graph would look great on a spatial map, but geo-coding the data will take more time.


Creating the adjacency matrix for my dataset was easy. Running network analysis and calculating statistics such as degree centrality or reciprocity is super easy after you create the graph.

Network analysis is an interesting method, and I'll use it in my research one way or another.
