# GOV.UK structural network adjacency
We work through a tutorial to learn about key network metrics and statistics based on this [blog post](https://programminghistorian.org/en/lessons/exploring-and-analyzing-network-data-with-python#fnref:import). 

We will be using the [GOV.UK structural network adjacency list](https://ckan.publishing.service.gov.uk/dataset/gov-uk-structural-network-adjacency-list?utm_source=ckan) data for this tutorial. This shows the connections (edges) between source and sink pages (nodes).

This tutorial will take us through the basics of network visualisations and analysis using NetworkX and Python visualisation tools. The data itself doesn't matter so much, we are learning the principles of graph theory and `networkx` functionality.

For further ideas see the [list of algorithms](https://networkx.github.io/documentation/stable/reference/algorithms/index.html) available for use with networkx.

In [None]:
from IPython.core.interactiveshell import InteractiveShell
import community
import networkx as nx
import os
import pandas as pd

from holoviews.element.graphs import layout_nodes
import holoviews as hv
import holoviews.operation.datashader as hd

# Enable multiline outputs in this notebook
InteractiveShell.ast_node_interactivity = "all"

# Enable interactivity for plots
hv.extension("bokeh")

## Reading the data
Rather than use the method of reading in the data used in the Quaker tutorial, we will opt for an approach using pandas.

Make sure to [download the data]((https://ckan.publishing.service.gov.uk/dataset/gov-uk-structural-network-adjacency-list?utm_source=ckan)), and place it in the `data/processed_network` folder at the parent-level of this repository. Alternatively, read in the data directly from the internet.

In [None]:
# Define a relative path to the data
input_dir = os.path.join("..", "..", "data", "processed_network")
data_path = os.path.join(input_dir, "structural_network_adjacency_list_20190301.csv")
data_path

# Read in the data downloaded locally
df_edges = pd.read_csv(data_path)
df_edges

In [None]:
# Define the URL to the data
data_url = "https://assets.publishing.service.gov.uk/media/5cde7160ed915d439fafebef/structural_network_adjacency_list_20190301.csv"

# Read in the data directly from the internet
df_edges = pd.read_csv(data_url)
df_edges

We now have the edges loaded in, but need a unique set of nodes for our network. Just in case there are some nodes listed in `source_base_path` but not `sink_base_path`, we need to take a unique set across both columns. 

This [SO answer](https://stackoverflow.com/a/26977495) will help us get an array of unique nodes. We will then create a pandas DataFrame, and arbitrarily set the index as `node_id`. Note we have sorted the array of unique nodes for ease here so that the main GOV.UK page slug (`/`) has node ID `0`. 

In [None]:
# Get an array of unique nodes
unique_nodes = pd.unique(df_edges[["source_base_path", "sink_base_path"]].values.ravel("K"))

# Sort the nodes alphabetically, and artifically create a 'node_id' column using the index
df_nodes = pd.DataFrame(sorted(unique_nodes), columns=["nodes"])
df_nodes.index.name = "node_id"
df_nodes

For completeness, let's label the node IDs for both source and sink pages in `df_edges`.

Here, we merge `df_nodes` to `df_edges` for both `source_base_path` and `sink_base_path` columns. The chained `reset_index` and `rename` functions on `df_nodes` ensure we don't create duplicate source/sink slug columns during the merge, and also create unique node ID column names for the source/sink slug columns.

In [None]:
# Add the node IDs back in for each of the sources and sinks
for col in df_edges.columns:
    if "_base_path" in col:
        df_edges = df_edges.merge(df_nodes.reset_index().rename(columns={"nodes": col, "node_id": col+"_id"}),
                                  on=col,
                                  how="left", 
                                  validate="m:1")
df_edges

## Creating the graph
In NetworkX, we can add two lists/arrays of nodes, and edges into a single network object that understands how nodes and edges are related. 

This object is called a Graph, referring to one of the common terms for data organized as a network \[N.B. it does not refer to any visual representation of the data. Graph here is used purely in a mathematical, network analysis sense\]. First you must initialize a Graph object with the following command:

In [None]:
# Instantiate a NetworkX Graph object
G = nx.Graph()

Note this is an undirected graph. However, the data is about how pages are linked together (hence the source and sink pages!), so a direct graph (digraph) is probably better:

In [None]:
# Instantiate a NetworkX DiGraph object
G = nx.DiGraph()

We can add some attribute data to the graph, to help us remember what it's about

In [None]:
G = nx.DiGraph(name="DiGraph using GOV.UK structural network adjacency list data", 
               query_used="appropriate descriptive name")
G.graph

We can then generate a list of nodes (`node_names`), and a list of edges (`edges`) from our pandas DataFrames:

In [None]:
# Extract the slugs for each node
node_names = df_nodes.nodes.values
node_names

In [None]:
# Get the source and destination ('sink') slugs
edges = df_edges[["source_base_path", "sink_base_path"]].values
edges

And add these to the digraph:

In [None]:
# Add nodes and edges to 'G'
G.add_nodes_from(node_names)
G.add_edges_from(edges)

# Print some info about 'G'
print(nx.info(G))

This is a quick way of getting some general information about your graph, but as you’ll learn in subsequent sections, it is only scratching the surface of what NetworkX can tell you about your data.

Because we specify it as a directed graph we get in degree and out degree.

## Adding Attributes
For NetworkX, a Graph object is one big thing (your network) made up of two kinds of smaller things (your nodes and your edges). So far you’ve uploaded nodes and edges (as pairs of nodes), but NetworkX allows you to add attributes to both nodes and edges, providing more information about each of them.

Later on in this tutorial, you’ll be running metrics and adding some of the results back to the Graph as attributes. For now, let’s make sure your Graph contains all of the attributes that are currently in our CSV.

You’ll want to return to a DataFrames you created at the beginning of your script: `df_nodes` and `df_edges`. These contain useful information, such as the node ID and the edge link type, which you'll want to add to our graph.

There are a couple ways to do this, but NetworkX provides two convenient functions for adding attributes to all of a Graph’s nodes or edges at once: `nx.set_node_attributes()` and `nx.set_edge_attributes()`. 

To use these functions, you’ll need your attribute data to be in the form of a Python dictionary, in which node names are the keys and the attributes you want to add are the values. You’ll want to create a dictionary for each one of your attributes, and then add them using the functions above.

In [None]:
# Just to recall
df_nodes.head()

In [None]:
# Create a dictionary from the nodes and node IDs (index of 'df_nodes')
node_id_dict = dict(zip(df_nodes.nodes, df_nodes.index))
node_id_dict["/"]

In [None]:
# To create the edge attributes, the key needs to be a tuple of source and sink slugs
edge_keys = list(zip(df_edges.source_base_path, df_edges.sink_base_path))
edge_keys[0:2]

# Then create the dictionary with link type as the values
edge_link_type_dict = dict(zip(edge_keys, df_edges.link_type))
edge_link_type_dict[("/1619-bursary-fund", "/courses-qualifications")]

Now we created the attribute dictionaries, let's add them to the graph:

In [None]:
# Add the node and edge attributes
nx.set_node_attributes(G, node_id_dict, "node_id")
nx.set_edge_attributes(G, edge_link_type_dict, "link_type")

Now all of your nodes have these attributes, and you can access them at any time. For example, you can print out all the node_ids of your nodes by looping through them and accessing the node_id attribute, like this (note we've limited this to the first ten nodes):

In [None]:
# Print the node and 
for ix, n in enumerate(G.nodes()):
    if ix < 10:
        print("Node: '{0}'; Node ID: '{1}'".format(n, G.node[n]["node_id"]))
    else:
        break