# Chapter 1 - Graph Theory

## Requirements

This notebook was tested on Linux and MacOS, using a conda environment.

Details to build the conda environment can be found here: https://github.com/ftheberge/GraphMiningNotebooks

In case users would find some issues with the notebook, we ask to open an issue in the GitHub repository.

For the data, make sure you have the correct directory in second next cell.

In [None]:
## required packages for this Chapter
import igraph as ig
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from collections import Counter
from statsmodels.distributions.empirical_distribution import ECDF as ecdf
import random


In [None]:
## setting the path to the datasets
datadir = "../Datasets/"


## Summary

In this notebook, we look at **simple graph statistics** such as:
- degree distribution
- clustering coefficient
- shortest path length distribution

We illustrate the fact that even with such simple measures, we can see big difference between different types of graphs, thus the importance of **EDA**, exploratory data analysis, as a first step in data mining. 

In this notebook we consider two graphs:

- a social-type graph, links between **GitHub developers**, and
- a transportation-type network, namely a **power grid**.

Throughout, we use the terms **vertex** (used in ```igraph```) and **node** interchangably. 


## GitHub Developers Graph

#### Description

A large undirected social network of GitHub developers which was collected from the public API in June 2019. Nodes are developers who have starred at least 10 repositories and edges are mutual follower relationships between them. The vertex features are extracted based on the location, repositories starred, employer and e-mail address. 

The graph has two types of nodes: 
- web developer 
- machine learning developer 

Below, we construct this graph (with igraph); later we will look at subgraphs respectively for web or machine learning (ml) developers.

The graph is stored in object ```GitHubGraph```; look at the code below to see how we can also store attributes for edges and vertices in igraph objects.

There are several ways to read the edges; here we used a function from the ```pandas``` package.


In [None]:
## read the GitHub edge list as tuples and build undirected graph
## each node index is stored in vertex attribute "id"
df = pd.read_csv(datadir + "GitHubDevelopers/musae_git_edges.csv")
GitHubGraph = ig.Graph.TupleList(
    [tuple(x) for x in df.values], directed=False, vertex_name_attr="id"
)


### Node attributes

Each node has three attributes:
* it's index (as used in the edge list)
* the username
* a binary flag indicating ml or web developer (1 = ml)


In [None]:
## read node attributes
GitHub_attr = pd.read_csv(datadir + "GitHubDevelopers/musae_git_target.csv")
GitHub_attr.head()


In [None]:
## build attribute dictionaries
Names = dict(zip(GitHub_attr.id, GitHub_attr.name))
ML = dict(zip(GitHub_attr.id, GitHub_attr.ml_target))


In [None]:
## add name attributes to graph
GitHubGraph.vs["name"] = [Names[i] for i in GitHubGraph.vs["id"]]

## add a class: 'ml' or 'web' depending on attribute 'ml_label'
labels = ["web", "ml"]
GitHubGraph.vs["class"] = [labels[ML[i]] for i in GitHubGraph.vs["id"]]
GitHubGraph.vs[0]


### GitHub subgraphs

We build the two subgraphs below: ```subgraph_ml``` for the machine learning (**ml**) developers and ```subgraph_web``` for the **web** developers.

In [None]:
## build the subgraphs
subgraph_ml = GitHubGraph.subgraph([v for v in GitHubGraph.vs() if v["class"] == "ml"])
subgraph_web = GitHubGraph.subgraph([v for v in GitHubGraph.vs() if v["class"] == "web"])

## there are 9739 ml developers and 27961 web developers
print(
    "GitHub nodes:",
    GitHubGraph.vcount(),
    "\nml developers:",
    subgraph_ml.vcount(),
    "\nweb developers:",
    subgraph_web.vcount(),
)


Note that some ml developers are connected only to web developers and vice-versa. 

Therefore, some nodes will end up with no connection (degree 0) in the subgraphs.


In [None]:
## github graph: count 'ml' nodes with edges to 'web' nodes, and vice-versa
count_ml = count_web = 0
for v in GitHubGraph.vs():
    if v["class"] == "ml":
        ## looking at all neighbors as a 'set' for this test:
        if set([GitHubGraph.vs[i]["class"] for i in GitHubGraph.neighbors(v)]) == {"web"}:
            count_ml += 1
    else:
        if set([GitHubGraph.vs[i]["class"] for i in GitHubGraph.neighbors(v)]) == {"ml"}:
            count_web += 1
print(
    count_ml,
    "ml developers connected only to web developers\n",
    count_web,
    "web developers connected only to ml developers",
)


In [None]:
## this should correpond to the number of degree 0 nodes in the subgraphs,
## as there are no degree 0 nodes in the original graph.
print(
    "nodes of degree zero in overall graph:",
    sum([d == 0 for d in GitHubGraph.degree()]),
)
print("nodes of degree zero in ml subgraph:", 
      sum([i == 0 for i in subgraph_ml.degree()]))
print(
    "nodes of degree zero in web subgraph:",
    sum([i == 0 for i in subgraph_web.degree()]),
)


## Europe Electric Grid

A network of high voltage grid in Europe. The vertices are stations and the edges are the lines connecting them.
More details can be found at: https://zenodo.org/record/47317#.Xt6nzy3MxTY.

The graph is stored in object ```Grid```.

The edges have directionality, but for this notebook, we consider an undirected version of this graph, and after reading the edges, we ''simplify'' the graph to remove multiedges.

Nodes have different attributes, including longitude and latitude; we will use those to force a graph layout for plotting according to the geographical location.

There are several types of nodes: 'joint', 'merge', 'plant', 'station', 'sub_station', 'substation'.

We show a different way to read in edges and build a graph using ```Read_Ncol```.

In [None]:
## read edge list for the grid network and build undirected simple graph
## by default, node ids are saved in the 'name' attribute as string
Grid = ig.Graph.Read_Ncol(datadir + "GridEurope/gridkit_europe-highvoltage.edges", directed=False)
Grid = Grid.simplify()
print('number of nodes:',Grid.vcount())


### Node attributes

In [None]:
## read the nodes and their attributes
## notice there are more nodes here ... so there are isolates
Grid_attr = pd.read_csv(datadir + "GridEurope/gridkit_europe-highvoltage.vertices")
print('number of nodes:',Grid_attr.shape[0])
Grid_attr.head()


In [None]:
## add the isolated nodes to the graph
isolates = set(Grid_attr.v_id).difference(set([int(x) for x in Grid.vs['name']]))
Grid.add_vertices([str(x) for x in isolates])
print('number of nodes:',Grid.vcount())


In [None]:
## build attribute dictionaries
Longitude = dict(zip(Grid_attr.v_id, Grid_attr.lon))
Latitude = dict(zip(Grid_attr.v_id, Grid_attr.lat))
Type = dict(zip(Grid_attr.v_id, Grid_attr.typ))


In [None]:
## map names to integers for easier referencing
Grid.vs["id"] = [int(x) for x in Grid.vs["name"]]

## save longitude, latitude and node type in the graph
Grid.vs["longitude"] = [Longitude[i] for i in Grid.vs["id"]]
Grid.vs["latitude"] = [Latitude[i] for i in Grid.vs["id"]]
Grid.vs["type"] = [Type[i] for i in Grid.vs["id"]]


In [None]:
## EDA - min/max latitude and longitude and node types
print("Node types and counts:")
print(Counter(Grid.vs["type"]).most_common())

print("\nLatitude range:", min(Grid.vs["latitude"]), "to", max(Grid.vs["latitude"]))
print("\nLongitude range:", min(Grid.vs["longitude"]), "to", max(Grid.vs["longitude"]))


In [None]:
## ploting the grid - colors w.r.t. node types
## we use negative latitude due to location of the origin in the layout.
Grid.vs["size"] = 4
cls = {
    "joint": "pink",
    "substation": "lightblue",
    "merge": "brown",
    "station": "blue",
    "sub_station": "lightblue",
    "plant": "black",
}
Grid.vs["color"] = [cls[x] for x in Grid.vs["type"]]
Grid.es["color"] = "grey"
Grid.layout = [(v["longitude"], -v["latitude"]) for v in Grid.vs]
ig.plot(Grid, layout=Grid.layout, bbox=(0, 0, 600, 450))


# Graph Features

Below, we compute several basic features for the 4 graphs we have:
- ```GitHubGraph``` (ml+web)
- ```subgraph_ml``` (GitHub ml developers subgraph)
- ```subgraph_web``` (GitHub web developers subgraph)
- ```Grid``` (Europe power grid)

Please refer to the book for details of those features.

```mode='nan'``` refers to the fact that nodes with less than 2 neighbours are left out when computing clustering coefficients.

Note that running the cell below can take a few minutes.


In [None]:
## compute and store basic statistics in a dataframe
def baseStats(G):
    deg = G.degree()
    return [
        G.vcount(),
        G.ecount(),
        np.min(deg),
        np.mean(deg),
        np.median(deg),
        np.quantile(deg, 0.99),
        np.max(deg),
        G.diameter(),
        np.max(G.connected_components().membership) + 1,
        G.connected_components().giant().vcount(),
        sum([x == 0 for x in deg]),
        G.transitivity_undirected(mode="nan"),
        G.transitivity_avglocal_undirected(mode="nan"),
    ]


In [None]:
%%time
## compute stats for the 4 graphs - this can take a few minutes
S = []
S.append(["GitHub"] + baseStats(GitHubGraph))
S.append(["GitHub (ml)"] + baseStats(subgraph_ml))
S.append(["GitHub (web)"] + baseStats(subgraph_web))
S.append(["Grid"] + baseStats(Grid))


In [None]:
## store in a dataframe
df = pd.DataFrame(
    S,
    columns=[
        "network",
        "# nodes",
        "# edges",
        r"$\delta$",
        r"$\langle k\rangle$",
        r"median degree",
        r"$d_{quant_{99}}$",
        r"$\Delta$",
        "diameter",
        "# components",
        "largest component",
        "# isolates",
        r"$C_{glob}$",
        r"$C_{loc}$",
    ],
).transpose()
df


### Analysis of the results

What do we see in the table above?

First, look at the **degree distribution**; the GitHub graphs have a wide range of values, including some very high degree nodes, while the Grid graph has degree in range 0 to 16 only.

The **diameter** (maximum shortest path length) is also quite different; it is common for social networks to have relatively small diameter. On the other hand, a multi-country power grid has large diameter.

Looking at **components**, the GitHub graph is connected (single component), but the two subgraphs are not, and there are even nodes with null degree as we already saw. The Grid graph has several components, but most nodes fall is what we call the "giant component".

Finally, we see some differences between the local and global **clustering coefficients** for the GitHub graphs; why is this so? what happens with very high degree nodes? See the plot below ...


In [None]:
X = [x for x in subgraph_ml.degree() if x > 1]
Y = [x for x in subgraph_ml.transitivity_local_undirected() if not np.isnan(x)]
plt.plot(X, Y, ".", color="grey")
plt.title("GitHub (ml) subgraph", fontsize=16)
plt.xlabel("degree", fontsize=14)
plt.ylabel("local clustering coefficient", fontsize=14)
plt.show()


## Visualize part of the Grid network

For the Grid graph, we select a range of lat/lon that correspond to the Iberic peninsula.



In [None]:
## subgraph of Grid -- Iberic Peninsula
V = [
    v
    for v in Grid.vs()
    if v["latitude"] > 36 and v["latitude"] < 44 and v["longitude"] > -10 and v["longitude"] < 4
]
Grid_IP = Grid.subgraph(V)
layout = [(v["longitude"], -v["latitude"]) for v in Grid_IP.vs]
print("showing", Grid_IP.vcount(), "nodes and", Grid_IP.ecount(), "edges")
ig.plot(Grid_IP, layout=layout, bbox=(0, 0, 300, 250), vertex_color="black", vertex_size=3)
# ig.plot(Grid_IP, 'grid_sg.eps', layout=layout, bbox=(0,0,300,250), vertex_color='black', vertex_size=3);


## Visualize part of the GitHub (ml) graph

There is no lat/lon here; in the code below, we take the giant component for the GitHub ml subgraph,
and arbitrary cut w.r.t. the computed layout to display a portion of the subgraph.

This is for illustration only; note that this is quite different from the Grid graph, with clumps (dense areas) and less regular edge distribution.


In [None]:
## plot subgraph for github(ml)
random.seed(123)
sg = subgraph_ml.connected_components().giant()
ly = sg.layout_auto()
sg.vs["x"] = [x[0] for x in ly]
sg.vs["y"] = [x[1] for x in ly]

##
z = 1
V = []
while len(V) < 800:
    V = [v for v in sg.vs() if v["x"] < z and v["x"] > -z and v["y"] < z and v["y"] > -z]
    z += 1
sg = sg.subgraph(V).connected_components().giant()
print("showing", sg.vcount(), "nodes and", sg.ecount(), "edges")
ig.plot(sg, bbox=(0, 0, 300, 250), vertex_size=3, edge_color='grey', vertex_color='black')
# ig.plot(sg, 'github_ml_sg.eps', bbox=(0,0,300,250), vertex_size=3);


## Degree distributions

Degree distribution is one of the fundamental properties of networks. Below, we plot the empirical cumulative distribution functions (cdf) of the degree distribution for the GitHib and Grid graphs. 

As we noted before, for the GitHub graph, most nodes have low degree, but a few have very high degree, up to almost 10,000. This is not at all the case with the Grid graph where almost all nodes having degree less than 10, and the maximum degree observed is only 16.


In [None]:
## degree distribution - GitHub graph
deg = GitHubGraph.degree()
e = ecdf(deg)  ## empirical cdf
x = np.arange(1, max(deg), 1)
y = [e(i) for i in x]

## plot on log scale for x-axis
plt.semilogx(x, y, "-", color="black", label="GitHub")
plt.xlabel("degree", fontsize=14)
plt.ylabel("empirical cdf", fontsize=14)
plt.title("GitHub graph (ml and web developers)")
#plt.savefig('ecdf_gh.eps')
plt.show()


In [None]:
## degree distribution - Grid graph
## we see much lower degrees in ths case
deg = Grid.degree()
e = ecdf(deg)
x = np.arange(1, 30, 1)
y = [e(i) for i in x]

## plot on linear scale for x-axis
plt.semilogx(x, y, "-", color="black", label="Grid")
plt.xlabel("degree", fontsize=14)
plt.ylabel("empirical cdf", fontsize=14)
plt.title("Power Grid Network")
# plt.savefig('ecdf_gr.eps')
plt.show()


## Shortest paths distribution

In the plots below, we consider 100 randomly chosen nodes, and compute the length of the shortest path to reach every other node. We then plot histograms for those values. 

Once again we see much different distributions: for the GitHub graph, most paths are quite short, with common values in the range from 2 to 6.
For the Grid graph however, the paths are generally longer, and over a much wider range.


In [None]:
## shortest paths length from a given node, GitHub graph
sg = (
    GitHubGraph.connected_components().giant()
)  ## giant component, so every pair of nodes are connected
np.random.seed(12345)
V = np.random.choice(sg.vcount(), size=100, replace=False)  ## sample of vertices
sp = [x for y in sg.distances(source=V) for x in y if x > 0]  ## drop 0-distances (distance to self)
c = Counter(sp)
s = sorted(c.items())

## barplot
fig, ax = plt.subplots()
b = ax.bar([x[0] for x in s], [x[1] for x in s], color="darkgrey")
ax.set_yscale("log")
ax.set_xlabel("path length", fontsize=14)
ax.set_ylabel("number of paths (log scale)", fontsize=14)
plt.title("GitHub graph (ml and web developers)")
# plt.savefig('pathlen_github.eps')
plt.show()


In [None]:
## min path length from that node to other nodes, Grid network
sg = Grid.connected_components().giant()  ## giant component, so every pair of nodes are connected
np.random.seed(12345)
V = np.random.choice(sg.vcount(), size=100, replace=False)  ## sample
sp = [x for y in sg.distances(source=V) for x in y if x > 0]  ## drop 0-distances (distance to self)
c = Counter(sp)
s = sorted(c.items())

## barplot
fig, ax = plt.subplots()
b = ax.bar([x[0] for x in s], [x[1] for x in s], color="darkgrey", width=1)
ax.set_xlabel("path length", fontsize=14)
ax.set_ylabel("number of paths", fontsize=14)
plt.title("Power Grid Network")
# plt.savefig('pathlen_grid.eps')
plt.show()


## Local clustering coefficient

Below, we compare the average local clustering coefficients as a function
of the node degrees for the GitHub graph. We consider degrees from 10 to 1000.

Looking at a log-log plot, we see a power law relation between those quantities; 
we also compute the regression line for comparison.


In [None]:
## GitHub graph: build dataframe with degrees and local clustering coefficients
## and compute mean values w.r.t. degree.
df = pd.DataFrame(
    np.array(
        [GitHubGraph.transitivity_local_undirected(mode="zero"), GitHubGraph.degree()]
    ).transpose(),
    columns=["ClustCoef", "degree"],
)
## keep row where degree is in range [10,1000]
mindeg = 10
maxdeg = 1000
df = df[(df.degree >= mindeg) & (df.degree <= maxdeg)]
df = df.groupby(by="degree").mean()
df.head()


In [None]:
## fit a regression (log-log scale) and plot
X = [np.log(i) for i in df.index]  ## degrees
Y = [np.log(i) for i in df.ClustCoef]
regressor = LinearRegression()
regressor.fit(np.array(X).reshape(-1, 1), Y)

## plot on log-log scale
b = regressor.intercept_
a = regressor.coef_[0]
print("power law exponent:", a)
plt.loglog(df.index, df.ClustCoef, ".-", color="grey")
## since log Y = a log X + b, Y = e^b X^a, we plot:
plt.plot([mindeg, maxdeg], [np.exp(b) * mindeg**a, np.exp(b) * maxdeg**a], color="black")
plt.xlabel("log (degree)", fontsize=14)
plt.ylabel("log (mean local clust. coef.)", fontsize=14)
plt.title("GitHub graph (ml and web developers)")
# plt.savefig('localCC.eps')
plt.show()
