<img src="data/images/lecture-notebook-header.png" />

# Graph Mining: Centrality

In the context of graph mining, centrality refers to the measure of importance or influence of a node within a graph. It quantifies the relative significance or prominence of a node based on its position and connections within the network.

Centrality measures help identify nodes that play crucial roles in various aspects of a graph, such as communication flow, information dissemination, and influence propagation. They are widely used in social network analysis, biological networks, transportation networks, citation networks, and various other fields. There are several different centrality measures commonly used in graph mining, including:

* **Degree Centrality:** It is a simple measure that counts the number of direct connections a node has. Nodes with high degree centrality are considered important hubs within the network.

* **Betweenness Centrality:** This measure quantifies the extent to which a node lies on the shortest paths between other nodes in the graph. Nodes with high betweenness centrality act as bridges or intermediaries between different parts of the network.

* **Closeness Centrality:** It measures how close a node is to all other nodes in terms of the shortest path length. Nodes with high closeness centrality have more efficient access to information or resources in the network.

* **Eigenvector Centrality:** This measure considers both the node's direct connections and the centrality of its neighboring nodes. Nodes with high eigenvector centrality are connected to other highly central nodes, indicating their influence and importance.

* **PageRank Centrality:** Originally developed for ranking web pages, PageRank assigns centrality based on the idea that a node is important if it is connected to other important nodes. It considers the entire graph structure to determine the centrality of each node.

These centrality measures provide different perspectives on the importance of nodes within a graph and can be used to identify key entities, influencers, or critical points within a network. The specific choice of centrality measure depends on the characteristics of the graph and the analysis goals.

## Setting up the Notebook

### Specify how Plots Get Rendered

In [None]:
%matplotlib inline

### Make all Required Imports

In [None]:
import numpy as np
import pandas as pd
import networkx as nx

import matplotlib.pyplot as plt

from networkx.algorithms.centrality import in_degree_centrality, closeness_centrality, betweenness_centrality
from networkx.algorithms.link_analysis.pagerank_alg import pagerank

---

## Generate Graph

Throughout this notebook, we use the example graph from the [PageRank Wikipedia page](https://en.wikipedia.org/wiki/PageRank) which we also used in the lecture. It's small but still "interesting" enough to see the differences between different centrality measures.

### Define Graph Using Adjacency Matrix

There are many ways to define a graph with `networkX`. Here, we simply define the graph via its adjacency matrix, which is a basic 2-dimensional `numpy` array. As our example graph is unweighted, all entries in the adjacency matrix are 1.

In [None]:
A = np.array([
    [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0],
    [0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0]
], dtype=float)

### Create Directed Graph Using NetworkX

The `networkX` class `DiGraph` defines directed graphs. We can use the adjacency matrix A to create an object of type `DiGraph` to represent our example graph. All the additional lines are only needed to specify the locations of the nodes when drawing the graph. This makes it easier to compare the graphs with each other, as all nodes will always be in the same position.

In [None]:
## Create graph from adjacency Matrix
G = nx.DiGraph(A)

## Define a position for each node
## (this not needed and only ensure that the graph looks like in the lecture)
fixed_positions = { 0:(1,0), 1:(5,0), 2:(10,0), 3:(0.5,-4), 4:(7,-5),
                    5:(11,-4), 6:(1,-7), 7:(3,-8), 8:(5,-9), 9:(9,-7),
                    10:(11,-6)}

fixed_nodes = fixed_positions.keys()

pos = nx.spring_layout(G, pos=fixed_positions, fixed=fixed_nodes)

### Draw the Graph

`networkX` comes with in-built methods to draw graphs.

In [None]:
plt.figure()
plt.axis('off')
plt.tight_layout()
nx.draw_networkx(G, pos, with_labels=True, font_weight='bold', node_color='#80BFFF')
plt.show()

---

## Degree Centrality

Degree centrality is the most basic way to measure the importance of a node by simply counting the number of edges the node is connected to. In case of, undirected graphs, this is just the sum all all edges $c_d(v_i)$ for a node $v_i$:

$$
c_d(v_i) = \sum_{v_j\in V} A[i,j]
$$

Note that for undirected graphs the adjacency matrix A is symmetric. In case of directed graphs, we can distinguish between the InDegree centrality $c_{d\_in}(v_i)$ and OutDegree centrality c_{d\_out}(v_i) of a node $v_i$:

$$
c_{d\_in}(v_i) = \sum_{v_j\in V} A[j,i]
$$
$$
c_{d\_out}(v_i) = \sum_{v_j\in V} A[i,j]
$$

### Calculating InDegree Scores

`networkX` provides methods for all types of degree centrality. In the following example, we calculate the InDegree centrality for each node in our directed graph G. In many application contexts, the InDegree is typically the better score that reflects the importance of a node. For example, in the case of hyperlinks on the Web, a page with many incoming links is more important.

In [None]:
indegree_scores = in_degree_centrality(G)

for node, score in indegree_scores.items():
    print('InDegree score of node {}: {:.3}'.format(node, score))

Maybe a bit surprising, the scores are not simple integer values. This is because the implementation of the method normalizes the scores by dividing by the maximum possible degree in a simple graph $n-1$, where $n$ is the number of nodes in G. Of course, we can easily reverse this normalization if needed:

In [None]:
n = len(G.nodes) - 1

for node, score in indegree_scores.items():
    print('InDegree score of node {}: {:.3}'.format(node, score*n))

### Drawing the Graph

We can draw graph G as above but now set the size of each node to reflect its InDegree score in relation to the other nodes. Note that the `10000` is just a scaling factor.

In [None]:
plt.figure()
plt.axis('off')
axis = plt.gca()
plt.tight_layout()
nx.draw_networkx(G, pos, with_labels=True, node_size=[ v*10000 for v in indegree_scores.values()], font_size=16, font_weight='bold', node_color='#80BFFF')
plt.show()

The scores and the graph are obviously very easy to interpret given the simplicity of InDegree centrality. Nodes with no incoming edges, naturally will have a score of 0, while the node with the most incoming edges has the highest score.

---

## PageRank

For InDegree, we made the argument that it can be used to measure the importance of a Web page by counting the number of hyperlinks that point to it. And this is not an unreasonable approach. However, it is also very vulnerable to cheaters/spammers. For example, one can set up many random Web pages with a link to a certain target page to quickly increase its InDegree score.

This is where PageRank comes in. PageRank also uses the number of incoming edges but weighs these edges with respect to the score of the source of an edge. That means if a node has a very low PageRank score an edge from it not another target node will hardly affect the PageRank score of the target node.

The PageRange centrality score $c_{pr}(v_i)$ of a node $v_i$ is recursively defined as:

$$
c_{pr}(v_i) = \alpha M c_{pr}(v_i) + (1-\alpha)E
$$

where $M$ is the transition matrix for graph G (see the lecture slides), $E$ is a vector of size $n$ (n being the number of nodes) with each values set to $1/n$, and $\alpha$ is the so-called damping factor.


### Calculating the PageRank Scores of G

Of course, `networkX` makes calculating the PageRank centrality scores for all nodes in a graph G eazy.

In [None]:
pagerank_scores = pagerank(G, alpha=0.85)

for node, score in pagerank_scores.items():
    print('PageRank score of node {}: {:.3}'.format(node, score))

### Drawing the Graph

We again can draw the graph in such a way that the size of the nodes reflects their PageRank scores.

In [None]:
plt.figure()
plt.axis('off')
axis = plt.gca()
plt.tight_layout()
nx.draw_networkx(G, pos, with_labels=True, node_size=[ v*20000 for v in pagerank_scores.values()], font_size=16, font_weight='bold', node_color='#80BFFF')
plt.show()

When comparing this graph to the one above for InDegree centrality, we can see various differences

* Even nodes with no incoming edges now have a centrality score larger than 0

* The score of Node 2 is now almost as high as then one for Node, although Node 2 has only one incoming edge. This is due to the fact that this one incoming edge comes from a Node with a very high score.

* The score of Node 4 is now much smaller compared to the score of Node 1, because most incoming edges of Node 4 come from nodes with a low score.

---

## Closeness

Closeness centrality considers nodes as important if the distance to all other nodes is small, where small distance to node $t$ means that there are short paths from all other nodes to $t$. The basic definition of Closeness is defined for unweighted graphs:

$$
c_{cl}(v) = \frac{N}{\sum_{w \in V} d(v,w)}
$$

where $d(v,w)$ is the length of shortest path from $v$ to $w$, and $N$ is the number of nodes reachable from $v$.

For directed paths, the Closeness of node t can differ greatly when considering incoming or outcoming edges for calculating distances. In the following, however, we only consider the basic case of undirected graphs


### Convert Directed Graph G to Undirected Graph

To convert a directed graph into an undirected graph in `networkX` requires only to call `to_undirected()` on directed graph G.

In [None]:
G_undirected = G.to_undirected()

### Calculating the PageRank Scores of G

As usual, `networkX` got us covered.

In [None]:
closeness_scores = closeness_centrality(G_undirected)

for node, score in closeness_scores.items():
    print('Closeness score of node {}: {:.3}'.format(node, score))

### Drawing the Graph

We again can draw the graph in such a way that the size of the nodes reflects their Closeness scores.

In [None]:
plt.figure()
plt.axis('off')
plt.gca().set_xlim(-0.5, 12)
plt.gca().set_ylim(-10, 2)
plt.tight_layout()
nx.draw_networkx(G_undirected, pos, with_labels=True, node_size=[ v*2500 for v in closeness_scores.values()], font_size=16, font_weight='bold', node_color='#80BFFF')
plt.show()

Converting our initial graph G to an undirected one makes all the nodes more "equal". And since there are not very many nodes that are also rather well connected, the Closeness scores of all nodes are relatively similar. However, some nodes (e.g. Node 1 and 4) are connected to more other nodes and therefore have the highest Closeness centrality scores.

---

## Betweenness Centrality

The intuition behind Betweenness centrality is that A node $t$ is important if many shortest paths between all other nodes pass through $t$. In other words, removing such nodes would cause the most "disruption" in a graph. Betweenness centrality is directly applicable to directed/undirected and weighted/unweighted graphs:

$$
c_{b}(v) = \sum_{} \frac{\sigma_{st}(v)}{\sigma_{st}}
$$

where $\sigma_{st}(v)$ is number of shortest paths from $s$ to $t$ passing through node $v$, and $\sigma_{st}$ is the total number of shortest paths from $s$ to $t$. In the following, we calculate the Betweenness scores for the undirected graph to compare the results with the Closeness scores

### Calculating the Betweenness Scores of G

As usual, `networkX` got us covered.

In [None]:
betweenness_scores = betweenness_centrality(G_undirected, normalized=True)

for node, score in betweenness_scores.items():
    print('Betweenness score of node {}: {:.3}'.format(node, score))

### Drawing the Graph

We again can draw the graph in such a way that the size of the nodes reflects their Betweenness scores.

In [None]:
plt.figure()
plt.axis('off')
plt.gca().set_xlim(-0.5, 12)
plt.gca().set_ylim(-10, 2)
plt.tight_layout()
nx.draw_networkx(G_undirected, pos, with_labels=True, node_size=[ v*2500 for v in betweenness_scores.values()], font_size=16, font_weight='bold', node_color='#80BFFF')
plt.show()

Unsurprisingly, nodes with only one edge have a Betweenness score of 0 (Nodes 0, 2, 9, 10) as they cannot be along any path, shortest or not. But also Nodes 5, 6, 7 and 8 have a Betweenness score of 0 as they are not on any shortest path between all pairs of nodes. All shortest paths go to Nodes 1, 3 or 4, which are therefore the only nodes with a Betweenness score larger than 0.

---

## Summary

Centrality measures play a crucial role in graph mining by quantifying the importance and influence of nodes within a network. These measures provide valuable insights into the structure and dynamics of the graph, allowing us to identify key entities, influencers, and critical points. However, each centrality measure has its own strengths and limitations.

Degree centrality is a simple and straightforward measure that counts the number of direct connections a node has. It is easy to compute and interpret, making it widely used. However, it fails to capture the importance of indirect connections and does not consider the centrality of neighboring nodes.

Betweenness centrality, on the other hand, identifies nodes that act as bridges or intermediaries between different parts of the network. It captures the node's role in information flow or communication. However, computing betweenness centrality for large graphs can be computationally expensive, and it may not be suitable for networks where multiple paths exist between nodes.

Closeness centrality measures how close a node is to all other nodes in terms of the shortest path length. It identifies nodes with efficient access to information or resources. Closeness centrality is straightforward to compute and provides a more nuanced view than degree centrality. However, it may not be suitable for disconnected graphs or graphs with unreachable nodes.

Eigenvector centrality considers both a node's direct connections and the centrality of its neighbors. It captures the influence and importance of nodes that are connected to other highly central nodes. However, calculating eigenvector centrality requires iterative methods and can be computationally intensive.

PageRank centrality, originally designed for ranking web pages, assigns centrality based on a node's connections to other important nodes. It provides a global view of the graph structure but may be influenced by link spamming or manipulation.

In summary, centrality measures offer valuable insights into graph mining, but it is important to choose the appropriate measure based on the specific characteristics of the graph and the analysis goals. Different centrality measures provide different perspectives on node importance, and understanding their pros and cons helps in making informed decisions when analyzing complex networks.