# Chapter 3 - Centrality Measures

In this notebook, we explore various centrality measures on a weighted, directed graph which represents the volume of passengers between US airports in 2008. 

As with the previous notebooks, make sure to set the data directory properly in the next cell.

In [None]:
datadir = '../Datasets/'

In [None]:
import igraph as ig
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
from statistics import mode

In [None]:
## define edges color
cls_edges = 'gainsboro'

## we will consider 3 types of nodes with the following colors and sizes:
cls = ['silver','dimgray','black']
sz = [6,9,12]

## US Airport Graph -  Volume of Passengers

The nodes are represented by the 3-letter airport codes such as LAX (Los Angeles); we also read in the volume of passengers that we use as **edge weights**. The edges are directed.



In [None]:
## read edges and build weighted directed graph
D = pd.read_csv(datadir+'Airports/connections.csv')
g = ig.Graph.TupleList([tuple(x) for x in D.values], directed=True, edge_attrs=['weight'])
D.head() ## look at a few edges

### Read node attributes

We read the node attributes in data frame A:
* lat/lon, which we will use as the graph layout
* state (2-letter code)
* city

In [None]:
## read vertex attributes and add to graph
A = pd.read_csv(datadir+'Airports/airports_loc.csv')
lookup = {k:v for v,k in enumerate(A['airport'])}
l = [lookup[x] for x in g.vs()['name']]
g.vs()['layout'] = [(A['lon'][i],A['lat'][i]) for i in l]
g.vs()['state'] = [A['state'][i] for i in l]
g.vs()['city'] = [A['city'][i] for i in l]
A.head() ## first few rows in A

In [None]:
## add a few more attributes for visualization
g.vs()['size'] = sz[1]
g.vs()['color'] = cls[1]
g.es()['color'] = cls_edges
g.es()['arrow_size'] = 0.33
print(g.vcount(),'nodes and',g.ecount(),'directed edges')

### Check for loops and multiple edges

There are no multiedges (not surprising, edges are weighted here), but there are some loops in the raw data,
for example:
``` 
SEA,SEA,69
```

In [None]:
print('number of loop edges:',sum(g.is_loop()))
print('number of multiple edges:',sum(g.is_multiple()))

## Connected components

The graph is weakly connected (that is, ignoring directionality) except for 2 airports: DET and WVL that are connected by a single directed edge.

With strong connectivity, the giant component has size 425.
 

In [None]:
## count the number of nodes in the giant component (weak connectivity)
print(g.connected_components(mode='WEAK').giant().vcount(),'out of',g.vcount(),'are in giant (weak) component')
print(g.connected_components(mode='STRONG').giant().vcount(),'out of',g.vcount(),'are in giant (strong) component')

In [None]:
## which airports are NOT weakly connected to the rest?
cl = g.connected_components(mode='WEAK').membership
giant = mode(cl) ## giant component
for i in range(g.vcount()):
    if cl[i] != giant:
        print(g.vs[i]['name'],'which has in degree',g.degree(i,mode='IN'),'and out degree',g.degree(i,mode='OUT'))   

### A few more statistics

Looking at coreness (mode = 'ALL' means that we merge in and out edges, so undirected coreness).
We see a group of nodes with very high coreness, a group of highly connected hub airports.
There are also several nodes with low coreness, more peripherial airports.

We also plot the degree distribution, again with mode='ALL' (total degree, in and out).
Which airport has maximal degree?

In [None]:
gc = g.coreness(mode='ALL')
plt.hist(gc);

In [None]:
## print a few airports with maximal coreness:
mc = np.max(gc)
top = 0
for i in range(g.vcount()):
    if gc[i] == mc:
        print(g.vs[i]['name'])
        top += 1
        if top==5:
            break


In [None]:
## degree distribution
gd = g.degree(mode='ALL')
plt.hist(gd, bins=20);

In [None]:
## max degree airport
print('max degree for:',g.vs[np.argmax(gd)]['name'])

## California Subgraph 

We now look at several centrality measures. To speed up the computation and plotting, we consider only the airports in California, and the edges within the state.

You can try other states by changing the first line below.


In [None]:
## Build smaller subgraph 'G' for California
G = g.subgraph([v for v in g.vs() if v['state'] == 'CA'])

## drop isolated vertices (i.e. without in-state connections)
G = G.subgraph([v for v in G.vs() if v.degree()>0])

## remove loops if any
G = G.simplify(multiple=False)
print(G.vcount(),'nodes and',G.ecount(),'directed edges')


In [None]:
## The graph is weakly connected except for 2 airports
## We color those in red for now
cl = G.connected_components(mode='WEAK').membership
giant = mode(cl)
for i in range(G.vcount()):
    if cl[i] != giant:
        print(G.vs[i]['name'],'which has in degree',G.degree(i,mode='IN'),'and out degree',G.degree(i,mode='OUT'))
        G.vs[i]['color'] = 'red'

In [None]:
## plot using lat/lon as layout
ly = ig.Layout(G.vs['layout'])
## y-axis goes top-down thus the inversion
ly.mirror(1)
ig.plot(G, bbox=(0,0,300,300), layout=ly)

## Centrality measures

Most measures defined in Chapter 3 of the book are available directly in igraph.

We compute the following centrality measures for the weighted graph G:
**PageRank**, **Authority** and **Hub**.

For **degree centrality**, we define our own function below (directed degree centrality) and we normalize the weights to get values bounded above by 1.

For the distance based centrality measures **closeness** and **betweenness**, we do not use the edges weights, so the distance between nodes is the number of hops, and not based on the number of passengers. This is a natural choice here, since distance between airports (cities) can be viewed as the number of flights needed to travel between those cities.

We compute the above centrality for every node in the CA subgraph.

#### Warning for disconnected graphs

We get a warning when running closeness centrality, since the graph is not connected. 
Here are the details of what is going on from the help file:

*If the graph is not connected, and there is no path between two
vertices, the number of vertices is used instead the length of
the geodesic. This is always longer than the longest possible
geodesic.*

In [None]:
## compute normalized weights 
mw = np.max(G.es['weight'])
G.es()['normalized_weight'] = [w/mw for w in G.es()['weight']]

## directed degree centrality
def degree_centrality(g, weights=None):
    n = g.vcount()
    if g.is_directed():
        dc = [sum(x)/(2*(n-1)) for x in zip(g.strength(mode='in',weights=weights),\
              g.strength(mode='out',weights=weights))]
    else:
        dc = [x/(n-1) for x in g.strength(weights=weights)]
    return dc

In [None]:
## compute several centrality measures for the CA subgraph G
C = pd.DataFrame({'airport':G.vs()['name'],\
                  'degree':degree_centrality(G,weights='normalized_weight'),\
                  'pagerank':G.pagerank(weights='weight'),'authority':G.authority_score(weights='weight'),\
                  'hub':G.hub_score(weights='weight'),'between':G.betweenness(),\
                  'closeness':G.closeness()})

## normalize betweenness
n = G.vcount()
C['between'] = [2*x/((n-1)*(n-2)) for x in C['between']]

## sort w.r.t. degree centrality, look at top airports
Cs = C.sort_values(by='degree', ascending=False)
Cs.head()


In [None]:
## bottom ones
Cs.tail()


#### Top airports

The above results agree with intuition in terms of the most central airports in California.
Note however that SAN (San Diego) has high values *except* for betweenness, an indication that connecting flights transit mainly via LAX or SFO. 

Below, we plot the CA graph again, highlighting the top-3 airports w.r.t. pagerank: LAX, SFO, SAN.

In [None]:
## reset node colours
G.vs()['color'] = cls[1]

## highlight top-3 airports w.r.t. pagerank
G.vs()['prk'] = C['pagerank']
for x in np.argsort(G.vs()['prk'])[-3:]:
    G.vs()[x]['color'] = cls[2]
    G.vs()[x]['size'] = sz[2]

#ig.plot(G,'California.eps',bbox=(0,0,300,300),layout=ly)
ig.plot(G,bbox=(0,0,300,300),layout=ly)


## Correlation between measures

We use the kendall-tau (rank-based) correlation measure below.

We observe high agreement between all measures.
In particular, degree-centrality, hub and authority measures are very highly correlated,
and so are the distance-based measures (betweenness, closeness).

In [None]:
## rank-based correlation between measures
df = C.corr('kendall', numeric_only=True)
df

## Looking at coreness

We already looked at coreness for the whole US graph, now we look at the CA subgraph, again with mode='ALL'.

Below we show nodes with max coreness as larger black dots, and nodes with minimal coreness as smaller dots.

In [None]:
## plot nodes w.r.t. coreness
G.vs['color'] = cls[1]
G.vs['size'] = sz[1]
G.vs()['core'] = G.coreness()
Mc = np.max(G.vs()['core'])
mc = np.min(G.vs()['core'])
for v in G.vs():
    if v['core'] == Mc:
        v['color'] = cls[2]
        v['size'] = sz[2]
    if v['core'] <= mc+1:
        v['color'] = cls[0]
        v['size'] = sz[0]
#ig.plot(G,"California_coreness.eps",bbox=(0,0,300,300),layout=ly)
ig.plot(G,bbox=(0,0,300,300),layout=ly)

The above uses the geographical layout, so it is not clear what is going on.

Let's use a force directed layout to make the difference between high and low core number clearer. 

The high coreness nodes are clearly seen, and we aso observe a small connected component that was buried in the previous visualization.


In [None]:
## Coreness is more clear here
c = [1 if v['core']==Mc else 2 if v['core']==mc else 0 for v in G.vs()]
ly = G.layout_kamada_kawai()
#ig.plot(G,"California_kamada.eps",bbox=(0,0,300,300),layout=ly)
ig.plot(G,bbox=(0,0,300,300),layout=ly)

In [None]:
## vertices with max coreness (13-core) 
## note that there are less than 14 nodes, this is an interesting remark and
## it is because we consider both in and out-going edges by default for directed graph.
V = [v['name'] for v in G.vs() if v['core']==Mc]
print(V)

### Looking at closeness centrality

Using the same layout as above (high coreness nodes in the middle), we display the closeness centrality scores.

Recall the warning -- this is normal for disconnected graphs.


In [None]:
## show closeness centralities, same layout
ix = np.round(G.closeness(),decimals=2)
G.vs['size'] = 3
#ig.plot(G,"California_closeness.eps",vertex_label=ix,layout=ly,bbox=(0,0,300,300))
ig.plot(G,vertex_label=ix,layout=ly,bbox=(0,0,300,300))

### comparing coreness with other centrality measures

We add coreness to data frame with centrality measures (C).

We then group the data in 3 categories: high coreness (13), low (2 or less) or mid-range, and we compute and plot the mean for every other measure.

We see that for all centrality measures except closeness centrality, the values are clearly higher for nodes with high coreness.

The slightly higher pagerank value for 'low' coreness nodes vs 'mid' ones is due to the two airports that are not part of the giant component.

As expected, nodes with small coreness generally have smaller centrality scores. 
This is why for example we often remove the small core nodes (for example, keeping only the 2-core) to reduce
the size of large graphs without destroying its main structure.


In [None]:
## group in 3 categories
G.vs()['Core'] = ['low' if v['core']<=2 else 'high' if v['core']==13 else 'mid' for v in G.vs()]
C['Coreness'] = G.vs['Core']
df = C.groupby('Coreness').mean(numeric_only=True)
df.sort_values(by='degree',inplace=True,ascending=False)
df

In [None]:
## grouped barplot
bh = np.array(df.loc[['high']])[0]
bm = np.array(df.loc[['mid']])[0]
bl = np.array(df.loc[['low']])[0]
barWidth = 0.25

# Set position of bar on X axis
r1 = np.arange(len(bh))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]

# Make the plot
plt.bar(r1, bh, color=cls[2], width=barWidth, edgecolor='white', label='high coreness')
plt.bar(r2, bm, color=cls[1], width=barWidth, edgecolor='white', label='mid coreness')
plt.bar(r3, bl, color=cls[0], width=barWidth, edgecolor='white', label='low coreness')
 
# Add xticks on the middle of the group bars
plt.xlabel('measure',fontsize=14)
plt.xticks([r + barWidth for r in range(len(bh))], df.columns, fontsize=10)
plt.ylabel('score',fontsize=14) 

# Create legend & Show graphic
plt.legend(fontsize=12);

# un-comment to save in file
#plt.savefig('California_core_vs_measures.eps',dpi=1200)

## Delta-centrality example

This is the simple pandemic example detailed in the book:

*The pandemic starts at exactly one airport selected uniformly at random from all the airports. Then, the following rules for spreading are applied: (i) in a given airport pandemic lasts only for one round and (ii) in the next round, with probability α, the pandemic spreads independently along the flight routes to the destination airports for all connections starting from this airport. Airports can interact with the pandemic many times, and the process either goes on forever or the pandemic eventually dies out. Our goal is to find the expected number of times a given airport interacted with the pandemic, which amounts to the sum over all airports of the expected number of times this airport has the pandemic.*

We use alpha = 0.1 and plot the (decreasing) delta centrality values in a barplot, using the same 3 colors are with the coreness plot above.

In [None]:
## Delta-centrality with a simple pandemic spread model
def spread(g, alpha=0.1):
    n = g.vcount()
    I = np.diag(np.repeat(1,n))
    A = np.array(g.get_adjacency().data)
    One = np.ones((n,1))
    X = np.linalg.inv(I-alpha*np.transpose(A))
    Y = np.reshape(X.dot(One)/n,n)
    return np.sum(Y)

def spread_delta_centrality(g, alpha=0.1):
    dc = []
    spr = spread(g, alpha=alpha)
    ## print(spr) # P(G) in the book
    for i in g.vs():
        G = g.copy()
        el = g.incident(i, mode='ALL')
        G.delete_edges(el)
        dc.append((spr-spread(G, alpha=alpha))/spr)
    return dc

In [None]:
## compute with alpha = 0.1, show top airports
G.vs['delta'] = spread_delta_centrality(G, alpha=.1)
DC = pd.DataFrame(np.transpose([G.vs['name'],G.vs['delta'],G.vs['color']]),columns=['airport','delta','color'])
DC.sort_values(by='delta',ascending=False, inplace=True)
DC.head()

In [None]:
## plot using the same colors as with coreness plot
heights = [float(x) for x in DC['delta']]
bars = DC['airport']
y_pos = range(len(bars))
plt.bar(y_pos, heights, color=DC['color'] )

# Rotation of the bars names
plt.ylabel('Delta Centrality',fontsize=12)
plt.xticks(y_pos, bars, rotation=90)
plt.yticks();


#plt.savefig('California_delta.eps',dpi=1200)

## Group centrality and centralization

We go back to the full US airports graph, abd ask the following questions:

* which states have highest delta centralities w.r.t. efficiency?
* what about centralization for each state subgraph?

Computing efficiency involves the computation of shortest path lengths, which will cause a warning if the graph is disconnected. Warnings can be turned off by un-commenting the next cell.

In [None]:
## efficiency function given g
def efficiency(g):
    n = g.vcount()
    s = 0
    for i in range(n):
        v = g.get_shortest_paths(i)
        s += np.sum([1/(len(x)-1) for x in v if len(x) > 1])
    return s/(n*(n-1))

## group delta centrality -- compute for each state
states = list(set(g.vs()['state']))
eff_us = efficiency(g)
dc = []
for s in states:
    v = [x for x in g.vs() if x['state']==s]
    G = g.copy()
    e = []
    for x in v:
        e.extend(g.incident(x, mode='ALL'))
    G.delete_edges(e)
    dc.append((eff_us-efficiency(G))/eff_us)

## sort and show top-3
DC = pd.DataFrame({'state':states, 'delta_centrality':dc})
DC = DC.sort_values(by='delta_centrality', ascending=False)
DC.head(3)


In [None]:
## ... and bottom 3
DC.tail(3)


For group centralization, we use the PageRank measure.

In [None]:
## group centralization (using PageRank) -- by state
states = list(set(g.vs()['state']))
pr = []
st = []
for s in states:
    v = [x for x in g.vs() if x['state']==s]
    if len(v)>5: ## look at states with more than 5 airports only
        G = g.subgraph(v)
        G = G.simplify(multiple=False) ## drop self-loops
        p = G.pagerank(weights='weight')
        pr.append(np.max(p) - np.mean(p))
        st.append(s)

## sort and show top-3
DC = pd.DataFrame({'State':st, 'Pagerank Centralization':pr})
DC = DC.sort_values(by='Pagerank Centralization', ascending=False)
DC.head(3)


We plot the state with highest PageRank centralization (Michigan).

This is a state with one high degree airport (DTW).

In [None]:
v = [x for x in g.vs() if x['state']=='MI']
G = g.subgraph(v)
G = G.subgraph([v for v in G.vs() if v.degree()>0])
G = G.simplify(multiple=False)
#ig.plot(G, 'central_MI.eps', bbox=(0,0,300,300))
ig.plot(G,bbox=(0,0,300,300))


In [None]:
## one big hub city: Detroit
G.vs['deg'] = G.degree() # overall degree
for v in G.vs:
    print(v['city'],v['name'],'has degree',v['deg'])

We plot the state with lowest PageRank centralization (ND).

This is a state without high degree (hub) airport.

In [None]:
## now the bottom 3
DC.tail(3)


In [None]:
v = [x for x in g.vs() if x['state']=='ND']
G = g.subgraph(v)
G = G.subgraph([v for v in G.vs() if v.degree()>0])
G = G.simplify(multiple=False)

#ig.plot(G, 'central_ND.eps', bbox=(0,0,300,300))
ig.plot(G, bbox=(0,0,300,300))

In [None]:
## no big hub city here
G.vs['city']

What should we expect for California? There are hub airports, but several ones. 

In [None]:
## what about California?
DC[DC['State']=='CA']