# Chapter 3 - Centrality Measures

In this notebook, we explore various centrality measures on a **weighted**, **directed** graph which represents the volume of passengers between US airports in 2008. This dataset is available at: https://www.kaggle.com/flashgordon/usa-airport-dataset#Airports2.csv which is part of the Kaggle Public Datasets: https://www.kaggle.com/datasets

As with the previous notebooks, make sure to set the data directory properly in the next cell.

In [None]:
datadir = "../Datasets/"

In [None]:
using Graphs
using SimpleWeightedGraphs
using DataFrames
using CSV
using PyPlot
using GraphPlot
using LinearAlgebra
using StatsBase
using Random
using CategoricalArrays

In [None]:
ENV["COLUMNS"] = 1000

## US Airport Graph

### Volume of Passengers

The nodes are represented by the 3-letter airport codes such as LAX (Los Angeles); 
each line below represents the number of passenges from ```orig_airport``` to ```dest_airport```.
The last column is the volume of passengers that we use as **edge weights**. Thus we will build a weighted,  directed graph.


In [None]:
## read edges and build weighted directed graph
D = CSV.read(datadir * "Airports/connections.csv", DataFrame)
first(D, 5)

In [None]:
# normalize weights
max_passengers = maximum(D.total_passengers)
D.total_passengers /= max_passengers
extrema(D.total_passengers)

In [None]:
id2name = sort!(unique(union(D.orig_airport, D.dest_airport)))
name2id = Dict(id2name .=> axes(id2name, 1))
g = SimpleWeightedDiGraph(length(id2name))
for row in eachrow(D)
    from = name2id[row.orig_airport]
    to = name2id[row.dest_airport]
    from == to || add_edge!(g, from, to, row.total_passengers)
end
g

In [None]:
A = CSV.read(datadir * "Airports/airports_loc.csv", DataFrame)
A.id = [name2id[a] for a in A.airport]
@assert A.id == axes(A, 1)
@assert A.airport == id2name
first(A, 5)

### Check for loops and multiple edges

There are no multiedges (not surprising, edges are weighted here), but there are some loops in the raw data,
i.e. same origin and destination airport. However, loops are not supported by SimpleWeightedDiGraph and are automatically removed during operating on the graph.

In [None]:
D[D.orig_airport.==D.dest_airport, :]

In [None]:
length([e for e in edges(g) if src(e) == dst(e)])

## Connected components

A (sub)graph is **weakly connected** if there is a path between any pair of nodes when we ignore the edge direction (i.e. treat the directed graph as undirected). The airport graph is weakly connected (that is, ignoring directionality) except for 2 airports: DET and WVL that are connected by a single directed edge.

A (sub)graph is **strongly connected** if there is a directed path from each node to every other node. The airport graph is not strongly connected. The largest stongly connected component has size 425.
 

In [None]:
## count the number of nodes in the giant component (weak connectivity)
scomp = strongly_connected_components(g)
println(
    maximum(length.(scomp)),
    " out of ",
    nv(g),
    " are in giant (strong) component",
)
wcomp = weakly_connected_components(g)
println(
    maximum(length.(wcomp)),
    " out of ",
    nv(g),
    " are in giant (weak) component",
)

In [None]:
## which two airports are NOT weakly connected to the rest of the graph?
giant = wcomp[argmax(length.(wcomp))]  ## giant component
println("Disconnected airports:")
for i in 1:nv(g)
    if !(i in giant)
        println(
            A[i, "airport"],
            " has in degree ",
            indegree(g, i),
            " and out degree ",
            outdegree(g, i),
        )
    end
end

### Coreness

Looking at coreness (we consider both in and out edges).
We see a group of nodes with very high coreness: highly connected hub airports (such as 'SFO', 'LAX', 'ATL', etc.).
There are also several nodes with low coreness: peripherial airports.


In [None]:
hist(core_number(g), bins=20, color=:gray, width=3)
xlabel("Coreness", fontsize=14)
ylabel("Frequency", fontsize=14);

In [None]:
max_core = maximum(core_number(g))
max_core_idxs = findall(core_number(g) .== max_core)
println(A[in.(A.id, Ref(max_core_idxs)), :airport])

###  Degree distribution

Below we plot the degree distribution (total degree, in and out).
Which airport has maximal degree?

In [None]:
## degree distribution
air_deg = degree(g)
hist(air_deg, bins=20, width=14, color=:gray)
xlabel("Total degree", fontsize=14)
ylabel("Frequency", fontsize=14);

In [None]:
## this is different than "merging" in/out edges:
maximum(degree(SimpleWeightedGraph(g)))

In [None]:
## max degree airport
degree_max_idx = argmax(air_deg)
println("Airport with maximal degree: ", A.airport[degree_max_idx])

## California Subgraph 

We will look at several **centrality** measures. To speed up the computation and plotting, we consider only the airports in **California**, and the edges within the state.
You can try other states by changing the first line below.

In [None]:
## Build smaller subgraph for California (you can try other states)
## drop isolated vertices (i.e. without in-state connections)

CA = findall(==("CA"), A.state)
G = induced_subgraph(g, CA)[1]
A_CA = A[CA, :]
NZ = findall(>(0), degree(G))
G = induced_subgraph(G, NZ)[1]
A_CANZ = A_CA[NZ, :]
println(nv(G), " nodes and ", ne(G), " directed edges")

In [None]:
## The graph is weakly connected except for 2 airports
wcomp = weakly_connected_components(G)
giant = wcomp[argmax(length.(wcomp))]  ## giant component
println("Nodes outside the giant component:")
for i in 1:nv(G)
    if !(i in giant)
        println(
            A_CANZ[i, "airport"],
            " has in degree ",
            indegree(G, i),
            " and out degree ",
            outdegree(G, i),
        )
    end
end

In [None]:
## plot using lat/lon as layout
gplot(G, A_CANZ.lon, -A_CANZ.lat,
      NODESIZE=0.03, nodefillc="black",
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

In [None]:
## same subgraph using a force directed layout
gplot(G, layout=spring_layout,
      NODESIZE=0.03, nodefillc="black",
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

## Centrality measures

We compute the following centrality measures for the weighted graph g_CA:
**PageRank**, **Authority** and **Hub**.
For **degree centrality**, we define our own function below and we normalize the weights to get values bounded above by 1. 

For the distance based centrality measures **closeness**, **harmonic**, **eccentricity** and **betweenness**, we do not use the edges weights, so the distance between nodes is the number of hops, and is not based on the number of passengers. This is a natural choice here, since distance between airports (cities) can be viewed as the number of flights needed to travel between those cities.

We compute the above centrality for every node in the G subgraph.

In [None]:
dir_degree_centrality(G::SimpleWeightedDiGraph) =
    (vec(sum(G.weights, dims=1)) + vec(sum(G.weights, dims=2))) / (2 * (nv(G) - 1))

In [None]:
function pagerank_simple(G::SimpleWeightedDiGraph; α=0.85)
    A = G.weights
    B = A ./ sum(A, dims=1)
    B[findall(isnan, B)] .= 1 / nv(G) # handle 0 out-degree nodes
    return (1 - α) / nv(G) * ((I - α * B) \ ones(nv(G)))
end

In [None]:
function hub_authority_simple(G::SimpleWeightedDiGraph)
    A = Matrix(G.weights)
    e = eigen(transpose(A) * A)
    λ = e.values[end]
    y = e.vectors[:, end]
    if all(<=(eps()), y)
        y .= -y
    end
    @assert all(>=(-eps()), y)
    x = A * y
    y ./= maximum(y)
    x ./= maximum(x)
    return x, y
end

In [None]:
function simple_closeness(G::SimpleGraph)
    c = zeros(nv(G))
    for i in 1:nv(G)
        x = gdistances(G, i)
        x .= min.(x, nv(G))
        c .+= x
    end
    return (nv(G) - 1) ./ c
end

In [None]:
function simple_eccentricity(G::SimpleDiGraph)
    return [replace(gdistances(G, v), typemax(Int) => 0) |> maximum for v in 1:nv(G)]
end

In [None]:
function harmonic_centrality(G::SimpleWeightedDiGraph)
    return [mean(replace(filter(x -> !isinf(x), 1 ./ gdistances(G, v)), 1 / typemax(Int) => 0)) for v in 1:nv(G)]
end

In [None]:
df = DataFrame("airport" => A_CANZ.airport,
    "degree" => dir_degree_centrality(G),
    "pagerank" => pagerank_simple(G),
    (["authority", "hub"] .=> hub_authority_simple(G))...,
    "between" => 2 * betweenness_centrality(SimpleDiGraph(G)),
    "harmonic" => harmonic_centrality(G),
    "closeness" => simple_closeness(SimpleGraph(SimpleDiGraph(G))),
    "eccentricity" => simple_eccentricity(SimpleDiGraph(G))
)
first(sort!(df, :degree, rev=true), 5)

In [None]:
## bottom ones
last(df, 5)

#### Top airports

The above results agree with intuition in terms of the most central airports in California.
Note however that **SAN** (San Diego) has high values *except* for betweenness, an indication that connecting flights transit mainly via LAX or SFO. 

Below, we plot the California graph again, highlighting the top-3 airports w.r.t. **pagerank**: LAX, SFO, SAN.

In [None]:
## highlight top-3 airports w.r.t. pagerank
## plot using lat/lon as layout
gplot(G, A_CANZ.lon, -A_CANZ.lat,
      NODESIZE=0.03, nodefillc=ifelse.(ordinalrank(df.pagerank, rev=true) .<= 3, "red", "black"),
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

## Correlation between measures

We use the rank-based **Kendall-tau** correlation to compare the different centrality measures.

We observe high agreement between all measures. In particular, degree-centrality, hub and authority measures are very highly correlated, and so are the distance-based measures (betweenness, closeness).

In [None]:
## rank-based correlation between measures
DataFrame(corkendall(Matrix(df[:, 2:end])), names(df)[2:end])

### Harmonic vs closeness centrality

By default, closeness centrality is computed **separately** on each **connected component**, which is why we defined our own function earlier, setting the distance equal to the number of nodes when no path exists between two nodes.
This is one advantage of harmonic centrality, which works as is even with disconnected graphs.
We illustrate this below, where we compute the 3 measures (harmonic, closeness with default behavior, closeness with our own definition). We report the results for 5 airports:

* 3 major airports (LAX, SFO, SAN): all values are high
* 2 disconnected airports (MCE, VIS): we see low values except when using the closeness centrality with default behavior, in which case the value is maximal (1). This can be misleading!

There is a similar concern when computing **eccentricity** (maximum shortest distance), which is done separately for each connected component. For the California subgraph, all nodes have value 2 or 3, except the to disconnected airports, which have value of 1, as we see below.

In [None]:
## Harmonic vs closeness centralit
look_at = ["LAX", "SFO", "SAN", "MCE", "VIS"]
# df."closeness_default" = g_CA.closeness()
df_sub = df[in.(df.airport, Ref(look_at)),
    ["airport", "harmonic", "closeness", "eccentricity"]]
df_sub

## Looking at coreness

We already looked at coreness for the whole airports graph, now we look at the California subgraph, again with mode='all'. Below we show nodes with maximal coreness as red dots, and nodes with small coreness as blue dots.

In [None]:
coreness = core_number(G)
mc = minimum(coreness)
Mc = maximum(coreness)
color = [x == Mc ? "red" : x <= mc + 1 ? "blue" : "black" for x in coreness];

In [None]:
## plot nodes w.r.t. coreness
gplot(G, A_CANZ.lon, -A_CANZ.lat,
      NODESIZE=0.03, nodefillc=color,
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

The above uses the geographical layout, so it is not clear what is going on.

Let's use a force directed layout to make the difference between high and low core number clearer. 

The high coreness nodes are clearly seen, and we also observe the small 2-node connected component that was buried in the previous visualization.

In [None]:
## Coreness is more clear here
Random.seed!(12)
gplot(G, layout=spring_layout,
      NODESIZE=0.03, nodefillc=color,
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

In [None]:
## vertices with max coreness (13-core)
## note that there are less than 14 nodes, this is an interesting remark and
## it is because we consider both in and out-going edges by default for directed graph.
println("max core value:", Mc, "\nairports:", df.airport[coreness.==Mc])

### Looking at harmonic centrality

Using the same layout as above (with high coreness nodes in the middle), we display the harmonic centrality scores.
We clearly see higher values for central nodes, and small values for the small 2-node component.


In [None]:
## show closeness centralities, same layout
Random.seed!(12)
gplot(G, layout=spring_layout,
      nodelabel=round.(harmonic_centrality(G), digits=2),
      nodelabeldist=8, nodelabelangleoffset=π / 4,
      NODESIZE=0.01, nodefillc=color,
      EDGELINEWIDTH=0.2, edgestrokec="lightgray",
      arrowlengthfrac=0.05, linetype="curve")

### Comparing coreness with other centrality measures

We add coreness to data frame with centrality measures ```df```.
We then group the data in 3 categories: high coreness (value of max_core), low (value of min_core+1 or less) or mid-range, and we compute and plot the mean for every other measure.

We see that for all centrality measures except closeness centrality, the values are much higher for nodes with high coreness. The pagerank value for 'low' coreness nodes (close to 'mid' ones) is due to the two airports that are not part of the giant component.

As expected, nodes with small coreness generally have smaller centrality scores. 
This is why for example we can often remove the small core nodes (for example, keeping only the 2-core) to reduce
the size of large graphs without destroying its main structure.


In [None]:
## group in 3 categories
sort!(df, :airport)
df.coreness = core_number(G)
df.core_grp = categorical([x <= 2 ? "low" : x == 13 ? "high" : "mid" for x in df.coreness])
levels!(df.core_grp, ["low", "mid", "high"])
df_grp = combine(groupby(df, :core_grp, sort=true),
    names(df, Between(:degree, :closeness)) .=> mean,
    renamecols=false)

In [None]:
## grouped barplot
bl, bm, bh = Vector.(eachrow(df_grp[:, 2:end]))
barWidth = 0.25
# Set position of bar on X axis
r1 = 1:length(bh)
r2 = r1 .+ barWidth
r3 = r2 .+ barWidth
# Make the plot
bar(r1, bh, color="black", width=barWidth, edgecolor="white", label="high coreness")
bar(r2, bm, color="gray", width=barWidth, edgecolor="white", label="mid coreness")
bar(r3, bl, color="lightgray", width=barWidth, edgecolor="white", label="low coreness")

# Add xticks on the middle of the group bars
xlabel("measure", fontsize=14)
xticks(r2, names(df_grp, Not(1)), fontsize=10)
ylabel("score", fontsize=14)
# Create legend & Show graphic
legend(fontsize=12);

## Delta-centrality example

This is the simple ''pandemic'' spread model as detailed in the book:

*The ''pandemic'' starts at exactly one airport selected uniformly at random from all the airports. Then, the following rules for spreading are applied: (i) in a given airport pandemic lasts only for one round and (ii) in the next round, with probability $\alpha$, the pandemic spreads independently along the flight routes to the destination airports for all connections starting from this airport. Airports can interact with the pandemic many times, and the process either goes on forever or the pandemic eventually dies out. 
Our goal is to find the sum over all airports of the expected number of times this airport has the pandemic.*

We use $\alpha$ = 0.1 and plot the (decreasing) delta centrality values in a barplot, using the same 3 colors are with the coreness plot above.

In [None]:
## Delta-centrality with a simple pandemic spread model
function spread(A::AbstractMatrix, α=0.1)
    One = ones(size(A, 1))
    X = I - α * transpose(A)
    return transpose(One) * (X \ One) / size(A, 1)
end

function spread_delta_centrality(g::SimpleDiGraph, α=0.1)
    A = Matrix(adjacency_matrix(g))
    dc = Float64[]
    spr = spread(A, α)
    for i in 1:nv(g)
        A′ = copy(A)
        A′[i, :] .= 0
        A′[:, i] .= 0
        push!(dc, (spr - spread(A′, α)) / spr)
    end
    return dc
end

In [None]:
df.delta = spread_delta_centrality(SimpleDiGraph(G))
df2 = sort(df, :delta, rev=true)
first(df2, 5)

In [None]:
heights = df2.delta
bars = df2.airport
y_pos = axes(bars, 1)
bar(y_pos, heights, color=recode(get.(df2.core_grp), "high" => "black", "mid" => "gray", "low" => "lightgray"))
# Rotation of the bars names
ylabel("Delta Centrality", fontsize=12)
xticks(y_pos, bars, rotation=90)
yticks();

## Group centrality and centralization

We go back to the full airports graph, and we ask the following questions:

* which states have highest delta centralities with respect to efficiency?
* what about centralization for each state subgraph?

Computing efficiency involves the computation of shortest path lengths, which will cause a warning if the graph is disconnected.


In [None]:
## group delta centrality
function efficiency(g::SimpleDiGraph)
    n = nv(g)
    s = 0
    for i in 1:n
        v = gdistances(g, i)
        s += sum([1 / x for x in v if 0 < x < n])
    end
    return s / (n * (n - 1))
end

In [None]:
sg = SimpleDiGraph(g)
states = unique(A.state)
eff_us = efficiency(sg)
dc = Float64[]
for s in states
    v = findall(==(s), A.state)
    csg = copy(sg)
    for i in 1:nv(csg), j in v
        rem_edge!(csg, i, j)
        rem_edge!(csg, j, i)
    end
    push!(dc, (eff_us - efficiency(csg)) / eff_us)
end
DC = DataFrame(state=states, delta_centrality=dc)
sort!(DC, :delta_centrality, rev=true)
first(DC, 3)

In [None]:
## ... and bottom states
last(DC, 3)

For group centralization, we use the PageRank measure.

In [None]:
## group centralization (using PageRank) -- by state
states = unique(A.state)
pr = Float64[]
st = String[]
for s in states
    v = findall(==(s), A.state)
    if length(v) > 5 ## look at states with more than 5 airports only
        G = induced_subgraph(g, v)[1]
        p = pagerank_simple(G)
        push!(pr, maximum(p) - mean(p))
        push!(st, s)
    end
end

DC = DataFrame("State" => st, "Pagerank Centralization" => pr)
first(sort!(DC, 2, rev=true), 5)

We plot the state with highest PageRank centralization (Michigan).

This is a state with one high degree airport (DTW).

In [None]:
Random.seed!(12)
v = findall(==("MI"), A.state)
G = induced_subgraph(g, v)[1]
NZ = findall(>(0), degree(G))
G = induced_subgraph(G, NZ)[1]
gplot(G,
      NODESIZE=0.03, nodefillc=[x == "DET" ? "red" : "black" for x in A.airport[v]],
      EDGELINEWIDTH=0.2, edgestrokec="lightgray", arrowlengthfrac=0.05,
      linetype="curve")

In [None]:
## one big hub airport: DTW (Detroit)
degs = degree(G)
for (i, v) in enumerate(v[NZ])
    println(A.city[v], " ", A.airport[v], " has degree ", degs[i])
end

We plot the state with lowest PageRank centralization (ND).

This is a state without high degree (hub) airport.

In [None]:
## lowest ones
last(DC, 5)

In [None]:
Random.seed!(3)
v = findall(==("ND"), A.state)
G = induced_subgraph(g, v)[1]
NZ = findall(>(0), degree(G))
G = induced_subgraph(G, NZ)[1]
gplot(G,
      NODESIZE=0.03, nodefillc="black",
      EDGELINEWIDTH=0.2, edgestrokec="gray", arrowlengthfrac=0.05,
      linetype="curve")

In [None]:
## no big hub city here
Set(A.city[v])

What should we expect for California? There are hub airports, but several ones. 

In [None]:
# what about California
DC[DC.State.=="CA", :]

# Extra material

### Figure 3.1 - empirical tests

The code below can be used to obtain the equivalent of Figure 3.1 in the book for different values of $n$, the number of nodes. Large $n$ values will generate a plot like the one in the book.

In [None]:
## G(n,p) graph and k-cores - fraction of nodes in k-core vs average degree
n = 25000
Random.seed!(123)

## Generate the graphs and store coreness
## Vary average degree (thus, number of edges to generate)
avg_deg = 0:0.5:16.0
n_edges = [round(Int, n * i / 2) for i in avg_deg]

C = []
for m in n_edges
    g = erdos_renyi(n, m)
    C = append!(C, [core_number(g)])
end

# Plot
fig, ax = subplots(1)
S = [sum(C[i] .>= 1) / n for i in 1:length(avg_deg)]
X = [avg_deg[i] for i in 1:length(avg_deg) if S[i] >= 0]
Y = [S[i] for i in 1:length(avg_deg) if S[i] >= 0]
ax.plot(X, Y)
ax.text(0.2, 0, "1")

for k in 2:10
    S = [sum(C[i] .>= k) / n for i in 1:length(avg_deg)]
    X = [avg_deg[i] for i in 1:length(avg_deg) if S[i] > 0]
    Y = [S[i] for i in 1:length(avg_deg) if S[i] > 0]
    ax.plot(X, Y)
    ax.text(minimum(X) + 0.2, minimum(Y), string(k))
end

ax.set_xlabel("average degree", fontsize=14)
ax.set_ylabel("fraction of nodes", fontsize=14)
ax.set_title("Order of k-cores with " * string(n) * " nodes", fontsize=14);
