# Chapter 2 - Random Graph Models

In the first part of this notebook, we provide the code required to generate the Figures in Chapter 2 of the textbook.

In the second part, we consider the GitHub machine learning (ml) developers graph that we introduced in Chapter 1, and compare various statistics for this graph with the values we get for the random graphs models introduced in Chapter 2.

### Requirements

Add `powerlaw` package using the following commands in a Julia session (you can also copy-paste it to a Jupyter Notebook cell and run it):
```
using PyCall
run(`$(PyCall.python) -m pip install --upgrade cython`)
run(`$(PyCall.python) -m pip install powerlaw`)
```
These commands need to be run only once

In [None]:
using Graphs
using GraphPlot
using PyCall
using PyPlot
using Statistics
using Optim
using Distributions
using FreqTables
using StatsBase
using Random
using CSV
using DataFrames

As with the previous notebook, make sure to set the data directory properly in the next cell.

In [None]:
datadir = "../Datasets/"

## Part 1 - Generating figures from the book

### Figure 2.1: size of the giant component

We generate several binomial random graphs with $n$ nodes, where we vary the average node degree (thus, the number of edges). We consider $n=100$ below, and you can try for different $n$. Un-comment the second line to run with $n=10,000$ nodes as in the second plot in the book (this will be much slower).

We plot the theoretical giant component size (black line) and the 90% confidence interval from the empirical data in grey, both as a function of the average degree; we see good agreement and we observe the various phases as described in the book. 

In [None]:
n = 100
# n=10000
gc_avg = Float64[]
gc_std = Float64[]
REP = 1000 ## repeats
ad = 0.1:0.1:10.0
for d in ad
    x = Int[]
    for rep in 1:REP
        p = d / (n - 1)
        g = erdos_renyi(n, p)
        push!(x, maximum(length, connected_components(g)))
    end
    push!(gc_avg, mean(x))
    push!(gc_std, std(x))
end

## theoretical
th = fill(log(n), 10)

fn(x, d) = x + exp(-x * d) - 1

for i in 1.1:0.1:10.0
    push!(th, n * optimize(x -> fn(x, i)^2, 0, 1).minimizer[1])
end

fill_between(ad, [a - 1.654 * s for (a, s) in zip(gc_avg, gc_std)],
    [a + 1.645 * s for (a, s) in zip(gc_avg, gc_std)], color="lightgray")
plot(ad, th, color="black")
suptitle("Random graph with " * string(n) * " nodes", fontsize=14)
title("Theoretical predictions (black) vs empirical results (grey)", fontsize=12)
xlabel("average degree", fontsize=14)
ylabel("giant component size", fontsize=14);

### Figure 2.2: probability that the graph is connected

This is a similar experiment as above, but this time we look at the probability that the random graph is connected.
We vary some constant $c$ introduced in the book such that the edge probability for the binomial graphs is given by $(\log(n)+c)/n$. Once again we compare theory (black line) and experimental results (in grey) with $n=100$ nodes. Un-comment the second line to run with $n=10,000$ nodes as in the second plot in the book (this will be much slower).

In the cell below, the grey area corresponds to a 90% confidence interval for proportions; for empirical proportion $x$ obtained from sample of size $n$, the formula is given by $x \pm 1.645 \sqrt{x(1-x)/n}$.

Here also we see good agreement between theory and experimental results.

In [None]:
n = 100
# n = 10000
REP = 1000 ## repeats
lo = max(-floor(log(n) * 10) / 10, -10)
C = lo:0.1:10.0
ic_avg = Float64[]
for c in C
    x = Bool[]
    for rep in 1:REP
        p = (c + log(n)) / n
        g = erdos_renyi(n, p)
        push!(x, is_connected(g))
    end
    push!(ic_avg, mean(x))
end

## theoretical
th = [exp(-exp(-c)) for c in C]

## plot
fill_between(C, [x - 1.654 * sqrt(x * (1 - x) / n) for x in ic_avg],
    [x + 1.645 * sqrt(x * (1 - x) / n) for x in ic_avg], color="lightgray")
plot(C, th, color="black")
suptitle("Random graph with " * string(n) * " nodes", fontsize=14)
title("Theoretical predictions (black) vs empirical results (grey)", fontsize=12)
xlabel("constant c", fontsize=14)
ylabel("P(graph is connected)", fontsize=14);

### Figure 2.4: Distribution of shortest path lengths

We consider a series of binomial random graphs with expected average degree 5, where we vary the number of nodes from $n=64$ to $n=2,048$.

We see that as we double the number of nodes, the average shortest path lengths (in the giant component) increases slowly.

In [None]:
sp = Vector{Int}[]
N = [64, 128, 256, 512, 1024, 2048]
Random.seed!(123)
for n in N
    p = 5 / (n - 1)
    g = erdos_renyi(n, p)
    z = Int[]
    for i in 1:n
        d = [e for e in gdistances(g, i) if e > 0]
        z = [z; d]
    end
    push!(sp, z)
end

In [None]:
## Show as histograms (boxplots in the first edition)
bins = 0.5:1:8.5
fig, axs = subplots(2, 3)
suptitle("Shortest path length distribution")
for i in 1:2
    for j in 1:3
        axs[i, j].hist(
            sp[3*(i-1)+j], bins=bins, width=0.9, density=true, color="darkgrey"
        )
        axs[i, j].set_ylim(0, 0.48)
        axs[i, j].set_xticks([1, 3, 5, 7])
        axs[i, j].set_title(string(N[3*(i-1)+j]) * " nodes", fontsize=10)
        axs[i, j].set_xlabel("path length")
        axs[i, j].set_ylabel("proportion")
    end
end
for ax in fig.get_axes()
    ax.label_outer()
end

### Figure 2.5 Poisson vs degree distributions

We plot the degree distribution for binomial random graphs with expected average degree 10, and $n=100$ nodes (the black dots), and we compare with the corresponding Poisson distribution (dashed line).

Try increasing $n$; the dots should get closer to the Poisson distribution.

We used $n=10,000$ for the book.

In [None]:
# n = 100
n = 10000
p = 10 / (n - 1)
g = erdos_renyi(n, p)
z = proptable(degree(g))
x = names(z)[1]
pmf = pdf(Poisson(10), x)
plot(x, z, "o", color="black")
plot(x, pmf, ":", color="black")
title("Empirical degree distribution vs Poisson distribution")
xlabel("degree", fontsize=14)
ylabel("frequency/pmf", fontsize=14);

In [None]:
sort(countmap(degree(g)))

### Figure 2.6 -  Power law graphs

We generate a random graph with $n=10,000$ nodes following power law degree distribution with exponent $\gamma=2.5$.
We do so using the Chung-Lu models described in section 2.5 of the book; we generate simple graphs (no loops or multiedges) and discard 0-degree nodes.

We then fit and plot the degree distribution of the obtained graph using the ```powerlaw``` package, see: https://arxiv.org/pdf/1305.0215.pdf

In [None]:
## fast Chung-Lu: generate m edges w.r.t. distribution d
function fastCL(d, m)
    p = Weights(d)
    n = length(d)
    target = m
    g = SimpleGraph(n)
    while ne(g) < target
        a, b = sample(1:n, p), sample(1:n, p)
        a != b && add_edge!(g, a, b)
    end
    return g
end

In [None]:
## power law graph
gamma = 2.5
n = 10000
delta = 1
Delta = sqrt(n)
W = Float64[]
for i in 1:n
    push!(W, delta * (n / (i - 1 + n / (Delta / delta)^(gamma - 1)))^(1 / (gamma - 1)))
end

deg = round.(Int, W)
m = trunc(Int, mean(deg) * n / 2)
g = fastCL(deg, m)

## number of isolated nodes
print("isolates: ", count(==(0), degree(g)))

g1 = induced_subgraph(g, findall(>(0), degree(g)))[1]

In [None]:
## KS statistic
d = degree(g1)
powerlaw = pyimport("powerlaw");
X = powerlaw.Fit(d)
println("Range of degrees in graph: $(minimum(d)) , $(maximum(deg))")
println("Value of l':", X.power_law.xmin)
println("Corresponding value of gamma:", X.power_law.alpha)

### Divergence vs $\ell$

In [None]:
## Plot divergence vs 'l'
x = X.xmins
y = X.Ds
plot(x, y, ".")

## Plot min value with larger dot
x = Int(X.power_law.xmin)
y = X.Ds[x-1]
plot([x], [y], "p")
xlabel(L"$\ell$", fontsize=14)
ylabel("Divergence", fontsize=12);

In [None]:
## plot divergence vs. exponent (alphas here, gamma' in the book)
plot(X.alphas[begin:50], X.Ds[begin:50], ".")

## Plot min value with larger dot
i = Int(X.power_law.xmin)
x = X.alphas[i-1]
y = X.Ds[i-1]
plot([x], [y], "o")
xlabel(raw"$\gamma$", fontsize=14)
ylabel("Divergence", fontsize=12);

### Figure 2.6 - inverse (cumulative) cdf vs degree and fitted power law

In the first plot, we look at degrees starting from $\ell'$.

In the second plot, we look at the whole range of degree.

In [None]:
## Figure 2.6 - starting from l'
fig1 = X.power_law.plot_ccdf(color="black", linestyle="-");
fig1 = X.plot_ccdf(ax=fig1, linewidth=2, color="gray", original_data=false, linestyle=":")
xlabel("degree", fontsize=13)
ylabel("inverse cdf", fontsize=13);

In [None]:
## now starting from 1 - need to translate power law line manually
fig = X.plot_ccdf(linewidth=2, color="dimgray", original_data=true, linestyle="--")
xlabel("degree", fontsize=13)
ylabel("inverse cdf", fontsize=13)

## get end points for power law fitted line
x = [Int(X.power_law.xmin), Int(X.data[end])]     ## x-axis: from l' to max value in data
delta_y = X.ccdf(original_data=true)[2][x[1]-1]   ## translation for first point
y = [delta_y, X.power_law.ccdf()[end] * delta_y] ## y-axis values
plt.plot(x, y, "-", linewidth=2, color="black")
print("power law slope:", (log10(y[2]) - log10(y[1])) / (log10(x[2]) - log10(x[1])));

## Figure 2.7: simple $d$-regular graphs

We generate several $d$-regular graphs and count how many are simple graphs.
We consider $d=2$ to $d=10$, with $n=100$ nodes. 
We used $n=10,000$ nodes in the book.

We plot the empirical proportion of simple graphs below (black dots), and we compare with the theoretical values (dashed line). We see good agreement even for small value $n=100$.

In [None]:
function check_random_regular_simple(n, k)
    stubs = reduce(vcat, [fill(i, k) for i in 1:n])
    shuffle!(stubs)
    existing = Set{Tuple{Int,Int}}()
    for i in 1:2:length(stubs)
        a, b = minmax(stubs[i], stubs[i+1])
        (a == b || (a, b) in existing) && return false
        push!(existing, (a, b))
    end
    return true
end

In [None]:
n = 100
# n = 10000
REP = 100
D = 2:10
simple = [mean(check_random_regular_simple(n, d) for _ in 1:REP) for d in D]
th = [exp(-(d * d - 1) / 4) for d in D]

plot(D, simple, "o", color="black")
plot(D, th, ":", color="black")
xlabel("degree", fontsize=14)
ylabel("P(graph is simple)", fontsize=14);

### Section 2.8 - Random geometric graphs (RGG)

With this model, $n$ nodes are dropped randomly on the d-dimensional space $[0,1]^d$.
Two nodes are connected by an edge if their distance is less than some **radius** parameter $r$.
We consider the **unit square**, so we fix $d=2$ in our examples.

For RGG on the unit square, the (expected) average degree of a node in a graph with $n$ nodes is $\pi r^2 (n-1)$ where $r$ is the radius parameter, unless the node is near the square boundary. Near the boundary, this approximation is actually slightly larger than the true expected average degree due to boundary effects (nodes close to the boundary of the square will have a smaller number of connections).

We can slightly modify this model by considering a unit **torus**. This will eliminate boundary effects so all nodes have (expected) average degree $\pi r^2 (n-1).$

In all experiments below, you can use either the unit square by setting ```boundary=:periodic```, or the torus model by setting ```boundary=:open```.

### Looking at some RGGs

We plot some geometric random graphs with $n=100$ nodes and varying radius threshold $r$.

In [None]:
## plotting a few random geometric graphs with 100 nodes and varying radius threshold
n = 100
boundary = :periodic ## Set to :open to see a torus-based RGG, else we use the constrained unit square

## select a value for the radius:
# radius = 0.1
radius = 0.15
# radius = 0.2

g, _, pos = euclidean_graph(n, 2, seed=1234, cutoff=radius, bc=boundary)
print("Geometric random graph with radius=", radius)
gplot(g, pos[1, :], pos[2, :], nodefillc="gray", nodestrokelw=1, nodestrokec="black")

**RGG - size of the giant component**

Next we look at the size of the giant component for geometric random graphs on the unit square as we vary the radius parameter.

For RGG on the unit square, the (expected) average degree of a node in a graph with $n$ nodes is $\pi r^2 (n-1)$ where $r$ is the radius parameter, unless the node is near the square boundary. Near the boundary, this approximation is actually slightly larger than the true expected average degree due to boundary effects (nodes close to the boundary of the square will have a smaller number of connections).

In the experiment below, we fix some degree range and compute the corresponding radius parameters $r$.
For each value $r$, we generate 1,000 GRGs and compute the mean and standard deviation for the size of the 
giant component.

We see a similar shape as with binomial (Erdos-Renyi) random graphs.

In [None]:
boundary = :open ## Set to :open to see a torus-based RGG, else we use the constrained unit square

## number of nodes:
n = 100
#n = 10000

repeats = 1000
average_degree = 0.5:0.5:15
radii = sqrt.(average_degree / (pi * (n - 1)))

df = DataFrame(["gc_avg", "gc_std", "deg_avg"] .=> Ref(Float64[]))
## generate random graphs and gather the sizes of the giant component
for radius in radii
    x = Int[]
    y = Float64[]
    for rep in 1:repeats
        g, _ = euclidean_graph(n, 2, seed=rep, cutoff=radius, bc=boundary)
        push!(x, maximum(length.(connected_components(g))))
        push!(y, mean(degree(g)))
    end
    push!(df, [mean(x), std(x), mean(y)])
end

In [None]:
## plot the above empirical results (confidence intervals and mean values)
fill_between(average_degree, df.gc_avg - 1.654 * df.gc_std,
    min.(n, df.gc_avg + 1.654 * df.gc_std), color="lightgray")
plot(average_degree, df.gc_avg, color="black", linestyle=":")
xlabel("Average degree (approximate)", fontsize=14)
ylabel("giant component size", fontsize=14);

**Degree formula vs empirical results**

If the experiment was conducted on the unit square (with `boundary = :periodic` ), then the empirical degrees are slightly smaller than the ones computes with the formula -- this is clearly seen by comparing with a unit slope line. However if the experiment was conducted on torus (with `boundary = :open`), then the values are the same.

In [None]:
plot(average_degree, df.deg_avg, color="black")
plot(df.deg_avg, df.deg_avg, ":", color="dimgray")
ylabel("empirical average degree", fontsize=14)
xlabel("average degree formula", fontsize=14);

**Connectivity of RGGs**

This time we look at the probability that the random graph is connected.
We vary some constant $c$ introduced in the book such that the radius $r$ for the RGGs is given by $n\pi r^2 = \ln n + c$. We compare theory (black line) and experimental results (in grey) with $n=100$ nodes. 
We used $n=10,000$ nodes in the book (this will be much slower).

In the cell below, the grey area corresponds to a 90% confidence interval for proportions; for empirical proportion $x$ obtained from sample of size $n$, the formula is given by $x \pm 1.645 \sqrt{x(1-x)/n}$.

In this case, if we generate RGGs on a square, boundary nodes will have a higher chance of being isolated, so the convergence to the theoretical result with respect to $n$ is slow, as we observe in the results below.
Working on a torus, we see faster convergence, but even with $n = 10,000$, there are still some small differences.

In [None]:
boundary = :open
n = 100
# n = 10000

repeats = 1000

## set lower bound for the range of values for 'c'
lo = max(-5, -Int(floor(log(n) * 10)) / 10)
c_range = lo:0.1:10
ic_avg = Float64[]

for c in c_range
    x = Int64[]
    r = sqrt((c + log(n)) / (pi * (n)))
    for rep in 1:repeats
        g, _ = euclidean_graph(n, 2, seed=rep, cutoff=r, bc=boundary)
        push!(x, Int(is_connected(g)))
    end
    push!(ic_avg, mean(x))
end
## theoretical values
th = exp.(-exp.(-c_range));

In [None]:
plt.fill_between(c_range, ic_avg - 1.654 * sqrt.(ic_avg .* (1 .- ic_avg) / n),
    ic_avg + 1.645 * sqrt.(ic_avg .* (1 .- ic_avg) / n), color="lightgrey")
plot(c_range, th, "-", color="black")
xlabel(L"constant $c$", fontsize=14)
ylabel("P(graph is connected)", fontsize=14);

## Part 2 - Experiments section

We use the giant component of the **GitHub machine learning (ml) developers** subgraph that we introduced in Chapter 1. Recall this graph has 7,083 nodes and 19,491 edges. 

We compute several graphs statistics for this "base graph", as reported in the first column of **Table 2.8** from the book.

We then generate **random** graphs with the same number of nodes and edges using 4 different models:
* binomial or Erdos-Renyi: only average degree is used
* Chung-Lu: expected degree distribution
* Configuration: exact degree distribution
* Configuration with Viger method: connected, simple graph is obtained

See **section 2.8** of the book for a more complete discussion of the results, but as a general observation, more complex models (such as the configuration model with Viger method) tend to preserve more characteristics of the reference graph.

In [None]:
## read the GitHub edge list into a graph
D = CSV.read(datadir * "GitHubDevelopers/musae_git_edges.csv", DataFrame) .+ 1
max_node_id = max(maximum(D.id_1), maximum(D.id_2))
gh = SimpleGraph(max_node_id)
foreach(row -> add_edge!(gh, row...), eachrow(D))

## add some node features, here there are
## 2 class of nodes, 0: web developer (red), 1: ml developer (blue)
X = CSV.read(datadir * "GitHubDevelopers/musae_git_target.csv", DataFrame)
X.id .+= 1
@assert extrema(diff(X.id)) == (1, 1) && extrema(X.id) == (1, length(vertices(gh)))
gh_lbl = ifelse.(X.ml_target .== 0, "web", "ml");

In [None]:
gh_ml = induced_subgraph(gh, findall(==("ml"), gh_lbl))[1]

In [None]:
cc = connected_components(gh_ml)
sg = induced_subgraph(gh_ml, cc[argmax(length.(cc))])[1]

In [None]:
function baseStats(G)
    deg = degree(G)
    cc = connected_components(G)
    return Any[nv(G), ne(G),
        minimum(deg), mean(deg), median(deg), quantile(deg, 0.99), maximum(deg),
        igraph_diameter(G),
        length(cc), maximum(length, cc), count(==(0), deg),
        global_clustering_coefficient(G),
        mean(local_clustering_coefficient(G)[degree(G).>1])]
end

In [None]:
function igraph_diameter(G)
    ccs = connected_components(G)
    ccg = [induced_subgraph(G, cc)[1] for cc in ccs]
    maximum(maximum(maximum(gdistances(g, i)) for i in vertices(g)) for g in ccg)
end

In [None]:
function cm_simple(ds)
    @assert iseven(sum(ds))
    stubs = reduce(vcat, fill(i, ds[i]) for i in 1:length(ds))
    shuffle!(stubs)
    local_edges = Set{Tuple{Int,Int}}()
    recycle = Tuple{Int,Int}[]
    for i in 1:2:length(stubs)
        e = minmax(stubs[i], stubs[i+1])
        if (e[1] == e[2]) || (e in local_edges)
            push!(recycle, e)
        else
            push!(local_edges, e)
        end
    end

    # resolve self-loops and duplicates
    last_recycle = length(recycle)
    recycle_counter = last_recycle
    while !isempty(recycle)
        recycle_counter -= 1
        if recycle_counter < 0
            if length(recycle) < last_recycle
                last_recycle = length(recycle)
                recycle_counter = last_recycle
            else
                break
            end
        end
        p1 = popfirst!(recycle)
        from_recycle = 2 * length(recycle) / length(stubs)
        success = false
        for _ in 1:2:length(stubs)
            p2 = if rand() < from_recycle
                used_recycle = true
                recycle_idx = rand(axes(recycle, 1))
                recycle[recycle_idx]
            else
                used_recycle = false
                rand(local_edges)
            end
            if rand() < 0.5
                newp1 = minmax(p1[1], p2[1])
                newp2 = minmax(p1[2], p2[2])
            else
                newp1 = minmax(p1[1], p2[2])
                newp2 = minmax(p1[2], p2[1])
            end
            if newp1 == newp2
                good_choice = false
            elseif (newp1[1] == newp1[2]) || (newp1 in local_edges)
                good_choice = false
            elseif (newp2[1] == newp2[2]) || (newp2 in local_edges)
                good_choice = false
            else
                good_choice = true
            end
            if good_choice
                if used_recycle
                    recycle[recycle_idx], recycle[end] = recycle[end], recycle[recycle_idx]
                    pop!(recycle)
                else
                    pop!(local_edges, p2)
                end
                success = true
                push!(local_edges, newp1)
                push!(local_edges, newp2)
                break
            end
        end
        success || push!(recycle, p1)
    end
    g = SimpleGraph(length(ds))
    for e in local_edges
        add_edge!(g, e...)
    end
    return g
end

In [None]:
er = erdos_renyi(nv(sg), ne(sg))

In [None]:
cl = expected_degree_graph(degree(sg))

In [None]:
cl2 = fastCL(degree(sg), ne(sg))

In [None]:
cm = cm_simple(degree(sg))

In [None]:
df = DataFrame("statistic" => ["nodes", "edges", "d_min", "d_mean", "d_median", "d_quant_99", "d_max",
        "diameter", "components", "largest", "isolates", "C_glob", "C_loc"],
    "Base Graph" => baseStats(sg),
    "Erdos-Renyi" => baseStats(er),
    "Chung-Lu original" => baseStats(cl),
    "Chung-Lu fixed" => baseStats(cl2),
    "Configuration simple" => baseStats(cm))

### Shortest path length distribution

We compute and compare the shortest path length distribution for several node pairs and for the 5 graphs we have (GitHub ml reference graph, and 4 random graphs). Sampling is used to speed-up the process.

We consider the giant component for disconnected graphs.

We see a reasonably high similarity for all graphs, with the binomial random graph having slightly longer path lengths due to the absence of high degree (hub) nodes in that model.

In [None]:
## compute min path length distribution for several node pairs for the 5 graphs (real and 4 random ones)
cc_er = connected_components(er)
er_g = induced_subgraph(er, cc_er[argmax(length.(cc_er))])[1]
cc_cl = connected_components(cl)
cl_g = induced_subgraph(cl, cc_cl[argmax(length.(cc_cl))])[1]
cc_cl2 = connected_components(cl2)
cl2_g = induced_subgraph(cl2, cc_cl2[argmax(length.(cc_cl2))])[1]
cc_cm = connected_components(cm)
cm_g = induced_subgraph(cm, cc_cm[argmax(length.(cc_cm))])[1]

SAMPLE_SIZE = 200

graphs = [sg, er_g, cl_g, cl2_g, cm_g]
Vs = [sample(1:nv(g), SAMPLE_SIZE, replace=false) for g in graphs]

sps = [Int[] for _ in eachindex(graphs)]
for i in 1:SAMPLE_SIZE
    for j in 1:length(sps)
        d = [e for e in gdistances(graphs[j], Vs[j][i]) if e > 0]
        sps[j] = [sps[j]; d]
    end
end

In [None]:
## compare shortest path length distributions
bins = 0.5:1:11.5
fig, axs = plt.subplots(2, 3)

## plot the 5 histograms
axs[1, 1].hist(sps[1], bins=bins, width=0.9, density=true, color="darkgrey")
axs[1, 1].set_title("Base (GitHub ml)", fontsize=10)
axs[1, 2].hist(sps[2], bins=bins, width=0.9, density=true, color="darkgrey")
axs[1, 2].set_title("Binomial", fontsize=10)
axs[1, 3].hist(sps[3], bins=bins, width=0.9, density=true, color="darkgrey")
axs[1, 3].set_title("Chung-Lu", fontsize=10)
axs[2, 1].hist(sps[4], bins=bins, width=0.9, density=true, color="darkgrey")
axs[2, 1].set_title("Config.", fontsize=10)
axs[2, 2].hist(sps[5], bins=bins, width=0.9, density=true, color="darkgrey")
axs[2, 2].set_title("Config.(V)", fontsize=10)

## set uniform y-range and ticks
for i in 1:2
    for j in 1:3
        axs[i, j].set_ylim(0, 0.5)
        axs[i, j].set_xticks([2, 4, 6, 8, 10])
    end
end

## adjust 3-2 format
axs[2, 3].set_visible(false)
axs[2, 1].set_position([0.24, 0.08, 0.228, 0.343])
axs[2, 2].set_position([0.55, 0.08, 0.228, 0.343])

## labels only on the outer axis
axs[1, 1].set_ylabel("proportion")
axs[2, 1].set_ylabel("proportion")
axs[2, 1].set_xlabel("path length")
axs[2, 2].set_xlabel("path length")
axs[1, 2].get_yaxis().set_ticklabels([])
axs[1, 3].get_yaxis().set_ticklabels([])

## add mean values
axs[1, 1].text(6, 0.42, "mean: " * string(round(mean(sps[1]), digits=2)), fontsize=8)
axs[1, 2].text(6, 0.42, "mean: " * string(round(mean(sps[2]), digits=2)), fontsize=8)
axs[1, 3].text(6, 0.42, "mean: " * string(round(mean(sps[3]), digits=2)), fontsize=8)
axs[2, 1].text(6, 0.42, "mean: " * string(round(mean(sps[4]), digits=2)), fontsize=8)
axs[2, 2].text(6, 0.42, "mean: " * string(round(mean(sps[5]), digits=2)), fontsize=8);

## Extras
### More power law tests - Grid and GitHub graphs

We try to fit power law for the degree distributions as we did before, this time for 3 real graphs:
* GitHub ml developers (giant component)
* GitHub web developers (giant component)
* Grid (Europe power grid graph, giant component)

While the first two exhibit power law degree distribution, this is not the case for the Grid graph.

**GitHub ml subgraph**

In [None]:
## estimates for xmin and gamma
d = degree(sg)
X = powerlaw.Fit(d)
println("gamma:", X.power_law.alpha)
println("l':", X.power_law.xmin)
println("KS statistic:", X.power_law.D)

In [None]:
fig1 = X.power_law.plot_ccdf(color="black", linestyle="-");
fig1 = X.plot_ccdf(ax=fig1, linewidth=2, color="gray", original_data=false, linestyle=":")
fig1.set_xlabel("degree", fontsize=13)
fig1.set_ylabel("inverse cdf", fontsize=13);

**GitHub web subgraph**

In [None]:
## github web developers subgraph
gh_web = induced_subgraph(gh, findall(!=("ml"), gh_lbl))[1]
cc = connected_components(gh_web)
sg = induced_subgraph(gh_web, cc[argmax(length.(cc))])[1]
## estimates for xmin and gamma
d = degree(sg)
X = powerlaw.Fit(d)
println("gamma:", X.power_law.alpha)
println("l':", X.power_law.xmin)
println("KS statistic:", X.power_law.D)

In [None]:
fig1 = X.power_law.plot_ccdf(color="black", linestyle="-");
fig1 = X.plot_ccdf(ax=fig1, linewidth=2, color="gray", original_data=false, linestyle=":")
fig1.set_xlabel("degree", fontsize=13)
fig1.set_ylabel("inverse cdf", fontsize=13);

**Grid graph**

In [None]:
edge_list = split.(readlines(datadir * "GridEurope/gridkit_europe-highvoltage.edges"))
vertex_ids = unique(reduce(vcat, edge_list))
vertex_map = Dict(vertex_ids .=> 1:length(vertex_ids))
gr = SimpleGraph(length(vertex_ids))
foreach(((from, to),) -> add_edge!(gr, vertex_map[from], vertex_map[to]), edge_list)

cc = connected_components(gr)
sg = induced_subgraph(gr, cc[argmax(length.(cc))])[1]
## estimates for xmin and gamma
d = degree(sg)
X = powerlaw.Fit(d)
println("gamma:", X.power_law.alpha)
println("l':", X.power_law.xmin)
println("KS statistic:", X.power_law.D)

In [None]:
fig1 = X.power_law.plot_ccdf(color="black", linestyle="-");
fig1 = X.plot_ccdf(ax=fig1, linewidth=2, color="gray", original_data=false, linestyle=":")
fig1.set_xlabel("degree", fontsize=13)
fig1.set_ylabel("inverse cdf", fontsize=13);

**Independent sets**

Illustrating a few functions to find independent sets (a set of vertices no two of which are adjacent).
The concept was defined in Chapter 1.

In [None]:
## generate random graph with (at least one) independent set
## n: nodes, s: independent set size, d: avg degree
function indepSet(n, s, d)
    N = n - s
    di = (n * d ÷ 2) - s * d
    ## random graph with N nodes
    g = erdos_renyi(N, di)
    ## extra nodes
    add_vertices!(g, s)
    ## assign remaining degree to extra nodes
    z = rand(N+1:n, s * d)
    cm = countmap(z)
    deg = [cm[i] for i in N+1:n]
    for i in 1:length(deg)
        e = sample(1:N, deg[i], replace=false)
        for j in e
            add_edge!(g, j, i + N)
        end
    end
    return induced_subgraph(g, randperm(nv(g)))[1]
end

In [None]:
function random_indset(g, maxiter)
    best = Int[]
    for _ in 1:maxiter
        this = independent_set(g, MaximalIndependentSet())
        length(this) > length(best) && (best = this)
    end
    return best
end

In [None]:
g = indepSet(50, 10, 20)

In [None]:
# algorithm based on vertex degree
independent_set(g, DegreeIndependentSet())

In [None]:
ris = random_indset(g, 1000)

In [None]:
g_colors = fill("gray", nv(g))
for n in ris
    g_colors[n] = "red"
end
gplot(g, nodefillc=g_colors, nodestrokec="black")