# Chapter 7 - Hypergraphs

In this notebook, we introduce hypergraphs, a generalization of graphs where we allow for arbitrary sized edges (in practice, we usually consider only edges of size 2 or more). 

We illustrate a few concepts using hypergraphs including modularity, community detection, simpliciality and transformation into 2-section graphs.


In [None]:
using Combinatorics
using CSV, DataFrames
using DelimitedFiles
using JSON
using PythonCall
using Random
using CondaPkg
using StatsBase
using Serialization

In [None]:
CondaPkg.withenv() do
  run(`python -m ensurepip`)
  run(`python -m pip install fastnode2vec`)
end

In [None]:
warnings = pyimport("warnings")
warnings.filterwarnings("ignore")
hnx = pyimport("hypernetx")
hmod = pyimport("hypernetx.algorithms.hypergraph_modularity")
plt = pyimport("matplotlib.pyplot")
random = pyimport("random")
ig = pyimport("igraph")
pyimport("sys")."path".append("")
spl = pyimport("simpliciality")
hl = pyimport("h_louvain")
AMI = pyimport("sklearn.metrics").adjusted_mutual_info_score
train_test_split = pyimport("sklearn.model_selection").train_test_split
RandomForestClassifier = pyimport("sklearn.ensemble").RandomForestClassifier
confusion_matrix = pyimport("sklearn.metrics").confusion_matrix
umap = pyimport("umap")
sns = pyimport("seaborn")
xgi = pyimport("xgi")
pickle = pyimport("pickle")

In [None]:
## Set this to the data directory
datadir = "../Datasets/"

In [None]:
## to compute degree-size correlation
function h_deg_size_corr(H)
    deg = Dict(pyconvert(String, v) => pyconvert(Int, H.degree(v)) for v in H.nodes)
    X = Int[]
    Y = Int[]

    for e in H.edges
        for v in H.edges[e]
            push!(X, deg[pyconvert(String, v)])
            push!(Y, length(H.edges[e]))
        end
    end

    return (X, Y, cor(X, Y))
end

In [None]:
pyDict(x...) = pydict(Dict(x...))

In [None]:
## read embedding from disk, in node2vec format
function readEmbedding(fn::String="_embed", sort::Bool=true)
    df = CSV.File(fn; delim=' ', header=false, skipto=2) |> DataFrame
    # Drop any columns that are entirely missing
    df = df[:, all.(!ismissing, eachcol(df))]
    sort && sort!(df, :Column1)
    Y = Matrix(df[:, 2:end])
    return Y
end

# HyperNetX basics with a toy hypergraph

We illustrate a few concepts with a toy hypergraph. 

First, we build the HNX hypergraph from a list of sets (the hyperedges), and we draw the hypergraph as well as its dual (where the role of nodes and hyperedges are swapped).


In [None]:
## build an hypergraph from a vector of hyperedges
E = [Set(["A", "B"]),
     Set(["A", "C"]),
     Set(["A", "B", "C"]),
     Set(["A", "D", "E", "F"]),
     Set(["D", "F"]),
     Set(["E", "F"]),
     Set(["B"]),
     Set(["G", "B"])]
H = hnx.Hypergraph(pyDict(enumerate(E)))
hnx.draw(H, edges_kwargs=pyDict("edgecolors" => "gray"))
plt.gcf()

In [None]:
## dual hypergraph
H_dual = H.dual()
hnx.draw(H_dual, edges_kwargs=pyDict("edgecolors" => "gray"))
plt.gcf()

In [None]:
## bipartite representation
B = ig.Graph.from_networkx(H.bipartite())
B.vs["label"] = B.vs["_nx_name"]
ly = B.layout_bipartite(types="bipartite")
ig.plot(B, bbox=(400, 300), vertex_color="white", layout=ly, vertex_label_size=14, edge_color="black")

In [None]:
## show the nodes and edges
println("shape:", H.shape)
println("nodes:", [x for x in H.nodes()])
println("edges:", pyconvert.(Int, [x for x in H.edges()]))
println("node degrees:", pyconvert(Tuple, [(v, H.degree(v)) for v in H.nodes()]))
println("edge sizes:", [H.size(e) for e in H.edges()])

In [None]:
## incidence dictionary
pyconvert(Dict, H.incidence_dict)

In [None]:
## incidence matrix
pyconvert(Matrix, H.incidence_matrix().toarray())

In [None]:
## 2-section graph
G = hmod.two_section(H)
ig.plot(G, bbox=(400, 300), vertex_label=G.vs["name"],
        vertex_label_size=12, vertex_color="lightblue",
        edge_width=G.es["weight"])

## s-walks and distance-based measures

We illustrate a few concepts with the toy hypergraph defined earlier.

Let $H=(V,E)$ a hypergraph, and consider its incidence matrix $B$ as defined in section 7.2. 
Consider also the dual hypergraph $H^*$, where the roles of nodes are hyperedges are swapped, 
namely the edges in $H$ are the nodes in $H^*$, 
and there is as edge two vertices in $H^*$ if the corresponding hyperedges in $H$ have a non-empty intersection.

### s-walks and distances

We define the concept of $s$-walks on a hypergraph as follows. A $s$-**walk** of length $k$ on $H$ is a sequences of edges $e_{i_0}, e_{i_1}, ..., e_{i_k}$ in $E$ such that 
all $|e_{i_{j-1}} \cap e_{i_j}| \ge s$ for $1 \le j \le k$ and all $i_{j-1} \ne i_j$.

The $s$-**distance** $d_s(e_i,e_j)$ between edges $e_i$ and $e_j$ is the length of the smallest $s$-walk between those, if it exists (else the distance is usually considered as infinity, and its inverse is set to zero).

A subset $E_s \subset E$ is an $s$-**connected component** if it is a maximal subset with an $s$-walk between all $e_i, e_j \in E_s$.
The $s$-**diameter** for $E_s$ is the maximal shortest path length between all $e_i, e_j \in E_s$.

Other concepts can also be defined using $s$-walks. For example for distinct $e_i, e_j, e_k \in E$, if there is a $s$-walk $e_i, e_j, e_k$, we say that they form an $s$-**wedge**, and if there is an $s$ walk $e_i, e_j, e_k, e_i$, we can say those form an $s$-**triangle** and from those, we can define the $s$-**clustering coefficients** as in section 1.11.

For **nodes**, all definitions above follow by considering the **dual** hypergraph. For example, a $s$-walk is a sequence of adjacent nodes such that each consecutive node pair in the walk share at least $s$ hyperedges; all other concepts defined above follow directly.

#### toy example

In the toy example above, with $s=2$, the sequence of edges 1-2-0 is an $s$-path since edges 1 and 2 share nodes A and C, and edges 2 and 0 share nodes A and B.

In the dual toy hypergraph, again with $s=2$, the sequence of nodes (edges in the dual) D-F-E is a $s$-path since nodes D and F are both incident to edges 3 and 4, and nodes F and E are both incident to edges 3 and 5. 
Another $s$-path (with $s=2$) is C-A-B.

With $s=1$, this corresponds to a walk on the (unweighted 2-section) graph, while for $s \ge 2$, this concept only applies to hypergraphs.

Below, we compute the distances between every pair of nodes (thus, using the $s$-walks on the dual). An infinite distance between a pair of nodes means that there is no $s$-path joining those.

We see the correspondence between the $s=1$ and graph cases; moreover in those cases, we have a single connected component since every pairs of nodes is connected by a path.

With $s=2$, we see several disconnected node pairs, so in this case, we have several $s$-connected components. From inspection of the table below, we see that nodes {A,B,C} are connected,
nodes {D,E,F} are connected; node G is then an isolated node. We verify this claim below (we also do the same with the edges, i.e. using the $s$-walk on $H$ with $s=2$)                                             

In [None]:
py_E = [Set(["A", "B"]),
    Set(["A", "C"]),
    Set(["A", "B", "C"]),
    Set(["A", "D", "E", "F"]),
    Set(["D", "F"]),
    Set(["E", "F"]),
    Set(["B"]),
    Set(["G", "B"])]
py_H = hnx.Hypergraph(Dict(enumerate(py_E)))
py_G = hmod.two_section(py_H)

In [None]:
## distances with s=1 and s=2 and on the 2-section graph
Nodes = ["A", "B", "C", "D", "E", "F", "G"]
L = []
for i in 1:length(Nodes)-1
    for j in (i+1):length(Nodes)
        push!(L, [Nodes[i], Nodes[j], G.distances(Nodes[i], Nodes[j])[0][0],
            H.distance(Nodes[i], Nodes[j]), H.distance(Nodes[i], Nodes[j], s=2)])
    end
end
df = DataFrame(permutedims(hcat(L...)), ["node1", "node2", "2-section", "s=1", "s=2"])
df

In [None]:
## s=2 components
Edges = [cc for cc in H.s_connected_components(s=2, return_singletons=true)]
Nodes = [cc for cc in H.s_connected_components(s=2, edges=false, return_singletons=true)]
println("s=2, connected components for the nodes:", pyconvert(Vector{Vector{String}}, Nodes))
println("s=2, connected components for the edges:", pyconvert(Vector{Vector{Int}}, Edges))


## Line graph

Below we illustrate the **line graph** for the toy hypergraph and its dual, with $s=2$.

Recall that in a line graph, the nodes are the edges in the original hypergraph, 
and an edge is draw between those if they share at least $s$ nodes in the original hypergraph.

We see the same connected components as listed above.


In [None]:
## linegraph
LG = ig.Graph.from_networkx(H.get_linegraph(s=2))
ig.plot(LG, bbox=(200, 200), vertex_label=LG.vs["_nx_name"], vertex_label_size=9, vertex_color="lightgrey")

In [None]:
## dual"s linegraph
DLG = ig.Graph.from_networkx(H.dual().get_linegraph(s=2))
ig.plot(DLG, bbox=(200, 200), vertex_label=DLG.vs["_nx_name"], vertex_label_size=9, vertex_color="lightgrey")

##  Centrality measures

For $H=(V,E)$, we define the **$s$-harmonic centrality** for edge $e_i \in E$ as:
$\frac{1}{|E|-1}\sum_{e_j \in E_s; e_i \ne e_j} \frac{1}{d_s(e_i,e_j)}$.
Recall that for $s$-disconnected edges $e_i, e_j$, we set $\frac{1}{d_s(e_i,e_j)} = 0$.

* n.b.: The HyperNetX implementation uses a different normalization, namely $(|E|-1)(|E|-2)/2$.

For nodes, the definition is identical using the dual hypergraph.
For our toy example, with $s=2$, nodes {A,B,C} form a connected connected component as we saw earlier, same
for nodes {D,E,F}, while node G is an isolated node.

Looking at the table of distances we computed earlier, we see that $d_2(A,B)=d_2(A,C)=1$ and $d_2(B,C)$=2,
so before normalization, the harmonic centrality for A is 2, and for B and C it is 1.5.
Results are comparable for the other connected component, with values of 1.5 for nodes D and E, and 2 for node F.
Node G is isolated and thus has zero harmonic centrality.

We can also define $s$-**betweenness centrality** as we did for graphs, namely for edge $e_i \in E$:

$\frac{1}{(|E|-1)(|E|-2)}\sum_{e_j \in E-\{e_i\}} \sum_{e_k \in E-\{e_i, e_j\}} \frac{\ell(e_j,e_k,e_i)}{\ell(e_j,e_k)}$

where: $\ell(e_j,e_k)$ is the number of shortest $s$-paths between $e_j$ and $e_k$, 
and $\ell(e_j,e_k,e_i)$ is the number of shortest $s$-paths between $e_j$ and $e_k$ that include $e_i$.
Again the definition is the same for nodes using the dual hypergraph.

For our toy example, with $s=2$, the only nodes that are on shortest $s$-paths between other nodes are nodes A (between B and C)
and node F (between D and E), thus the results we see below.

Other distance-based centrality measures can be defined for hypergraphs in the same way, using $s$-distances,
including the measures we covered in Section 3.3. 
In the example below, we also show **closeness centrality**; note that by default, the computation is done separately for each $s$-connected component, thus the results below.

Computing **eccentricity** (the length of the longest shortest path from a vertex to every other vertex in
the s-linegraph) with $s=2$ returns an error since some node are not connected, so we show the results for $s=1$.


In [None]:
## eccentricity - this yields an error with s > 1
hnx.algorithms.s_eccentricity(H, edges=false, s=1)

In [None]:
## centralities for "s=2"
s = 2

hc = hnx.algorithms.s_harmonic_centrality(H, edges=false, s=s, normalized=false)
bc = hnx.algorithms.s_betweenness_centrality(H, edges=false, s=s, normalized=false)
cc = hnx.algorithms.s_closeness_centrality(H, edges=false, s=s)

## normalize w.r.t. definition in the book
n = length(H.nodes)
data = [[v,
    hc[v] / (n - 1),
    2 * bc[v] / ((n - 1) * (n - 2)),
    cc[v]] for v in H.nodes]

D = DataFrame(permutedims(hcat(data...)), [:node, :harmonic, :betweenness, :closeness])
sort(D, :harmonic, rev=true)


## hypergraph modularity (qH) and clustering

We compute qH on the toy graph for 4 different partitions, and using different variations for the edge contribution (a.k.a. $\tau$-modularity).

For edges of size $d$ where $c$ is the number of nodes from the part with the most representatives, we consider  variations as follows for edge contribution:

* **strict**: edges are considered only if all nodes are from the same part, with unit weight, i.e. $w$ = 1 iff $c == d$ (0 else).
* **cubic**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the cube of the number of nodes in the majority, i.e. $w = (c/d)^3$ iff $c>d/2$ (0 else).
* **quadratic**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the square of the number of nodes in the majority, i.e. $w = (c/d)^2$ iff $c>d/2$ (0 else).
* **linear**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the number of nodes in the majority, i.e. $w = c/d$ iff $c>d/2$ (0 else).
* **majority**: edges are counted only if more that half the nodes are from the same part, with unit weights, i.e. $w$ = 1 iff $c>d/2$ (0 else).

Some of the above are supplied with the `hmod` module, the **qH2** and **qH3** functions are examples of user-supplied choice.

The order above goes from only counting "pure" edges as community edges, gradually giving more weight to edges with $c>d/2$, all the way to giving the the same weights.


In [None]:
## these will be included in the next version of hmod
## square modularity weights
function qH2(d, c)
    return c > d / 2 ? (c / d)^2 : 0
end
## cubic modularity weights
function qH3(d, c)
    return c > d / 2 ? (c / d)^3 : 0
end

## compute hypergraph modularity (qH) for the following partitions:
A1 = [Set(["A", "B", "C", "G"]), Set(["D", "E", "F"])]
A2 = [Set(["B", "C"]), Set(["A", "D", "E", "F", "G"])]
A3 = [Set(["A", "B", "C", "D", "E", "F", "G"])]
A4 = [Set(["A"]), Set(["B"]), Set(["C"]), Set(["D"]), Set(["E"]), Set(["F"]), Set(["G"])]

## we compute with different choices of functions for the edge contribution
for fun in [hmod.strict, qH3, qH2, hmod.linear, hmod.majority]
    println("qH(A1): ", hmod.modularity(H, A1, fun),
        "  qH(A2): ", hmod.modularity(H, A2, fun),
        "  qH(A3): ", hmod.modularity(H, A3, fun),
        "  qH(A4): ", hmod.modularity(H, A4, fun))
end

### weighted 2-section graph

We already built the 2-section weighted graph **G** for the above toy hypergraph.

Here we run Leiden clustering algorithm on this graph, and compare with Kumar"s hypergraph clustering algorithm.

We run each algorithm multiple times to show the difference in performance. In general, hypergraph-based algorithms are much slower than graph-based algorithms.


In [None]:
## 2-section graph
G.vs["label"] = G.vs["name"]
ig.plot(G, bbox=(0, 0, 250, 250), edge_width=G.es["weight"],
        vertex_color="gainsboro", vertex_label_size=10)

In [None]:
## 2-section clustering with Leiden
@time begin
    for i in 1:100
        G.vs["community"] = G.community_leiden(objective_function="modularity", weights="weight").membership
    end
end
println("clusters:", hmod.dict2part(pyDict(v["name"] => v["community"] for v in G.vs)))


In [None]:
## Kumar clustering
cl = nothing
@time begin
    for i in 1:100
        cl = hmod.kumar(H)
    end
end
println("clusters:", cl)

## Simplicial ratio

We use the same toy graph, but we remove the singleton edge {"B"}.

First, we see a simplicial ratio slightly above 1, and we also see that the two simplicial pairs between 2-edges and 3-edges are more surprising that the two pairs between 2-edges and 4-edges.


In [None]:
## toy example without the singleton edge
vertices = pylist([v for v in H.nodes()])
edges = pylist(pyset.([Set(["A", "B"]),
    Set(["A", "C"]),
    Set(["A", "B", "C"]),
    Set(["A", "D", "E", "F"]),
    Set(["D", "F"]),
    Set(["E", "F"]),
    Set(["G", "B"])]))
## simplicial ratio
random.seed(42)
spl.get_simplicial_ratio(vertices, edges, samples=1000)


In [None]:
## simplicial matrix
random.seed(42)
spl.get_simplicial_matrix(vertices, edges, samples=1000)

In [None]:
## number of simplicial pairs
spl.get_simplicial_pairs(vertices, edges, as_matrix=true)

### Other simpliciality measures

* no 3+ edge has downward closure, so the fraction is 0
* edit simpliciality is 7/16, since 9 edges would need to be added to get downward closures
* face edit simpliciality: the two values for maximal edges are 3/4 and 3/11 (keeping the maximal face in the counts) or 2/3 and 2/10 otherwise
    

In [None]:
println("Simplicial fraction:", spl.get_simplicial_fraction(vertices, edges))
println("Edit simpliciality:", spl.get_edit_simpliciality(vertices, edges))
println("Face edit simpliciality:", spl.get_face_edit_simpliciality(vertices, edges, exclude_self=false))
println("Face edit simpliciality:", spl.get_face_edit_simpliciality(vertices, edges, exclude_self=true))

# h-ABCD Examples

Julia code to generate h-ABCD benchmarks ca be found here:
https://github.com/bkamins/ABCDHypergraphGenerator.jl

The first small h-ABCD hypergraph we use next was generated as follows:

`julia --project abcdh.jl -n 100 -d 2.5,3,10 -c 1.5,30,40 -x .2 -q 0,.3,.4,.3 -w :strict -s 123 -o toy_100`

It has 100 nodes and 3 well-defined communities. We will use this example mainly for visualization.

The second one, which is much more noisy, was generated as follows:

`julia --project abcdh.jl -n 300 -d 2.5,5,30 -c 1.5,80,120 -x .6 -q 0,0,.1,.9 -w :strict -s 123 -o toy_300`

We will use this example to show that optimizing the appropriate hypergraph modularity function can lead to better clustering in some cases.
    

## 100-node h-ABCD - visualization

In [None]:
## read the edges and build the h-ABCD hypergraph H
Edges = []
for line in eachrow(readdlm(datadir * "ABCD/toy_100_he.txt"))
   push!(Edges, Set(string.(split(line[1], ','))))
end
H = hnx.Hypergraph(PyDict(enumerate(Edges)))
println("distribution of edge sizes:", countmap([length(x) for x in Edges]))


In [None]:
## read the ground-truth communities and assign node colors accordingly
H_comm = Dict(string(k) => v for (k, v) in enumerate(readdlm(datadir * "ABCD/toy_100_assign.txt", Int)))
cls = ["white", "darkgrey", "black"]
node_colors = Dict(zip(string.(collect(H.nodes)), [cls[H_comm[string(i)]] for i in H.nodes]))

## build the 2-section graph and plot (with ground-truth community colors)
g = hmod.two_section(H)
for v in g.vs
    name = string(v["name"])
    v["color"] = node_colors[name]
    v["gt"] = H_comm[name]
end

random.seed(12345)
ly = g.layout_fruchterman_reingold()
g.vs["ly"] = [x for x in ly]
fig, ax = plt.subplots(figsize=(7, 7))
ig.plot(g, target=ax, vertex_size=9, layout=ly, edge_color="darkgrey", edge_width=1)
plt.gcf()

In [None]:
## rubber band plot
H_ly = pyDict(zip(g.vs["name"], pylist.([[x[0], x[1]] for x in g.vs["ly"]])))
fig, ax = plt.subplots(figsize=(7, 7))
hnx.draw(H, with_node_labels=false, with_edge_labels=false, node_radius=0.67,
    nodes_kwargs=pyDict("facecolors" => pydict(node_colors), "edgecolors" => "black"),
    edges_kwargs=pyDict("edgecolors" => "darkgrey"),
    pos=H_ly)
plt.gcf()

In [None]:
### Plot via convex hull with the XGI package
H_nc = pyDict(zip(g.vs["name"], g.vs["color"]))
fig, ax = plt.subplots(figsize=(7, 7))
XH = xgi.Hypergraph(Edges)
xgi.draw(XH, node_fc=H_nc, dyad_color="grey", hull=true, radius=0.15, edge_fc_cmap="Greys_r", alpha=0.2, pos=H_ly, node_size=8, ax=ax, node_labels=false);

### Edge composition

Recall we call a $d$-edge a **community** edge if $c>d/2$ where $c$ is the number of nodes that belong to the **most represented** community.

Below we show the number of edges with all values $d$ and $c$, community edges or not.
We see that given the ground-truth communities, most community edges are *pure* in the sense that $c=d$.

In real examples, we usually do not know the ground-truth communities, or at least not for every node.
We can try some clustering, for example graph clustering on the 2-section graph, or Kumar"s algorithm on the hypergraph, to get a sense of edge composition.

The result is quite similar to the ground-truth.


In [None]:
## edge composition - ground truth
L = []
for e in H.edges
    push!(L, (maximum(values(countmap([H_comm[string(i)] for i in H.edges[e]]))), length(H.edges[e])))
end
X = countmap(L)

L = []
for x in X
    push!(L, [x[1][2], x[1][1], x[1][1] > x[1][2] / 2, x[2]])
end
D = DataFrame(hcat(L...)', ["d", "c", "community edge", "frequency (ground truth)"])
D = sort(D, ["d", "c"])

## edge composition - Leiden on 2-section
g.vs["leiden"] = g.community_leiden(objective_function="modularity", weights="weight").membership
leiden = Dict(zip(g.vs["name"], g.vs["leiden"]))
L = []
for e in H.edges
    push!(L, (maximum(values(countmap([leiden[i] for i in H.edges[e]]))), length(H.edges[e])))
end
X = countmap(L)
L = []
for x in X
    push!(L, [x[1][2], x[1][1], x[1][1] > x[1][2] / 2, x[2]])
end
D2 = DataFrame(hcat(L...)', ["d", "c", "community edge", "frequency (Leiden)"])
D2 = sort(D2, ["d", "c"])

D."frequency (Leiden)" = D2[!, "frequency (Leiden)"]
D = sort(D, "frequency (ground truth)")
D

### simpliciality

We show some measures of simpliciality, namely the number of simplicial pairs, the simpliciality matrix and the simplicial ratio measure.

The simplicial ratio value is around 1.3 (recall it is based on sampling), which indicates that this hypergraph does not exhibit high simpliciality.


In [None]:
E = [Set(H.edges[e]) for e in H.edges]
V = collect(Set([x for y in E for x in y]))
spl.get_simplicial_pairs(V, E, as_matrix=true)

In [None]:
py_V = pylist(V)
py_E = pylist(pyset.(E))
spl.get_simplicial_matrix(py_V, py_E), samples=1000)

In [None]:
spl.get_simplicial_ratio(py_V, py_E, samples=1000)

In [None]:
## other measures of simpliciality
println("Simplicial fraction:", spl.get_simplicial_fraction(V, E))
println("Edit simpliciality:", spl.get_edit_simpliciality(V, E))
println("Face edit simpliciality:", spl.get_face_edit_simpliciality(V, E, exclude_self=true))


# 300-node noisy h-ABCD

This is a noisier hypergraph with $\xi=0.6$, edges mostly of size 4 and some edges of size 3.

In the experiment below, we run each of the following algorithms 30 times and compare AMI with the ground-truth communities.
* Leiden on 2-section (weighted) graph
* Kumar"s algorithm
* h-Louvain

We observe that Kumar"s algorithm, which does take the hypergraph structure into account, slightly improves on the results with 2-section clustering, 
while h-Louvain improves it further, albeit with slower run time.


In [None]:
## read the edges and build the h-ABCD hypergraph H
edgelist = datadir * "ABCD/toy_300_he.txt"
Edges = []
for line in eachrow(readdlm(edgelist))
    push!(Edges, Set(string.(split(line[1], ','))))
end
H = hnx.Hypergraph(pyDict([i - 1 => e for (i, e) in enumerate(Edges)]))

## read the ground-truth communities and assign node colors accordingly
H_comm = Dict(string(k) => v for (k, v) in enumerate(readdlm(datadir * "ABCD/toy_300_assign.txt")))

## build the 2-section graph
g = hmod.two_section(H)
for v in g.vs
    v["gt"] = H_comm[string(v["name"])]
end

In [None]:
## reduce the number of repeats (REP) for a faster run (we used REP=30 for the book)
REP = 15
L = []
random.seed(321)

for s in 1:REP
    g.vs["leiden"] = g.community_leiden(objective_function="modularity", weights="weight").membership
    ami_g = AMI(g.vs["gt"], g.vs["leiden"])
    H_kumar = hmod.kumar(H)
    H_kumar_dict = hmod.part2dict(H_kumar)
    ami_k = AMI([H_comm[string(v)] for v in H.nodes], [H_kumar_dict[string(v)] for v in H.nodes])
    push!(L, [ami_g, ami_k])
end
D = DataFrame(pyconvert.(Float64, hcat(L...))', ["2-section", "Kumar"])
println("mean values:")
println(Dict(zip(names(D), mean.(eachcol(D)))))

### Running h-Louvain with Bayesian Optimization

This is slower as for each repetition, several attempts are made to find a good set of parameters using Bayesian optimization.
Results are saved and can be retieved for plotting. To re-run the experiment, uncomment the second cell below.

In [None]:
L = deserialize(datadir * "ABCD/toy_300_h-Louvain_jl.ser")
D."h-Louvain" = L[1:15]
plt.figure(figsize=(6, 5))
sns.boxplot(Matrix(D), width=0.5, color="darkgray", linewidth=1.2)
plt.xticks([0, 1, 2], names(D))
plt.ylabel("AMI", fontsize=14)
plt.gcf()

In [None]:
## no simplicial pair in this case
E = pylist(pyset.([Set(H.edges[e]) for e in H.edges]))
V = pylist(collect(Set([x for y in E for x in y])))
spl.get_simplicial_pairs(V, E, as_matrix=true)

In [None]:
## other measures
println("Simplicial fraction:", spl.get_simplicial_fraction(V, E))
println("Edit simpliciality:", spl.get_edit_simpliciality(V, E))
println("Face edit simpliciality:", spl.get_face_edit_simpliciality(V, E, exclude_self=true))

## Embeddings

We fit two embeddings to the h-ABCD graph, namely:
* 2-section node2vec
* bipartite node2vec (where we ignore the edge embeddings)

We fit a classifier where we train on 50% of the points, and test on the rest,
after reducing to 16-dim via UMAP.

We verify if keeping the hypergraph structure helps, as we do with the bipartite representation.


In [None]:
## 2-section
open("_edgelist", "w") do file
  for e in g.to_tuple_list()
    write(file, "$(e[0]) $(e[1])\n")
  end
end
CondaPkg.withenv() do
  run(`python n2v_to_file.py _edgelist 32 1 1 1`)
end
X_twosec = readEmbedding("_embed")

## 2-section - 2-d visualization
U = umap.UMAP().fit_transform(X_twosec)
df = DataFrame(pyconvert(Matrix{Float64}, U), ["X", "Y"])
plt.figure(figsize=(6, 6))
plt.scatter(df.X, df.Y, c=g.vs["gt"], s=25)
plt.gcf()

In [None]:
## bipartite (edges are in first positions; we ignore the edges)
G = ig.Graph.from_networkx(H.bipartite())
open("_edgelist", "w") do file
  for e in G.to_tuple_list()
    write(file, "$(e[0]) $(e[1])\n")
  end
end
CondaPkg.withenv() do
  run(`python n2v_to_file.py _edgelist 32 1 1 1`)
end
n_edges = length([e for e in H.edges()])
X_bip = readEmbedding("_embed")[n_edges+1:end, :]

## bipartite 2-d viz
U = umap.UMAP().fit_transform(X_bip)
df = DataFrame(pyconvert(Matrix{Float64}, U), ["X", "Y"])
plt.figure(figsize=(6, 6))
plt.scatter(df.X, df.Y, c=g.vs["gt"], s=25);
plt.gcf()

## fit a classifier

We train on half the data chosen at random, which we repeat several times.



In [None]:
## classifier - with 2-section and bipartite embeddings
acc = []
acc_b = []
y = label = g.vs["gt"]

for seed in 0:10:50 ## we used 30 repeats in textbook which can take a few minutes

    ## 2-section
    X = umap.UMAP(n_components=16, n_jobs=1, random_state=seed).fit_transform(X_twosec)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=seed)
    model = RandomForestClassifier(n_estimators=100, bootstrap=true, max_features="sqrt", random_state=seed)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    push!(acc, sum(cm.diagonal()) / sum(sum(cm)))

    ## bipartite - same seed
    X = umap.UMAP(n_components=16, n_jobs=1, random_state=seed).fit_transform(X_bip)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=seed)
    model = RandomForestClassifier(n_estimators=100, bootstrap=true, max_features="sqrt", random_state=seed)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    # print(cm)
    push!(acc_b, sum(cm.diagonal()) / sum(sum(cm)))
end
println(mean(acc), " ", mean(acc_b))

In [None]:
## compare the results - we see slightly better results with the bipartite representation
plt.figure(figsize=(6, 5))
sns.boxplot([acc acc_b], width=0.5, color="darkgray", linewidth=1.2);
plt.xticks([0, 1], ["2-section", "bipartite"])
plt.grid()
plt.ylabel("Accuracy", fontsize=14)
plt.gcf()

# Game of Thrones scenes hypergraph

The original data can be found here: https://github.com/jeffreylancaster/game-of-thrones.

A pre-processed version is provided, where we consider a hypergraph from the game of thrones scenes with he following elements:

* **Nodes** are named characters in the series
* **Hyperedges** are groups of character appearing in the same scene(s)
* **Hyperedge weights** are total scene(s) duration in seconds involving each group of characters

We kept hyperedges with at least 2 characters and we discarded characters with degree below 5.

We saved the following:

* *Edges*: list of sets where the nodes are 0-based integers represented as strings: "0", "1", ... "n-1"
* *Names*: dictionary; mapping of nodes to character names
* *Weights*: list; hyperedge weights (in same order as Edges)


In [None]:
## read the data
open(datadir * "GoT/GoT.pkl", "r") do f
    global got_pkl = pickle.load(f)
end
got_edges, got_names, got_weights = got_pkl

## Build the weighted hypergraph 

Use the above to build the weighted hypergraph (GoT).

In [None]:
## Nodes are represented as strings from "0" to "n-1"
GoT = hnx.Hypergraph(pyDict([(k - 1, v) for (k, v) in enumerate(got_edges)]))

In [None]:
## add full names of characters and compute node strength (a.k.a. weighted degree)
I, _node, _edge = GoT.incidence_matrix(index=true)
S = pyconvert(Matrix, I.toarray()) * pyconvert(Vector{Int}, [got_weights[pyconvert(Int, i)] for i in _edge])
Strength = pyDict(i => j for (i, j) in zip(_node, S))
for v in GoT.nodes
    GoT.nodes[v].name = got_names[v]
    GoT.nodes[v].strength = Strength[v]
end
for e in GoT.edges
    GoT.edges[e].weight = got_weights[e]
end

## EDA on the GoT hypergraph

Simple exploratory data analysis (EDA) on this hypergraph. 

In [None]:
## edge sizes (number of characters per scene)
plt.figure(figsize=(6, 4))
plt.hist([GoT.size(e) for e in GoT.edges], bins=25, color="grey")
plt.xlabel("Edge size", fontsize=14)

## max edge size
println("max edge size:", maximum([GoT.size(e) for e in GoT.edges]))
println("median edge size:", median(pyconvert.(Int, [GoT.size(e) for e in GoT.edges])))
plt.gcf()

In [None]:
## edge weights (total scene durations for each group of characters appearing together)
plt.figure(figsize=(6, 4))
plt.hist([got_weights], bins=25, color="grey")
plt.xlabel("Edge weight", fontsize=14);

## max/median edge weight
println("max edge weight:", maximum(got_weights))
println("median edge weight:", median(pyconvert(Vector{Int}, got_weights)))
plt.gcf()

In [None]:
## node degrees
plt.figure(figsize=(6, 4))
plt.hist(hnx.degree_dist(GoT), bins=20, color="grey")
plt.xlabel("Node degree", fontsize=14);

## max degree
println("max node degree:", maximum(hnx.degree_dist(GoT)))
println("median node degree:", median(pyconvert(Vector{Int}, hnx.degree_dist(GoT))))
plt.gcf()

In [None]:
## node strength (total scene appearance)
plt.figure(figsize=(6, 4))
plt.hist([GoT.nodes[n].strength for n in GoT.nodes], bins=20, color="grey")
plt.xlabel("Node strength", fontsize=14);

## max strength
println("max node strength:", maximum([GoT.nodes[n].strength for n in GoT.nodes]))
println("median node strength:", median(pyconvert(Vector{Int}, [GoT.nodes[n].strength for n in GoT.nodes])))
plt.gcf()

In [None]:
## build a dataframe with node characteristics
df = DataFrame([
        pyconvert.(String, [GoT.nodes[v].name for v in GoT.nodes()]),
        pyconvert.(Int, [GoT.degree(v) for v in GoT.nodes()]),
        pyconvert.(Int, [GoT.nodes[v].strength for v in GoT.nodes()]),
    ],
    ["name", "degree", "strength"])
first(sort(df, "strength", rev=true), 12)


###  Compute s-centrality and betweenness

We consider $s=1$ and $s=2$ below.

In [None]:
## with s=1
bet = hnx.s_betweenness_centrality(GoT, edges=false)
har = hnx.s_harmonic_centrality(GoT, edges=false, normalized=false)
df."betweenness(s=1)" = pyconvert.(Float64, [bet[v] for v in GoT.nodes()])
n = GoT.shape[0]
df."harmonic(s=1)" = pyconvert.(Float64, [har[v] / (n - 1) for v in GoT.nodes()])

## with s=2
bet = hnx.s_betweenness_centrality(GoT, edges=false, s=2)
har = hnx.s_harmonic_centrality(GoT, edges=false, normalized=false, s=2)
df."betweenness(s=2)" = pyconvert.(Float64, [bet[v] for v in GoT.nodes()])
df."harmonic(s=2)" = pyconvert.(Float64, [har[v] / (n - 1) for v in GoT.nodes()])

#print(df.sort_values(by=["strength"],ascending=false).head(10)[["name","degree","strength","betweenness(s=1)","harmonic(s=1)"]].to_latex(index=false, float_format="{:0.5f}".format))
first(sort(df, "strength", rev=true), 10)

## Build 2-section graph and compute a few centrality measures

We saw several centrality measures for graphs in chapter 3. 

Below, we build the 2-section graph for GoT and compute a few of those. 

**n.b.: Unlike in the first edition of the book, we now ignore edge weights to compare with the hypergraph s-measures.**


In [None]:
## build 2-section
G = hmod.two_section(GoT)

## betweenness
n = G.vcount()
b = G.betweenness(directed=false)
G.vs["bet"] = [2 * x / ((n - 1) * (n - 2)) for x in b]
for v in G.vs
    GoT.nodes[v["name"]].bet = v["bet"]
end
df."betweenness" = pyconvert.(Float64, [GoT.nodes[v].bet for v in GoT.nodes()])

## harmonic
G.vs["hc"] = G.harmonic_centrality(normalized=true)
for v in G.vs
    GoT.nodes[v["name"]].hc = v["hc"]
end
df."harmonic" = pyconvert.(Float64, [GoT.nodes[v].hc for v in GoT.nodes()])

## order w.r.t. harmonic
first(sort(df, "harmonic", rev=true), 5)


In [None]:
## high correlation between centrality measures
corr = cor(Matrix(df[:, 4:end]))
println(names(df)[4:end])
corr

## Hypergraph modularity and clustering

We use $\tau=3$ for the hypergraph ($\tau$) modularity weights below.


In [None]:
##### visualize the 2-section graph
println("nodes:", G.vcount(), " edges:", G.ecount())
G.vs["size"] = 14
G.vs["color"] = "lightgrey"
G.vs["label"] = [parse(Int, pyconvert(String, x)) for x in G.vs["name"]] ## use int(name) as label
G.vs["character"] = [GoT.nodes[n].name for n in G.vs["name"]]
G.vs["label_size"] = 6
random.seed(42)
ly_fr = G.layout_fruchterman_reingold()
ig.plot(G, layout=ly_fr, bbox=(0, 0, 600, 400), edge_color="lightgrey")

In [None]:
## we see a well-separated small clique; it is the Braavosi theater troup
println([GoT.nodes[e].name for e in string.(166:172)])

### random clustering


In [None]:
## Compute modularity (with qH3 function) on several random partition with K parts for a range of K"s
## This should be close to 0 and can be negative.
h = []
for K in 2:2:21
    for rep in [1] ## 10 for the textbook
        V = collect(GoT.nodes)
        Random.seed!(K * rep)
        p = sample(0:K-1, length(V))
        RandPart = hmod.dict2part(pyDict(V[i] => p[i] for i in 1:length(V)))
        ## drop empty sets if any
        RandPart = [x for x in RandPart if length(x) > 0]
        ## compute qH
        push!(h, hmod.modularity(GoT, RandPart, qH3))
    end
end
println("range for qH:", minimum(h), " to ", maximum(h))


In [None]:
plt.figure(figsize=(5, 4))
sns.boxplot(h, showfliers=false, width=0.5)
plt.gcf()

### 2-section graph clustering

In [None]:
## Cluster the 2-section graph (with Leiden) and compute qH
## We now see qH >> 0
qH_best = -1
for i in 1:100
    G.vs["_leiden"] = G.community_leiden(objective_function="modularity", weights="weight", resolution=1.0).membership
    ML = hmod.dict2part(pyDict(v["name"] => v["_leiden"] for v in G.vs))
    qH = pyconvert(Float64, hmod.modularity(GoT, ML, qH3))
    if qH > qH_best
        qH_best = qH
        G.vs["leiden"] = G.vs["_leiden"]
    end
end
println("qH:", qH_best)
for v in G.vs
    GoT.nodes[v["name"]].leiden = v["leiden"]
end
df."leiden_cluster" = [GoT.nodes[v].leiden for v in GoT.nodes()];


In [None]:
## plot 2-section w.r.t. the resulting clusters
cl = G.vs["leiden"]

## pick greyscale or color plot:
pal = ig.GradientPalette("white", "black", maximum(cl) + 1)
pal = ig.ClusterColoringPalette(maximum(cl) + 1)
G.vs["color"] = [pal[x] for x in cl]

## show labels or not
G.vs["label_size"] = 0

ig.plot(G, layout=ly_fr, bbox=(0, 0, 600, 400), edge_color="gainsboro", vertex_size=8)

### edge composition after clustering

We see that the most frequent edges are small "pure" edges, but there ar also several edges with all but one node from the same community.

This suggests an intermediate value for $\tau$, such as $\tau$=2 or 3, for the exponent in the modularity.


In [None]:
comm_dict = pyDict(zip(G.vs["name"], G.vs["leiden"]))
L = []
for e in GoT.edges
    push!(L, collect(values(countmap([comm_dict[i] for i in GoT.edges[e]]))))
end
X = countmap(L)
L = []
for x in X
    push!(L, [length(x[1]), sum(x[1]), x[1][1], x[2], x[1][1] > sum(x[1]) / 2])
end
df_cd = DataFrame(hcat(L...)', ["n_comm", "d", "c", "frequency", "community edge"],)
sort!(df_cd, "frequency", rev=true)
df_cd.cum_freq = cumsum(df_cd.frequency) / pyconvert(Int, GoT.shape[1])
first(df_cd, 10)


### Kumar"s algorithm

In [None]:
Ku = hmod.kumar(GoT, verbose=false)
println("qH:", hmod.modularity(GoT, Ku, qH3))
dct = hmod.part2dict(Ku)
G.vs["kumar"] = [dct[i] for i in G.vs["name"]]
df."kumar" = [dct[v] for v in G.vs["name"]]
println("AMI vs 2-section partitions:", AMI(G.vs["leiden"], G.vs["kumar"]))

### Looking at one of the lead characters

In [None]:
## ex: high strength nodes in same cluster with Daenerys Targaryen
df.leiden_cluster = pyconvert.(Int, df.leiden_cluster)
dt = df[df.name.=="Daenerys Targaryen", "leiden_cluster"][1]
first(sort(df[df."leiden_cluster".==dt, :], "strength", rev=true), 9)

## Compute the simplicial ratio and other simpliciality measures

We see a simpliciality ratio well above 1, suggesting more simplicial pairs than would happen at random.

For the other measures, the simplicial fraction (0.07) and more so the edit simpliciality (7e-5) are small,
which is to be expected as there are several large edges in this dataset.
The face edit sompliciality is a bit higher at 0.26.


In [None]:
## compute the simplicial ratio measure
E = pylist(pyset.([Set(GoT.edges[e]) for e in GoT.edges]))
V = pylist(Set([x for y in E for x in y]))

## build list of edges incident to each node
edge_dict = spl.get_edge_sets(V, E)

## mapping between node index and character name
node_dict = pyDict(zip([GoT.nodes[v].name for v in GoT.nodes], collect(GoT.nodes)))

## simplicial ratio
spl.get_simplicial_ratio(V, E, samples=100)

In [None]:
println("Simplicial fraction:", spl.get_simplicial_fraction(V, E))
println("Edit simpliciality:", spl.get_edit_simpliciality(V, E))
println("Face edit simpliciality:", spl.get_face_edit_simpliciality(V, E, exclude_self=true))

### Compute the individual simpliciality ratio for each GoT character

We look at the ego-nets for some nodes high/low simpliciality

In [None]:
## Compute the individual simpliciality ratio for each character and rank
sm = []
random.seed(123)
for name in df.name
    E = edge_dict[node_dict[name]]
    V = pylist(pyset([x for y in E for x in y]))
    push!(sm, spl.get_simplicial_ratio(V, E, samples=100))
end
df."simpliciality" = sm
sort(df, "simpliciality", rev=true)


In [None]:
## pick high/low simpliciality nodes with low degree for viz below
hs = "Bowen Marsh"
ls = "Ros"

In [None]:
## high simpliciality
edges_kwargs = pyDict("edgecolors" => "grey")
SE = pylist([e for e in edge_dict[node_dict[hs]]])
HG = hnx.Hypergraph(SE)
nc = pylist(fill("grey", length(collect(HG.nodes))))
idx = findall(pyconvert(Vector{Bool}, collect(HG.nodes) .== node_dict[hs]))[1]
nc[idx] = "black"
nr = pyDict(zip(HG.nodes, ones(Int, length(collect(HG.nodes)))))
nr[node_dict[hs]] = 2
nodes_kwargs = pyDict("facecolors" => nc)
print("looking at node:", hs)
plt.subplots(figsize=(7, 7))
hnx.draw(HG, edges_kwargs=edges_kwargs, nodes_kwargs=nodes_kwargs, node_radius=nr,
    with_edge_labels=false, with_node_labels=false)
plt.gcf()

In [None]:
## convex hull view
XH = xgi.Hypergraph(pylist([pylist(HG.edges[e]) for e in HG.edges]))
xgi.draw(XH, node_fc="black", hull=true, node_size=pylist([nr[i] for i in XH.nodes]))
plt.gcf()

In [None]:
## low simpliciality
edges_kwargs = pyDict("edgecolors" => "grey")
SE = pylist([e for e in edge_dict[node_dict[ls]]])
HG = hnx.Hypergraph(SE)
nc = pylist(fill("grey", length(collect(HG.nodes))))
idx = findall(pyconvert(Vector{Bool}, collect(HG.nodes) .== node_dict[ls]))[1]
nc[idx] = "black"
nr = pyDict(zip(HG.nodes, ones(Int, length(collect(HG.nodes)))))
nr[node_dict[ls]] = 2
nodes_kwargs = pyDict("facecolors" => nc)
print("looking at node:", ls)
plt.subplots(figsize=(7, 7))
hnx.draw(HG, edges_kwargs=edges_kwargs, nodes_kwargs=nodes_kwargs, node_radius=nr,
    with_edge_labels=false, with_node_labels=false)
plt.gcf()

In [None]:
##convex hull view
XH = xgi.Hypergraph(pylist([pylist(HG.edges[e]) for e in HG.edges]))
xgi.draw(XH, node_fc="black", hull=true, node_size=pylist([nr[i] for i in XH.nodes]))
plt.gcf()

In [None]:
## 3-d view per edge size
_, ax = plt.subplots(figsize=(10, 10), subplot_kw=pyDict("projection" => "3d"))
xgi.draw_multilayer(XH, ax=ax, node_fc="black", hull=true, node_size=pylist([nr[i] for i in XH.nodes]), sep=1, h_angle=25)
plt.show()

##  degree - size correlation

We see positive, but very small correlation in this case.


In [None]:
_x, _y, corr = h_deg_size_corr(GoT)
println("correlation:", corr)
_df = DataFrame(degree=_x, edge_size=_y)
plt.figure(figsize=(5, 4))
sns.boxplot(x=Matrix(_df)[:, 2], y=Matrix(_df)[:, 1], showfliers=false, width=0.5, color="lightblue")
plt.ylabel("degree")
plt.xlabel("edge size")
plt.gcf()

In [None]:
## grouping node sizes in 3 tiers: up to 8, 9-16 and 17+
_df."edge size range" = [(x - 1) ÷ 8 for x in _df."edge_size"]
plt.figure(figsize=(5, 4))
sns.boxplot(x=_df.var"edge size range", y=_df.degree, showfliers=false, width=0.5, color="lightblue")
plt.xticks([0, 1, 2], ["2-8", "9-16", "17-24"]);
plt.gcf()

### Rich club coefficients - via sampling for computing the denominator

* first, compute number of edges with all nodes having degree >= k for each k: $\phi(k)$


In [None]:
## degrees in GoT graph
threshold = quantile(pyconvert.(Int, [GoT.degree(v) for v in GoT.nodes]), 0.95)
d = sort(collect(Set([GoT.degree(v) for v in GoT.nodes if pyconvert(Int, GoT.degree(v)) < threshold])))
L = []
for e in GoT.edges
    push!(L, minimum([GoT.degree(v) for v in GoT.edges[e]]))
end
## compute phi"s
phi = []
L = pyconvert.(Int, L)
for k in pyconvert.(Int, d)
    push!(phi, sum(L .>= k))
end

* now generate random bipartites graphs and compute all $\hat{\phi}_k$.


In [None]:
## number of repeats
REP = 100

## repeat each node w.r.t. its degree
V = []
for v in GoT.nodes
    V = [V; fill(pyconvert(String, v), pyconvert(Int, GoT.degree(v)))]
end

## edge sizes
S = [length(GoT.edges[e]) for e in GoT.edges()]

## initialize
random.seed(321)
phi_hat = zeros(length(phi))

for rep in 1:REP
    ## randomize
    V = shuffle(V)
    ## generate the edges
    ctr = 0
    E = []
    for s in S
        push!(E, collect(V[ctr+1:(ctr+s)]))
        ctr += s
    end
    ## min degree seen for each edge
    L = []
    for e in E
        push!(L, minimum([GoT.degree(v) for v in e]))
    end
    L = pyconvert.(Int, L)
    ## compute one instance of phi_hat and add to the sum
    ph = []
    for k in pyconvert.(Int, d)
        push!(ph, sum(L .>= k))
    end
    phi_hat = phi_hat + ph
end

## average the final phi_hat vector
phi_hat = phi_hat ./ REP;


In [None]:
## no strong rich-club phenomenon here
plt.figure(figsize=(6, 6))
rc = [a / b for (a, b) in zip(phi, phi_hat)]
plt.semilogx(d, rc, ".", c="black")
plt.xlabel("degree", fontsize=12)
plt.ylabel("rich club coefficient")
plt.gcf()

## (k,t)-hypercoreness

Maximal generalized subhypergraph where nodes have degree $k$ or more, and each edge contains at least proportion t of its original nodes.

We compute the size of this maximal hypercore for values of $5 \le k \le 50$ and $.6 \le t \le 1$.


In [None]:
## From paper - faster
function hypercore(HG, k, t=1, verbatim=false)
    E = [Set(pyconvert(Vector{String}, HG.edges[i])) for i in 0:length(HG.edges)-1]
    D = [max(2, t * length(e)) for e in E]
    V = Set(pyconvert.(String, [v for v in HG.nodes()]))
    deg = countmap(pyconvert.(String, [v for e in E for v in e]))
    if verbatim
        println(length(V))
    end
    R = Set([v for v in V if deg[v] < k])
    while !isempty(R)
        Rp = Set()
        for i in 1:length(E)
            e = E[i]
            if length(e) > 0
                if length(intersect(R, e)) > 0
                    E[i] = setdiff(E[i], R)
                    deg = countmap(pyconvert.(String, [v for e in E for v in e]))
                    if length(E[i]) < D[i]
                        a = Set([v for v in E[i] if deg[v] == k])
                        Rp = union(Rp, a)
                        E[i] = Set()
                    end
                end
            end
        end
        V = setdiff(V, R)
        R = Rp
        if verbatim
            println(length(V))
        end
    end
    return V
end

In [None]:
# T = [.6,.7,.8,.9,1] ## un-comment to try several values for t
T = [0.9]
## compute for range of values for k and t and store
L = []
for k in 5:51
    for t in T
        push!(L, [k, t, length(hypercore(GoT, k, t))])
    end
end
D = DataFrame(hcat(L...)', ["k", "t", "Size"])

In [None]:
## plot the resulting values
plt.figure(figsize=(6, 6))
for t in T
    plt.plot(D[D.t.==t, :].k, D[D.t.==t, :].Size, ".-", label=t)
end
plt.xlabel("value of k", fontsize=14)
plt.ylabel("(k,t)-hypercore size", fontsize=14)
plt.legend(title="value of t", fontsize=12)
plt.gcf()

### looking at a specific (k,t)-hypercore

$k=18$ and $t=0.9$

In [None]:
## map to 2-section and visualize
V = hypercore(GoT, 18, 0.9)
E = [e for e in GoT.edges if length(intersect(V, Set(pyconvert(Vector{String}, GoT.edges[e])))) / length(GoT.edges[e]) >= 0.9]
H = GoT.restrict_to_edges(E)
H = H.restrict_to_nodes(V) ## ADDED
G = hmod.two_section(H)
G.vs["size"] = 0
G.vs["color"] = "white"
G.vs["label"] = pylist([GoT.nodes[n].name for n in G.vs["name"]])
G.vs["label_size"] = 12
random.seed(321)
G.vs["layout"] = G.layout_fruchterman_reingold()
ig.plot(G, layout=G.vs["layout"], bbox=(500, 500), margin=50, edge_color="lightgrey")

In [None]:
## same hypergraph, different view with XGI
pos = pyDict(zip(G.vs["label"], [[v[0], -v[1]] for v in G.vs["layout"]]))
E = []
for e in H.edges
    push!(E, [got_names[x] for x in H.edges[e]])
end
XH = xgi.Hypergraph(pylist(E))
fig, ax = plt.subplots(figsize=(10, 10))
xgi.draw(XH, pos=pos, dyad_color="grey", hull=true, radius=0.25, edge_fc_cmap="Greys_r", alpha=0.005, node_size=0, ax=ax, node_labels=true);
plt.gcf()

# Contact hypergraphs

We consider two datasets where hyperedges are built when individuals come into close physical contact over some time ingtervals. The datasets are available from the XGI package directly, see: https://xgi.readthedocs.io/en/stable/xgi-data.html.
For both datasets, we keep a single instance for every edge. 
The data is in directory ```../Datasets/Contacts```.
Some questions at the end of Chapter 7 refer to those datasets.

### Primary school dataset

* 12,704 hyperedges of size 2 to 5 built from 242 nodes.
* the nodes are children belonging to one of 10 classes, and the teachers 
* file ```hyperedges-contact-primary.txt``` contains the edges (1 per line, csv), the nodes are 1-based
* file ```labels-contact-primary.txt``` contains the node labels, 1 to 11 (in numerical order of the nodes)

References in: https://zenodo.org/records/10155810


### High school dataset

* 7,818 hyperedges of size 2 to 5 built from 327 nodes.
* the nodes are students belonging to one of 9 classes
* file ```hyperedges-contact-highschool.txt``` contains the edges (1 per line, csv), the nodes are 1-based
* file ```labels-contact-highschool.txt``` contains the node labels, 1 to 9 (in numerical order of the nodes)

References in: https://zenodo.org/records/10155802


In [None]:
## read the edges and ground-truth communities and build hypergraph H and 2-section graph G

## pick one of the two datasets
#dataset = "primary"
dataset = "highschool"

## read edge list, build H
Lines = readdlm(datadir * "Contacts/hyperedges-contact-" * dataset * ".txt")
E = []
for line in eachrow(Lines)
    push!(E, Set(string.(split(line[1], ','))))
end
H = hnx.Hypergraph(pyDict([i - 1 => e for (i, e) in enumerate(E)]))
println("number of nodes:", length(H.nodes), "  number of edges:", length(H.edges))

## build 2-section graph
G = hmod.two_section(H)

## read ground-truth communities and store in a dictionary
fn = datadir * "Contacts/labels-contact-" * dataset * ".txt"
gt = readdlm(fn)
Communities = pyDict(string(k) => v for (k, v) in enumerate(gt))

## plot the 2-section graph
pal = ig.RainbowPalette(n=maximum(gt) + 1)
G.vs["color"] = [pal[Communities[v["name"]]] for v in G.vs]
ig.plot(G, bbox=(400, 400), vertex_size=5, edge_color="lightgrey")

# Motifs example 

Using HNX and XGI draw function to get patterns from **Figure 7.1** in the book and count motifs reported in **Table 7.2**.

Given:
* E2: number of edges of size 2
* G(E2): graph built only with E2
* E3: edges of size 3

Compute:
* H1: number of subgraphs of 4-nodes in G(E2) with 5 edges + 6 times the number of 4-cliques in G(E2)
* H3: count pairs of edges in E3 with intersection of size 2
* H2: for each (i,j,k) in E3, count common neighbours in G(E2) for (i,j), (i,k) and (j,k) 

Random hypergraphs:
* probability for 2-edges: p2 = c/(n-1)
* probability for 3-edges to maintain expected 2-section graph degree:  p3 = (8-c)/((n-1)*(n-2)) 
* probability for 3-edges to maintain expected H-degree: p3 = (8-c)/((n-1)*(n/2-1))


In [None]:
CondaPkg.add_pip("networkx")

In [None]:
nx = pyimport("networkx")

In [None]:
## H1 pattern
ly = pyDict("A" => (0, 1), "B" => (1, 1), "C" => (0, 0), "D" => (1, 0))
E = [Set(["A"]), Set(["B"]), Set(["C"]), Set(["D"])]
HG = hnx.Hypergraph(pyDict(enumerate(E)))
g = nx.Graph()
g.add_edge("B", "A")
g.add_edge("C", "A")
g.add_edge("B", "C")
g.add_edge("B", "D")
g.add_edge("C", "D")
plt.figure(figsize=(3, 3))
hnx.draw(HG, pos=ly, with_edge_labels=false, with_node_labels=false,
        edges_kwargs=pyDict("linewidths" => 0, "edgecolors" => "grey"),
        node_radius=3.0, with_additional_edges=g
)
plt.gcf()

In [None]:
## H2 patterns
E = [Set(["A", "B", "C"]), Set(["D"])]
HG = hnx.Hypergraph(pyDict(enumerate(E)))
g = nx.Graph()
g.add_edge("B", "D")
g.add_edge("C", "D")
plt.figure(figsize=(3, 3))
hnx.draw(HG, pos=ly, with_edge_labels=false, with_node_labels=false,
        edges_kwargs=pyDict("linewidths" => pylist([1.5, 0]), "edgecolors" => "grey"),
        node_radius=3.0, with_additional_edges=g
)
plt.gcf()

In [None]:
## H3 pattern
E = [Set(["A", "B", "C"]), Set(["B", "C", "D"])]
HG = hnx.Hypergraph(pyDict(enumerate(E)))
plt.figure(figsize=(3, 3))
hnx.draw(HG, pos=ly, with_edge_labels=false, with_node_labels=false,
        edges_kwargs=pyDict("linewidths" => 1.5, "edgecolors" => "grey"),
        node_radius=3.0
)
plt.gcf()

In [None]:
## This takes a while to run -- see some results in next cell
choice = "2-section"
Random.seed!(123)

n = 500
V = string.(1:n)

L = []
REP = 16

for c in 0:8
    p2 = c / (n - 1)
    if choice == "2-section"
        p3 = (8 - c) / ((n - 1) * (n - 2))    ## to maintain expected 2-section graph degree
    else
        p3 = (8 - c) / ((n - 1) * (n / 2 - 1))  ## to maintain expected H-degree
    end
    println("running c = ", c)
    for rep in 1:REP
        E2 = []
        E3 = []

        ## generate 2-edges
        r = rand(Int(n * (n - 1) / 2))
        v = combinations(V, 2)
        for (i, j) in enumerate(v)
            if r[i] < p2
                push!(E2, j)
            end
        end
        ## generate 3-edges
        r = rand(Int(n * (n - 1) * (n - 2) / 6))
        v = combinations(V, 3)
        for (i, j) in enumerate(v)
            if r[i] < p3
                push!(E3, j)
            end
        end

        dg = 2 * length(E2) + 3 * length(E3)
        HG = hnx.Hypergraph(pyDict(enumerate([E2; E3])))
        g = hmod.two_section(HG)
        sd = g.ecount()

        ## count motifs in graph G with 2-edges only
        G = ig.Graph.TupleList(E2)
        M = G.motifs_randesu(size=4)
        H1 = M[9] + 6 * M[10] ## exactly as H1 + 6 times 4-clique

        ## H2: for each 3-edge, for each pair within, count common neighbor(s) in G
        H2 = 0
        for e in E3
            if length(intersect(Set(pyconvert(Vector{String}, G.vs["name"])), Set(e))) == 3
                s1 = Set(G.neighbors(G.vs.find(name=e[1])))
                s2 = Set(G.neighbors(G.vs.find(name=e[2])))
                s3 = Set(G.neighbors(G.vs.find(name=e[3])))
                H2 += length(intersect(s1, s2)) + length(intersect(s1, s3)) + length(intersect(s3, s2))
            end
        end
        ## H3: count pairs of 3-edges with intersection of size 2
        H3 = 0
        e = [Set(i) for i in E3]
        l = length(e)
        for i in 1:l
            for j in i+1:l
                if length(intersect(e[i], e[j])) == 2
                    H3 += 1
                end
            end
        end
        push!(L, [c, H1, H2, H3, dg / n, 2 * sd / n])
    end
end
D = DataFrame(pyconvert.(Float64, hcat(L...))', ["c", "H1", "H2", "H3", "H deg", "2-sec deg"])
combine(groupby(D, "c"), Not(:c) .=> mean)