# Chapter 7 - Hypergraphs

In this notebook, we introduce hypergraphs, a generalization of graphs where we allow for arbitrary sized edges (in practice, we usually consider only edges of size 2 or more). 

We illustrate a few concepts using hypergraphs including modularity, community detection, simpliciality and transformation into 2-section graphs.

(If using HNX version **2.4**, remove all instances of: ```{'with_node_counts': False}```)

We also do some visualization with **XGI** (https://xgi.readthedocs.io/en/stable/index.html), which can by pip installed.


In [None]:
import pandas as pd
import numpy as np
import igraph as ig
import matplotlib.pyplot as plt
%matplotlib inline
import hypernetx as hnx
import hypernetx.algorithms.hypergraph_modularity as hmod 
import xgi 
import pickle
from collections import Counter
import warnings
import random
import networkx as nx
from sklearn.metrics import adjusted_mutual_info_score as AMI
from itertools import combinations
import seaborn as sns
import fastnode2vec as n2v
import umap
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix

## Some Python files:

# Functions to compute various simpliciality measures
import simpliciality as spl

# h-Louvain preliminary version
import h_louvain as hl


In [None]:
## Set this to the data directory
datadir='../Datasets/'


In [None]:
## to compute degree-size correlation
def h_deg_size_corr(H):
    deg = {v:H.degree(v) for v in H.nodes}
    X = []
    Y = []
    for e in H.edges:
        for v in H.edges[e]:
            X.append(deg[v])
            Y.append(len(H.edges[e]))
    return(X, Y, np.corrcoef(X,Y)[1,0])
    

# HyperNetX basics with a toy hypergraph

We illustrate a few concepts with a toy hypergraph. 

First, we build the HNX hypergraph from a list of sets (the hyperedges), and we draw the hypergraph as well as its dual (where the role of nodes and hyperedges are swapped).


In [None]:
## build an hypergraph from a list of sets (the hyperedges)
E = [{'A','B'},{'A','C'},{'A','B','C'},{'A','D','E','F'},{'D','F'},{'E','F'},{'B'},{'G','B'}]
#kwargs = {'layout_kwargs': {'seed': 123}, 'with_node_counts': False}
kwargs = {'layout_kwargs': {'seed': 123}}
## using enumeration, edges will have integer IDs
H = hnx.Hypergraph(dict(enumerate(E)))
for e in H.edges:
    H.edges[e].weight = 1.0
edges_kwargs={'edgecolors':'grey'}
with warnings.catch_warnings(): ## matplotlib warning
    warnings.simplefilter("ignore")
    plt.figure(figsize=(8,5))
    hnx.draw(H, **kwargs, edges_kwargs=edges_kwargs, edge_label_alpha=1,
             node_labels_kwargs={'fontsize': 9},
             edge_labels_kwargs={'fontsize': 7}
            )
#plt.savefig('h_toy_a.pdf', bbox_inches='tight')
plt.show()


In [None]:
## dual hypergraph
H_dual = H.dual()
#kwargs = {'layout_kwargs': {'seed': 123}, 'with_node_counts': False, 'with_edge_labels':True}
kwargs = {'layout_kwargs': {'seed': 123}, 'with_edge_labels':True}
edges_kwargs={'edgecolors':'grey'}
plt.figure(figsize=(8,5))
hnx.draw(H_dual, **kwargs, edges_kwargs=edges_kwargs, edge_label_alpha=1,
         node_labels_kwargs={'fontsize': 9},
         edge_labels_kwargs={'fontsize': 7}
        )
#plt.savefig('h_toy_b.pdf', bbox_inches='tight')
plt.show()


In [None]:
## bipartite representation - HNX exports in networkx format
B = ig.Graph.from_networkx(H.bipartite())
B.vs['label'] = B.vs['_nx_name']
ly = B.layout_bipartite(types='bipartite')
#ig.plot(B, 'h_toy_c.pdf', bbox=(400,300), vertex_color='white', layout=ly, vertex_label_size=14, edge_color='black')
ig.plot(B, bbox=(400,300), vertex_color='white', layout=ly, vertex_label_size=14, edge_color='black')


In [None]:
## show the nodes and edges
print('shape:', H.shape)
print('nodes:', [x for x in H.nodes()])
print('edges:', [x for x in H.edges()])
print('node degrees:', [(v,H.degree(v)) for v in H.nodes()])
print('edge sizes:',[H.size(e) for e in H.edges()])


In [None]:
## incidence dictionary
H.incidence_dict


In [None]:
## incidence matrix
H.incidence_matrix(index=True)


In [None]:
## incidence matrix (in csr format by default - here showing whole array)
M = H.incidence_matrix(index=True)
df = pd.DataFrame(M[0].toarray(), 
                  index=M[1],
                  columns=M[2])
df


In [None]:
## 2-section graph
G = hmod.two_section(H)
ig.plot(G, bbox=(400,300),vertex_label=G.vs['name'], 
        vertex_label_size=12, vertex_color='lightblue',
        edge_width=G.es['weight'])
#ig.plot(G, target="h_toy_d.pdf", bbox=(400,300),vertex_label=G.vs['name'], vertex_label_size=14, vertex_color='white')


## s-walks and distance-based measures

We illustrate a few concepts with the toy hypergraph defined earlier.

Let $H=(V,E)$ a hypergraph, and consider its incidence matrix $B$ as defined in section 7.2. 
Consider also the dual hypergraph $H^*$, where the roles of nodes are hyperedges are swapped, 
namely the edges in $H$ are the nodes in $H^*$, 
and there is as edge two vertices in $H^*$ if the corresponding hyperedges in $H$ have a non-empty intersection.

### s-walks and distances

We define the concept of $s$-walks on a hypergraph as follows. A $s$-**walk** of length $k$ on $H$ is a sequences of edges $e_{i_0}, e_{i_1}, ..., e_{i_k}$ in $E$ such that 
all $|e_{i_{j-1}} \cap e_{i_j}| \ge s$ for $1 \le j \le k$ and all $i_{j-1} \ne i_j$.

The $s$-**distance** $d_s(e_i,e_j)$ between edges $e_i$ and $e_j$ is the length of the smallest $s$-walk between those, if it exists (else the distance is usually considered as infinity, and its inverse is set to zero).

A subset $E_s \subset E$ is an $s$-**connected component** if it is a maximal subset with an $s$-walk between all $e_i, e_j \in E_s$.
The $s$-**diameter** for $E_s$ is the maximal shortest path length between all $e_i, e_j \in E_s$.

Other concepts can also be defined using $s$-walks. For example for distinct $e_i, e_j, e_k \in E$, if there is a $s$-walk $e_i, e_j, e_k$, we say that they form an $s$-**wedge**, and if there is an $s$ walk $e_i, e_j, e_k, e_i$, we can say those form an $s$-**triangle** and from those, we can define the $s$-**clustering coefficients** as in section 1.11.

For **nodes**, all definitions above follow by considering the **dual** hypergraph. For example, a $s$-walk is a sequence of adjacent nodes such that each consecutive node pair in the walk share at least $s$ hyperedges; all other concepts defined above follow directly.

#### toy example

In the toy example above, with $s=2$, the sequence of edges 1-2-0 is an $s$-path since edges 1 and 2 share nodes A and C, and edges 2 and 0 share nodes A and B.

In the dual toy hypergraph, again with $s=2$, the sequence of nodes (edges in the dual) D-F-E is a $s$-path since nodes D and F are both incident to edges 3 and 4, and nodes F and E are both incident to edges 3 and 5. 
Another $s$-path (with $s=2$) is C-A-B.

With $s=1$, this corresponds to a walk on the (unweighted 2-section) graph, while for $s \ge 2$, this concept only applies to hypergraphs.

Below, we compute the distances between every pair of nodes (thus, using the $s$-walks on the dual). An infinite distance between a pair of nodes means that there is no $s$-path joining those.

We see the correspondence between the $s=1$ and graph cases; moreover in those cases, we have a single connected component since every pairs of nodes is connected by a path.

With $s=2$, we see several disconnected node pairs, so in this case, we have several $s$-connected components. From inspection of the table below, we see that nodes {A,B,C} are connected,
nodes {D,E,F} are connected; node G is then an isolated node. We verify this claim below (we also do the same with the edges, i.e. using the $s$-walk on $H$ with $s=2$)                                             

In [None]:
## distances with s=1 and s=2 and on the 2-section graph
warnings.filterwarnings('ignore') ## avoid warnings in disconnected case
Nodes = ['A','B','C','D','E','F','G']
L = []
for i in range(len(Nodes)-1):
    for j in np.arange(i+1,len(Nodes)):
        L.append([Nodes[i],Nodes[j],G.distances(Nodes[i],Nodes[j])[0][0],
                  H.distance(Nodes[i],Nodes[j]),H.distance(Nodes[i],Nodes[j],s=2)])
df = pd.DataFrame(L, columns=['node1','node2','2-section','s=1','s=2'])
df


In [None]:
## s=2 components
Edges = [cc for cc in H.s_connected_components(s=2, return_singletons=True)]
Nodes = [cc for cc in H.s_connected_components(s=2, edges=False, return_singletons=True)]
print('s=2, connected components for the nodes:',Nodes)
print('s=2, connected components for the edges:',Edges)


## Line graph

Below we illustrate the **line graph** for the toy hypergraph and its dual, with $s=2$.

Recall that in a line graph, the nodes are the edges in the original hypergraph, 
and an edge is draw between those if they share at least $s$ nodes in the original hypergraph.

We see the same connectd components as listed above.


In [None]:
## linegraph
LG = ig.Graph.from_networkx(H.get_linegraph(s=2))
ig.plot(LG, bbox=(200,200),vertex_label=LG.vs['_nx_name'], vertex_label_size=9, vertex_color='lightgrey')


In [None]:
## dual's linegraph
DLG = ig.Graph.from_networkx(H.dual().get_linegraph(s=2))
ig.plot(DLG, bbox=(200,200),vertex_label=DLG.vs['_nx_name'], vertex_label_size=9, vertex_color='lightgrey')


##  Centrality measures

For $H=(V,E)$, we define the **$s$-harmonic centrality** for edge $e_i \in E$ as:
$\frac{1}{|E|-1}\sum_{e_j \in E_s; e_i \ne e_j} \frac{1}{d_s(e_i,e_j)}$.
Recall that for $s$-disconnected edges $e_i, e_j$, we set $\frac{1}{d_s(e_i,e_j)} = 0$.

* n.b.: The HyperNetX implementation uses a different normalization, namely $(|E|-1)(|E|-2)/2$.

For nodes, the definition is identical using the dual hypergraph.
For our toy example, with $s=2$, nodes {A,B,C} form a connected connected component as we saw earlier, same
for nodes {D,E,F}, while node G is an isolated node.

Looking at the table of distances we computed earlier, we see that $d_2(A,B)=d_2(A,C)=1$ and $d_2(B,C)$=2,
so before normalization, the harmonic centrality for A is 2, and for B and C it is 1.5.
Results are comparable for the other connected component, with values of 1.5 for nodes D and E, and 2 for node F.
Node G is isolated and thus has zero harmonic centrality.

We can also define $s$-**betweenness centrality** as we did for graphs, namely for edge $e_i \in E$:

$\frac{1}{(|E|-1)(|E|-2)}\sum_{e_j \in E-\{e_i\}} \sum_{e_k \in E-\{e_i, e_j\}} \frac{\ell(e_j,e_k,e_i)}{\ell(e_j,e_k)}$

where: $\ell(e_j,e_k)$ is the number of shortest $s$-paths between $e_j$ and $e_k$, 
and $\ell(e_j,e_k,e_i)$ is the number of shortest $s$-paths between $e_j$ and $e_k$ that include $e_i$.
Again the definition is the same for nodes using the dual hypergraph.

For our toy example, with $s=2$, the only nodes that are on shortest $s$-paths between other nodes are nodes A (between B and C)
and node F (between D and E), thus the results we see below.

Other distance-based centrality measures can be defined for hypergraphs in the same way, using $s$-distances,
including the measures we covered in Section 3.3. 
In the example below, we also show **closeness centrality**; note that by default, the computation is done separately for each $s$-connected component, thus the results below.

Computing **eccentricity** (the length of the longest shortest path from a vertex to every other vertex in
the s-linegraph) with $s=2$ returns an error since some node are not connected, so we show the results for $s=1$.


In [None]:
## eccentricity - this yields an error with s > 1
hnx.algorithms.s_eccentricity(H, edges=False, s=1)


In [None]:
## centralities for 's=2'
s = 2

hc = hnx.algorithms.s_harmonic_centrality(H, edges=False, s=s, normalized=False)
bc = hnx.algorithms.s_betweenness_centrality(H, edges=False, s=s, normalized=False)
cc = hnx.algorithms.s_closeness_centrality(H, edges=False, s=s)

## normalize w.r.t. definition in the book
D = pd.DataFrame([[v,hc[v]/(H.nodes.dataframe.shape[0]-1),
                   2*bc[v]/((H.nodes.dataframe.shape[0]-1)*(H.nodes.dataframe.shape[0]-2)),
                   cc[v]] for v in H.nodes], 
                   columns=['node','harmonic','betweenness','closeness'])

#print(D.sort_values('harmonic', ascending=False).to_latex())
D.sort_values('harmonic', ascending=False)


## hypergraph modularity (qH) and clustering

We compute qH on the toy graph for 4 different partitions, and using different variations for the edge contribution (a.k.a. $\tau$-modularity).

For edges of size $d$ where $c$ is the number of nodes from the part with the most representatives, we consider  variations as follows for edge contribution:

* **strict**: edges are considered only if all nodes are from the same part, with unit weight, i.e. $w$ = 1 iff $c == d$ (0 else).
* **cubic**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the cube of the number of nodes in the majority, i.e. $w = (c/d)^3$ iff $c>d/2$ (0 else).
* **quadratic**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the square of the number of nodes in the majority, i.e. $w = (c/d)^2$ iff $c>d/2$ (0 else).
* **linear**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the number of nodes in the majority, i.e. $w = c/d$ iff $c>d/2$ (0 else).
* **majority**: edges are counted only if more that half the nodes are from the same part, with unit weights, i.e. $w$ = 1 iff $c>d/2$ (0 else).

Some of the above are supplied with the `hmod` module, the **qH2** and **qH3** functions are examples of user-supplied choice.

The order above goes from only counting "pure" edges as community edges, gradually giving more weight to edges with $c>d/2$, all the way to giving the the same weights.


In [None]:
## these will be included in the next version of hmod
## square modularity weights
def qH2(d,c):
    return (c/d)**2 if c > d/2 else 0
## cubic modularity weights
def qH3(d,c):
    return (c/d)**3 if c > d/2 else 0

## compute hypergraph modularity (qH) for the following partitions:
A1 = [{'A','B','C','G'},{'D','E','F'}]            ## good clustering, qH should be positive
A2 = [{'B','C'},{'A','D','E','F','G'}]            ## not so good
A3 = [{'A','B','C','D','E','F','G'}]              ## this should yield qH == 0
A4 = [{'A'},{'B'},{'C'},{'D'},{'E'},{'F'},{'G'}]  ## qH should be negative here

## we compute with different choices of functions for the edge contribution

print('strict edge contribution:')
print('qH(A1):',"{:.4f}".format(hmod.modularity(H,A1,hmod.strict)),
      'qH(A2):',"{:.4f}".format(hmod.modularity(H,A2,hmod.strict)),
      'qH(A3):',"{:.4f}".format(hmod.modularity(H,A3,hmod.strict)),
      'qH(A4):',"{:.4f}".format(hmod.modularity(H,A4,hmod.strict)))
print('\ncubic edge contribution:')
print('qH(A1):',"{:.4f}".format(hmod.modularity(H,A1,qH3)),
      'qH(A2):',"{:.4f}".format(hmod.modularity(H,A2,qH3)),
      'qH(A3):',"{:.4f}".format(hmod.modularity(H,A3,qH3)),
      'qH(A4):',"{:.4f}".format(hmod.modularity(H,A4,qH3)))
print('\nquadratic edge contribution:')
print('qH(A1):',"{:.4f}".format(hmod.modularity(H,A1,qH2)),
      'qH(A2):',"{:.4f}".format(hmod.modularity(H,A2,qH2)),
      'qH(A3):',"{:.4f}".format(hmod.modularity(H,A3,qH2)),
      'qH(A4):',"{:.4f}".format(hmod.modularity(H,A4,qH2)))
print('\nlinear edge contribution:')
print('qH(A1):',"{:.4f}".format(hmod.modularity(H,A1,hmod.linear)),
      'qH(A2):',"{:.4f}".format(hmod.modularity(H,A2,hmod.linear)),
      'qH(A3):',"{:.4f}".format(hmod.modularity(H,A3,hmod.linear)),
      'qH(A4):',"{:.4f}".format(hmod.modularity(H,A4,hmod.linear)))
print('\nmajority edge contribution:')
print('qH(A1):',"{:.4f}".format(hmod.modularity(H,A1,hmod.majority)),
      'qH(A2):',"{:.4f}".format(hmod.modularity(H,A2,hmod.majority)),
      'qH(A3):',"{:.4f}".format(hmod.modularity(H,A3,hmod.majority)),
      'qH(A4):',"{:.4f}".format(hmod.modularity(H,A4,hmod.majority)));


### weighted 2-section graph

We already built the 2-section weighted graph **G** for the above toy hypergraph.

Here we run Leiden custering algorithm on this graph, and compare with Kumar's hypergraph clustering algorithm.

We run each algorithm multiple times to show the difference in performance. In general, hypergraph-based algorithms are much slower than graph-based algorithms.


In [None]:
## 2-section graph
G.vs['label'] = G.vs['name']
ig.plot(G, bbox=(0,0,250,250), edge_width = 2*np.array(G.es['weight']), 
        vertex_color='gainsboro', vertex_label_size=10)


In [None]:
%%time
## 2-section clustering with Leiden
for i in range(100):
    G.vs['community'] = G.community_leiden(objective_function='modularity', weights='weight').membership
print('clusters:',hmod.dict2part({v['name']:v['community'] for v in G.vs}))


In [None]:
%%time
## Kumar clustering
for i in range(100):
    cl = hmod.kumar(H)
print('clusters:', cl)


## Simplicial ratio

We use the same toy graph, but we remove the singleton edge {'B'}.

First, we see a simplicial ratio slightly above 1, and we also see that the two simplicial pairs between 2-edges and 3-edges are more surprising that the two pairs between 2-edges and 4-edges.


In [None]:
## toy example without the singleton edge
vertices = [v for v in H.nodes()]
edges = [{'A','B'},{'A','C'},{'A','B','C'},{'A','D','E','F'},{'D','F'},{'E','F'},{'G','B'}]

## simplicial ratio
random.seed(123)
spl.get_simplicial_ratio(vertices, edges, samples=1000)


In [None]:
## simplicial matrix
random.seed(123)
spl.get_simplicial_matrix(vertices, edges, samples=1000)


In [None]:
## number of simplicial pairs
spl.get_simplicial_pairs(vertices, edges, as_matrix=True)


### Other simpliciality measures

* no 3+ edge has downward closure, so the fraction is 0
* edit simpliciality is 7/16, since 9 edges would need to be added to get downward closures
* face edit simpliciality: the two values for maximal edges are 3/4 and 3/11 (keeping the maximal face in the counts) or 2/3 and 2/10 otherwise
    

In [None]:
print('Simplicial fraction:',spl.get_simplicial_fraction(vertices, edges))
print('Edit simpliciality:',spl.get_edit_simpliciality(vertices, edges))
print('Face edit simpliciality:',spl.get_face_edit_simpliciality(vertices, edges, exclude_self=False))
print('Face edit simpliciality:',spl.get_face_edit_simpliciality(vertices, edges, exclude_self=True))


# h-ABCD Examples

Julia code to generate h-ABCD benchmarks ca be found here:
https://github.com/bkamins/ABCDHypergraphGenerator.jl

The first small h-ABCD hypergraph we use next was generated as follows:

`julia --project abcdh.jl -n 100 -d 2.5,3,10 -c 1.5,30,40 -x .2 -q 0,.3,.4,.3 -w :strict -s 123 -o toy_100`

It has 100 nodes and 3 well-defined communities. We will use this example mainly for visualization.

The second one, which is much more noisy, was generated as follows:

`julia --project abcdh.jl -n 300 -d 2.5,5,30 -c 1.5,80,120 -x .6 -q 0,0,.1,.9 -w :strict -s 123 -o toy_300`

We will use this example to show that optimizing the appropriate hypergraph modularity function can lead to better clustering in some cases.
    

## 100-node h-ABCD - visualization

In [None]:
## read the edges and build the h-ABCD hypergraph H
fp = open(datadir+'ABCD/toy_100_he.txt', 'r')

Lines = fp.readlines()
Edges = []
for line in Lines:
    #Edges.append(set([int(x)-1 for x in line.strip().split(',')]))
    Edges.append(set([x for x in line.strip().split(',')]))
H = hnx.Hypergraph(dict(enumerate(Edges)))
print('distribution of edge sizes:',Counter([len(x) for x in Edges]))


In [None]:
## read the ground-truth communities and assign node colors accordingly
H_comm = {str(k+1):v for k,v in enumerate(pd.read_csv(datadir+'ABCD/toy_100_assign.txt', header=None)[0].tolist())}
cls = ['white','darkgrey','black']
node_colors = dict(zip(H.nodes, [cls[H_comm[i]-1] for i in H.nodes]))

## build the 2-section graph and plot (with ground-truth community colors)
g = hmod.two_section(H)
for v in g.vs:
    v['color'] = node_colors[v['name']]
    v['gt'] = H_comm[v['name']]
    
random.seed(12345)
ly = g.layout_fruchterman_reingold()
g.vs['ly'] = [x for x in ly]
fig, ax = plt.subplots(figsize=(7,7))
ig.plot(g, target=ax, vertex_size=9, layout=ly, edge_color='darkgrey', edge_width=1)
#fig.savefig('habcd_1.pdf');
plt.show()


In [None]:
## rubber band plot
H_ly = dict(zip(g.vs['name'], [[x[0],x[1]] for x in g.vs['ly']]))
fig, ax = plt.subplots(figsize=(7,7))
hnx.draw(H, with_node_labels=False, with_edge_labels=False, node_radius=.67,
         nodes_kwargs={'facecolors': node_colors, 'edgecolors' : 'black'},
         edges_kwargs={'edgecolors': 'darkgrey'},
         pos=H_ly)
#fig.savefig('habcd_2.pdf');
plt.show()


In [None]:
### Plot via convex hull with the XGI package
H_nc = dict(zip(g.vs['name'], g.vs['color']))
fig, ax = plt.subplots(figsize=(7,7))
XH = xgi.Hypergraph(Edges)
xgi.draw(XH, node_fc=H_nc, dyad_color='grey', hull=True, radius=.15, edge_fc_cmap='Greys_r', alpha=.2, pos=H_ly, node_size=8, ax=ax, node_labels=False )
#fig.savefig('habcd_3.pdf');
plt.show()


### Edge composition

Recall we call a $d$-edge a **community** edge if $c>d/2$ where $c$ is the number of nodes that belong to the **most represented** community.

Below we show the number of edges with all values $d$ and $c$, community edges or not.
We see that given the ground-truth communities, most community edges are *pure* in the sense that $c=d$.

In real examples, we usually do not know the ground-truth communities, or at least not for every node.
We can try some clustering, for example graph clustering on the 2-section graph, or Kumar's algorithm on the hypergraph, to get a sense of edge composition.

The result is quite similar to the ground-truth.


In [None]:
## edge composition - ground truth
L = []
for e in H.edges:
    L.append((Counter([H_comm[i] for i in H.edges[e]]).most_common(1)[0][1],len(H.edges[e])))
X = Counter(L).most_common()

L = []
for x in X:
    L.append([x[0][1], x[0][0], x[0][0]>x[0][1]/2, x[1]])
D = pd.DataFrame(np.array(L), columns=['d','c','community edge','frequency (ground truth)'])
D = D.sort_values(by=['d','c'], ignore_index=True)

## edge composition - Leiden on 2-section
g.vs['leiden'] = g.community_leiden(objective_function='modularity', weights='weight').membership
leiden = dict(zip(g.vs['name'],g.vs['leiden']))
L = []
for e in H.edges:
    L.append((Counter([leiden[i] for i in H.edges[e]]).most_common(1)[0][1],len(H.edges[e])))
X = Counter(L).most_common()
L = []
for x in X:
    L.append([x[0][1], x[0][0], x[0][0]>x[0][1]/2, x[1]])
D2 = pd.DataFrame(np.array(L), columns=['d','c','community edge','frequency (Leiden)'])
D2 = D2.sort_values(by=['d','c'], ignore_index=True)

D['frequency (Leiden)'] = D2['frequency (Leiden)']
D = D.sort_values('frequency (ground truth)', ascending=False)
#print(D[['d','c','frequency (ground truth)','frequency (Leiden)']].to_latex(index=False))
D


### simpliciality

We show some measures of simpliciality, namely the number of simplicial pairs, the simpliciality matrix and the simplicial ratio measure.

The simplicial ratio value is around 1.3 (recall it is based on sampling), which indicates that this hypergraph does not exhibit high simpliciality.


In [None]:
E = [set(H.edges[e]) for e in H.edges]
V = list(set([x for y in E for x in y]))
spl.get_simplicial_pairs(V, E, as_matrix=True)


In [None]:
spl.get_simplicial_matrix(V, E, samples=1000)


In [None]:
spl.get_simplicial_ratio(V, E, samples=1000)


In [None]:
## other measures of simpliciality
print('Simplicial fraction:',spl.get_simplicial_fraction(V,E))
print('Edit simpliciality:',spl.get_edit_simpliciality(V,E))
print('Face edit simpliciality:',spl.get_face_edit_simpliciality(V,E,exclude_self=True))


# 300-node noisy h-ABCD

This is a noisier hypergraph with $\xi=0.6$, edges mostly of size 4 and some edges of size 3.

In the experiment below, we run each of the following algorithms 30 times and compare AMI with the ground-truth communities.
* Leiden on 2-section (weighted) graph
* Kumar's algorithm
* h-Louvain

We observe that Kumar's algorithm, which does take the hypergraph structure into account, slightly improves on the results with 2-section clustering, 
while h-Louvain improves it further, albeit with slower run time.


In [None]:
## read the edges and build the h-ABCD hypergraph H
fp = open(datadir+'ABCD/toy_300_he.txt', 'r')
Lines = fp.readlines()
Edges = []
for line in Lines:
    Edges.append(set([x for x in line.strip().split(',')]))
H = hnx.Hypergraph(dict(enumerate(Edges)))

## read the ground-truth communities and assign node colors accordingly
H_comm = {str(k+1):v for k,v in enumerate(pd.read_csv(datadir+'ABCD/toy_300_assign.txt', header=None)[0].tolist())}

## build the 2-section graph
g = hmod.two_section(H)
for v in g.vs:
    v['gt'] = H_comm[v['name']]


In [None]:
%%time
## reduce the number of repeats (REP) for a faster run (we used REP=30 for the book)
REP = 30
L = []
random.seed(321)
np.random.seed(321) 

for s in range(REP):
    g.vs['leiden'] = g.community_leiden(objective_function='modularity',weights='weight').membership
    ami_g = AMI(g.vs['gt'], g.vs['leiden'])
    H_kumar = hmod.kumar(H)
    H_kumar_dict = hmod.part2dict(H_kumar)
    ami_k = AMI([H_comm[v] for v in H.nodes], [H_kumar_dict[v] for v in H.nodes])
    #H_ls = hmod.part2dict(hmod.last_step(H, H_kumar, hmod.strict))
    #ami_ls = AMI([H_comm[v] for v in H.nodes], [H_ls[v] for v in H.nodes])
    L.append([ami_g, ami_k])
        
D = pd.DataFrame(L, columns=['2-section', 'Kumar'])
print('mean values:')
print(D.mean())


### Running h-Louvain with Bayesian Optimization

This is slower as for each repetition, several attempts are made to find a good set of parameters using Bayesian optimization.
Results are saved and can be retieved for plotting. To re-run the experiment, uncomment the cell below.

In [None]:
with open(datadir+'ABCD/toy_300_h-Louvain.pkl','rb') as fn:
    L = pickle.load(fn)
D['h-Louvain'] = L[:D.shape[0]]
plt.figure(figsize=(6,5))
sns.boxplot(D, width=.5, color='darkgray', linewidth=1.2)
plt.ylabel('AMI', fontsize=14)
#plt.savefig('habcd_cluster.eps')   
plt.show()


In [None]:
## no simplicial pair in this case
E = [set(H.edges[e]) for e in H.edges]
V = list(set([x for y in E for x in y]))
spl.get_simplicial_pairs(V, E, as_matrix=True)


In [None]:
## other measures
print('Simplicial fraction:',spl.get_simplicial_fraction(V,E))
print('Edit simpliciality:',spl.get_edit_simpliciality(V,E))
print('Face edit simpliciality:',spl.get_face_edit_simpliciality(V,E,exclude_self=True))


## Embeddings

We fit two embeddings to the h-ABCD graph, namely:
* 2-section node2vec
* bipartite node2vec (where we ignore the edge embeddings)

We fit a classifier where we train on 50% of the points, and test on the rest,
after reducing to 16-dim via UMAP.

We verify if keeping the hypergraph structure helps, as we do with the bipartite representation.


In [None]:
## 2-section
graph = n2v.Graph(g.to_tuple_list(), directed=False, weighted=False)
nv = n2v.Node2Vec(graph, dim=32, p=1, q=1, walk_length=80, window=5, seed=123)
nv.train(epochs=10, verbose=False)
X_twosec = np.array([nv.wv[i] for i in range(len(nv.wv))])

## 2-section - 2-d visualization
U = umap.UMAP().fit_transform(X_twosec)
df = pd.DataFrame(U, columns=['X','Y'])
plt.figure(figsize=(6,6))
plt.scatter(df.X, df.Y, c=g.vs['gt'], s=25)
plt.show()


In [None]:
## bipartite (edges are in first positions; we ignore the edges)
G = ig.Graph.from_networkx(H.bipartite())
graph = n2v.Graph(G.to_tuple_list(), directed=False, weighted=False)
nv = n2v.Node2Vec(graph, dim=32, p=1, q=1, walk_length=80, window=5, seed=123)
nv.train(epochs=10, verbose=False)
n_edges = len([e for e in H.edges()])
X_bip = np.array([nv.wv[i] for i in range(len(nv.wv))])[n_edges:]

## bipartite 2-d viz
U = umap.UMAP().fit_transform(X_bip)
df = pd.DataFrame(U,columns=['X','Y'])
plt.figure(figsize=(6,6))
plt.scatter(df.X, df.Y, c=g.vs['gt'], s=25)
plt.show()


## fit a classifier

We train on half the data chosen at random, which we repeat several times.



In [None]:
%%time
## classifier - with 2-section and bipartite embeddings
acc = []
acc_b = []
y = label = g.vs['gt']

for seed in np.arange(0,51,10): ## we used 30 repeats in textbook which can take a few minutes
    
    ## 2-section
    X = umap.UMAP(n_components=16, n_jobs=1, random_state=seed).fit_transform(X_twosec)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=seed)
    model = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'sqrt', random_state=seed)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    # print(cm)
    acc.append(sum(cm.diagonal())/sum(sum(cm)))

    ## bipartite - same seed
    X = umap.UMAP(n_components=16, n_jobs=1, random_state=seed).fit_transform(X_bip)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=seed)
    model = RandomForestClassifier(n_estimators=100, bootstrap = True, max_features = 'sqrt', random_state=seed)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    # print(cm)
    acc_b.append(sum(cm.diagonal())/sum(sum(cm)))
    
print(np.mean(acc), np.mean(acc_b))

In [None]:
## compare the results - we see slightly better results with the bipartite representation
D = pd.DataFrame(np.array([acc,acc_b]).transpose(),columns=['2-section','bipartite'])
plt.figure(figsize=(6,5))
sns.boxplot(D, width=.5, color='darkgray', linewidth=1.2);
plt.grid()
plt.ylabel('Accuracy', fontsize=14)
#plt.savefig('habcd_classify.eps')
plt.show()


# Game of Thrones scenes hypergraph

The original data can be found here: https://github.com/jeffreylancaster/game-of-thrones.

A pre-processed version is provided, where we consider a hypergraph from the game of thrones scenes with he following elements:

* **Nodes** are named characters in the series
* **Hyperedges** are groups of character appearing in the same scene(s)
* **Hyperedge weights** are total scene(s) duration in seconds involving each group of characters

We kept hyperedges with at least 2 characters and we discarded characters with degree below 5.

We saved the following:

* *Edges*: list of sets where the nodes are 0-based integers represented as strings: '0', '1', ... 'n-1'
* *Names*: dictionary; mapping of nodes to character names
* *Weights*: list; hyperedge weights (in same order as Edges)


In [None]:
## read the data
with open(datadir+"GoT/GoT.pkl","rb") as f:
    Edges, Names, Weights = pickle.load(f)


## Build the weighted hypergraph 

Use the above to build the weighted hypergraph (GoT).

In [None]:
## Nodes are represented as strings from '0' to 'n-1'
GoT = hnx.Hypergraph(dict(enumerate(Edges)))


In [None]:
## add full names of characters and compute node strength (a.k.a. weighted degree)
I, _node, _edge = GoT.incidence_matrix(index=True)
S = I * [Weights[int(i)] for i in _edge]
Strength = {i:j for i,j in zip(_node,S)}
for v in GoT.nodes:
    GoT.nodes[v].name = Names[v]
    GoT.nodes[v].strength = Strength[v]
for e in GoT.edges:
    GoT.edges[e].weight = Weights[e]
    

## EDA on the GoT hypergraph

Simple exploratory data analysis (EDA) on this hypergraph. 

In [None]:
## edge sizes (number of characters per scene)
plt.figure(figsize=(6,4))
plt.hist([GoT.size(e) for e in GoT.edges], bins=25, color='grey')
plt.xlabel("Edge size", fontsize=14)
#plt.savefig('got_hist_1.eps')
plt.show()

## max edge size
print('max edge size:', np.max([GoT.size(e) for e in GoT.edges]))
print('median edge size:', np.median([GoT.size(e) for e in GoT.edges]))


In [None]:
## edge weights (total scene durations for each group of characters appearing together)
plt.figure(figsize=(6,4))
plt.hist([Weights], bins=25, color='grey')
plt.xlabel("Edge weight",fontsize=14)
#plt.savefig('got_hist_2.eps')
plt.show()

## max/median edge weight
print('max edge weight:', np.max([Weights]))
print('median edge weight:', np.median([Weights]))


In [None]:
## node degrees
plt.figure(figsize=(6,4))
plt.hist(hnx.degree_dist(GoT),bins=20, color='grey')
plt.xlabel("Node degree",fontsize=14)
#plt.savefig('got_hist_3.eps')
plt.show()

## max degree
print('max node degree:', np.max(hnx.degree_dist(GoT)))
print('median node degree:', np.median(hnx.degree_dist(GoT)))


In [None]:
## node strength (total scene appearance)
plt.figure(figsize=(6,4))
plt.hist([GoT.nodes[n].strength for n in GoT.nodes], bins=20, color='grey')
plt.xlabel("Node strength",fontsize=14)
#plt.savefig('got_hist_4.eps')
plt.show()

## max strength
print('max node strength:', np.max([GoT.nodes[n].strength for n in GoT.nodes]))
print('median node strength:', np.median([GoT.nodes[n].strength for n in GoT.nodes]))


In [None]:
## build a dataframe with node characteristics
df = pd.DataFrame()
df['name'] = [GoT.nodes[v].name for v in GoT.nodes()]
df['degree'] = [GoT.degree(v) for v in GoT.nodes()]
df['strength'] = [GoT.nodes[v].strength for v in GoT.nodes()]
df.sort_values(by='strength',ascending=False).head(12)


###  Compute s-centrality and betweenness

We consider $s=1$ and $s=2$ below.

In [None]:
## with s=1
bet = hnx.s_betweenness_centrality(GoT, edges=False)
har = hnx.s_harmonic_centrality(GoT, edges=False, normalized=False)
df['betweenness(s=1)'] = [bet[v] for v in GoT.nodes()]
n = GoT.shape[0]
df['harmonic(s=1)'] = [har[v]/(n-1) for v in GoT.nodes()]

## with s=2
bet = hnx.s_betweenness_centrality(GoT, edges=False, s=2)
har = hnx.s_harmonic_centrality(GoT, edges=False, normalized=False, s=2)
df['betweenness(s=2)'] = [bet[v] for v in GoT.nodes()]
df['harmonic(s=2)'] = [har[v]/(n-1) for v in GoT.nodes()]

#print(df.sort_values(by=['strength'],ascending=False).head(10)[['name','degree','strength','betweenness(s=1)','harmonic(s=1)']].to_latex(index=False, float_format="{:0.5f}".format))
df.sort_values(by=['strength'],ascending=False).head(10)


## Build 2-section graph and compute a few centrality measures

We saw several centrality measures for graphs in chapter 3. 

Below, we build the 2-section graph for GoT and compute a few of those. 

**n.b.: Unlike in the first edition of the book, we now ignore edge weights to compare with the hypergraph s-measures.**


In [None]:
## build 2-section
G = hmod.two_section(GoT)

## betweenness
n = G.vcount()
b = G.betweenness(directed=False)
G.vs['bet'] = [2*x/((n-1)*(n-2)) for x in b]
for v in G.vs:
    GoT.nodes[v['name']].bet = v['bet']
df['betweenness'] = [GoT.nodes[v].bet for v in GoT.nodes()]

## harmonic
G.vs['hc'] = G.harmonic_centrality(normalized=True)
for v in G.vs:
    GoT.nodes[v['name']].hc = v['hc']
df['harmonic'] = [GoT.nodes[v].hc for v in GoT.nodes()]

## order w.r.t. harmonic
df.sort_values(by='harmonic',ascending=False).head()


In [None]:
## high correlation between centrality measures
corr = df[['betweenness(s=1)','betweenness(s=2)','betweenness','harmonic(s=1)','harmonic(s=2)','harmonic']].corr()
#print(corr[['harmonic','betweenness']].to_latex(index=True,  float_format="{:0.3f}".format))
print(corr[['harmonic','betweenness']])


## Hypergraph modularity and clustering

We use $\tau=3$ for the hypergraph ($\tau$) modularity weights below.


In [None]:
##### visualize the 2-section graph
print('nodes:',G.vcount(),'edges:',G.ecount())
G.vs['size'] = 14
G.vs['color'] = 'lightgrey'
G.vs['label'] = [int(x) for x in G.vs['name']] ## use int(name) as label
G.vs['character'] = [GoT.nodes[n].name for n in G.vs['name']]
G.vs['label_size'] = 6
seed = 42
np.random.seed(seed)
random.seed(seed)
ly_fr = G.layout_fruchterman_reingold()
ig.plot(G, layout=ly_fr, bbox=(0,0,600,400), edge_color='lightgrey')


In [None]:
## we see a well-separated small clique; it is the Braavosi theater troup
print([GoT.nodes[str(x)].name for x in np.arange(166,173)])


### random clustering


In [None]:
%%time
## Compute modularity (with qH3 function) on several random partition with K parts for a range of K's
## This should be close to 0 and can be negative.
h = []
for K in np.arange(2,21,2):
    for rep in range(1): ## 10 for the textbook
        V = list(GoT.nodes)
        np.random.seed(K*rep)
        p = np.random.choice(K, size=len(V))
        RandPart = hmod.dict2part({V[i]:p[i] for i in range(len(V))})
        ## drop empty sets if any
        RandPart = [x for x in RandPart if len(x)>0]
        ## compute qH
        h.append(hmod.modularity(GoT, RandPart, qH3))
print('range for qH:',min(h),'to',max(h))


In [None]:
plt.figure(figsize=(5,4))
sns.boxplot(h, showfliers=False, width=.5)
plt.show()


### 2-section graph clustering

In [None]:
%%time
## Cluster the 2-section graph (with Leiden) and compute qH
## We now see qH >> 0
qH_best = -1
for i in range(100):
    G.vs['_leiden'] = G.community_leiden(objective_function='modularity', weights='weight', resolution=1.0).membership
    ML = hmod.dict2part({v['name']:v['_leiden'] for v in G.vs})
    qH = hmod.modularity(GoT, ML, qH3)
    if qH > qH_best:
        qH_best = qH
        G.vs['leiden'] = G.vs['_leiden']
print('qH:',"{:.4f}".format(qH_best))
for v in G.vs:
    GoT.nodes[v['name']].leiden = v['leiden']
df['leiden_cluster'] = [GoT.nodes[v].leiden for v in GoT.nodes()]


In [None]:
## plot 2-section w.r.t. the resulting clusters
cl = G.vs['leiden']

## pick greyscale or color plot:
pal = ig.GradientPalette("white","black",max(cl)+2)
pal = ig.ClusterColoringPalette(max(cl)+2)
G.vs['color'] = [pal[x] for x in cl]

## show labels or not
G.vs['label_size'] = 0

ig.plot(G, layout = ly_fr, bbox=(0,0,600,400), edge_color='gainsboro', vertex_size=8)
#ig.plot(G, target='GoT_clusters.eps', layout = ly_fr, bbox=(0,0,600,400), edge_color='grey')


### edge composition after clustering

We see that the most frequent edges are small "pure" edges, but there ar also several edges with all but one node from the same community.

This suggests an intermediate value for $\tau$, such as $\tau$=2 or 3, for the exponent in the modularity.


In [None]:
comm_dict = dict(zip(G.vs['name'],G.vs['leiden']))
L = []
for e in GoT.edges:
    L.append(tuple(x[1] for x in Counter([comm_dict[i] for i in GoT.edges[e]]).most_common()))
X = Counter(L).most_common()
L = []
for x in X:
    L.append([len(x[0]), sum(x[0]), x[0][0], x[1], x[0][0]>sum(x[0])/2])
df_cd = pd.DataFrame(np.array(L), columns=['n_comm','d','c','frequency','community edge'],)
df_cd['cum_freq'] = df_cd.cumsum().frequency / GoT.shape[1]
df_cd.head(10)


### Kumar's algorithm

In [None]:
Ku = hmod.kumar(GoT, verbose=False)
print('qH:',"{:.4f}".format(hmod.modularity(GoT, Ku, qH3)))
dct = hmod.part2dict(Ku)
G.vs['kumar'] = [dct[i] for i in G.vs['name']]
df['kumar'] = [dct[v] for v in G.vs['name']]
print('AMI vs 2-section partitions:',AMI(G.vs['leiden'],G.vs['kumar']))


### h-Louvain

Slower but often yields higher h-modularity than other algorithms

Uncomment the code below to run - this can take a few minutes.


### Looking at one of the lead characters

In [None]:
## ex: high strength nodes in same cluster with Daenerys Targaryen
dt = df[df['name']=='Daenerys Targaryen']['leiden_cluster'].iloc[0]
df[df['leiden_cluster']==dt].sort_values(by='strength',ascending=False).head(9)


## Compute the simplicial ratio and other simpliciality measures

We see a simpliciality ratio well above 1, suggesting more simplicial pairs than would happen at random.

For the other measures, the simplicial fraction (0.07) and more so the edit simpliciality (7e-5) are small,
which is to be expected as there are several large edges in this dataset.
The face edit sompliciality is a bit higher at 0.26.


In [None]:
## compute the simplicial ratio measure
E = [set(GoT.edges[e]) for e in GoT.edges]
V = list(set([x for y in E for x in y]))

## build list of edges incident to each node
edge_dict = spl.get_edge_sets(V, E)

## mapping between node index and character name
node_dict = dict(zip([GoT.nodes[v].name for v in GoT.nodes], list(GoT.nodes)))

## simplicial ratio
spl.get_simplicial_ratio(V, E, samples=100)


### Compute the individual simpliciality ratio for each GoT character

We look at the ego-nets for some nodes high/low simpliciality

In [None]:
## Compute the individual simpliciality ratio for each character and rank
sm = []
np.random.seed(123)
for name in df.name:
    E = edge_dict[node_dict[name]]
    V = list(set([x for y in E for x in y]))
    sm.append(spl.get_simplicial_ratio(V, E, samples=100))
df['simpliciality'] = sm
df.sort_values(by='simpliciality', ascending=False)


In [None]:
## pick high/low simpliciality nodes with low degree for viz below
hs = 'Bowen Marsh'
ls = 'Ros'


In [None]:
## high simpliciality
kwargs = {'layout_kwargs': {'seed': 123}, 'with_edge_labels':False, 'with_node_labels':False}
edges_kwargs={'edgecolors':'grey'}
SE = [e for e in edge_dict[node_dict[hs]]]
HG = hnx.Hypergraph(SE)
nc = ['grey']*len(list(HG.nodes))
idx = np.where(np.array(HG.nodes)==node_dict[hs])[0][0]
nc[idx] = 'black'
nr = dict(zip(HG.nodes,[1]*len(list(HG.nodes))))
nr[node_dict[hs]] = 2
nodes_kwargs={'facecolors':nc}
print('looking at node:',hs)
plt.subplots(figsize=(7,7))
hnx.draw(HG, **kwargs, edges_kwargs=edges_kwargs, nodes_kwargs=nodes_kwargs,  node_radius=nr)
#plt.savefig('bowen.eps')
plt.show()


In [None]:
## convex hull view
XH = xgi.Hypergraph([list(HG.edges[e]) for e in HG.edges])
xgi.draw(XH, node_fc='black', hull=True, node_size=[nr[i] for i in XH.nodes])
plt.show()


In [None]:
## low simpliciality
kwargs = {'layout_kwargs': {'seed': 123}, 'with_edge_labels':False, 'with_node_labels':False}
edges_kwargs={'edgecolors':'grey'}
SE = [e for e in edge_dict[node_dict[ls]]]
HG = hnx.Hypergraph(SE)
nc = ['grey']*len(list(HG.nodes))
idx = np.where(np.array(HG.nodes)==node_dict[ls])[0][0]
nc[idx] = 'black'
nr = dict(zip(HG.nodes,[1]*len(list(HG.nodes))))
nr[node_dict[ls]] = 2
nodes_kwargs={'facecolors':nc}
print('looking at node:',ls)
plt.subplots(figsize=(7,7))
hnx.draw(HG, **kwargs, edges_kwargs=edges_kwargs, nodes_kwargs=nodes_kwargs, node_radius=nr)
#plt.savefig('ros.eps')
plt.show()


In [None]:
##convex hull view
XH = xgi.Hypergraph([list(HG.edges[e]) for e in HG.edges])
xgi.draw(XH, node_fc='black', hull=True, node_size=[nr[i] for i in XH.nodes]);
plt.show()


In [None]:
## 3-d view per edge size
_, ax = plt.subplots(figsize=(10, 10), subplot_kw={"projection": "3d"})
xgi.draw_multilayer(XH, ax=ax, node_fc='black',hull=True, node_size=[nr[i] for i in XH.nodes], sep=1, h_angle=25)
plt.show()


##  degree - size correlation

We see positive, but very small correlation in this case.


In [None]:
_x, _y, corr = h_deg_size_corr(GoT)
print('correlation:',corr)
_df = pd.DataFrame(np.array([_x, _y]).T, columns=['degree','edge size'])
plt.figure(figsize=(5,4))
sns.boxplot(data=_df, x='edge size', y='degree', showfliers=False, width=.5, color='lightblue')
plt.show()


In [None]:
## grouping node sizes in 3 tiers: up to 8, 9-16 and 17+
_df['edge size range'] = [(x-1)//8 for x in _df['edge size']]
plt.figure(figsize=(5,4))
sns.boxplot(data=_df, x='edge size range', y='degree', showfliers=False, width=.5, color='lightblue')
plt.xticks([0,1,2],['2-8','9-16','17-24'])
plt.show()


### Rich club coefficients - via sampling for computing the denominator

* first, compute number of edges with all nodes having degree >= k for each k: $\phi(k)$


In [None]:
## degrees in GoT graph
threshold = np.quantile([GoT.degree(v) for v in GoT.nodes],.95)
d = np.sort(list(set([GoT.degree(v) for v in GoT.nodes if GoT.degree(v)<threshold])))
L = []
for e in GoT.edges:
    L.append(np.min([GoT.degree(v) for v in GoT.edges[e]]))
## compute phi's
phi = []
L = np.array(L)
for k in d:
    phi.append(sum(L>=k))


* now generate random bipartites graphs and compute all $\hat{\phi}_k$.


In [None]:
## number of repeats
REP = 100

## repeat each node w.r.t. its degree
V = []
for v in GoT.nodes:
    V.extend(list(np.repeat(v,GoT.degree(v))))

## edge sizes
S = [len(GoT.edges[e]) for e in GoT.edges()]

## initialize
np.random.seed(321)
phi_hat = np.zeros(len(phi))

for rep in range(REP):
    ## randomize   
    V = np.random.permutation(V)
    ## generate the edges
    ctr = 0
    E = []
    for s in S:
        E.append(list(V[ctr:(ctr+s)]))
        ctr += s
    ## min degree seen for each edge
    L = []
    for e in E:
        L.append(np.min([GoT.degree(v) for v in e]))
    L = np.array(L)
    ## compute one instance of phi_hat and add to the sum
    ph = []
    for k in d:
        ph.append(sum(L>=k))   
    phi_hat = phi_hat + np.array(ph)

## average the final phi_hat vector
phi_hat = phi_hat / REP


In [None]:
## no strong rich-club phenomenon here
plt.figure(figsize=(6,6))
rc = [a/b for a,b in zip(phi,phi_hat)]
plt.semilogx(d, rc, ".", c="black")
plt.xlabel(r"degree $\ell$", fontsize=12)
plt.ylabel(r"rich club coefficient $\rho(\ell)$")
# plt.savefig('rich_club_got.eps')
plt.show()


## (k,t)-hypercoreness

Maximal generalized subhypergraph where nodes have degree $k$ or more, and each edge contains at least proportion t of its original nodes.

We compute the size of this maximal hypercore for values of $5 \le k \le 50$ and $.6 \le t \le 1$.


In [None]:
## From paper - faster
def hypercore(HG, k, t=1, verbatim=False):
    E = [set(HG.edges[e]) for e in HG.edges]
    D = [max(2,t*len(e)) for e in E]
    V = set([v for v in HG.nodes()])
    deg = Counter([v for e in E for v in e])
    if verbatim:
        print(len(V))
    R = set([v for v in V if deg[v] < k])
    while len(R)>0:
        Rp = set()
        for i in range(len(E)):
            e = E[i]
            if len(e)>0:
                if len(R.intersection(e))>0:
                    E[i] = E[i].difference(R)
                    deg = Counter([v for e in E for v in e])
                    if (len(E[i])<D[i]):
                        a = set([v for v in E[i] if deg[v]==k])
                        Rp = Rp.union(a)
                        E[i] = set()

        V = V.difference(R)
        R = Rp
        if verbatim:
            print(len(V))
    return V

In [None]:
%%time
# T = [.6,.7,.8,.9,1] ## un-comment to try severar values for t
T = [.9]
## compute for range of values for k and t and store
L = []
for k in np.arange(5,51):
    for t in T:
        L.append([k,t, len(hypercore(GoT,k,t))])
D = pd.DataFrame(L, columns=['k','t','Size'])


In [None]:
## plot the resulting values
plt.figure(figsize=(6,6))
for t in T:
    plt.plot(D[D.t==t].k, D[D.t==t].Size, '.-', label=t)
plt.xlabel('value of k', fontsize=14)
plt.ylabel('(k,t)-hypercore size', fontsize=14)
plt.legend(title='value of t', fontsize=12)
plt.show()


### looking at a specific (k,t)-hypercore

$k=18$ and $t=0.9$

In [None]:
## map to 2-section and visualize
V = hypercore(GoT,18,.9, verbatim=False)
E = [e for e in GoT.edges if len(V.intersection(set(GoT.edges[e]))) / len(GoT.edges[e]) >= .9]
H = GoT.restrict_to_edges(E)
H = H.restrict_to_nodes(V) ## ADDED
G = hmod.two_section(H)
G.vs['size'] = 0
G.vs['color'] = 'white'
#G.vs['label'] = G.vs['name']a
G.vs['label'] = [GoT.nodes[n].name for n in G.vs['name']]
G.vs['label_size'] = 12
random.seed(321)
G.vs['layout'] = G.layout_fruchterman_reingold()
ig.plot(G, layout=G.vs['layout'], bbox=(500,500), margin=50, edge_color='lightgrey')


In [None]:
## same hypergraph, different view with XGI
pos = dict(zip(G.vs['label'],[[v[0],-v[1]] for v in G.vs['layout']]))
E = []
for e in H.edges:
    E.append([Names[x] for x in H.edges[e]])
XH = xgi.Hypergraph(E)
fig, ax = plt.subplots(figsize=(10,10))
xgi.draw(XH, pos=pos, dyad_color='grey', hull=True, radius=.25, edge_fc_cmap='Greys_r', alpha=.005, node_size=0, ax=ax, node_labels=True)
plt.show()


# Contact hypergraphs

We consider two datasets where hyperedges are built when individuals come into close physical contact over some time ingtervals. The datasets are available from the XGI package directly, see: https://xgi.readthedocs.io/en/stable/xgi-data.html.
For both datasets, we keep a single instance for every edge. 
The data is in directory ```../Datasets/Contacts```.
Some questions at the end of Chapter 7 refer to those datasets.

### Primary school dataset

* 12,704 hyperedges of size 2 to 5 built from 242 nodes.
* the nodes are children belonging to one of 10 classes, and the teachers 
* file ```hyperedges-contact-primary.txt``` contains the edges (1 per line, csv), the nodes are 1-based
* file ```labels-contact-primary.txt``` contains the node labels, 1 to 11 (in numerical order of the nodes)

References in: https://zenodo.org/records/10155810


### High school dataset

* 7,818 hyperedges of size 2 to 5 built from 327 nodes.
* the nodes are students belonging to one of 9 classes
* file ```hyperedges-contact-highschool.txt``` contains the edges (1 per line, csv), the nodes are 1-based
* file ```labels-contact-highschool.txt``` contains the node labels, 1 to 9 (in numerical order of the nodes)

References in: https://zenodo.org/records/10155802


In [None]:
## read the edges and ground-truth communities and build hypergraph H and 2-section graph G

## pick one of the two datasets
#dataset = 'primary'
dataset = 'highschool'


## read edge list, build H
fp = open(datadir+'Contacts/hyperedges-contact-'+dataset+'.txt', 'r')
Lines = fp.readlines()
E = []
for line in Lines:
    E.append(set([x for x in line.strip().split(',')]))
H = hnx.Hypergraph(dict(enumerate(E)))
print('number of nodes:',len(H.nodes),'  number of edges:',len(H.edges))

## build 2-section graph
G = hmod.two_section(H)

## read ground-truth communities and store in a dictionary
fn = datadir+'Contacts/labels-contact-'+dataset+'.txt'
gt = pd.read_csv(fn, header=None)[0].tolist()
Communities = {str(k+1):v for k,v in enumerate(gt)}

## plot the 2-section graph
pal = ig.RainbowPalette(n=max(gt)+1)
G.vs['color'] = [pal[Communities[v['name']]] for v in G.vs]
ig.plot(G, bbox=(400,400), vertex_size=5, edge_color='lightgrey')


# Motifs example 

Using HNX and XGI draw function to get patterns from **Figure 7.1** in the book and count motifs reported in **Table 7.2**.

Given:
* E2: number of edges of size 2
* G(E2): graph built only with E2
* E3: edges of size 3

Compute:
* H1: number of subgraphs of 4-nodes in G(E2) with 5 edges + 6 times the number of 4-cliques in G(E2)
* H3: count pairs of edges in E3 with intersection of size 2
* H2: for each (i,j,k) in E3, count common neighbours in G(E2) for (i,j), (i,k) and (j,k) 

Random hypergraphs:
* probability for 2-edges: p2 = c/(n-1)
* probability for 3-edges to maintain expected 2-section graph degree:  p3 = (8-c)/((n-1)*(n-2)) 
* probability for 3-edges to maintain expected H-degree: p3 = (8-c)/((n-1)*(n/2-1))


In [None]:
## H1 pattern
ly = {'A':(0,1),'B':(1,1),'C':(0,0),'D':(1,0)}
E = [{'A'},{'B'},{'C'},{'D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
g = nx.Graph()
g.add_edge('B','A')
g.add_edge('C','A')
g.add_edge('B','C')
g.add_edge('B','D')
g.add_edge('C','D')
plt.figure(figsize=(3,3))
hnx.draw(HG, pos=ly, with_edge_labels=False, with_node_labels=False,  
         edges_kwargs={'linewidths': 0, 'edgecolors': 'grey'},
         node_radius=3.0, with_additional_edges=g
        )
#plt.savefig('H1.eps')
plt.show()


In [None]:
## H2 pattern
E = [{'A','B','C'},{'D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
g = nx.Graph()
g.add_edge('B','D')
g.add_edge('C','D')
plt.figure(figsize=(3,3))
hnx.draw(HG, pos=ly, with_edge_labels=False, with_node_labels=False,  
         edges_kwargs={'linewidths': [1.5,0], 'edgecolors': 'grey'},
         node_radius=3.0, with_additional_edges=g
        )
#plt.savefig('H2.eps')
plt.show()


In [None]:
## H3 pattern
E = [{'A','B','C'},{'B','C','D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
plt.figure(figsize=(3,3))
hnx.draw(HG, pos=ly, with_edge_labels=False, with_node_labels=False,  
         edges_kwargs={'linewidths': 1.5, 'edgecolors': 'grey'},
         node_radius=3.0
        )
#plt.savefig('H3.eps');
plt.show()
