# Chapter 7 - Hypergraphs

In this notebook, we introduce hypergraphs, a generalization of graphs where we allow for arbitrary sized edges (in practice, we consider edges of size 2 or more). We illustrate a few concepts using hypergraphs including modularity, community detection and transformation into 2-section graphs.

**This notebook requires version 1.2 or newer of the HyperNetX package** (https://github.com/pnnl/HyperNetX).

### New required package (HNX version 1.2 or newer):

* pip install hypernetx


In [None]:
## Set this to the data directory
datadir='../Datasets/'

In [None]:
import pandas as pd
import numpy as np
import igraph as ig
import partition_igraph
import hypernetx as hnx
import hypernetx.algorithms.hypergraph_modularity as hmod ## new as of version 1.2
import pickle
import matplotlib.pyplot as plt
%matplotlib inline
from collections import Counter
from functools import reduce
import itertools
from scipy.special import comb


# Toy hypergraph example with HNX

We illustrate a few concepts with a small toy hypergraph. 
First, we build the hnx hypergraph from a list of sets (the edges), and we draw the hypergraph as well as its dual (where the role of nodes and edges are swapped).


In [None]:
## build an hypergraph from a list of sets (the hyperedges)
## using 'enumerate', edges will have integer IDs
E = [{'A','B'},{'A','C'},{'A','B','C'},{'A','D','E','F'},{'D','F'},{'E','F'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
hnx.draw(HG)


In [None]:
## dual hypergraph
HD = HG.dual()
hnx.draw(HD)

### pre-computing

HNX hypergraphs have node and edge weights set to 1 by default if no other values are supplied.
The hypergraph modularity code requires a few other quantities that we pre-compute for efficiency: node strength (sum of weight of incident edges; this is the same as degree if all edge weights are equal to 1) and d-weights (sum of weights of edges of size d for each value d appearing in the hypergraph)

In [None]:
## compute node strength (add unit weight if none) and a few other quantities useful to quickly compute modularity
HG = hmod.precompute_attributes(HG)

## show the nodes (here strength = degree since all weights are 1 by default)
HG.nodes.elements


In [None]:
## show the edges (unit weights were added by default)
HG.edges.elements


In [None]:
## d-weights distribution; here there are edges of size 2, 3 and 4 only.
HG.d_weights


### hypergraph modularity qH

We compute qH on the toy graph for 4 different partitions, and using 3 different variations for the edge contribution.

For edges of size $d$ where $c$ is the number of nodes from the part with the most representatives, we consider three variations as follows for edge contribution:

* **strict**: edges are considered only if all nodes are from the same part, with unit weight, i.e. $w$ = 1 iff $c == d$ (0 else).
* **majority**: edges are counted only if more that half the nodes are from the same part, with unit weights, i.e. $w$ = 1 iff $c>d/2$ (0 else).
* **linear**: edges are counted only if more that half the nodes are from the same part, with weights proportional to the number of nodes in the majority, i.e. $w = c/d$ iff $c>d/2$ (0 else).


In [None]:
## compute hypergraph modularity (qH) for the following partitions:
A1 = [{'A','B','C'},{'D','E','F'}]           ## good clustering, qH should be positive
A2 = [{'B','C'},{'A','D','E','F'}]           ## not so good
A3 = [{'A','B','C','D','E','F'}]             ## this should yield qH == 0
A4 = [{'A'},{'B'},{'C'},{'D'},{'E'},{'F'}]   ## qH should be negative here

## we compute with 3 different choices of functions for the edge contribution: linear (default), strict and majority

print('linear edge contribution:')
print('qH(A1):',hmod.modularity(HG,A1),
      'qH(A2):',hmod.modularity(HG,A2),
      'qH(A3):',hmod.modularity(HG,A3),
      'qH(A4):',hmod.modularity(HG,A4))
print('strict edge contribution:')
print('qH(A1):',hmod.modularity(HG,A1,hmod.strict),
      'qH(A2):',hmod.modularity(HG,A2,hmod.strict),
      'qH(A3):',hmod.modularity(HG,A3,hmod.strict),
      'qH(A4):',hmod.modularity(HG,A4,hmod.strict))
print('majority edge contribution:')
print('qH(A1):',hmod.modularity(HG,A1,hmod.majority),
      'qH(A2):',hmod.modularity(HG,A2,hmod.majority),
      'qH(A3):',hmod.modularity(HG,A3,hmod.majority),
      'qH(A4):',hmod.modularity(HG,A4,hmod.majority))


### 2-section graph

We build the 2-section graph for the toy hypergraph, and run graph lcustering (ECG) on this graph.


In [None]:
## 2-section graph
G = hmod.two_section(HG)
G.vs['label'] = G.vs['name']
ig.plot(G,bbox=(0,0,250,250))


In [None]:
## 2-section clustering with ECG
G.vs['community'] = G.community_ecg().membership
hmod.dict2part({v['name']:v['community'] for v in G.vs})


# Game of Thrones scenes hypergraph

The original data can be found here: https://github.com/jeffreylancaster/game-of-thrones.
A pre-processed version is provided, where we consider an hypergraph from the game of thrones scenes with he following elements:

* **Nodes** are named characters in the series
* **Hyperedges** are groups of character appearing in the same scene(s)
* **Hyperedge weights** are total scene(s) duration in seconds involving each group of characters

We kept hyperedges with at least 2 characters and we discarded characters with degree below 5.

We saved the following:

* *Edges*: list of sets where the nodes are 0-based integers represented as strings: '0', '1', ... 'n-1'
* *Names*: dictionary; mapping of nodes to character names
* *Weights*: list; hyperedge weights (in same order as Edges)


In [None]:
## read the data
with open(datadir+"GoT/GoT.pkl","rb") as f:
    Edges, Names, Weights = pickle.load(f)


## Build weighted hypergraph 

Use the above to build the weighted hypergraph (GoT).

In [None]:
## Nodes are represented as strings from '0' to 'n-1'
GoT = hnx.Hypergraph(dict(enumerate(Edges)))

## add edge weights
for e in GoT.edges:
    GoT.edges[e].weight = Weights[e]

## add full names of characters
for v in GoT.nodes:
    GoT.nodes[v].name = Names[v]

## pre-compute required quantities for modularity and clustering
GoT = hmod.precompute_attributes(GoT)

print(GoT.number_of_nodes(),'nodes and',GoT.number_of_edges(),'edges')

In [None]:
## example of a node (indices are strings)
GoT.nodes['0']

In [None]:
## example of an edge (indices are integers)
GoT.edges[0]

In [None]:
## to get the nodes for a given edge
GoT.edges[0].elements

In [None]:
## or just the keys
GoT.edges[0].elements.keys()

## EDA on GoT hypergraph

Simple exploratory data analysis (EDA) on this hypergraph. 

In [None]:
## edge sizes (number of characters per scene)
plt.hist([GoT.edges[e].size() for e in GoT.edges], bins=25, color='grey')
plt.xlabel("Edge size",fontsize=14);
#plt.savefig('got_hist_1.eps');
## max edge size
print('max = ',max([GoT.edges[e].size() for e in GoT.edges]))

In [None]:
## edge weights (total scene durations for each group of characters appearing together)
plt.hist([GoT.edges[e].weight for e in GoT.edges], bins=25, color='grey')
plt.xlabel("Edge weight",fontsize=14);
#plt.savefig('got_hist_2.eps');
## max edge weight
print('max = ',max([GoT.edges[e].weight for e in GoT.edges]))

In [None]:
## node degrees
plt.hist(hnx.degree_dist(GoT),bins=20, color='grey')
plt.xlabel("Node degree",fontsize=14);
#plt.savefig('got_hist_3.eps');
## max degree
print('max = ',max(hnx.degree_dist(GoT)))

In [None]:
## node strength (total appearance)
plt.hist([GoT.nodes[n].strength for n in GoT.nodes], bins=20, color='grey')
plt.xlabel("Node strength",fontsize=14);
#plt.savefig('got_hist_4.eps');
## max strength
print('max = ',max([GoT.nodes[n].strength for n in GoT.nodes]))

In [None]:
## build a dataframe with node characteristics
dg = [GoT.degree(v) for v in GoT.nodes()]
st = [GoT.nodes[v].strength for v in GoT.nodes()]
nm = [GoT.nodes[v].name for v in GoT.nodes()]
D = pd.DataFrame(np.array([nm,dg,st]).transpose(),columns=['name','degree','strength'])
D['degree'] = pd.to_numeric(D['degree'])
D['strength'] = pd.to_numeric(D['strength'])

## sort w.r.t. strength
D.sort_values(by='strength',ascending=False).head()

In [None]:
## sort w.r.t. degree
D.sort_values(by='degree',ascending=False).head()

In [None]:
## we see clear correlation between degree and strength
plt.plot(D['degree'],D['strength'],'.')
plt.xlabel('degree',fontsize=14)
plt.ylabel('strength',fontsize=14);

## Build 2-section graph and compute a few centrality measures

We saw several centrality measures for graphs in chapter 3. Below, we build the 2-section graph for GoT and compute a few of those.

Node ordering should be preserved and we verify that it is. 

In [None]:
## build 2-section
G = hmod.two_section(GoT)

In [None]:
## sanity check -- node ordering is the same in GoT and G

## ordering of nodes in GoT
ord_GoT = list(GoT.nodes.elements.keys())

## ordering of nodes in G
ord_G = [v['name'] for v in G.vs]

ord_GoT == ord_G

In [None]:
b = G.betweenness(directed=False,weights='weight')
n = G.vcount()
D['betweenness'] = [2*x/((n-1)*(n-2)) for x in b]
D['pagerank'] = G.pagerank(directed=False,weights='weight')

## order w.r.t. betweenness
D.sort_values(by='betweenness',ascending=False).head()

In [None]:
## order w.r.t. pagerank
D.sort_values(by='pagerank',ascending=False).head()

## Hypergraph modularity and clustering



In [None]:
## visualize the 2-section graph
print('nodes:',G.vcount(),'edges:',G.ecount())
G.vs['size'] = 10
G.vs['color'] = 'lightgrey'
G.vs['label'] = [int(x) for x in G.vs['name']] ## use int(name) as label
G.vs['character'] = [GoT.nodes[n].name for n in G.vs['name']]
G.vs['label_size'] = 5
ly = G.layout_fruchterman_reingold()
ig.plot(G, layout = ly, bbox=(0,0,600,400))

In [None]:
## we see a well-separated small clique; it is the Braavosi theater troup
print([GoT.nodes[str(x)].name for x in np.arange(166,173)])


In [None]:
## Compute modularity (qH) on several random partition with K parts for a range of K's
## This should be close to 0 and can be negative.
h = []
for K in np.arange(2,21):
    for rep in range(10):
        V = list(GoT.nodes)
        p = np.random.choice(K, size=len(V))
        RandPart = hmod.dict2part({V[i]:p[i] for i in range(len(V))})
        ## drop empty sets if any
        RandPart = [x for x in RandPart if len(x)>0]
        ## compute qH
        h.append(hmod.modularity(GoT, RandPart))
print('range for qH:',min(h),'to',max(h))
plt.boxplot(h, showfliers=False);

In [None]:
## Cluster the 2-section graph (with Louvain) and compute qH
## We now see qH >> 0
G.vs['louvain'] = G.community_multilevel(weights='weight').membership
D['cluster'] = G.vs['louvain']
ML = hmod.dict2part({v['name']:v['louvain'] for v in G.vs})
## Compute qH
print(hmod.modularity(GoT, ML))


In [None]:
## plot 2-section w.r.t. the resulting clusters
cl = G.vs['louvain']

## pick greyscale or color plot:
#pal = ig.GradientPalette("white","black",max(cl)+2)
pal = ig.ClusterColoringPalette(max(cl)+1)

G.vs['color'] = [pal[x] for x in cl]
G.vs['label_size'] = 5
ig.plot(G, layout = ly, bbox=(0,0,500,400))
#ig.plot(G, target='GoT_clusters.eps', layout = ly, bbox=(0,0,400,400))

In [None]:
## ex: high strength nodes in same cluster with Daenerys Targaryen
dt = int(D[D['name']=='Daenerys Targaryen']['cluster'])
D[D['cluster']==dt].sort_values(by='strength',ascending=False).head(9)

# Extra material


## Experiment with simple random hypergraphs with communities

Note: qH-based heuristics are still very experimental; we only provide this for illustration in **Section 7.4** of the book. Experiment results are stored in files taus_xx.pkl with xx in {00, 05, 10, 15}.

For each experiment, we have results for:

* 16 hypergraphs each with 1000 nodes, 1400 edges of size 2 to 8 (200 each)
* 10 communities with 0%, 5%, 10% or 15% of "noise" edges ($\mu$)
* community edge homogeneity ($\tau$) from 0.5 to 1
* communities obtained via 3 algorithms:
 * qG-based Louvain on the 2-section graph
 * qH-based heuristic clustering algorithm on the hypergraph
 * qH+: same but using true homogeneity ($\tau$)

Recall that AMI = adjusted mutual information.


In [None]:
## load results (here mu = .05) and plot
with open( datadir+"Hypergraph/taus_05.pkl", "rb" ) as f:
    results = pickle.load(f)

R = pd.DataFrame(results,columns=['tau','Graph','Hypergraph','Hypergraph+']).groupby(by='tau').mean()
t = [x for x in np.arange(.501,1,.025)]

## color or greyscale
pal = ig.GradientPalette("grey","black",3)
#pal = ig.GradientPalette("red","blue",3)

## plot
plt.plot(t,R['Graph'],'o-',label='qG-based',color=pal[0])
plt.plot(t,R['Hypergraph'],'o-',label='qH-based',color=pal[1])
plt.plot(t,R['Hypergraph+'],'o-',label='qH-based (tuned)',color=pal[2])
plt.xlabel(r'homogeneity ($\tau$)',fontsize=14)
plt.ylabel('AMI',fontsize=14)
plt.legend();
#plt.savefig('taus_05.eps');

## Community hypergraphs

We provide hyperedge list and communities for 3 random hypergraph with communities, namely:

* edges_65, comm_65: hypergraph with $\tau_e = \lceil(d*0.65)\rceil$ for all community edges of size $d$
* edges_85, comm_85: hypergraph with $\tau_e = \lceil(d*0.85)\rceil$ for all community edges of size $d$
* edges_65_unif, comm_65_unif: hypergraph with $\tau_e$ chosen uniformly from $\{\lceil(d*0.65)\rceil,...,d\}$ for all community edges of size $d$

All have 1000 nodes, 1400 edges of size 2 to 8 (200 each) 10 communities and noise parameter $\mu=0.1$.

In [None]:
## load the edge lists and communities
with open(datadir+"Hypergraph/hypergraphs.pkl","rb") as f:
    (edges_65, comm_65, edges_85, comm_85, edges_65_unif, comm_65_unif) = pickle.load(f)

In the experiment below, we estimate the homogeneity parameter $\tau$ via clustering on the 2-section graph and comare with the results we get using the true communities.

In [None]:
## pick one of the three hypergraphs
comm = comm_65
L = edges_65

## build hypergraph
HG = hnx.Hypergraph(dict(enumerate(L)))

## compute P(homogeneity > $\tau$) using the true communities
x = []
for e in L:
    x.append(max([len(e.intersection(k)) for k in comm])/len(e))
y = []
for t in np.arange(.501,1,.025):
    y.append(sum([i>t for i in x])/len(x))
plt.plot(np.arange(.501,1,.025),y,'.-',color='grey',label='true communities')

## same but using the communities obtained via Louvain algorithm on the 2-section graph
G = hmod.two_section(HG)
G.vs['louvain'] = G.community_multilevel(weights='weight').membership
ML = hmod.dict2part({v['name']:v['louvain'] for v in G.vs})
x = []
for e in L:
    x.append(max([len(e.intersection(k)) for k in ML])/len(e))
y = []
for t in np.arange(.501,1,.025):
    y.append(sum([i>t for i in x])/len(x))
plt.plot(np.arange(.501,1,.025),y,'.-',color='black',label='Louvain')

## add grid and legend
plt.grid()
#plt.title(r'Estimating $\tau$ from data',fontsize=14)
plt.ylabel(r'Pr(homogeneity > $\tau$)',fontsize=14)
plt.xlabel(r'$\tau$',fontsize=14)
plt.legend()
plt.ylim(0,1);
#plt.savefig('tau_65.eps');


In [None]:
## distribution of edge homogeneity with all tau = 0.65
## results vary in view of various edge sizes, nd some "noise" edges.
x = []
for e in edges_65:
    x.append(max([len(e.intersection(k)) for k in comm_65])/len(e))
plt.hist(x,bins='rice',color='grey');
#plt.savefig('hist_65.eps');


In [None]:
## distribution of edge homogeneity with tau varying from 0.65 to 1
## we see many more pure community edges in this case, as expected
x = []
for e in edges_65_unif:
    x.append(max([len(e.intersection(k)) for k in comm_65_unif])/len(e))
plt.hist(x, bins='rice',color='grey');
#plt.savefig('hist_65_unif.eps');


# Motifs example 

Using HNX draw function to get patterns from Figure 7.1 in the book

In [None]:
## H1 pattern
E = [{'A','B'},{'A','C'},{'A','D'},{'B','D'},{'C','D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
hnx.draw(HG)

In [None]:
## H2 pattern
E = [{'A','B','C'},{'A','D'},{'C','D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
hnx.draw(HG)

In [None]:
## H3 pattern
E = [{'A','B','C'},{'B','C','D'}]
HG = hnx.Hypergraph(dict(enumerate(E)))
hnx.draw(HG)


In [None]:
### Counting those patterns -- Table 7.2 in the book