# Chapter 5 - Commmunity Detection

In this notebook, we explore several algorithms to find communities in graphs.

In some cells, we use the ABCD benchmark to generate synthetic graphs with communities. 
ABCD is written in Julia.

### Installing Julia and ABCD

We use the command line interface option to run ABCD below. 
The following steps are required:

* install Julia (we used version 1.4.2) from https://julialang.org/downloads/
* download ABCD from https://github.com/bkamins/ABCDGraphGenerator.jl
* adjust the 'abcd_path' in the next cell to the location of the 'utils' subdirectory of ABCD
* run 'julia abcd_path/install.jl' to install the required packages

Also set the path(s) in the cell below. For Windows, you may need to use "\\" or "\\\\" as delimiters, for example 'C:\ABCD\utils\\\\'

### Directories

* Set the directories accordingly in the next cell


In [None]:
## set those accordingly
datadir = '../Datasets/'
abcd_path = '~/ABCD/utils/'
julia = 'julia' ## you may need the full path here

In [None]:
import igraph as ig
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.linear_model import LinearRegression
from collections import Counter
import os
import umap
import pickle
import partition_igraph
import subprocess
from sklearn.metrics import adjusted_mutual_info_score as AMI

## we used those for the book, but you can change to other colors
cls_edges = 'gainsboro'
cls = ['silver','dimgray','black']

# Zachary (karate) graph

A small graph with 34 nodes and two "ground-truth" communities.
Modularity-based algorithms will typically find 4 or 5 communities.
In the next cells, we look at this small graph from several different angles.


In [None]:
z = ig.Graph.Famous('zachary')
z.vs['size'] = 12
z.vs['name'] = [str(i) for i in range(z.vcount())]
z.vs['label'] = [str(i) for i in range(z.vcount())]
z.vs['label_size'] = 8
z.es['color'] = cls_edges
z.vs['comm'] = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]
z.vs['color'] = [cls[i] for i in z.vs['comm']]
#ig.plot(z, 'zachary_gt.eps', bbox=(0,0,300,200))
ig.plot(z, bbox=(0,0,350,250))

## Node Roles
 
We compute z(v) (normalized within module degree) and p(v) (participation coefficients) as defined in section 5.2 of the book for the Zachary graph. We identify 3 types of nodes, as described in the book.

* provincial hubs
* peripheral nodes (non-hubs)
* ultra peripheral nodes (non-hubs)
    

In [None]:
## compute internal degrees
in_deg_0 = z.subgraph_edges([e for e in z.es if z.vs['comm'][e.tuple[0]]==0 and z.vs['comm'][e.tuple[1]]==0],
                            delete_vertices=False).degree()
in_deg_1 = z.subgraph_edges([e for e in z.es if z.vs['comm'][e.tuple[0]]==1 and z.vs['comm'][e.tuple[1]]==1],
                            delete_vertices=False).degree()

## compute z (normalized within-module degree)
z.vs['in_deg'] = [in_deg_0[i] + in_deg_1[i] for i in range(z.vcount())]
mu = [np.mean([x for x in in_deg_0 if x>0]),np.mean([x for x in in_deg_1 if x>0])]
sig = [np.std([x for x in in_deg_0 if x>0],ddof=1),np.std([x for x in in_deg_1 if x>0],ddof=1)]
z.vs['z'] = [(v['in_deg']-mu[v['comm']])/sig[v['comm']] for v in z.vs]

## computing p (participation coefficient)
z.vs['deg'] = z.degree()
z.vs['out_deg'] = [v['deg'] - v['in_deg'] for v in z.vs]
z.vs['p'] = [1-(v['in_deg']/v['deg'])**2-(v['out_deg']/v['deg'])**2 for v in z.vs]
D = pd.DataFrame(np.array([z.vs['z'],z.vs['p']]).transpose(),columns=['z','p']).sort_values(by='z',ascending=False)
D.head()


Below, we plot the Zachary graph w.r.t. z where z>2.5 are hubs, which we show as square nodes.
The largest values are for node 0 (instructor), node 33 (president) and node 32.
Nodes 0 and 33 are the key nodes for the division of the group into two factions.


In [None]:
## Zachary graph w.r.t. roles
z.vs['color'] = 'black'
z.vs['shape'] = 'circle'
for v in z.vs:
    if v['z']<2.5: ## non-hub
        if v['p'] < .62 and v['p'] >= .05: ## peripheral
            v['color'] = 'dimgrey'
        if v['p'] < .05: ## ultra-peripheral
            v['color'] = 'gainsboro'
    if v['z']>=2.5 and v['p'] < .3: ## hubs (all provincial here)            
        v['color'] = 'silver'
        v['shape'] = 'square'
#ig.plot(z, 'zachary_roles_1.eps', bbox=(0,0,350,250))
ig.plot(z, bbox=(0,0,350,250))

Code below is to generate Figure 5.3(b) in the book, again comparing node roles in the Zachary graph.


In [None]:
## Figure 5.3(b) -- comparing the roles
fig, ax = plt.subplots(figsize=(12,9))
ax.scatter(z.vs['p'],z.vs['z'],marker='o',s=75, color='k')

plt.plot([0, .5], [2.5, 2.5], color='k', linestyle='-', linewidth=2)
plt.plot([.05, .05], [-.5, 2.4], color='k', linestyle='-', linewidth=2)

ax.annotate('node 0', (z.vs['p'][0],z.vs['z'][0]-.05), xytext=(z.vs['p'][0]+.01,z.vs['z'][0]-.3), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('node 33', (z.vs['p'][33],z.vs['z'][33]-.05), xytext=(z.vs['p'][33]-.07,z.vs['z'][33]-.3), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('node 32', (z.vs['p'][32]-.005,z.vs['z'][32]), xytext=(z.vs['p'][32]-.07,z.vs['z'][32]), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('node 1', (z.vs['p'][1],z.vs['z'][1]-.05), xytext=(z.vs['p'][1]-.07,z.vs['z'][1]-.3), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('node 3', (z.vs['p'][3],z.vs['z'][3]-.05), xytext=(z.vs['p'][3]+.07,z.vs['z'][3]-.3), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('node 2', (z.vs['p'][2],z.vs['z'][2]-.05), xytext=(z.vs['p'][2]-.07,z.vs['z'][2]-.3), 
            fontsize=14,
            arrowprops = dict(  arrowstyle="-",connectionstyle="angle3,angleA=0,angleB=-90"))

ax.annotate('provincial hubs',(.3,3), fontsize=18)
ax.annotate('peripheral non-hubs',(.3,1.8), fontsize=18)
ax.annotate('ultra peripheral non-hubs',(0.025,0.0),xytext=(.1,0), fontsize=18,
             arrowprops = dict( arrowstyle="->", connectionstyle="angle3,angleA=0,angleB=-90"))

plt.xlabel('participation coefficient (p)',fontsize=16)
plt.ylabel('normalized within module degree (z)',fontsize=16);
#plt.savefig('zachary_roles_2.eps')

## Strong and weak communities

Communities are defined as strong or weak as per (5.1) and (5.2) in the book.
For the Zachary graph, we verify if nodes within communities satisfy the strong criterion, then we verify is the two communities satisfy the weak definition.

For the strong definition (internal degree larger than external degree for each node), only two nodes do not qualify.

For the weak definition (total community internal degree > total community external degree), both communities satisfy this criterion.


In [None]:
## strong criterion
for i in range(z.vcount()):
    c = z.vs[i]['comm']
    n = [z.vs[v]['comm']==c for v in z.neighbors(i)]
    if sum(n)<=len(n)-sum(n):
        print('node',i,'has internal degree',sum(n),'external degree',len(n)-sum(n))

In [None]:
## weak criterion
I = [0,0]
E = [0,0]
for i in range(z.vcount()):
    c = z.vs[i]['comm']
    n = [z.vs[v]['comm']==c for v in z.neighbors(i)]
    I[c] += sum(n)
    E[c] += len(n)-sum(n)
print('community 0 internal degree',I[0],'external degree',E[0])
print('community 1 internal degree',I[1],'external degree',E[1])


## Hierarchical clustering and dendrogram

Girvan-Newman algorithm is described in section 5.5 of the book. We apply it to the Zachary graph and show the results of this divisive algorithm as a dendrogram.


In [None]:
## Girvan-Newman algorithm
gn = z.community_edge_betweenness()
#ig.plot(gn,'zachary_dendrogram.eps',bbox=(0,0,300,300))
ig.plot(gn,bbox=(0,0,300,300))

This is an example of a hierarchical clustering. In the next plot, we compute modularity for each possible cut of the dendrogram.

We see that we get strong modularity with 2 clusters, but maximal value is obtained with 5.


In [None]:
## compute modularity at each possible cut and plot
q = []
for i in np.arange(z.vcount()):
    q.append(z.modularity(gn.as_clustering(n=i+1)))
plt.plot(np.arange(1,1+z.vcount()),q,'o-',color='black')
plt.xlabel('number of clusters',fontsize=14)
plt.ylabel('modularity',fontsize=14);
#plt.savefig('zachary_modularity.eps');

How are the nodes partitioned is we pick only 2 communities? How does this compare to the underlying ground truth?

From the plot below, we see that only 1 node is misclassified.

We also report the modularity of this partition, $q = 0.35996$. We also compare the partition with ground truth via AMI (adjusted mutual information), as defined in section 5.3 of the book; we got a high value AMI = 0.83276 showing  strong concordance. 


In [None]:
## show result with 2 clusters -- 
z.vs['gn'] = gn.as_clustering(n=2).membership
print('AMI:',AMI(z.vs['comm'],z.vs['gn']))  ## adjusted mutual information
print('q:',z.modularity(z.vs['gn']))        ## modularity

z.vs['size'] = 10
z.vs['name'] = [str(i) for i in range(z.vcount())]
z.vs['label'] = [str(i) for i in range(z.vcount())]
z.vs['label_size'] = 8
z.es['color'] = cls_edges
z.vs['comm'] = [0,0,0,0,0,0,0,0,1,1,0,0,0,0,1,1,0,0,1,0,1,0,1,1,1,1,1,1,1,1,1,1,1,1]
#z.vs['color'] = [cls[i] for i in z.vs['comm']]
z.vs['color'] = [cls[i] for i in z.vs['gn']]
#ig.plot(z, 'zachary_2.eps',bbox=(0,0,300,200))
ig.plot(z,bbox=(0,0,300,200))

Same as above with 5 communities. We see higher modularity, but weaker AMI value.


In [None]:
## show result with optimal modularity (5 clusters)
z.vs['label'] = gn.as_clustering(n=5).membership
print('AMI:',AMI(z.vs['comm'],z.vs['label']))
print('q:',z.modularity(z.vs['label']))
z.vs['color'] = [cls[i] for i in z.vs['comm']]
z.vs['size'] = 10
z.vs['label_size'] = 8
#ig.plot(z, 'zachary_5.eps',bbox=(0,0,300,200))
ig.plot(z,bbox=(0,0,300,200))

# ABCD graph with 100 nodes

Next we look at a slightly larger graph generated with the ABCD benchmark model, which is described in section 5.3 of the book. This graph has 3 communities. 
Using hierarchical clustering, we compare modularity and AMI for each possible cut.

ABCD parameters used to generate this graph are: $\gamma=3, \tau=2$, degree range [5,15], community size range [25,50], $\xi=.2$.

In [None]:
## read graph and communities; plot
g = ig.Graph.Read_Ncol(datadir+'ABCD/abcd_100.dat',directed=False)
c = np.loadtxt(datadir+'ABCD/abcd_100_comms.dat',dtype='uint16',usecols=(1))
g.vs['comm'] = [c[int(x['name'])-1]-1 for x in g.vs]
gt = {k:(v-1) for k,v in enumerate(g.vs['comm'])}
## map between int(name) to key
n2k = {int(v):k for k,v in enumerate(g.vs['name'])}
g.vs['size'] = 7
g.es['color'] = cls_edges
g.vs['color'] = [cls[i] for i in g.vs['comm']]
ig.plot(g, bbox=(0,0,300,200))

Girvan-Newman algorithm -- Modularity and AMI for each cut

In this case, both modularity and AMI are maximized with 3 communities.

In [None]:
q = []
a = []
gn = g.community_edge_betweenness()
for i in np.arange(g.vcount()):
    q.append(g.modularity(gn.as_clustering(n=i+1)))
    a.append(AMI(g.vs['comm'],gn.as_clustering(n=i+1).membership))
plt.plot(np.arange(1,1+g.vcount()),q,'.-',color='black',label='modularity')
plt.plot(np.arange(1,1+g.vcount()),a,'.-',color='grey',label='AMI')
plt.xlabel('number of clusters',fontsize=14)
plt.ylabel('modularity or AMI',fontsize=14)
plt.legend();
#plt.savefig('abcd_dendrogram.eps');

We see that with 3 communities, $q=0.502$ and AMI=1, so perfect recovery.


In [None]:
n_comm = np.arange(1,g.vcount()+1)
D = pd.DataFrame(np.array([n_comm,q,a]).transpose(),columns=['n_comm','q','AMI'])
df = D.head()
df

What would we get with 4 clusters, for which AMI = 0.95?
We see below that we have a few nodes splitted from one community.

In [None]:
## 4 communities
g.vs['gn'] = gn.as_clustering(n=4).membership
cls = ['silver','dimgray','black','white']
g.vs['color'] = [cls[i] for i in g.vs['gn']]
#ig.plot(g, 'abcd_4.eps', bbox=(0,0,300,200))
ig.plot(g, bbox=(0,0,300,200))

Those nodes form a triangle


In [None]:
sg = g.subgraph([v for v in g.vs() if v['gn']==3])
ig.plot(sg, bbox=(0,0,100,100))

# ABCD with varying $\xi$

Here we show a typical way to compare graph clustering algorithms using benchmark graphs. 
We pick some model, here ABCD, and we vary the noise parameter $\xi$. 
With ABCD, the larger $\xi$ is, the closer we are to a random Chung-Lu or configuration model graph (i.e. where only the degree distribution matters). For $\xi=0$, we get pure communities (all edges are internal).

For each choice of $\xi$, we generate 30 graphs, apply several different clustering algorithms,
and compute AMI for each algorithm, comparing with griund-truth communities.

The code below is commented out as it can take a while to run; a pickle file with results is included in the Data directory. To re-run from scratch, uncomment the cell below.

Parameters for the ABCD benchmark graphs are:

$\gamma=2.5, \tau=1.5$, degree range [10,50], community size range [50,100], $0.1 \le \xi \le 0.8$.

In [None]:
## load data generated with the code from above cell 
with open(datadir+"ABCD/abcd_study.pkl","rb") as f:
    L = pickle.load(f)
## store in dataframe and take averages
D = pd.DataFrame(L,columns=['algo','xi','AMI'])
## take average over 30 runs for each algorithm and every choice of xi
X = D.groupby(by=['algo','xi']).mean()


We plot the results in the following 2 cells. 
We see good results with Louvain and Infomap, and even better results with ECG.
Label propagation is a fast algortihm, but it does collapse with moderate to high level of noise.

From the standard deviation plot, we see high variability around the value(s) for $\xi$ where the different
algorithms start to collapse. We see that this happen later and at a smaller scale with EGC, which is known to have better stability.

Such studies are useful to compare algorithms; using benchmarks, we can directly control parameters such as the noise level.


In [None]:
## plot average results foe each algorithm over range of xi
a = ['ECG','Louvain','Infomap','Label Prop.']
lt = ['-','--',':','-.','-.']
cl = ['blue','green','purple','red']
for i in range(len(a)):
    ## pick one - color or greyscale
    plt.plot(X.loc[(a[i])].index,X.loc[(a[i])],lt[i],label=a[i],color=cl[i])
    #plt.plot(X.loc[(a[i])].index,X.loc[(a[i])],lt[i],label=a[i],color='black')
plt.xlabel(r'ABCD noise ($\xi$)',fontsize=14)
plt.ylabel('AMI',fontsize=14)
plt.legend();
#plt.savefig('abcd_study.eps');

In [None]:
##  Look at standard deviations
S = D.groupby(by=['algo','xi']).std()
a = ['ECG','Louvain','Infomap','Label Prop.']
#a = ['ECG','Louvain','Infomap','Label Prop.','Leiden','CNM']
lt = ['-','--',':','-.','--',':']
cl = ['blue','green','purple','red','red','blue']
for i in range(len(a)):
    ## pick one - color of greyscale
    plt.plot(S.loc[(a[i])].index,S.loc[(a[i])],lt[i],label=a[i],color=cl[i])
    #plt.plot(S.loc[(a[i])].index,S.loc[(a[i])],lt[i],label=a[i],color='black')
plt.xlabel(r'ABCD noise ($\xi$)',fontsize=14)
plt.ylabel('Standard Deviation (AMI)',fontsize=14)
plt.legend();
#plt.savefig('abcd_study_stdv.eps');

## Compare stability 

This study is similar to the previous one, but we compare pairs of partitions for each algorithm on the same graph instead of comparing with the ground truth, so we look at the stability of algorithms. Note that an algorithm can be stable, but still be bad (ex: always cluster all nodes in a single community).

The code below can take a while to run; a pickle file with results is included in the Data directory. To re-run from scratch, uncomment the cell below.


In [None]:
## load L and train/val/test ids
with open(datadir+"ABCD/abcd_study_stability.pkl","rb") as f:
    Ls = pickle.load(f)
## store in dataframe 
D = pd.DataFrame(Ls,columns=['algo','xi','AMI'])
## take averages for each algorithm and each noise value xi
X = D.groupby(by=['algo','xi']).mean()

We plot the results below. The behaviour of algorithms can be clustered in two groups:

* For Louvain and ECG, stability is excellent and degrades gradually for high noise level, with ECG being the more stable algorithm.
* For Infomap and Label Propagation, stability is also good until the noise value where the results start to degrade, as we saw in the previous study. We see near perfect stability for very high noise values; those are values where the results were very bad in the previous study; this typically happens when the algorithm can't get any good clustering and returns some trivial parititon, such as putting all nodes together in the same community, thus a stable but bad result.


In [None]:
a = ['ECG','Louvain','Infomap','Label Prop.']
lt = ['-','--',':','-.']
for i in range(len(a)):
    plt.plot(X.loc[(a[i])].index,X.loc[(a[i])],lt[i],label=a[i],color='black')
plt.xlabel(r'ABCD noise ($\xi$)',fontsize=14)
plt.ylabel('AMI between successive runs',fontsize=14)
plt.legend();
#plt.savefig('abcd_study_stability.eps');

# Modularity, resolution limit and rings of cliques

We illustrate issues with modularity with the famous ring of cliques examples.

For example below, we have a ring of 3-cliques connected ny a single (inter-clique) edge.

In [None]:
## n cliques of size s
def ringOfCliques(n,s):
    roc = ig.Graph.Erdos_Renyi(n=n*s,p=0)
    ## cliques
    for i in range(n):
        for j in np.arange(s*i,s*(i+1)):
            for k in np.arange(j+1,s*(i+1)):
                roc.add_edge(j,k)
    ## ring
    for i in range(n):
        if i>0:
            roc.add_edge(s*i-1,s*i)
        else:
            roc.add_edge(n*s-1,0)
    roc.vs['size'] = 8
    roc.vs['color'] = cls[2]
    roc.es['color'] = cls_edges
    return roc

## Ex: 10 3-cliques
roc = ringOfCliques(10,3)
#ig.plot(roc,'ring_3.eps',bbox=(0,0,300,300))     
ig.plot(roc,bbox=(0,0,300,300))        

We compare the number of cliques (the natural parts in a partition) with the actual number of communities found via 3 modularity based algorithms (Louvain, CNM, ECG).

We see that both Louvain and CNM return a smaller number of communities than the number of cliques; this is a known problem with modularity: merging cliques in the same community often lead to higher modularity.

A concensus algorithm like ECG can help a lot in such cases; here we see that the cliques are correctly recovered with ECG.


In [None]:
## Compare number of cliques and number of clusters found
L = []
s = 3
for n in np.arange(3,50,3):
    roc = ringOfCliques(n,s)
    ml = np.max(roc.community_multilevel().membership)+1
    ec = np.max(roc.community_ecg().membership)+1
    cnm = np.max(roc.community_fastgreedy().as_clustering().membership)+1
    L.append([n,ml,ec,cnm])
D = pd.DataFrame(L,columns=['n','Louvain','ECG','CNM'])
plt.figure(figsize=(8,6))
plt.plot(D['n'],D['Louvain'],'--o',color='black',label='Louvain')
plt.plot(D['n'],D['ECG'],'-o',color='black',label='ECG')
plt.plot(D['n'],D['CNM'],':o',color='black',label='CNM')

plt.xlabel('number of '+str(s)+'-cliques',fontsize=14)
plt.ylabel('number of clusters found',fontsize=14)
plt.legend(fontsize=14);
#plt.savefig('rings.eps');

Let us look at a specific example: 10 cliques of size 3. Below we plot the communities found with Louvain; we clearly see that pairs of communities are systematically grouped into clusters.

In [None]:
## Louvain communities with 10 3-cliques
roc = ringOfCliques(n=10,s=3)
roc.vs['ml'] = roc.community_multilevel().membership
roc.vs['color'] = [cls[x%3] for x in roc.vs['ml']]
#ig.plot(roc,'ring_3_q.eps', bbox=(0,0,300,300))
ig.plot(roc,bbox=(0,0,300,300))

Why is ECG solving this problem? It is due to the first step, where we run an ensemble of level-1 Louvain and assign new weights to edges based on the proportion of times those edges are internal to a community.
We see below that there are exactly 30 edges with maximal edge weight of 1 (edges within cliques) and 10 edges with default minimal weight of 0.05 (edges between cliques). 

With those new weights, the last clustering in ECG can easily recover the cliques as communities.


In [None]:
## ECG weights in this case: all 30 clique edges have max score
roc.es['W'] = roc.community_ecg().W
Counter(roc.es['W'])

# Ego nets and more

Suppose we want to look at node "near" some seed node $v$. One common way to do this is to look at its ego-net, i.e. the subgraph consisting of node $v$ and all other nodes that can be reached from $v$ in $k$ hops or less, where $k$ is small, typically 1 or 2. 

Such subgraphs can become large quickly as we increase $k$. In the cells below, we look at ego-nets and compare with another approach to extract subgraph(s) around $v$ via clustering.

We consider the airport graph we already saw several times. We consider a simple, undirected version (no loops, directions or edge weights).

We compare ego-nets (1 and 2-hops subgraphs from a given node) with clusters obtained via graph clustering for some vertex $v$ with degree 11 (you can try other vertices).

In [None]:
## read edges and build simple undirected graph
D = pd.read_csv(datadir+'Airports/connections.csv')
g = ig.Graph.TupleList([tuple(x) for x in D.values], directed=True, edge_attrs=['weight'])
#df = D.head()
g = g.as_undirected()
g = g.simplify()

## read vertex attributes and add to graph
A = pd.read_csv(datadir+'Airports/airports_loc.csv')
lookup = {k:v for v,k in enumerate(A['airport'])}
l = [lookup[x] for x in g.vs()['name']]
g.vs()['layout'] = [(A['lon'][i],A['lat'][i]) for i in l]
g.vs()['state'] = [A['state'][i] for i in l]
g.vs()['city'] = [A['city'][i] for i in l]
## add a few more attributes for visualization
g.vs()['size'] = 6
g.vs()['color'] = cls[0]
g.es()['color'] = cls_edges
df = A.head()

## pick a vertex v
v = 207
print(g.vs[v])
print('degree:',g.degree()[v])
g.vs[v]['color'] = 'black'


In [None]:
## show its ego-net for k=1 (vertex v in black)
sg = g.subgraph([i for i in g.neighborhood(v,order=1)])
print(sg.vcount(),'nodes')
#ig.plot(sg,'airport_ego_1.eps',bbox=(0,0,300,300))
ig.plot(sg,bbox=(0,0,300,300))

In [None]:
## show its 2-hops ego-net ... this is already quite large!
sg = g.subgraph([i for i in g.neighborhood(v,order=2)])
sg.vs()['core'] = sg.coreness()
sg.delete_vertices([v for v in sg.vs if v['core']<2])
print(sg.vcount(),'nodes')
#ig.plot(sg,'airport_ego_2.eps',bbox=(0,0,300,300))
ig.plot(sg,bbox=(0,0,300,300))

In [None]:
## apply clustering and show the cluster containing the selected vertex
## recall that we ignore edge weights
## This result can vary somehow between runs
ec = g.community_ecg(ens_size=16)
g.es['W'] = ec.W
m = ec.membership[v]
sg = g.subgraph([i for i in range(g.vcount()) if ec.membership[i]==m])
sg.vs()['core'] = sg.coreness()
## display the 2-core
sg.delete_vertices([v for v in sg.vs if v['core']<2])
print(sg.vcount(),'nodes')
#ig.plot(sg,'airport_ecg.eps',bbox=(0,0,300,300))
ig.plot(sg,bbox=(0,0,300,300))

We see above that looking at the cluster with $v$ is smaller than the 2-hops ego-net, and several nodes are tightly connected.

Below we go further and look at the ECG edge weights, which we can use to prune the graph above, so we can look at the nodes most tightly connected to node $v$.

You can adjust the threshold below to get different zoomings.


In [None]:
## filter edges w.r.t. ECG votes (weights)
thresh = .85

tmp = sg.subgraph_edges([e for e in sg.es if e['W'] > thresh])
n = [i for i in range(tmp.vcount()) if tmp.vs[i]['color']=='black'][0]
tmp.vs['cl'] = tmp.connected_components().membership
cl = tmp.vs[n]['cl']
ssg = tmp.subgraph([i for i in tmp.vs if i['cl']==cl])
ssg.vs()['core'] = ssg.coreness()
ssg.delete_vertices([v for v in ssg.vs if v['core']<2])
print(ssg.vcount(),'nodes')
#ig.plot(ssg,'airport_ecg_focus.eps',bbox=(0,0,300,300))
ig.plot(ssg,bbox=(0,0,300,300))

Most nodes in this subgraph are from the same state as node $v$ (MI) or nearby state (WI).

In [None]:
## states in the above subgraph
Counter(ssg.vs['state'])

# EXTRA CODE

The code below requires that Julia and ABCD are installed.

This is extra material not in the book.

# ABCD Properties

The cells below are for illustration purpose only, to show some ABCD graphs with different $\xi$ (noise) parameters,
and to show how you can run ABCD with Julia installed. 

* notice the density of edges between communities as $\xi$ increases.
* most runs should yield 3 communities

Natural layouts for noisy graphs make it hard to distinguish communities, as the nodes will overlap a lot.
We use an ad-hoc method to "push away" nodes from the 3 different clusters to allow for better visualization.


In [None]:
## just for visualization -- push the layout apart given 3 communities
## adjust the 'push' factor with d
def push_layout(d=0):
    if np.max(g.vs['comm'])>2:
        return -1
    ly = g.layout()
    g.vs['ly'] = ly
    x = [0,0,0]
    y = [0,0,0]
    for v in g.vs:
        c = v['comm']
        x[c] += v['ly'][0]
        y[c] += v['ly'][1]
    delta = [-d,0,d]
    dx = [delta[i] for i in np.argsort(x)]
    dy = [delta[i] for i in np.argsort(y)]
    for v in g.vs:
        c = v['comm']
        v['ly'][0] += dx[c]
        v['ly'][1] += dy[c]
    return g.vs['ly']

In [None]:
## ABCD with very strong communities (xi = 0.05)
## results will vary, but we see 3 communities in most runs.
xi = 0.05
mc = 0
while mc != 3: ## run until we get 3 communities
    ## generate degree and community size values
    cmd = julia+' '+abcd_path+'deg_sampler.jl deg.dat 2.5 5 15 100 1000'
    os.system(cmd+' >/dev/null 2>&1')
    cmd = julia+' '+abcd_path+'com_sampler.jl cs.dat 1.5 30 50 100 1000'
    os.system(cmd+' >/dev/null 2>&1');
    cmd = julia+' '+abcd_path+'graph_sampler.jl net.dat comm.dat deg.dat cs.dat xi '\
            +str(xi)+' false false'
    os.system(cmd+' >/dev/null 2>&1')
    g = ig.Graph.Read_Ncol('net.dat',directed=False)
    c = np.loadtxt('comm.dat',dtype='uint16',usecols=(1))
    mc = max(c)

## plot
g.vs['comm'] = [c[int(x['name'])-1]-1 for x in g.vs]
g.vs['color'] = [cls[i] for i in g.vs['comm']]
g.vs['size'] = 5
g.es['color'] = 'lightgrey'
ly = push_layout(d=0) ## d=0, no need to push, communities are clear
ig.plot(g, layout=ly, bbox=(0,0,300,300))


In [None]:
## viz: ABCD with strong communities (xi = 0.15)
xi = 0.15
mc = 0
while mc != 3: ## run until we get 3 communities
## generate degree and community size values
    cmd = julia+' '+abcd_path+'deg_sampler.jl deg.dat 2.5 5 15 100 1000'
    os.system(cmd+' >/dev/null 2>&1')
    cmd = julia+' '+abcd_path+'com_sampler.jl cs.dat 1.5 30 50 100 1000'
    os.system(cmd+' >/dev/null 2>&1');
    cmd = julia+' '+abcd_path+'graph_sampler.jl net.dat comm.dat deg.dat cs.dat xi '\
            +str(xi)+' false false'
    os.system(cmd+' >/dev/null 2>&1')
    ## compute AMI for various clustering algorithms
    g = ig.Graph.Read_Ncol('net.dat',directed=False)
    c = np.loadtxt('comm.dat',dtype='uint16',usecols=(1))
    mc = max(c)

## plot
g.vs['comm'] = [c[int(x['name'])-1]-1 for x in g.vs]
g.vs['color'] = [cls[i] for i in g.vs['comm']]
g.vs['size'] = 5
g.es['color'] = 'lightgrey'
ly = push_layout(d=1) ## slightly push clusters apart for viz
ig.plot(g, layout=ly, bbox=(0,0,300,300))


In [None]:
## viz: ABCD with weak communities
## lots of edges between communities as expected
xi = 0.33
mc = 0
while mc != 3: ## run until we get 3 communities
    ## generate degree and community size values
    cmd = julia+' '+abcd_path+'deg_sampler.jl deg.dat 2.5 5 15 100 1000'
    os.system(cmd+' >/dev/null 2>&1')
    cmd = julia+' '+abcd_path+'com_sampler.jl cs.dat 1.5 30 50 100 1000'
    os.system(cmd+' >/dev/null 2>&1');
    cmd = julia+' '+abcd_path+'graph_sampler.jl net.dat comm.dat deg.dat cs.dat xi '\
            +str(xi)+' false false'
    os.system(cmd+' >/dev/null 2>&1')
    ## compute AMI for various clustering algorithms
    g = ig.Graph.Read_Ncol('net.dat',directed=False)
    c = np.loadtxt('comm.dat',dtype='uint16',usecols=(1))
    mc = max(c)
    
## plot    
g.vs['comm'] = [c[int(x['name'])-1]-1 for x in g.vs]
g.vs['color'] = [cls[i] for i in g.vs['comm']]
g.vs['size'] = 5
g.es['color'] = 'lightgrey'
ly = push_layout(d=3) ## need to push more -- with d=0, communities can't be seen clearly
ig.plot(g, layout=ly, bbox=(0,0,300,300))

In [None]:
## viz: ABCD with very weak communities
xi = 0.5
mc = 0
while mc != 3: ## run until we get 3 communities
    ## generate degree and community size values
    cmd = julia+' '+abcd_path+'deg_sampler.jl deg.dat 2.5 5 15 100 1000'
    os.system(cmd+' >/dev/null 2>&1')
    cmd = julia+' '+abcd_path+'com_sampler.jl cs.dat 1.5 30 50 100 1000'
    os.system(cmd+' >/dev/null 2>&1');
    cmd = julia+' '+abcd_path+'graph_sampler.jl net.dat comm.dat deg.dat cs.dat xi '\
            +str(xi)+' false false'
    os.system(cmd+' >/dev/null 2>&1')
    ## compute AMI for various clustering algorithms
    g = ig.Graph.Read_Ncol('net.dat',directed=False)
    c = np.loadtxt('comm.dat',dtype='uint16',usecols=(1))
    mc = max(c)

## plot    
g.vs['comm'] = [c[int(x['name'])-1]-1 for x in g.vs]
g.vs['color'] = [cls[i] for i in g.vs['comm']]
g.vs['size'] = 5
g.es['color'] = 'lightgrey'
ly = push_layout(5) ## need to push more -- with d=0, communities can't be seen clearly
ig.plot(g, layout=ly, bbox=(0,0,300,300))

## Measures to compare partitions

* We illustrate the importance of using proper adjusted measures when comparing partitions; this is why we use AMI (adjusted mutual information) or ARI (adjusted Rand index) in our experiments
* We generate some ABCD graph and compare ground truth with **random** partitions of different sizes
* Scores for random partitions should be close to 0 regardless of the number of parts

In [None]:
## RAND Index: given two clusterings u and v
def RI(u,v):
    ## build sets from A and B
    a = np.max(u)+1
    b = np.max(v)+1
    n = len(u)
    if n != len(v):
        exit -1
    A = [set() for i in range(a)]
    B = [set() for i in range(b)]
    for i in range(n):
        A[u[i]].add(i)
        B[v[i]].add(i)   
    ## RAND index step by step
    R = 0
    for i in range(a):
        for j in range(b):
            s = len(A[i].intersection(B[j]))
            if s>1:
                R += s*(s-1)/2
    R *= 2
    for i in range(a):
        s = len(A[i])
        if s>1:
            R -= s*(s-1)/2
    for i in range(b):
        s = len(B[i])
        if s>1:
            R -= s*(s-1)/2
    R += n*(n-1)/2
    R /= n*(n-1)/2
    return R
    

In [None]:
## generate new degree and community size values
cmd = julia+' '+abcd_path+'deg_sampler.jl deg.dat 2.5 5 50 1000 1000'
os.system(cmd+' >/dev/null 2>&1')
cmd = julia+' '+abcd_path+'com_sampler.jl cs.dat 1.5 75 150 1000 1000'
os.system(cmd+' >/dev/null 2>&1')
xi = .1
cmd = julia+' '+abcd_path+'graph_sampler.jl net.dat comm.dat deg.dat cs.dat xi '\
        +str(xi)+' false false'
os.system(cmd+' >/dev/null 2>&1')
g = ig.Graph.Read_Ncol('net.dat',directed=False)
c = np.loadtxt('comm.dat',dtype='uint16',usecols=(1))
## ground-truth communities
gt = [c[int(x['name'])-1]-1 for x in g.vs]
print('number of communities:',np.max(gt)+1)

In [None]:
## generate random clusterings and compute various measures w.r.t. ground truth
## this can take a few minutes to run
from sklearn.metrics import mutual_info_score as MI
from sklearn.metrics import adjusted_rand_score as ARI
from sklearn.metrics import normalized_mutual_info_score as NMI
L = []
n = g.vcount()
tc = {idx:part for idx,part in enumerate(gt)}
ar = np.arange(2,21)
for s in ar:
    for i in range(100):
        r = np.random.choice(s, size=n)
        rd = {idx:part for idx,part in enumerate(r)}        
        L.append([s,MI(gt,r),NMI(gt,r),AMI(gt,r),RI(gt,r),ARI(gt,r),g.gam(tc,rd,adjusted=False),g.gam(tc,rd)])
D = pd.DataFrame(L,columns=['size','MI','NMI','AMI','RI','ARI','GRI','AGRI'])
R = D.groupby(by='size').mean()


Below we show results for 3 measures:
* Mutual information (MI) as is has strong bian w.r.t. number of clusters
* Normalized MI is better
* AMI is best, no bias w.r.t. number of clusters.

In [None]:
## Mutual information (MI), normalized MI (NMI) and adjusted MI (AMI)
plt.plot(ar,R['MI'],':',color='black',label='MI')
plt.plot(ar,R['NMI'],'--',color='black',label='NMI')
plt.plot(ar,R['AMI'],'-',color='black',label='AMI')
plt.xlabel('number of random clusters',fontsize=14)
plt.legend();
#plt.savefig('MI.eps');


Same below for Rand index (RI) and adjusted version. 

GRI (graph RI) and AGRI (adjusted GRI) are variations of RI specifically for graph data.

In [None]:
## RAND index (RI) and adjusted (ARI)
## Also: Graph-aware RAND index (GRI) and adjusted version (AGRI)
## those measures are included in partition-igraph 
## input are partitions of type 'igraph.clustering.VertexClustering'or a dictionaries of node:community.
plt.plot(ar,R['RI'],':',color='black',label='RI')
plt.plot(ar,R['GRI'],'--',color='black',label='GRI')

plt.plot(ar,R['ARI'],'-',color='black',label='ARI/AGRI')
plt.plot(ar,R['AGRI'],'-',color='black')

plt.xlabel('number of random clusters',fontsize=14)
plt.legend();
#plt.savefig('RI.eps');