In [None]:
## path to the datasets
datadir='../Datasets/'

## required packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import igraph as ig
import partition_igraph
from collections import Counter
from sklearn.metrics import adjusted_mutual_info_score as AMI
import random

## 2.4 Community-based anomaly detection

In this short notebook, we illustrate an application of community detection: anomaly detection.

In a nutshell, using community detection, we look for vertices that are not strongly associated with a community, thus a possible "outlier".


### New dataset: American College Football Graph

This is a small graph useful for illustrating anomaly detection methods.

The graph consists of 115 US college football teams (nodes) playing games (edges).

Teams are part of 12 conferences (the 'communities'):
*   0 = Atlantic Coast
*   1 = Big East
*   2 = Big Ten
*   3 = Big Twelve
*   4 = Conference USA
*   5 = Independents
*   6 = Mid-American
*   7 = Mountain West
*   8 = Pacific Ten
*   9 = Southeastern
*  10 = Sun Belt
*  11 = Western Athletic

14 teams out of 115 appear as "anomalous", namely:
- the 5 teams in #5 conference (Independent) play against teams in other conferences 
- the 7 teams in #10 conference (Sun Belt) are broken in 2 clumps 
- 2 teams from #11 conference play mainly with #10 conference

[REF]: "Community structure in social and biological networks", M. Girvan and M. E. J. Newman
PNAS June 11, 2002 99 (12) 7821-7826; https://doi.org/10.1073/pnas.122653799

First, we build the graph.

In [None]:
cfg = ig.Graph.Read_Ncol(datadir+'Football/football.edgelist', directed=False)

and load the "ground-truth communities" (i.e. the conferences)

In [None]:
c = np.loadtxt(datadir+'Football/football.community',dtype='uint16', usecols=(0))
cfg.vs['community'] = [c[int(x['name'])] for x in cfg.vs]

Let's look at the College Football Graph and show the conferences in different colors, with anomalies shown as squares.

In [None]:
cfg.vs['shape'] = 'circle'
cfg.vs['anomaly'] = False
pal = ig.ClusterColoringPalette(n=max(cfg.vs['community'])+1) 
cfg.vs['color'] = [pal.get(int(i)) for i in cfg.vs['community']]
for v in cfg.vs:
    if v['community'] in [5,10] or v['name'] in ['28','58']:
        v['shape']='square'
        v['anomaly']=True
ly = cfg.layout_fruchterman_reingold()
ig.plot(cfg, layout=ly, bbox=(0,0,500,300), vertex_size=6, edge_color='lightgray')

### graph clustering

In practice, we usually don't have access to the ground-truth, and we must rely on the communities found with some algorithm(s).

We use ECG (ensemble clustering on graphs), which is a good choice with unweighted graphs. We'll do a few runs and keep result with highest modularity.

In [None]:
q = 0
for rep in range(10):
    ec = cfg.community_ecg(ens_size=32)
    if cfg.modularity(ec.membership) > q:
        q = cfg.modularity(ec.membership)
        cfg.vs['ecg'] = ec.membership
print('number of communities found:',np.max(cfg.vs['ecg'])+1)

The adjusted mutual information (AMI) is a measure of the cluster quality given ground-truth. Values close to 1 are indicative of very good match.


In [None]:
AMI(cfg.vs['community'], cfg.vs['ecg'])

## Community-based anomaly detection

We explore two ways to find anomalous nodes based on the hypothesis that "regular" nodes are part of one or a small number of communities, while anomalous ones have more heterogeneous edge distribution.

We use two simple methods:

* the **participation coefficient**, a measure of dispersion of communities amongst a node's neighbours. A **high** value indicative of **outlier**
* The ECG clustering method assigns "weights" to edges, indicative of how strongly they are "within" a community. Nodes strongly in a community are expected to have **high** "ecg weights" on edges linking its neighbours, while outliers are expected to have **lower** weights.

In [None]:
def partCoef(l):
    """Compute the participation coefficient of a list. This is a measure of the homogeneity of a list"""
    s = sum(l)
    pc = 1-sum([i**2/s**2 for i in l]) 
    return pc

For each node, list the clusters of its neighbours and compute the participation coefficient. 

Next we look at the boxplot of "outliers" vs "regular" nodes.

In [None]:
for v in cfg.vs:
    l = list(Counter([cfg.vs[x]['ecg'] for x in cfg.neighbors(v)]).values()) ## neighbour's communities    
    v['pc'] = partCoef(l)

In [None]:
plt.figure(figsize=(6,4))
plt.rcParams['font.size'] = '14'
X = [v['pc'] for v in cfg.vs if not v['anomaly']]
Y = [v['pc'] for v in cfg.vs if v['anomaly']]
plt.boxplot([X,Y],labels=['Regular','Outlier'],sym='.',whis=(0,100), widths=.5)
plt.ylabel('Participation coefficient',fontsize=14);

ECG is an ensemble method where for each edge, a "weight" is assigned which is proportional to the number of time this edge was deemed to have both its vertices in the same community.

Below, we collect ECG edge weights and compare the distribution for edges touching an outlier vertex or not.


In [None]:
cfg.es['ecg_weight'] = ec.W

## label edges touching an outlier node
cfg.es['anomaly'] = False
for v in cfg.vs:
    if v['anomaly']:
        for e in cfg.incident(v):
            cfg.es[e]['anomaly'] = True 

x = [e['ecg_weight'] for e in cfg.es if not e['anomaly']]
y = [e['ecg_weight'] for e in cfg.es if e['anomaly']]

plt.figure(figsize=(6,4))
plt.hist([x,y],label=['regular','outlier'])
plt.legend();

Finally, we compute the average ECG incident edge weights for each node and consider the boxplot of "outliers" vs "regular" nodes.

As we the participation coefficient, we observe good separation between the two classes.


In [None]:
for v in cfg.vs:
    v['ecg'] = np.mean([cfg.es[e]['ecg_weight'] for e in cfg.incident(v)])

X = [v['ecg'] for v in cfg.vs if v['anomaly']==0]
Y = [v['ecg'] for v in cfg.vs if v['anomaly']==1]

plt.figure(figsize=(6,4))
plt.boxplot([X,Y],labels=['Regular','Outlier'],sym='.',whis=(0,100), widths=.5)
plt.ylabel('ECG weights',fontsize=14);