# Metrics for hierarchical graph clustering

This notebook presents experiments related to two metrics for assessing the quality of hierarchical graph clustering, the relative entropy and Dasgupta's cost.

Let ${\cal T}$ be a binary tree representing the hierarchical structure of a graph.

The **relative entropy** is defined by:
$$
 \sum_{A,B: (A,B) \in {\cal I}}p(A,B) \log \frac{p(A,B)}{\pi(A) \pi(B)},
$$
where:
* ${\cal I}$ is the set of internal nodes of the tree ${\cal T}$ 
* $A,B$ are the sets of nodes induced by each element of ${\cal I}$
* $p(A,B)$ is the sampling probability of the node sets $A,B$
* $\pi(A)$ is the sampling probability of the node set $A$

This is the Kullback-Leibler divergence between the  probability distribution on node sets  induced by the tree ${\cal T}$ and that induced by independent node sampling from the   distribution $\pi$. 

**Dasgupta's cost** is defined by:
$$
\sum_{A,B: (A,B) \in {\cal I}}p(A,B) (\pi(A)  + \pi(B)).
$$

## Import

In [1]:
from hierarchy_metrics import *

In [3]:
graph = nx.karate_club_graph()
dendrogram = hierarchical_clustering(graph, algorithm = "newman")

## Real data

In [2]:
import urllib.request

url = "http://perso.telecom-paristech.fr/~bonald/graphs/"

# Openflights
dataset = "openflights.graphml.gz"
# Wikipedia for schools
#dataset = "wikipedia_schools_undirected.graphml.gz"

download = urllib.request.urlretrieve(url + dataset, dataset)

In [5]:
graph = nx.read_graphml(dataset, node_type=int)

In [6]:
print(nx.info(graph))

Name: Openflights
Type: Graph
Number of nodes: 3097
Number of edges: 18193
Average degree:  11.7488


In [None]:
# Number of samples for the random algorithm
number_samples = 100

In [7]:
dendrogram_paris = hierarchical_clustering(graph, "paris")

In [8]:
dendrogram_newman = hierarchical_clustering(graph, "newman")

In [None]:
dendrogram_random = [hierarchical_clustering(graph, "random") for s in range(number_samples)]

In [9]:
print('Relative entropy (weighted, uniform)')
print('Paris hierarchy: ', relative_entropy(graph, dendrogram_paris), relative_entropy(graph, dendrogram_paris, False))
print('Newman hierarchy: ', relative_entropy(graph, dendrogram_newman), relative_entropy(graph, dendrogram_newman, False))
#print('Random hierarchy: ', np.mean([relative_entropy(graph, d) for d in dendrogram_random]),np.mean([relative_entropy(graph, d, False) for d in dendrogram_random]))

Relative entropy (weighted, uniform)
Paris hierarchy:  2.7735716948816957 2.9113663048934426
Newman hierarchy:  2.02722899031336 3.5131042696737222


In [10]:
print('Dasgupta cost (weighted, uniform)')
print('Paris hierarchy: ', dasgupta_cost(graph, dendrogram_paris), dasgupta_cost(graph, dendrogram_paris, False))
print('Newman hierarchy: ', dasgupta_cost(graph, dendrogram_newman), dasgupta_cost(graph, dendrogram_newman, False))
#print('Random hierarchy: ', np.mean([dasgupta_cost(graph, d) for d in dendrogram_random]), np.mean([dasgupta_cost(graph, d, False) for d in dendrogram_random])) 

Dasgupta cost (weighted, uniform)
Paris hierarchy:  0.16716504372119464 0.12967721944804603
Newman hierarchy:  0.24642084053212826 0.1383132701120162


## Synthetic data

In [None]:
def random_dendrogram(number_nodes = 100):
    nodes = list(range(number_nodes))
    dendrogram = []
    t = 0
    size = {u: 1 for u in nodes}
    while (len(nodes)) > 1:
        u = nodes.pop(np.random.randint(len(nodes)))
        v = nodes.pop(np.random.randint(len(nodes)))
        new_node = number_nodes + t
        t += 1
        size[new_node] = size.pop(u) + size.pop(v)
        dendrogram.append([u,v,size[new_node],size[new_node]])
        nodes.append(new_node)
    return np.array(dendrogram, float)

In [None]:
def get_similarity(dendrogram):
    n = np.shape(dendrogram)[0] + 1
    sim = np.zeros((n,n),float)
    cluster = {u:[u] for u in range(n)}
    for t in range(n - 1):
        u = int(dendrogram[t][0])
        v = int(dendrogram[t][1])
        for i in cluster[u]:
            for j in cluster[v]:
                sim[i][j] = 1 / dendrogram[t][2]
        cluster[n + t] = cluster.pop(u) + cluster.pop(v)
    return sim

In [None]:
def generate_graph(dendrogram, average_degree = 10):
    n = np.shape(dendrogram)[0] + 1
    similarity = get_similarity(dendrogram)
    is_connected = False
    while not is_connected:
        adjacency = np.random.rand(n,n) < similarity / np.sum(similarity) * n * average_degree / 2
        adjacency = np.array(adjacency + adjacency.T,int)
        graph = nx.from_numpy_matrix(adjacency)
        is_connected = nx.is_connected(graph)
    return graph

In [None]:
def add_noise(graph, prob = 0.1):
    is_connected = False
    while not is_connected:
        new_graph = graph.copy()
        edges = list(graph.edges())
        indices = np.random.choice(list(range(len(edges))),replace = False, size = int(np.floor(prob * len(edges))))
        for i in indices:
            u,v = edges[i]
            new_graph.remove_edge(u,v)
            new_edge = np.random.choice(list(new_graph.nodes()), replace = False, size = 2)
            new_graph.add_edge(new_edge[0],new_edge[1],weight = 1.)
        is_connected = nx.is_connected(new_graph)
    return new_graph

In [None]:
def classification_scores(number_nodes, average_degree, prob_range, number_samples, algorithm, weighted = True):
    results = []
    for prob in prob_range:
        cost = 0.
        quality = 0.
        for s in range(number_samples):
            dendrogram = random_dendrogram(number_nodes)
            graph = generate_graph(dendrogram, average_degree)
            graph1 = add_noise(graph,prob)
            graph2 = add_noise(graph,prob)
            dendrogram1 = hierarchical_clustering(graph1, algorithm)
            dendrogram2 = hierarchical_clustering(graph2, algorithm)
            cost += (dasgupta_cost(graph1, dendrogram1, weighted) < dasgupta_cost(graph1, dendrogram2, weighted))
            cost += (dasgupta_cost(graph2, dendrogram2, weighted) < dasgupta_cost(graph2, dendrogram1, weighted))
            quality += (relative_entropy(graph1, dendrogram1, weighted) > relative_entropy(graph1, dendrogram2, weighted))
            quality += (relative_entropy(graph2, dendrogram2, weighted) > relative_entropy(graph2, dendrogram1, weighted))
        results.append((cost / 2 / number_samples, quality / 2 / number_samples))
    return np.array(results)

In [None]:
number_nodes = 100
average_degree = 10
prob_range = np.arange(0.01,0.2,0.03)
number_samples = 1000
results_paris = classification_scores(number_nodes, average_degree, prob_range, number_samples, "paris", True)
results_newman = classification_scores(number_nodes, average_degree, prob_range, number_samples, "newman", True)

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.figure()
plt.plot(100 * prob_range,100 * results_paris[:,1],label = 'Entropy', color = "b")
plt.plot(100 * prob_range,100 * results_paris[:,0],'--',label = 'Dasgupta',color = "b")
plt.plot(100 * prob_range,100 * results_newman[:,1], color = "r")
plt.plot(100 * prob_range,100 * results_newman[:,0],'--',color = "r")
plt.xticks(np.arange(0, 21, step=5))
plt.xlabel("Graph distance (%)")
plt.ylabel("Classification score (%)")
plt.legend()
plt.show()