R code for the fast & parallelized calculation of Adjusted Mutual Information between clusterings
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Calculate Adjusted Mutual Information between Clusterings

This repository contains code for the fast & parallelized calculation of Adjusted Mutual Information (AMI), Normalized Mutual Information (NMI) and Adjusted Rand Index (ARI) between clusterings in R.

NMI and ARI are widely used and well-established metrics of partition agreement. The Adjusted Mutual Information metric was suggested by Vinh et al, 2009. It provides a normalized mutual information metric that is corrected for shifting baseline values of randomly expected partition overlap by computing an Expected Mutual Information (EMI) between partitions of the observed cluster size distributions. For more information, see also wikipedia. The original authors have provided Matlab code to compute AMI values and more.

The code in this repository provides fast, efficient and parallelizable calculations of AMI, NMI and ARI. It was used in Schmidt et al., 2014 for a specific biological application: to assess the agreement of partitions when clustering microbial metagenomic sequence data into Operational Taxonomic Units (OTUs).

The data provided in this repository is for a set of ~1M sequences, clustered into OTUs according to either hierarchical complete linkage or average linkage clustering. Both partitions are saved in a one-line-per-cluster ("otu mapping") and a one-line-per-sequence ("seq mapping") format; more details are provided in the R script. Importantly, the code is generic and can be used for any type of clustering data; the sequence clustering into OTUs is only one applied example.