Generalized Conventional Mutual Information (NMI for Overlapping clusters compatible with standard NMI)
Gecmi evaluates the mutual information of graph covers considering overlaps.
The paper: Comparing network covers using mutual information by Alcides Viamontes Esquivel, Martin Rosval, 2012.
(c) Alcides Viamontes Esquivel
This is a clone of the slightly outdated gecmi repository on bitbucket with the fixed compilation under Linux Ubuntu 16.04 x64 and minor I/O extensions to support stabdard formats and be easily applicable in the PyCaBeM clustering benchmark.
Modified and extended by Artem Lutov artem@exascale.info
The refined, optimized and extended, pure C++ version that works 2 ORDERS faster and more accurate on large networks, provides fully automatic cross-platform build producing a single executable is available in the GenConvNMI repository.
The prebuilt binaries for Ubuntu 16.04 x64 can be downloaded in the releases section, the dependencies should be additionally installed as outlined below.
For the compilation:
Any lower version will also probably work after some tuning.
For the brebuilt executables:
- libtbb2:
$ sudo apt-get install libtbb2
- libboost_program_options v1.58:
$ sudo apt-get install libboost-program-options1.58.0
- libstdc++6:
$ sudo apt-get install libstdc++6
For using the Python module, you will need development headers of python, boost::python (including in boost, which is required anyway) and numpy.
Once you download and unpack the source code of Gecmi in a directory,
cd
inside it and look for a file called site_config.py.example
, and
rename or copy this file so that you get site_config.py
in the same
directory. Check this file and edit it in such a way that it matches your build
environment, the targets you want to compile and where do you want to install
them.
Finally, do
$ scons
and you will have the build going.
Things to watch out:
-
You can get messages of the kind
error while loading shared libraries
if the dependencies are not correctly installed. In that case, you might want to fiddle with the commandslocate
and the environment variableLD_LIBRARY_PATH
, or the equivalents in your operating system of choice. -
I have only tested the build setup in Linux.
When the build finish, you can install components using
$ sudo scons install
Alternatively you can copy-paste two libraries located from build/objects-release/lib
and
build/bin/gecmi
to run the (C++) executablebuild/python/gecmi.so
to import from the Python to the required directory. Note thatlibcluster_reader.so
andlibgecmi_so.so
should be located in the same directory asgecmi
, or in the 'lib/' subdirectory, otherwise you will have to defineLD_LIBRARY_PATH
to startgecmi
.
If everything worked o.k, you should be able to run gecmi
executable, or import it from
Python as import gecmi
using gecmi.so
module.
The standalone program uses files in CNL format:
# The comments start with '#' like this line
# Each non-commented line is a module(cluster, community) consisting of the the member nodes separated by space / tab
1
1 2
2
where each line corresponds to the network nodes forming the cluster (community, module).
To run the executable, it's local dependencies should be located in the same directory or int the 'lib/' subdirectory.
The original gecmi
format is also supported:
vertex: modules
0: 1
1: 1 2
2: 2
means a network cover with the vertices 0,1,2. The vertex numbers appear before
the colon. The vertex numbers should be consecutive and start from zero, but they do not need
to appear in the file in this way. The modules of
each vertex appear after the colon, separated by spaces, and there is a line for each
vertex and its memberships. If you prefer, it is
also possible to have the file in the opposite format:
module: vertices
1: 0 1
2: 2
The format is automatically identified by the file header (or it's absence).
To get the normalized mutual information of the covers in the two files, issue the command:
$ gecmi file1 file2
If you want to tweak the precision, use the options -e and -r, to set the error and the risk respectively. See the paper for the meaning of these concepts.
The python module allows to use this tool in programs more easily. Check an example of how it is used.
- GenConvNMI - Significantly reimplemented current gecmi with much better performance (2 orders faster, consumes 2x less RAM memory and provides more accurate results), pure C++ version with the fully automatic cross-platform build that produces a single executable (without the annoying need to copy 2 lib-dependencies of the current GenConvNMI).
- OvpNMI - Another method of the NMI evaluation for the overlapping clusters (communities) that is not compatible with the standard NMI value unlike GenConvNMI, but it is much faster and yields exact results unlike probabilistic results with some variance in GenConvNMI.
- ExecTime - A lightweight resource consumption profiler.
- PyCABeM - Python Benchmarking Framework for the Clustering Algorithms Evaluation. Uses extrinsic (NMIs) and intrinsic (Q) measures for the clusters quality evaluation considering overlaps (nodes membership by multiple clusters).