#clique-percolation
This repo is an implementation of k-clique community detection to study the citation networks of articles in U.S. social science journals from 1900 to 2009. The data are stored in network edgelist format and are derived from the Thompson Reuters™ Web of Knowledge (WOK) database.
#Quick Start
To install software for clique percolation on an Ubuntu system and run a test on sample data, enter these commands:
cd ~/
git clone https://github.com/brooksambrose/clique-percolation
cd ~/clique-percolation
time bash -v install.sh
Then to test the installation:
time bash -v test.sh
If you are successful the bottom of your output will look something like:
Found 709 6-clique-communities
Found 843 5-clique-communities
Found 1066 3-clique-communities
real 0m0.587s
user 0m0.553s
sys 0m0.024s
### Time to run install.sh including tests:
real 0m4.206s
user 0m2.131s
sys 0m0.414s
#Data
The in49
and in99
directories contain the main network edgelist data as compressed text files as described below. The levels
directory contains compressed text files that are lists of standardized WOK codes in which the order of the list corresponds to the index numbers in the corresponding edgelist files.
##1900-1949
The in49
directory contains a smaller set of edgelists covering selected journals in the social sciences from 1900 to 1949.
###Original Citations Codes
bel2mel-49crel.txt.gz
is a weighted CR-UT-CR edgelist of 123,055 lines and 10,874 unique vertex ids.
CR-UT-CR refers to the WOK codes for cited references (CR) and the record id (UT) of the citing article, thus CR-UT-CR indicates a monopartite network of citations tied by the number of articles containing each in their list of references.
bel2mel-49utel.txt.gz
is a weighted UT-CR-UT edgelist of 12,719 lines and 3,108 unique vertex ids.
This monopartite edgelist (mel, projected from a bipartite edgelist bel) is the inverse of bel2mel-49crel
. It represents a network of articles tied by the number of citations shared in common between their list of references.
###Resolved Identity Citation Codes
To correct a nontrivial level of error and natural variation in the coding of citations over several decades of citation indexing, a machine learning approach to the identity resolution of WOK citation codes was performed. Files with the z
prefix refer to error-corrected or resolved identity edgelists. These lists are longer because citations were selected for inclusion in the analysis only if they were cited by more than one article; identity resolution has the consequence of saving otherwise solitary codes from exclusion by subsuming them under a set of identified variations (z
referring to fuzzy sets of variations).
bel2mel-49zcrel.txt.gz
is a resolved identity CR-UT-CR edgelist of 139,032 lines and 11,998 unique vertex ids.bel2mel-49zutel.txt.gz
is a resolved identity UT-CR-UT edgelist of 15,420 lines and 3,524 unique vertex ids.
##1900-2009
The in99
directory contains larger edgelists covering selected journals in the social sciences from 1900-2009 roughly, including all data from the 1900-1949 subset and with partial coverage extending to 2015. We use the prefix 99
because 1999 will be the extent of the analytical window and to remind the user that these data are later and larger than the 49
data.
bel2mel-99crel.txt.gz
is a CR-UT-CR edgelist of 27,303,359 lines and 350,816 unique vertex ids.bel2mel-99utel.txt.gz
is a UT-CR-UT edgelist of 13,485,947 lines and 85,477 unique vertex ids.bel2mel-99zcrel.txt.gz
is a resolved identity CR-UT-CR edgelist of 27,797,685 lines and 352,151 unique vertex ids.bel2mel-99zutel.txt.gz
is a resolved identity UT-CR-UT edgelist of 13,702,183 lines and 86,549 unique vertex ids.
#Software
##Berkeley Common Environment
The BCE is a virtualization project led by Berkeley Research Computing to create a common environment for statistical computing. The install.sh
script is intended to be run within BCE but may work in other unix-alike environments.
##Clique Percolation
Clique percolation is a method for detecting k-clique community structure in large graphs. The original Clique Percolation Method (CPM) is a serial method suitable for smaller graphs. Details of the clique percolation method are described in:
Palla, G., I. Derényi, I. Farkas, and T. Vicsek. 2005. “Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society” Nature.
We instead implement the parallelized Clique percolation On Steroids (COS) method to accommodate the larger scale of our data. Details of the COS method are described in:
Enrico Gregori, Luciano Lenzini, Simone Mainardi, "Parallel k-Clique Community Detection on Large-Scale Networks," IEEE Transactions on Parallel and Distributed Systems 24(8):1651–60.
To run a clique percolation analysis on the entire 1900-1999 database, enter the following:
cd ~/
time git clone https://github.com/brooksambrose/clique-percolation-data
cd ~/clique-percolation
time bash -v bigjob.sh