Skip to content

brooksambrose/clique-percolation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

#clique-percolation

This repo is an implementation of k-clique community detection to study the citation networks of articles in U.S. social science journals from 1900 to 2009. The data are stored in network edgelist format and are derived from the Thompson Reuters™ Web of Knowledge (WOK) database.

#Quick Start

To install software for clique percolation on an Ubuntu system and run a test on sample data, enter these commands:

cd ~/
git clone https://github.com/brooksambrose/clique-percolation
cd ~/clique-percolation
time bash -v install.sh

Then to test the installation:

time bash -v test.sh

If you are successful the bottom of your output will look something like:

Found	709	6-clique-communities
Found	843	5-clique-communities
Found	1066	3-clique-communities

real	0m0.587s
user	0m0.553s
sys	0m0.024s

### Time to run install.sh including tests:

real	0m4.206s
user	0m2.131s
sys	0m0.414s

#Data

The in49 and in99 directories contain the main network edgelist data as compressed text files as described below. The levels directory contains compressed text files that are lists of standardized WOK codes in which the order of the list corresponds to the index numbers in the corresponding edgelist files.

##1900-1949

The in49 directory contains a smaller set of edgelists covering selected journals in the social sciences from 1900 to 1949.

###Original Citations Codes

  • bel2mel-49crel.txt.gz is a weighted CR-UT-CR edgelist of 123,055 lines and 10,874 unique vertex ids.

CR-UT-CR refers to the WOK codes for cited references (CR) and the record id (UT) of the citing article, thus CR-UT-CR indicates a monopartite network of citations tied by the number of articles containing each in their list of references.

  • bel2mel-49utel.txt.gz is a weighted UT-CR-UT edgelist of 12,719 lines and 3,108 unique vertex ids.

This monopartite edgelist (mel, projected from a bipartite edgelist bel) is the inverse of bel2mel-49crel. It represents a network of articles tied by the number of citations shared in common between their list of references.

###Resolved Identity Citation Codes

To correct a nontrivial level of error and natural variation in the coding of citations over several decades of citation indexing, a machine learning approach to the identity resolution of WOK citation codes was performed. Files with the z prefix refer to error-corrected or resolved identity edgelists. These lists are longer because citations were selected for inclusion in the analysis only if they were cited by more than one article; identity resolution has the consequence of saving otherwise solitary codes from exclusion by subsuming them under a set of identified variations (z referring to fuzzy sets of variations).

  • bel2mel-49zcrel.txt.gz is a resolved identity CR-UT-CR edgelist of 139,032 lines and 11,998 unique vertex ids.
  • bel2mel-49zutel.txt.gz is a resolved identity UT-CR-UT edgelist of 15,420 lines and 3,524 unique vertex ids.

##1900-2009

The in99 directory contains larger edgelists covering selected journals in the social sciences from 1900-2009 roughly, including all data from the 1900-1949 subset and with partial coverage extending to 2015. We use the prefix 99 because 1999 will be the extent of the analytical window and to remind the user that these data are later and larger than the 49 data.

  • bel2mel-99crel.txt.gz is a CR-UT-CR edgelist of 27,303,359 lines and 350,816 unique vertex ids.
  • bel2mel-99utel.txt.gz is a UT-CR-UT edgelist of 13,485,947 lines and 85,477 unique vertex ids.
  • bel2mel-99zcrel.txt.gz is a resolved identity CR-UT-CR edgelist of 27,797,685 lines and 352,151 unique vertex ids.
  • bel2mel-99zutel.txt.gz is a resolved identity UT-CR-UT edgelist of 13,702,183 lines and 86,549 unique vertex ids.

#Software

##Berkeley Common Environment

The BCE is a virtualization project led by Berkeley Research Computing to create a common environment for statistical computing. The install.sh script is intended to be run within BCE but may work in other unix-alike environments.

##Clique Percolation

Clique percolation is a method for detecting k-clique community structure in large graphs. The original Clique Percolation Method (CPM) is a serial method suitable for smaller graphs. Details of the clique percolation method are described in:

Palla, G., I. Derényi, I. Farkas, and T. Vicsek. 2005. “Uncovering the Overlapping Community Structure of Complex Networks in Nature and Society” Nature.

We instead implement the parallelized Clique percolation On Steroids (COS) method to accommodate the larger scale of our data. Details of the COS method are described in:

Enrico Gregori, Luciano Lenzini, Simone Mainardi, "Parallel k-Clique Community Detection on Large-Scale Networks," IEEE Transactions on Parallel and Distributed Systems 24(8):1651–60.

Run a Big Job

To run a clique percolation analysis on the entire 1900-1999 database, enter the following:

cd ~/
time git clone https://github.com/brooksambrose/clique-percolation-data
cd ~/clique-percolation
time bash -v bigjob.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages