Searching for a new probability distribution for modeling non-scale-free heavy-tailed real-world networks
In this study, we consider large-scale network data sets from different disciplines, namely social networks, collaboration networks, web graphs, citation networks, biological networks, product co-purchasing networks, temporal networks, communication networks, ground-truth networks, and brain networks. We study several individual data sets from each discipline. These data sets are publicly available at http://snap.stanford.edu/data/index.html
Usage of the repository for the paper "Searching for a new probability distribution for modeling non-scale-free heavy-tailed real-world networks":
-
In this repository, we present examples with 10 networks data set. These data set can be directly imported to R statistical software for the analysis of degree distribution of networks data. For example, the repository contains the following data sets (one data example from 10 different domains) for presentation: ego-Twitter(In) (Social Networks), cit-HepTh(In) (Citation Networks), ca-CondMat (Collaboration Networks), Google(In) (Web graphs), Yeast-PPIN (Biological Networks), amazon0601(In) (Product co-purchasing Networks), sx-mathoverflow(In) (Temporal Networks), Email-Enron (Communication Networks), Wiki-Topcats (Ground-truth Networks), and Human25890-session1 (Brain Networks). The ".csv" file of these data sets contain the frequency-degree for these real-world complex networks.
-
The "models.R" file contains the implementation of popularly-used degree distributions, namely Lomax, power-law, power-law with cutoff, Log-normal, and Exponential distributions. Furthermore, the file "models.R" also contains the implementations of our proposed "Generalized Lomax" family of distributions, namely GLM Type-I, GLM Type-II, GLM Type-III and GLM Type-IV models. The decsirptions of all these models are provided in the manuscript titled "Searching for a new probability distribution for modeling non-scale-free heavy-tailed real-world networks".
-
Once the implementation is done, the predicted outputs of Lomax, power-law, power-law with cutoff, Log-normal, Exponential, GLM Type-I, GLM Type-II, GLM Type-III and GLM Type-IV models are restored in dataname_output.csv file. For example, in case of ego-Twitter(In).csv data set; all the predicted values based on different probability models are presented in ego-Twitter(In)_outputs.csv file. This file is further used for the computation of different metrics for finding predictive accuracy of several models in the manuscript.
-
Using the outputs of dataname_output.csv file, we obtain the graphs (Plots of degree distributions along with different proabbility distributions) for our paper and the codes are given in figures_plots.m (MATLAB implementation file).
-
Reults obtained in the paper for all these networks data sets can directly be computed along with the graphs and figures using the implementation files and data sets (along with outputs) given in this repository for replicability and sake of reproducibility of our paper. The rest of the data sets can be obtained from this link: http://snap.stanford.edu/data/index.html and similarly the implementations will be alike as shown for these 10 data sets.