This repository contains command line Python scripts to
- import graphs from external sources as files into a running instance of
arangod
, - generate graphs in a running instance and
- run a Pregel algorithm on a graph.
- Import a Graphalytics graph:
python importer.py --endpoint http://localhost:8529/_db/_system graphalytics --dir_graphalytics /PATH/GRAPH_DIRECTORY
- Import a graph saved as a list of edges:
python importer.py --endpoint http://localhost:8529/_db/_system edge-list --edges_file_edge_list /PATH/GRAPH_FILE
- Generate a clique (only one direction for every undirected edge):
python generator.py --endpoint http://localhost:8529/_db/_system clique \
--num_vertices 1000 --graphname Clique --vertex_collection_name cliqueVertices --edge_collection_name cliqueEdges \
--overwrite --vertex_property_type random \
--vertex_property 0.1 0.9 --edge_property_type random --edge_property 0.2 0.8
This will create a clique graph on 1000 vertices in the database. The vertex collection will be cliqueVertices
, the
edge collection cliqueEdges
, the graph itself Clique
. Any existing object with the same name will be overwritten.
The vertices will have a random number between 0.1 and 0.9 as an attribute, the edges - between 0.2 and 0.8.
- Generate a "cliques-graph": a dijoint union of 100 cliques such that
- in a clique, an edge is missing with probability 0.4;
- any two cliques are connected with probability 0.7;
- an edge between two connected cliques exists with probability 0.3:
python generator.py --endpoint http://localhost:8529/_db/_system cliques-graph \
--num_cliques 100 --min_size_clique 12 --max_size_clique 12 \
--prob_missing_one 0.4 --prob_missing_all 0.7 --density_between_two_cliques 0.3
- Generate the 20-partite complete graph with parts of random size between 30 and 35. The constructed graph will be
smart with smart attribute
part
python generator.py --endpoint http://localhost:8529/_db/_system k-partite \
--num_parts 20 --min_size_clique 30 --max_size_clique 35 \
--make_smart --smart_attribute part --overwrite
- Run the Pregel PageRank program on graph
generatedGraph
, write the result into the fieldres_field
, update status every5
seconds, run until value change is at most0.00001
:
python start_Pregel.py --endpoint http://localhost:8529/ --graphname generatedGraph \
--resultField res_field \
--sleep_time 5 --pr_threshold 0.00001 pagerank
- Download the
cit-Patents
dataset from the Graphalytics website, import the graph and run Pregel PageRank until value change is at most0.00001
:
python benchmark_graphalytics.py --endpoint http://localhost:8529/ pagerank cit-Patents \
--pr_threshold 0.00001
The graphs must be stored in local files. Two formats are accepted: graphs from the Graphalytics repository and graphs saved in one file as a list of edges, possiblly with weights. In the following we describe the accepted formats. It is expected that the files are well-formed.
A Graphalytics graph is stored in multiple files. Our script uses only two of them:
-
the file containing vertices has the extension
.v
and lists vertex ids, which are natural numbers, one vertex per line, e.g.,123 321 543 43
-
the edge file describes edges one edge per line as a pair of vertex ids or a triple (vertex id, vertex id, weight). The latter is a real number. As a separator, the space character is used, e.g.,
123 321 2.322 43 321 43.2
A graph is stored in a single file that has the same format as the edge files in Graphalytics format except that it may
contain comment lines starting with #
, %
or /
and the weighs are any sequences of characters without whitespaces.
The import script is importer.py
. You can call with the option -h
to obtain detailed information on its options that
we describe here as well. The examples are given for a *nix system.
- The only necessary parameter is the address of the server running an ArangoDB instance:
python3 importer.py http://localhost:8529/_db/_system
- the format is either
graphalytics
oredge-list
(default isedge-list
):
python3 importer.py http://localhost:8529/_db/_system edge-list
-
database options. These options describe the connection to the database and the objects that will be created in the database.
--user
: username, default isroot
--pwd
: user password, default is the empty string--graphname
: the name of the graph in the database, default isimportedGraph
--vertex_collection_name
: the name of the new vertex collection that will be created in the database, default isvertices
--edge_collection_name
: the name of the new edge collection that will be created in the database, default isedges
--make_smart
: make a SmartGraph (only available if the database is run in the Enterprise edition). Deafult isFalse
. For smart graphs, further options are available (they are ignored for non-smart graphs):--num_shards
: the number of shards for the new collections, default is 5--repl_factor
: replication factor for the new collections, default is 2--smart_attribute
: the smart attribute for the vertex collection, default issmartProp
--overwrite
: a flag to indicate that any of the graph and both collections (edges and vertices) should be deleted if they exist before graph creation, default isFalse
-
translation options:
--bulk_size
: the maximum number of vertices/edges that are internally inserted into the database in one database interaction, default is 10000
-
graphalytics options describe the input files. There are two ways to do this: by giving all three files explicitly or by giving the directory containing the three files. In the latter case, certain conditions must be fulfilled:
- The names of the files must be identical to the name of the directory containing it.
- The vertex file must have the extension
.v
. - The edge file must have the extension
.e
. - The properties file must have the extension
.properties
.
Example: the directory is
./abc/def
and the file names aredef.v
anddef.e
anddef.properties
.--vertices_file_graphalytics
: the file containing the vertices, if not given,--dir_graphalytics
will be used--edges_file_graphalytics
: the file containing the edges, if not given,--dir_graphalytics
will be used--properties_file_graphalytics
: the file containing the graph properties, if not given,--dir_graphalytics
will be used--dir_graphalytics
: the directory containing (at least) the two files, default is the current directory
-
edge list properties:
--edges_file_edge_list
: the file containing the list of edges, default isgraph.txt
-
verbosity:
-- silent
: do not print time statistics, progress bar and what is being currently done, default isFalse
The script name is generator.py
. It can create two types of graphs: undirected cliques and the cliques graphs. An
undirected clique is a graph where every vertex has an edge to every other vertex and, possibly, self-loops. A cliques
graph is a graph that is the result of the following construction.
Make a graph with num_cliques
many vertices, add edges with probability inter_cliques_density
. Then replace every
vertex v
by a set V
of vertices of size randomly chosen between min_size_clique
and
max_size_clique
with equal probability. Add edges between vertices of V
where an edge is not added with
probability prob_missing
. (We call the subgraph induced by V
a clique, although it is not necessarily a complete
subgraph. The idea is that if prob_missing
is very low, the resulting subgraph is "almost"
a clique.) Replace edges (v,w)
by num_edges_between_cliques(|V_v|, |V_w|)
edges between the cliques, choosing
endpoints in the cliques randomly with equal distribution.
The script generator.py
has at least two arguments:
- the address of the server running an ArangoDB instance and
- the graph type (
clique
,cliques-graph
ork-partite
), e.g.,
python3 importer.py http://localhost:8529/_db/_system clique
-h
: print the help and terminate- database parameters. These options describe the connection to the database and the objects that will be created in the
database. Currently, only smart graphs are created.
--user
: username--pwd
: user password--graphname
: the name of the graph in the database--vertices
: the name of the new vertex collection that will be created in the database--edges
: the name of the new edge collection that will be created in the database--num_shards
: the number of shards for the new collections--repl_factor
: replication factor for the new collections--smart_attribute
: the smart attribute for the vertex collection--overwrite
: a flag to indicate that any of the graph and both collections (edges and vertices) should be deleted if they exist before graph creation--make_smart
: whether the graph should be smart
- graph parameters:
- vertex attributes. There are three values that a vertex can have as attributes besides the system attributes:
(1) the id, (2) to which part of the graph it belongs (e.g., the part in a k-partite graph of the clique in a
cliques-graph) and (3) an additional attribute, which currently can be only a random real value. If the graph is
not smart (
--make_smart
is not given), the id is written into the attributeid
, the part (in graphs where it makes sense, currently in clqiues-graphs and in k-partite graphs) into the attributepart
and the additional value (which is optional) into the attribute given in the parameter--additional_vertex_attribute
. The latter cannot beid
orpart
. If the graph is smart, the smart attribute name is given in--smart_attribute
(the default issmartProp
). It is possible that the smart attribute name isid
orpart
; in this case, the ids are written into the attributeid
.--vertex_property_type
: one ofnone
,random
. If this parameter is not given or isnone
, no additional vertex attribute is created. Otherwise, the attribute name is given in--additional_vertex_attribute
and the values are determined by the parameter--vertex_property
.--additional_vertex_attribute
: if the vertices should have another (besidessmart_attribute
andid
) attribute, which is determined by the parameter--vertex_property_type
, its name is given here--vertex_property
: if--vertex_property_type
is random, two space separated numbersa
,b
witha <= b
. The real value is computed randomly with equal distribution betweena
andb
.--edge_attribute
: if the edges should have an attribute, which is determined by the parameter--edge_property_type
, its name is given here--edge_property_type
: one ofnone
,random
. If this parameter is not given or isnone
, the standard edge attributeweight
with propertyNull
is created. Otherwise, the attribute name is given in--edge_attribute
and the values are determined by the parameter--edge_property
.--edge_property
: if--edge_property_type
is random, two space separated numbersa
,b
witha <= b
. The real value is computed randomly with equal distribution betweena
andb
.
- vertex attributes. There are three values that a vertex can have as attributes besides the system attributes:
(1) the id, (2) to which part of the graph it belongs (e.g., the part in a k-partite graph of the clique in a
cliques-graph) and (3) an additional attribute, which currently can be only a random real value. If the graph is
not smart (
- clique parameters:
--size
: the number of vertices in the clique
- cliques graph parameters:
--num_cliques
: the number of cliques--min_size_clique
: the (non-strict) lower bound for the size of a clique--max_size_clique
: the (non-strict) upper bound for the size of a clique--prob_missing
: the probability for an edge in a clique to be missing--inter_cliques_density
: the probability there are some edges between two cliques--density_between_two_cliques
: the density of edges between two cliques, i.e., if the cliques have sizes s1 and s2, ' 'and there are m edges between the two cliques, the density is m/(s1*s2).
It is possible to start a Pregel algorithm and to observe its progress while it is working.
The script is start_Pregel.py
, which can be called with the following parameters.
--endpoint
: the server address, e.g.,http://localhost:8529/_db/_system
--graphname
: the name of an existing graph--silent
: is ignored--store:
,--maxGSS
,--parallelism
,--async
,--resultField
,--useMemoryMaps
,--shardKeyAttribute
: see Pregel documentation--sleep_time
: time in seconds to wait between requesting the Pregel status--user
: your username--pwd
: your password--algorithm
: the name of the Pregel algorithm, one of:pagerank
- Page Rank;sssp
- Single-Source Shortest Path;connectedcomponents
- Connected Components;wcc
- Weakly Connected Components;scc
- Strongly Connected Components;hits
- Hyperlink-Induced Topic Search;effectivecloseness
- Effective Closeness;linerank
- LineRank;labelpropagation
- Label Propagation;slpa
- Speaker-Listener Label Propagation
- algorithm specific parameters:
- for Pagerank:
--pr_threshold
: execute the algorithm until the value changes in the vertices are at most the given value--pr_sourceField
: the attribute of vertices to read the initial rank value from
- for SSSP:
--sssp_source
: the vertex ID to calculate distances to/from--sssp_resultField
: the vertex attribute to write the result to
- for other algorithms: not implemented yet
- for Pagerank:
The corresponding script is a wrapper that downloads a graph file (if it does not find it locally) into the current directory, imports the graph into the database and runs a Pregel algorithm on it. It accepts all Pregel parameters and in addition
--remove_archive
: remove the downloaded arcive file after extracting all files from it--target_directory
: the directory to download and extract the filesdataset
: a positional (necessary) parameter which can be one of
cit-Patents, com-friendster, datagen-7_5-fb, datagen-7_6-fb, datagen-7_7-zf, datagen-7_8-zf,
datagen-7_9-fb, datagen-8_0-fb, datagen-8_1-fb, datagen-8_2-zf, datagen-8_3-zf, datagen-8_4-fb,
datagen-8_5-fb, datagen-8_6-fb, datagen-8_7-zf, datagen-8_8-zf, datagen-8_9-fb, datagen-9_0-fb,
datagen-9_1-fb, datagen-9_2-zf, datagen-9_3-zf, datagen-9_4-fb, datagen-sf3k-fb, dota-league,
example-directed, example-undirected, graph500-22, graph500-23, graph500-24, graph500-25,
graph500-26, graph500-27, graph500-28, graph500-29, kgs, twitter_mpi
You need python3 and all python packages listed in requirements.txt