Skip to content
This repository has been archived by the owner on Sep 12, 2018. It is now read-only.

Notes on graph/connected component/file size limits #134

Closed
fedarko opened this issue Jan 10, 2017 · 1 comment
Closed

Notes on graph/connected component/file size limits #134

fedarko opened this issue Jan 10, 2017 · 1 comment

Comments

@fedarko
Copy link
Owner

fedarko commented Jan 10, 2017

For GraphViz (dot)

I'm not sure here. Per Emden Gansner, there isn't any set upper limit on graph size, although lots of nodes/edges can cause "an explosion in the problem size" due to how dot handles edges crossing multiple levels.

Gansner recommends a general upper limit of 1000 nodes per drawing (see the second link in the previous paragraph), which is not really practical for our purposes at all -- however, we know from experience with the Shakya dataset that laying out larger components isn't that bad (the first component of Shakya has around 6k nodes and 9.8k edges, and takes around five minutes to lay out using dot). Furthermore, since the conceit of AsmViz involves precomputing the layouts, I don't think worrying about efficiency at this part of the program is a super significant issue.

For huge, huge connected components, we have a few options (and almost certainly more that I'm not aware of):

  1. Try using desktop Cytoscape's hierarchical graph layout option -- not sure how fast it is in comparison to dot, but it's worth a shot. If it's much faster than dot, then we can just refer users to desktop Cytoscape from collate (???).
  2. Use the nslimit/nslimitl parameters to bound the upper limit of iterations in dot at the cost of worse graph drawing. I really like this option -- we could even have this be up to the user of collate, so that they could personally make the tradeoff of drawing quality vs. time/memory used.
  3. Force the user to split up the graph at certain places within components (???)

Worth noting that for one of the datasets Jay sent (it's in my testgraphs/ directory as 20170226_oriented.gml) has a largest connected component of around 19,000 nodes. Trying to lay out the .gv file generated for this component (which, due to backfilling, has 15,569 nodes and 31,322 edges) just crashes dot with a segfault -- running the layout operation from within PyGraphviz also results in a crash, naturally. So the upper limit in connected component size is maybe somewhere around 15k nodes as a most optimistic guess? (The massive amount of edges probably also factors into this, also.) At this size of connected component we can focus on SPQR decompositions/desktop Cytoscape/etc.

For sqlite3 .db files

Assuming we go the "no DNA" approach, we can assume that most large graphs will generally approach having their main "size"-causing factors be node count and edge count (this relies upon the assumption that the amount of connected components and clusters, as related to the number of nodes/edges, will be somewhat uniform).

I've read that, at least in one environment, the upper limit of .db files in sql.js is around 120M.

Since the size of the shakya.db file is about 4.6M on my system (and would be, likely, somewhere around 4.8M if it included edge multiplicity and node G/C content data), we can do some (very approximate) math using this + the 120M figure as a basis to estimate the maximum total number of nodes and edges:

imag4418
(my phone's camera is not that great, sorry)

So we can say that, only considering the .db side of things, the maximum amount of nodes is somewhere around 500,000.

For Cytoscape.js' renderer

I know it's somewhere on the order of tens of thousands of nodes, but I'd have to check. (TODO: add here)

@fedarko
Copy link
Owner Author

fedarko commented Aug 9, 2017

This issue was moved to marbl/MetagenomeScope#28

@fedarko fedarko closed this as completed Aug 9, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant