vg
's current graph memory model is weak and extremely bloated. It relies on fixed-width 64-bit integer ids and large hash tables mapping these to other entities. This makes it difficult to store in memory, and a general-purpose key-value store (rocksdb) is used to allow low-memory access to the entire graph. Although this design has some advantages, querying the graph requires costly IO operations, and thus use must be managed carefully when developing high-performance applications.
Fully-indexed graphs should be cheap to store and hold in memory, but it doesn't seem there is a standard approach that can be used just for high-performance access to the sequence and identifier space of the graph. Most work has gone into improving performance for querying the text of such a graph (GCSA) or generating one out of sequencing reads (assemblers such as SGA or fermi2).
The basic requirement is a system that a minimal amount of memory to store the sequence of the graph, its edges, and paths in the graph, but still allows constant-time access to the essential features of the graph. The system should support accessing:
- the node's label (a DNA sequence, for instance, or URL)
- the node's neighbors (inbound and outbound edges)
- the node's region in the graph (ranges of node id space that are within some distance of the node)
- node locations relative to stored paths in the graph
- node and edge path membership
read this in HTML for rendered equations
In theory we could construct a mutable system based on wavelet tries, but research in this area is very new, and I have not found readily-available code for working with these systems. It should be possible to construct mutable wavelet tries using sdsl-lite as a basis, but at present this may be too complex an objective. An immutable system seems like a straightforward thing to do.
First some definitions. We have a graph
We first store the concatenated sequences of all elements,
To store edges we keep compressed integer vectors of node ids for the forward
We can represent the path space of the graph using a bitvector marking which entities in the edge-from integer vector