-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.html
32 lines (32 loc) · 8.7 KB
/
README.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="generator" content="pandoc" />
<title></title>
<style type="text/css">code{white-space: pre;}</style>
</head>
<body>
<h1 id="succinct-graphs">succinct graphs</h1>
<p><a href="https://travis-ci.org/ekg/succinct-graph"><img src="https://travis-ci.org/ekg/succinct-graph.svg" alt="Build Status" /></a></p>
<h2 id="challenge">challenge</h2>
<p><code>vg</code>'s current graph memory model is weak and extremely bloated. It relies on fixed-width 64-bit integer ids and large hash tables mapping these to other entities. This makes it difficult to store in memory, and a general-purpose key-value store (rocksdb) is used to allow low-memory access to the entire graph. Although this design has some advantages, querying the graph requires costly IO operations, and thus use must be managed carefully when developing high-performance applications.</p>
<p>Fully-indexed graphs should be cheap to store and hold in memory, but it doesn't seem there is a standard approach that can be used just for high-performance access to the sequence and identifier space of the graph. Most work has gone into improving performance for querying the text of such a graph (<a href="https://github.com/jltsiren/gcsa2">GCSA</a>) or generating one out of sequencing reads (assemblers such as <a href="https://github.com/jts/sga">SGA</a> or <a href="https://github.com/lh3/fermi2">fermi2</a>).</p>
<p>The basic requirement is a system that a minimal amount of memory to store the sequence of the graph, its edges, and paths in the graph, but still allows constant-time access to the essential features of the graph. The system should support accessing:</p>
<ul>
<li>the node's label (a DNA sequence, for instance, or URL)</li>
<li>the node's neighbors (inbound and outbound edges)</li>
<li>the node's region in the graph (ranges of node id space that are within some distance of the node)</li>
<li>node locations relative to stored paths in the graph</li>
<li>node and edge path membership</li>
</ul>
<h2 id="sketch">sketch</h2>
<p><a href="https://rawgit.com/ekg/succinct-graph/master/README.html">read this in HTML for rendered equations</a></p>
<p>In theory we could construct a mutable system based on <a href="http://arxiv.org/abs/1204.3581">wavelet tries</a>, but research in this area is very new, and I have not found readily-available code for working with these systems. It should be possible to construct mutable wavelet tries using sdsl-lite as a basis, but at present this may be too complex an objective. An immutable system seems like a straightforward thing to do.</p>
<p>First some definitions. We have a graph <span class="math"><em>G</em> = <em>N</em>, <em>E</em>, <em>P</em></span> with nodes <span class="math"><em>N</em> = <em>n</em><sub>1</sub>, …, <em>n</em><sub>∣<em>N</em>∣</sub></span>, directed edges <span class="math"><em>E</em> = <em>e</em><sub>1</sub>, …, <em>e</em><sub>∣<em>E</em>∣</sub></span>, and paths <span class="math"><em>P</em> = <em>p</em><sub>1</sub>, …, <em>p</em><sub>∣<em>P</em>∣</sub></span>. Nodes match labels <span class="math"><em>l</em><sub><em>n</em><sub><em>i</em></sub></sub></span> to ranks <span class="math"><em>i</em></span> in the collection of node labels: <span class="math"><em>n</em><sub><em>i</em></sub> = <em>l</em><sub><em>n</em><sub><em>i</em></sub></sub>, <em>i</em></span>. Edges go from one node to another <span class="math"><em>e</em><sub><em>j</em></sub> = <em>n</em><sub><em>x</em></sub>, <em>n</em><sub><em>y</em></sub></span>. Paths match labels <span class="math"><em>l</em><sub><em>p</em><sub><em>k</em></sub></sub></span> to sets of nodes and edges <span class="math"><em>p</em><sub><em>k</em></sub> = <em>l</em><sub><em>p</em><sub><em>k</em></sub></sub>, {<em>n</em><sub>1</sub>, <em>e</em><sub>3</sub>, <em>n</em><sub>4</sub>, <em>e</em><sub>5</sub>, …}</span>.</p>
<p>We first store the concatenated sequences of all elements, <span class="math"><em>S</em> = <em>l</em><sub><em>n</em><sub>1</sub></sub><em>l</em><sub><em>n</em><sub>2</sub></sub><em>l</em><sub><em>n</em><sub>3</sub></sub>…<em>l</em><sub><em>n</em><sub>∣<em>N</em>∣</sub></sub></span>, in the graph in a <a href="https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/enc_vector.hpp#L48-L58">compressed integer vector</a>, <span class="math"><em>S</em><sub><em>i</em><em>v</em></sub></span>. A second <a href="https://github.com/simongog/sdsl-lite/blob/master/include/sdsl/rrr_vector.hpp">compressed bitvector</a>, <span class="math"><em>S</em><sub><em>b</em><em>v</em></sub> : ∣<em>S</em><sub><em>i</em><em>v</em></sub>∣ = ∣<em>S</em><sub><em>b</em><em>v</em></sub>∣</span>, flags node starts, providing a system of node identifiers. We can apply <span class="math"><em>r</em><em>a</em><em>n</em><em>k</em><sub>1</sub>(<em>S</em><sub><em>b</em><em>v</em></sub>, <em>x</em>)</span> to determine the node rank/id at a given position in <span class="math"><em>S</em><sub><em>i</em><em>v</em></sub></span>, and we can use <span class="math"><em>s</em><em>e</em><em>l</em><em>e</em><em>c</em><em>t</em><sub>1</sub>(<em>S</em><sub><em>b</em><em>v</em></sub>, <em>x</em>)</span> to find the positions in <span class="math"><em>S</em><sub><em>i</em><em>v</em></sub></span> corresponding to node with rank/id <span class="math"><em>x</em></span>, thus allowing basic navigation of the nodes and their labels.</p>
<p>To store edges we keep compressed integer vectors of node ids for the forward <span class="math"><em>F</em><sub><em>i</em><em>v</em></sub></span> and reverse <span class="math"><em>T</em><sub><em>i</em><em>v</em></sub></span> link directions, where <span class="math"><em>F</em><sub><em>i</em><em>v</em></sub> = <em>f</em><sub>1</sub>, …, <em>f</em><sub>∣<em>N</em>∣</sub></span> and <span class="math"><em>f</em><sub><em>i</em></sub> = <em>i</em>, <em>t</em><em>o</em><sub><em>i</em><sub>1</sub></sub>, …, <em>t</em><em>o</em><sub><em>i</em><sub>∣<em>t</em><em>o</em><sub><em>i</em></sub>∣</sub></sub></span>. <span class="math"><em>T</em><sub><em>i</em><em>v</em></sub></span> inverts this relationship, providing <span class="math"><em>T</em><sub><em>i</em><em>v</em></sub> = <em>t</em><sub>1</sub>, …, <em>t</em><sub>∣<em>N</em>∣</sub></span> and <span class="math"><em>t</em><sub><em>i</em></sub> = <em>i</em>, <em>f</em><em>r</em><em>o</em><em>m</em><sub><em>i</em><sub>1</sub></sub>, …, <em>f</em><em>r</em><em>o</em><em>m</em><sub><em>i</em><sub>∣<em>f</em><em>r</em><em>o</em><em>m</em><sub><em>i</em></sub>∣</sub></sub></span>. Recall that <span class="math"><em>i</em></span> is the rank of the node. Using another bitvector <span class="math"><em>F</em><sub><em>b</em><em>v</em></sub> : ∣<em>F</em><sub><em>b</em><em>v</em></sub>∣ = ∣<em>F</em><sub><em>i</em><em>v</em></sub>∣</span> and <span class="math"><em>T</em><sub><em>b</em><em>v</em></sub> : ∣<em>T</em><sub><em>b</em><em>v</em></sub>∣ = ∣<em>T</em><sub><em>i</em><em>v</em></sub>∣</span> for we record the first position of each node's entries in <span class="math"><em>F</em><sub><em>i</em><em>v</em></sub></span> and <span class="math"><em>T</em><sub><em>i</em><em>v</em></sub></span>. This first position simply records the rank <span class="math"><em>i</em></span> in <span class="math"><em>S</em><sub><em>i</em><em>v</em></sub></span>. The rest of the positions in the node's range record the ranks/ids of the nodes on the other end of the edge--- on the "to" end in the <span class="math"><em>F</em><sub><em>i</em><em>v</em></sub></span> and the "from" end in <span class="math"><em>T</em><sub><em>i</em><em>v</em></sub></span>. If a node has no edges either coming from or going to it, it will only be represented by reference to its own rank in the correspending edge integer vector.</p>
<p>We can represent the path space of the graph using a bitvector marking which entities in the edge-from integer vector <span class="math"><em>F</em><sub><em>i</em><em>v</em></sub></span> lie in a path. For each traversed node or edge, we mark a 1 in a new bitvector <span class="math"><em>P</em><sub><em>i</em></sub><em></em><sub><em>b</em><em>v</em></sub> : ∣<em>P</em><sub><em>i</em><sub><em>b</em><em>v</em></sub></sub>∣ = ∣<em>F</em><sub><em>i</em><em>v</em></sub>∣</span>. We mark contained entries with 1 and set the un-traversed nodes and edges to 0. Each path thus maps a label to a list of nodes and edges.</p>
</body>
</html>