Skip to content

Commit

Permalink
with requested revisions
Browse files Browse the repository at this point in the history
  • Loading branch information
ctb committed Jun 19, 2012
1 parent a606213 commit aaf1348
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 9 deletions.
22 changes: 13 additions & 9 deletions kmer-percolation.tex
Expand Up @@ -63,7 +63,7 @@
analyze the k-mer connectivity of metagenomic samples. The graph
representation is based on a probabilistic data structure, a Bloom
filter, that allows us to efficiently store assembly graphs in as
little as 4 bits per k-mer. We show that this data structure
little as 4 bits per k-mer, albeit inexactly. We show that this data structure
accurately represents DNA assembly graphs in low memory. We apply
this data structure to the problem of partitioning assembly graphs
into components as a prelude to assembly, and show that this reduces
Expand Down Expand Up @@ -209,10 +209,12 @@ \subsection{Bloom filters can store de Bruijn graphs}
which 1 in 6 random k-mers tested would be falsely considered present,
each real k-mer can be stored in under 4 bits of memory (see
%Table~\ref{table:bitskmer}
Table 1).
Table 1). While there are many false k-mers, they only matter if they
connect to a real k-mer.

Using a probabilistic data structure to store k-mer nodes does cause
one potential problem: in contrast to an exact graph storage, there is
The false positive rate inherent in Bloom filters thus raises one concern
for graph storage:
in contrast to an exact graph storage, there is
a chance that a k-mer will be adjacent to a false positive k-mer.
That is, a k-mer may connect to another k-mer that does not actually
exist in the original dataset but nonetheless registers as present,
Expand All @@ -232,8 +234,9 @@ \subsection{False positives cause local elaboration of graph structure}
graphs using $k=31$ with four different false positive rates
($p_f$=0.01, 0.05, 0.10, and 0.15), we explored the graph using
breadth-first search beginning at the first 31-mer. The graphs in
Figure \ref{fig:circles} illustrate how the local graph structure
elaborates with the false positive rate while the overall circular
Figure \ref{fig:circles} illustrate how the graphs connected to
the original k-mers
elaborate with the false positive rate while the overall circular
graph structure remains, with no erroneous shortcuts between k-mers
that are present in the original sequence. It is visually apparent
that even a high false positive rate of 15\% does not systematically
Expand Down Expand Up @@ -363,11 +366,12 @@ \subsection{Erroneous k-mers from sequencing eclipse graph false positives}
%component.
Furthermore, the number of real 17-mers, those that are not false
positives, comprise the majority of the graph.
(As above, we only counted false positive k-mers
that are transitively connected to at least one real k-mer.)

In contrast, when we examined an exact representation of an Illumina
dataset, only 9.9\% of the k-mers in the graph truly exist in the
reference genome. As above, we only counted false positive k-mers
that are transitively connected to at least one real k-mer. The number of
reference genome. The number of
17-mers with more than 2 neighbors in the sequencing reads is higher than for the exact
representation of the genome, which demonstrates that sequencing
errors add to the complexity of the graph. Overall, the errors
Expand Down Expand Up @@ -718,7 +722,7 @@ \section{Software and Software Availability}
We have implemented this compressible graph representation and the
associated partitioning algorithm in a software package named khmer.
It is written in C++ and Python 2.6 and is available under the BSD
open source license at https://github.com/ctb/khmer. The graphviz
open source license at https://github.com/ged-lab/khmer. The graphviz
software package was used for graph visualizations. The scripts to
generate the figures of this paper are available in the khmer
repository.
Expand Down
5 changes: 5 additions & 0 deletions table1.tex
@@ -1,3 +1,7 @@
\documentclass[12pt]{article}

\begin{document}

\begin{table*}
\centering
\caption{Bits per k-mer for various false positive rates.}
Expand All @@ -14,3 +18,4 @@
\label{table:bitskmer}
\end{table*}

\end{document}
7 changes: 7 additions & 0 deletions table2.tex
@@ -1,3 +1,7 @@
\documentclass[12pt]{article}

\begin{document}

\begin{table}
\centering

Expand All @@ -17,3 +21,6 @@
\end{tabular*}
\label{table:ecoli}
\end{table}


\end{document}
6 changes: 6 additions & 0 deletions table3.tex
@@ -1,3 +1,7 @@
\documentclass[12pt]{article}

\begin{document}

\begin{table}
\centering
\caption{Partitioning results on a soil metagenome at k=31.}
Expand All @@ -14,3 +18,5 @@

\label{table:parts}
\end{table}

\end{document}

0 comments on commit aaf1348

Please sign in to comment.