Skip to content

Commit

Permalink
made some revisions based on reviewer comments
Browse files Browse the repository at this point in the history
  • Loading branch information
jasonpell committed Mar 9, 2012
1 parent fb793f3 commit 6b2763d
Show file tree
Hide file tree
Showing 5 changed files with 72 additions and 95 deletions.
Binary file modified f3b001.pdf
Binary file not shown.
Binary file modified f3b005.pdf
Binary file not shown.
Binary file modified f3b010.pdf
Binary file not shown.
Binary file modified f3b015.pdf
Binary file not shown.
167 changes: 72 additions & 95 deletions kmer-percolation.tex
Expand Up @@ -122,7 +122,7 @@ \section{Introduction}

In this work, we describe a simple probabilistic representation for
storing de Bruijn graphs in memory, based on Bloom filters
\cite{bloom}. Bloom filters are constant-memory probabilistic data
\cite{bloom}. Bloom filters are fixed-memory probabilistic data
structures for storing sparse sets; essentially hash tables without collision
detection, set membership queries on Bloom filters can yield false
positives -- elements marked as present that do not actually exist --
Expand All @@ -140,7 +140,7 @@ \section{Introduction}

We apply this graph representation to reduce the memory needed
to assemble a soil metagenome sample, through
the use of read partitioning. Partitioning divides a de Bruijn graph up
the use of read partitioning. Partitioning separates a de Bruijn graph up
into disconnected graph components; these components can be used to
subdivide sequencing reads into disconnected subsets that can be
assembled separately.
Expand Down Expand Up @@ -172,10 +172,10 @@ \subsection{Bloom filters can store de Bruijn graphs}
Thus each k-mer has up to 8 edges connecting to 8 neighbors, which can be determined by
simply building all possible 1-base extensions and testing for their
presence in the Bloom filter. In doing so, we implicitly treat the
graph as a simple graph as opposed to a multigraph or digraph, which
graph as a simple graph as opposed to a multigraph, which
means that there can be no self-loops or parallel edges between
vertices/k-mers. By relying on Bloom filters, the data structure is
constant memory: no extra memory is used as additional data is
vertices/k-mers. By relying on Bloom filters, the size of the data structure
is fixed: no extra memory is used as additional data is
added.

This graph structure is effectively {\em compressible} because one can
Expand Down Expand Up @@ -454,7 +454,7 @@ \subsection{Preservation of long-range structure permits graph partitioning}
can be assembled with parameters chosen for the coverage and sequence
heterogeneity present in each partition. Moreover, data sets partitioned
at a low $k_0$ can be exactly assembled with any $k \ge k_0$, because
overlaps of $k_0$ bases include all overlaps of greater length.
overlaps of $k_0-1$ bases include all overlaps of greater length.
Existing approaches to partitioning, however,
rely on exact graph representations that are more memory intensive
than the one presented here. The utility of the probabilistic graph
Expand Down Expand Up @@ -492,7 +492,7 @@ \subsection{Concluding thoughts}
impacting the global structure of the graph, allowing graph storage in
as little as 4 bits per k-mer. Because a higher false positive rate
yields a more elaborate local structure, memory can be traded for
traversal time in e.g. partitioning. Second, it is a constant memory
traversal time in e.g. partitioning. Second, it is a fixed-memory
data structure, with predictable degradation of both local and global
structure as more data is inserted. For data sets where the number of
unique k-mers is not known in advance, the occupancy of the Bloom
Expand Down Expand Up @@ -614,8 +614,8 @@ \section{Graph Partitioning Using A Bloom Filter}
We used the Bloom filter data structure containing the k-mers from a
dataset to discover components of the graph, i.e. to partition the
graph. Here a component is a set of k-mers whose originating reads
overlap transitively by at least $k$ base pairs. Reads belonging only
to small components can be discovered and eliminated in constant
overlap transitively by at least $k-1$ base pairs. Reads belonging only
to small components can be discovered and eliminated in fixed
memory using a simple traversal algorithm that truncates after
discovering more than a given number of novel k-mers. For discovering
large components we tag the graph at a minimum density by using the
Expand Down Expand Up @@ -665,196 +665,171 @@ \section{Software and Software Availability}
\begin{thebibliography}{10}

\bibitem{pubmed19482960}
M.~Pop.
Pop~M. (2009{\rm{}}).
\newblock Genome assembly reborn: recent computational challenges.
\newblock {\em Brief Bioinform}, 10(4):354--66, 2009.
\newblock {\em Brief Bioinform}, 10(4):354--66.

\bibitem{pubmed22147368}
S.~Salzberg, A.~Phillippy, A.~Zimin, D.~Puiu, T.~Magoc, S.~Koren, T.~Treangen,
M.~Schatz, A.~Delcher, M.~Roberts, G.~Marcais, M.~Pop, and J.~Yorke.
Salzberg~S, et al. (2011{\rm{}}).
\newblock Gage: A critical evaluation of genome assemblies and assembly
algorithms.
\newblock {\em Genome Res}, 2011.
\newblock {\em Genome Res}.

\bibitem{metahit}
Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen,
T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li,
D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P.,
Bertalan, M., Batto, J., Hansen, T., Paslier, D.~L., Linneberg, A., Nielsen,
H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H.,
Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang,
H., Wang, J., Brunak, S., Dore, J., Guarner, F., Kristiansen, K., Pedersen,
O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S. and Wang, J.
Qin ~J, et al.
(2010{\rm{}}).
\newblock A human gut microbial gene catalogue established by metagenomic
sequencing.
\newblock {\rm Nature } \emph{464}, 59--65.

\bibitem{rumen}
Hess, M., Sczyrba, A., Egan, R., Kim, T., Chokhawala, H., Schroth, G., Luo, S.,
Clark, D., Chen, F., Zhang, T., Mackie, R., Pennacchio, L., Tringe, S.,
Visel, A., Woyke, T., Wang, Z. and Rubin, E. (2011{\rm{}}).
Hess~M, et al.
(2011{\rm{}}).
\newblock Metagenomic discovery of biomass-degrading genes and genomes from cow
rumen.
\newblock {\rm Science } \emph{331}, 463--7.

\bibitem{pubmed20195499}
J.~Wooley, A.~Godzik, and I.~Friedberg.
Wooley~J, Godzik~A, and Friedberg~I. (2010{\rm{}}).
\newblock A primer on metagenomics.
\newblock {\em PLoS Comput Biol}, 6(2):e1000667, 2010.
\newblock {\em PLoS Comput Biol}, 6(2):e1000667.

\bibitem{pubmed16123304}
J.~Gans, M.~Wolinsky, and J.~Dunbar.
Gans~J, Wolinsky~M, and Dunbar~J. (2005{\rm{}}).
\newblock Computational improvements reveal great bacterial diversity and high
metal toxicity in soil.
\newblock {\em Science}, 309(5739):1387--90, 2005.
\newblock {\em Science}, 309(5739):1387--90.

\bibitem{sargasso}
Venter, J., Remington, K., Heidelberg, J., Halpern, A., Rusch, D., Eisen, J.,
Wu, D., Paulsen, I., Nelson, K., Nelson, W., Fouts, D., Levy, S., Knap, A.,
Lomas, M., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R.,
Baden-Tillson, H., Pfannkoch, C., Rogers, Y. and Smith, H. (2004{\rm{}}).
Venter~J, et al. (2004{\rm{}}).
\newblock Environmental genome shotgun sequencing of the Sargasso Sea.
\newblock {\rm Science } \emph{304}, 66--74.

\bibitem{permafrost}
R.~Mackelprang, M.~Waldrop, K.~DeAngelis, M.~David, K.~Chavarria, S.~Blazewicz,
E.~Rubin, and J.~Jansson.
Mackelprang~R, et al. (2011{\rm{}}).
\newblock Metagenomic analysis of a permafrost microbial community reveals a
rapid response to thaw.
\newblock {\em Nature}, 480(7377):368--71, 2011.
\newblock {\em Nature}, 480(7377):368--71.

\bibitem{pubmed11504945}
P.~Pevzner, H.~Tang, and M.~Waterman.
\newblock An eulerian path approach to dna fragment assembly.
\newblock {\em Proc Natl Acad Sci U S A}, 98(17):9748--53, 2001.
Pevzner~P, Tang~H, and Waterman~M. (2001{\rm{}}).
\newblock An Eulerian path approach to DNA fragment assembly.
\newblock {\em Proc Natl Acad Sci U S A}, 98(17):9748--53.

\bibitem{pubmed20211242}
J.~Miller, S.~Koren, and G.~Sutton.
Miller~J, Koren~S, and Sutton~G. (2010{\rm{}}).
\newblock Assembly algorithms for next-generation sequencing data.
\newblock {\em Genomics}, 95(6):315--27, 2010.
\newblock {\em Genomics}, 95(6):315--27.

\bibitem{pubmed22068540}
P.~Compeau, P.~Pevzner, and G.~Tesler.
Compeau~P, Pevzner~P, and Tesler~G. (2011{\rm{}}).
\newblock How to apply de {B}ruijn graphs to genome assembly.
\newblock {\em Nat Biotechnol}, 29(11):987--91, 2011.
\newblock {\em Nat Biotechnol}, 29(11):987--91.

\bibitem{pmid21187386}
S.~Gnerre, I.~Maccallum, D.~Przybylski, F.~J. Ribeiro, J.~N. Burton, B.~J.
Walker, T.~Sharpe, G.~Hall, T.~P. Shea, S.~Sykes, A.~M. Berlin, D.~Aird,
M.~Costello, R.~Daza, L.~Williams, R.~Nicol, A.~Gnirke, C.~Nusbaum, E.~S.
Lander, and D.~B. Jaffe.
Gnerre~S, et al. (2011{\rm{}}).
\newblock {{H}igh-quality draft assemblies of mammalian genomes from massively
parallel sequence data}.
\newblock {\em Proc. Natl. Acad. Sci. U.S.A.}, 108:1513--1518, Jan 2011.
\newblock {\em Proc. Natl. Acad. Sci. U.S.A.}, 108:1513--1518.

\bibitem{pubmed21114842}
D.~Kelley, M.~Schatz, and S.~Salzberg.
Kelley~D, Schatz~M, and Salzberg~S. (2010{\rm{}}).
\newblock Quake: quality-aware detection and correction of sequencing errors.
\newblock {\em Genome Biol}, 11(11):R116, 2010.
\newblock {\em Genome Biol}, 11(11):R116.

\bibitem{bloom}
B.~Bloom.
Bloom~B. (1970{\rm{}}).
\newblock {{S}pace/time tradeoffs in hash coding with allowable errors}.
\newblock {\em CACM}, 13(7):422--426, 1970.
\newblock {\em CACM}, 13(7):422--426.

\bibitem{velvet}
D.~R. Zerbino, E.~Birney.
Zerbino~DR, Birney~E. (2008{\rm{}}).
\newblock {{V}elvet: algorithms for de novo short read assembly using de {B}ruijn graphs}.
\newblock {\em Genome Res.}, 18(5):821-9, 2008.
\newblock {\em Genome Res.}, 18(5):821-9.

\bibitem{abyss}
J.~T. Simpson, K.~Wong, S.~D. Jackman, J.~E. Schein, S.~J. Jones, I.~Birol.
Simpson~JT, et al. (2009{\rm{}}).
\newblock {{A}{B}y{S}{S}: a parallel assembler for short read sequence data}.
\newblock {\em Genome Res.}, 19(6):1117-23, 2009.
\newblock {\em Genome Res.}, 19(6):1117-23.

\bibitem{metavelvet}
T.~Namiki, T.~Hachiya, H.~Tanaka, and Y.~Sakakibara.
Namiki~T, Hachiya~T, Tanaka~H, and Sakakibara~Y. (2011{\rm{}}).
\newblock {M}eta{V}elvet: {A}n extension of {V}elvet assembler to de novo
metagenome assembly from short sequence reads.
\newblock {\em ACM Conference on Bioinformatics, Computational Biology and
Biomedicine}, 2011.
Biomedicine}.

\bibitem{pubmed21685107}
Y.~Peng, H.~Leung, S.~Yiu, and F.~Chin.
Peng~Y, Leung~H, Yiu~S, and Chin~F. (2011{\rm{}}).
\newblock Meta-IDBA: a de Novo assembler for metagenomic data.
\newblock {\em Bioinformatics}, 27(13):i94--i101, 2011.
\newblock {\em Bioinformatics}, 27(13):i94--i101.

\bibitem{trinity}
M.~Grabherr, B.~Haas, M.~Yassour, J.~Levin, D.~Thompson, I.~Amit, X.~Adiconis,
L.~Fan, R.~Raychowdhury, Q.~Zeng, et~al.
Grabherr~M, et al. (2011{\rm{}}).
\newblock {F}ull-length transcriptome assembly from {R}{N}{A}-{S}eq data
without a reference genome.
\newblock {\em Nature biotechnology}, 2011.
\newblock {\em Nature biotechnology}.

\bibitem{staufferintro}
D.~Stauffer and A.~Aharony.
Stauffer~D and Aharony~A. (2010{\rm{}}).
\newblock {I}ntroduction to {P}ercolation {T}heory.
\newblock {\em Taylor and Frances e-Library}, 2010.
\newblock {\em Taylor and Frances e-Library}.

\bibitem{stauffer1979scaling}
D.~Stauffer.
Stauffer~D. (1979{\rm{}}).
\newblock {S}caling theory of percolation clusters.
\newblock {\em Physics Reports}, 54(1):1--74, 1979.
\newblock {\em Physics Reports}, 54(1):1--74.

\bibitem{bondy2008graph}
J.~Bondy and U.~Murty.
Bondy~J and Murty~U. (2008{\rm{}}).
\newblock {G}raph {T}heory.
\newblock {\em Graduate Texts in Mathematics}, 2008.
\newblock {\em Graduate Texts in Mathematics}.

\bibitem{zerbinothesis}
Z.~DR.
Zerbino~DR. (2009{\rm{}}).
\newblock Genome assembly and comparison using de {B}ruijn graphs.
\newblock Ph.D. thesis, University of Cambridge, 2009.
\newblock Ph.D. thesis, University of Cambridge.

\bibitem{terabasemetag}
J.~Gilbert, F.~Meyer, D.~Antonopoulos, P.~Balaji, C.~Brown, C.~Brown, N.~Desai,
J.~Eisen, D.~Evers, D.~Field, W.~Feng, D.~Huson, J.~Jansson, R.~Knight,
J.~Knight, E.~Kolker, K.~Konstantindis, J.~Kostka, N.~Kyrpides,
R.~Mackelprang, A.~McHardy, C.~Quince, J.~Raes, A.~Sczyrba, A.~Shade, and
R.~Stevens.
Gilbert~J, et al. (2010{\rm{}}).
\newblock Meeting report: the terabase metagenomics workshop and the vision of
an earth microbiome project.
\newblock {\em Stand Genomic Sci}, 3(3):243--8, 2010.
\newblock {\em Stand Genomic Sci}, 3(3):243--8.

\bibitem{emp2010}
Gilbert, J., Meyer, F., Jansson, J., Gordon, J., Pace, N., Tiedje, J., Ley, R.,
Fierer, N., Field, D., Kyrpides, N., Glockner, F., Klenk, H., Wommack, K.,
Glass, E., Docherty, K., Gallery, R., Stevens, R. and Knight, R.
(2010{\rm{b}}).
\newblock The Earth Microbiome Project: Meeting report of the '1 EMP meeting on
sample selection and acquisition' at Argonne National Laboratory October 6
Gilbert~J, et al. (2010{\rm{}}).
\newblock The Earth Microbiome Project: Meeting report of the ``1 EMP meeting on
sample selection and acquisition'' at Argonne National Laboratory October 6
2010.
\newblock {\rm Stand Genomic Sci } \emph{3}, 249--53.

\bibitem{zhang2003dna}
Y.~Zhang and M.~Waterman.
Zhang~Y and Waterman~M. (2003{\rm{}}).
\newblock {D}{N}{A} {S}equence {A}ssembly and {M}ultiple {S}equence {A}lignment
by an {E}ulerian {P}ath {A}pproach.
\newblock In {\em Cold Spring Harbor Symposia on Quantitative Biology},
volume~68, pages 205--212. Cold Spring Harbor Laboratory Press, 2003.
volume~68, pages 205--212. Cold Spring Harbor Laboratory Press.

\bibitem{price2005novo}
A.~Price, N.~Jones, and P.~Pevzner.
Price~A, Jones~N, and Pevzner~P. (2005{\rm{}}).
\newblock {D}e novo identification of repeat families in large genomes.
\newblock {\em Bioinformatics}, 21(suppl 1):i351--i358, 2005.
\newblock {\em Bioinformatics}, 21(suppl 1):i351--i358.

\bibitem{bloomsurvey}
A.~Broder and M.~Mitzenmacher.
Broder~A and Mitzenmacher~M. (2004{\rm{}}).
\newblock {N}etwork applications of bloom filters: {A} survey.
\newblock {\em Internet Mathematics}, 1(4):485--509, 2004.
\newblock {\em Internet Mathematics}, 1(4):485--509.

\bibitem{adami2002critical}
C.~Adami and J.~Chu.
Adami~C and Chu~J. (2002{\rm{}}).
\newblock {C}ritical and near-critical branching processes.
\newblock {\em Physical Review E}, 66(1):011907, 2002.
\newblock {\em Physical Review E}, 66(1):011907.

\bibitem{wald43}
A.~Wald.
Wald~A. (1943{\rm{}}).
\newblock Tests of statistical hypotheses concerning several parameters when
the number of observations is large.
\newblock {\em Transactions of the American Mathematical Society}, 54:426--482,
1943.
\newblock {\em Transactions of the American Mathematical Society}, 54:426--482.

\end{thebibliography}

Expand All @@ -876,7 +851,9 @@ \section{Software and Software Availability}
\includegraphics[width=2in]{f3b015}

\caption{Graph visualizations demonstrating the decreasing fidelity of
graph structure with increasing false positive rate. From top left
graph structure with increasing false positive rate. Erroneous k-mers are
colored red and k-mers corresponding to the original generated sequence
are black. From top left
to bottom right, the false positive rates are 0.01, 0.05, 0.10, and
0.15. Shortcuts ``across'' the graph are not created.}

Expand Down

0 comments on commit 6b2763d

Please sign in to comment.