made some revisions based on reviewer comments

dib-lab · Mar 9, 2012 · 6b2763d · 6b2763d
1 parent fb793f3
commit 6b2763d
Show file tree

Hide file tree

Showing 5 changed files with 72 additions and 95 deletions.
diff --git a/f3b001.pdf b/f3b001.pdf
diff --git a/f3b005.pdf b/f3b005.pdf
diff --git a/f3b010.pdf b/f3b010.pdf
diff --git a/f3b015.pdf b/f3b015.pdf
diff --git a/kmer-percolation.tex b/kmer-percolation.tex
@@ -122,7 +122,7 @@ \section{Introduction}
 
 In this work, we describe a simple probabilistic representation for
 storing de Bruijn graphs in memory, based on Bloom filters
-\cite{bloom}.  Bloom filters are constant-memory probabilistic data
+\cite{bloom}.  Bloom filters are fixed-memory probabilistic data
 structures for storing sparse sets; essentially hash tables without collision
 detection, set membership queries on Bloom filters can yield false
 positives -- elements marked as present that do not actually exist --
@@ -140,7 +140,7 @@ \section{Introduction}
 
 We apply this graph representation to reduce the memory needed
 to assemble a soil metagenome sample, through
-the use of read partitioning.  Partitioning divides a de Bruijn graph up
+the use of read partitioning.  Partitioning separates a de Bruijn graph up
 into disconnected graph components; these components can be used to
 subdivide sequencing reads into disconnected subsets that can be
 assembled separately.
@@ -172,10 +172,10 @@ \subsection{Bloom filters can store de Bruijn graphs}
 Thus each k-mer has up to 8 edges connecting to 8 neighbors, which can be determined by
 simply building all possible 1-base extensions and testing for their
 presence in the Bloom filter.  In doing so, we implicitly treat the
-graph as a simple graph as opposed to a multigraph or digraph, which
+graph as a simple graph as opposed to a multigraph, which
 means that there can be no self-loops or parallel edges between
-vertices/k-mers.  By relying on Bloom filters, the data structure is
-constant memory: no extra memory is used as additional data is
+vertices/k-mers.  By relying on Bloom filters, the size of the data structure 
+is fixed: no extra memory is used as additional data is
 added.
 
 This graph structure is effectively {\em compressible} because one can
@@ -454,7 +454,7 @@ \subsection{Preservation of long-range structure permits graph partitioning}
 can be assembled with parameters chosen for the coverage and sequence
 heterogeneity present in each partition.  Moreover, data sets partitioned
 at a low $k_0$ can be exactly assembled with any $k \ge k_0$, because
-overlaps of $k_0$ bases include all overlaps of greater length.
+overlaps of $k_0-1$ bases include all overlaps of greater length.
 Existing approaches to partitioning, however,
 rely on exact graph representations that are more memory intensive
 than the one presented here.  The utility of the probabilistic graph
@@ -492,7 +492,7 @@ \subsection{Concluding thoughts}
 impacting the global structure of the graph, allowing graph storage in
 as little as 4 bits per k-mer.  Because a higher false positive rate
 yields a more elaborate local structure, memory can be traded for
-traversal time in e.g. partitioning.  Second, it is a constant memory
+traversal time in e.g. partitioning.  Second, it is a fixed-memory
 data structure, with predictable degradation of both local and global
 structure as more data is inserted.  For data sets where the number of
 unique k-mers is not known in advance, the occupancy of the Bloom
@@ -614,8 +614,8 @@ \section{Graph Partitioning Using A Bloom Filter}
 We used the Bloom filter data structure containing the k-mers from a
 dataset to discover components of the graph, i.e. to partition the
 graph.  Here a component is a set of k-mers whose originating reads
-overlap transitively by at least $k$ base pairs.  Reads belonging only
-to small components can be discovered and eliminated in constant
+overlap transitively by at least $k-1$ base pairs.  Reads belonging only
+to small components can be discovered and eliminated in fixed
 memory using a simple traversal algorithm that truncates after
 discovering more than a given number of novel k-mers.  For discovering
 large components we tag the graph at a minimum density by using the
@@ -665,196 +665,171 @@ \section{Software and Software Availability}
 \begin{thebibliography}{10}
 
 \bibitem{pubmed19482960}
-M.~Pop.
+Pop~M. (2009{\rm{}}).
 \newblock Genome assembly reborn: recent computational challenges.
-\newblock {\em Brief Bioinform}, 10(4):354--66, 2009.
+\newblock {\em Brief Bioinform}, 10(4):354--66.
 
 \bibitem{pubmed22147368}
-S.~Salzberg, A.~Phillippy, A.~Zimin, D.~Puiu, T.~Magoc, S.~Koren, T.~Treangen,
-  M.~Schatz, A.~Delcher, M.~Roberts, G.~Marcais, M.~Pop, and J.~Yorke.
+Salzberg~S, et al. (2011{\rm{}}).
 \newblock Gage: A critical evaluation of genome assemblies and assembly
   algorithms.
-\newblock {\em Genome Res}, 2011.
+\newblock {\em Genome Res}.
 
 \bibitem{metahit}
-Qin, J., Li, R., Raes, J., Arumugam, M., Burgdorf, K., Manichanh, C., Nielsen,
-  T., Pons, N., Levenez, F., Yamada, T., Mende, D., Li, J., Xu, J., Li, S., Li,
-  D., Cao, J., Wang, B., Liang, H., Zheng, H., Xie, Y., Tap, J., Lepage, P.,
-  Bertalan, M., Batto, J., Hansen, T., Paslier, D.~L., Linneberg, A., Nielsen,
-  H., Pelletier, E., Renault, P., Sicheritz-Ponten, T., Turner, K., Zhu, H.,
-  Yu, C., Li, S., Jian, M., Zhou, Y., Li, Y., Zhang, X., Li, S., Qin, N., Yang,
-  H., Wang, J., Brunak, S., Dore, J., Guarner, F., Kristiansen, K., Pedersen,
-  O., Parkhill, J., Weissenbach, J., Bork, P., Ehrlich, S.  and Wang, J.
+Qin ~J, et al.
   (2010{\rm{}}).
 \newblock A human gut microbial gene catalogue established by metagenomic
   sequencing.
 \newblock {\rm Nature } \emph{464}, 59--65.
 
 \bibitem{rumen}
-Hess, M., Sczyrba, A., Egan, R., Kim, T., Chokhawala, H., Schroth, G., Luo, S.,
-  Clark, D., Chen, F., Zhang, T., Mackie, R., Pennacchio, L., Tringe, S.,
-  Visel, A., Woyke, T., Wang, Z.  and Rubin, E. (2011{\rm{}}).
+Hess~M, et al.
+  (2011{\rm{}}).
 \newblock Metagenomic discovery of biomass-degrading genes and genomes from cow
   rumen.
 \newblock {\rm Science } \emph{331}, 463--7.
 
 \bibitem{pubmed20195499}
-J.~Wooley, A.~Godzik, and I.~Friedberg.
+Wooley~J, Godzik~A, and Friedberg~I. (2010{\rm{}}).
 \newblock A primer on metagenomics.
-\newblock {\em PLoS Comput Biol}, 6(2):e1000667, 2010.
+\newblock {\em PLoS Comput Biol}, 6(2):e1000667.
 
 \bibitem{pubmed16123304}
-J.~Gans, M.~Wolinsky, and J.~Dunbar.
+Gans~J, Wolinsky~M, and Dunbar~J. (2005{\rm{}}).
 \newblock Computational improvements reveal great bacterial diversity and high
   metal toxicity in soil.
-\newblock {\em Science}, 309(5739):1387--90, 2005.
+\newblock {\em Science}, 309(5739):1387--90.
 
 \bibitem{sargasso}
-Venter, J., Remington, K., Heidelberg, J., Halpern, A., Rusch, D., Eisen, J.,
-  Wu, D., Paulsen, I., Nelson, K., Nelson, W., Fouts, D., Levy, S., Knap, A.,
-  Lomas, M., Nealson, K., White, O., Peterson, J., Hoffman, J., Parsons, R.,
-  Baden-Tillson, H., Pfannkoch, C., Rogers, Y.  and Smith, H. (2004{\rm{}}).
+Venter~J, et al. (2004{\rm{}}).
 \newblock Environmental genome shotgun sequencing of the Sargasso Sea.
 \newblock {\rm Science } \emph{304}, 66--74.
 
 \bibitem{permafrost}
-R.~Mackelprang, M.~Waldrop, K.~DeAngelis, M.~David, K.~Chavarria, S.~Blazewicz,
-  E.~Rubin, and J.~Jansson.
+Mackelprang~R, et al. (2011{\rm{}}).
 \newblock Metagenomic analysis of a permafrost microbial community reveals a
   rapid response to thaw.
-\newblock {\em Nature}, 480(7377):368--71, 2011.
+\newblock {\em Nature}, 480(7377):368--71.
 
 \bibitem{pubmed11504945}
-P.~Pevzner, H.~Tang, and M.~Waterman.
-\newblock An eulerian path approach to dna fragment assembly.
-\newblock {\em Proc Natl Acad Sci U S A}, 98(17):9748--53, 2001.
+Pevzner~P, Tang~H, and Waterman~M. (2001{\rm{}}).
+\newblock An Eulerian path approach to DNA fragment assembly.
+\newblock {\em Proc Natl Acad Sci U S A}, 98(17):9748--53.
 
 \bibitem{pubmed20211242}
-J.~Miller, S.~Koren, and G.~Sutton.
+Miller~J, Koren~S, and Sutton~G. (2010{\rm{}}).
 \newblock Assembly algorithms for next-generation sequencing data.
-\newblock {\em Genomics}, 95(6):315--27, 2010.
+\newblock {\em Genomics}, 95(6):315--27.
 
 \bibitem{pubmed22068540}
-P.~Compeau, P.~Pevzner, and G.~Tesler.
+Compeau~P, Pevzner~P, and Tesler~G. (2011{\rm{}}).
 \newblock How to apply de {B}ruijn graphs to genome assembly.
-\newblock {\em Nat Biotechnol}, 29(11):987--91, 2011.
+\newblock {\em Nat Biotechnol}, 29(11):987--91.
 
 \bibitem{pmid21187386}
-S.~Gnerre, I.~Maccallum, D.~Przybylski, F.~J. Ribeiro, J.~N. Burton, B.~J.
-  Walker, T.~Sharpe, G.~Hall, T.~P. Shea, S.~Sykes, A.~M. Berlin, D.~Aird,
-  M.~Costello, R.~Daza, L.~Williams, R.~Nicol, A.~Gnirke, C.~Nusbaum, E.~S.
-  Lander, and D.~B. Jaffe.
+Gnerre~S, et al. (2011{\rm{}}).
 \newblock {{H}igh-quality draft assemblies of mammalian genomes from massively
   parallel sequence data}.
-\newblock {\em Proc. Natl. Acad. Sci. U.S.A.}, 108:1513--1518, Jan 2011.
+\newblock {\em Proc. Natl. Acad. Sci. U.S.A.}, 108:1513--1518.
 
 \bibitem{pubmed21114842}
-D.~Kelley, M.~Schatz, and S.~Salzberg.
+Kelley~D, Schatz~M, and Salzberg~S. (2010{\rm{}}).
 \newblock Quake: quality-aware detection and correction of sequencing errors.
-\newblock {\em Genome Biol}, 11(11):R116, 2010.
+\newblock {\em Genome Biol}, 11(11):R116.
 
 \bibitem{bloom}
-B.~Bloom.
+Bloom~B. (1970{\rm{}}).
 \newblock {{S}pace/time tradeoffs in hash coding with allowable errors}.
-\newblock {\em CACM}, 13(7):422--426, 1970.
+\newblock {\em CACM}, 13(7):422--426.
 
 \bibitem{velvet}
-D.~R. Zerbino, E.~Birney.
+Zerbino~DR, Birney~E. (2008{\rm{}}).
 \newblock {{V}elvet: algorithms for de novo short read assembly using de {B}ruijn graphs}.
-\newblock {\em Genome Res.}, 18(5):821-9, 2008.
+\newblock {\em Genome Res.}, 18(5):821-9.
 
 \bibitem{abyss}
-J.~T. Simpson, K.~Wong, S.~D. Jackman, J.~E. Schein, S.~J. Jones, I.~Birol.
+Simpson~JT, et al. (2009{\rm{}}).
 \newblock {{A}{B}y{S}{S}: a parallel assembler for short read sequence data}.
-\newblock {\em Genome Res.}, 19(6):1117-23, 2009.
+\newblock {\em Genome Res.}, 19(6):1117-23.
 
 \bibitem{metavelvet}
-T.~Namiki, T.~Hachiya, H.~Tanaka, and Y.~Sakakibara.
+Namiki~T, Hachiya~T, Tanaka~H, and Sakakibara~Y. (2011{\rm{}}).
 \newblock {M}eta{V}elvet: {A}n extension of {V}elvet assembler to de novo
   metagenome assembly from short sequence reads.
 \newblock {\em ACM Conference on Bioinformatics, Computational Biology and
-  Biomedicine}, 2011.
+  Biomedicine}.
 
 \bibitem{pubmed21685107}
-Y.~Peng, H.~Leung, S.~Yiu, and F.~Chin.
+Peng~Y, Leung~H, Yiu~S, and Chin~F. (2011{\rm{}}).
 \newblock Meta-IDBA: a de Novo assembler for metagenomic data.
-\newblock {\em Bioinformatics}, 27(13):i94--i101, 2011.
+\newblock {\em Bioinformatics}, 27(13):i94--i101.
 
 \bibitem{trinity}
-M.~Grabherr, B.~Haas, M.~Yassour, J.~Levin, D.~Thompson, I.~Amit, X.~Adiconis,
-  L.~Fan, R.~Raychowdhury, Q.~Zeng, et~al.
+Grabherr~M, et al. (2011{\rm{}}).
 \newblock {F}ull-length transcriptome assembly from {R}{N}{A}-{S}eq data
   without a reference genome.
-\newblock {\em Nature biotechnology}, 2011.
+\newblock {\em Nature biotechnology}.
 
 \bibitem{staufferintro}
-D.~Stauffer and A.~Aharony.
+Stauffer~D and Aharony~A. (2010{\rm{}}).
 \newblock {I}ntroduction to {P}ercolation {T}heory.
-\newblock {\em Taylor and Frances e-Library}, 2010.
+\newblock {\em Taylor and Frances e-Library}.
 
 \bibitem{stauffer1979scaling}
-D.~Stauffer.
+Stauffer~D. (1979{\rm{}}).
 \newblock {S}caling theory of percolation clusters.
-\newblock {\em Physics Reports}, 54(1):1--74, 1979.
+\newblock {\em Physics Reports}, 54(1):1--74.
 
 \bibitem{bondy2008graph}
-J.~Bondy and U.~Murty.
+Bondy~J and Murty~U. (2008{\rm{}}).
 \newblock {G}raph {T}heory.
-\newblock {\em Graduate Texts in Mathematics}, 2008.
+\newblock {\em Graduate Texts in Mathematics}.
 
 \bibitem{zerbinothesis}
-Z.~DR.
+Zerbino~DR. (2009{\rm{}}).
 \newblock Genome assembly and comparison using de {B}ruijn graphs.
-\newblock Ph.D. thesis, University of Cambridge, 2009.
+\newblock Ph.D. thesis, University of Cambridge.
 
 \bibitem{terabasemetag}
-J.~Gilbert, F.~Meyer, D.~Antonopoulos, P.~Balaji, C.~Brown, C.~Brown, N.~Desai,
-  J.~Eisen, D.~Evers, D.~Field, W.~Feng, D.~Huson, J.~Jansson, R.~Knight,
-  J.~Knight, E.~Kolker, K.~Konstantindis, J.~Kostka, N.~Kyrpides,
-  R.~Mackelprang, A.~McHardy, C.~Quince, J.~Raes, A.~Sczyrba, A.~Shade, and
-  R.~Stevens.
+Gilbert~J, et al. (2010{\rm{}}).
 \newblock Meeting report: the terabase metagenomics workshop and the vision of
   an earth microbiome project.
-\newblock {\em Stand Genomic Sci}, 3(3):243--8, 2010.
+\newblock {\em Stand Genomic Sci}, 3(3):243--8.
 
 \bibitem{emp2010}
-Gilbert, J., Meyer, F., Jansson, J., Gordon, J., Pace, N., Tiedje, J., Ley, R.,
-  Fierer, N., Field, D., Kyrpides, N., Glockner, F., Klenk, H., Wommack, K.,
-  Glass, E., Docherty, K., Gallery, R., Stevens, R.  and Knight, R.
-  (2010{\rm{b}}).
-\newblock The Earth Microbiome Project: Meeting report of the '1 EMP meeting on
-  sample selection and acquisition' at Argonne National Laboratory October 6
+Gilbert~J, et al. (2010{\rm{}}).
+\newblock The Earth Microbiome Project: Meeting report of the ``1 EMP meeting on
+  sample selection and acquisition'' at Argonne National Laboratory October 6
   2010.
 \newblock {\rm Stand Genomic Sci } \emph{3}, 249--53.
 
 \bibitem{zhang2003dna}
-Y.~Zhang and M.~Waterman.
+Zhang~Y and Waterman~M. (2003{\rm{}}).
 \newblock {D}{N}{A} {S}equence {A}ssembly and {M}ultiple {S}equence {A}lignment
   by an {E}ulerian {P}ath {A}pproach.
 \newblock In {\em Cold Spring Harbor Symposia on Quantitative Biology},
-  volume~68, pages 205--212. Cold Spring Harbor Laboratory Press, 2003.
+  volume~68, pages 205--212. Cold Spring Harbor Laboratory Press.
 
 \bibitem{price2005novo}
-A.~Price, N.~Jones, and P.~Pevzner.
+Price~A, Jones~N, and Pevzner~P. (2005{\rm{}}).
 \newblock {D}e novo identification of repeat families in large genomes.
-\newblock {\em Bioinformatics}, 21(suppl 1):i351--i358, 2005.
+\newblock {\em Bioinformatics}, 21(suppl 1):i351--i358.
 
 \bibitem{bloomsurvey}
-A.~Broder and M.~Mitzenmacher.
+Broder~A and Mitzenmacher~M. (2004{\rm{}}).
 \newblock {N}etwork applications of bloom filters: {A} survey.
-\newblock {\em Internet Mathematics}, 1(4):485--509, 2004.
+\newblock {\em Internet Mathematics}, 1(4):485--509.
 
 \bibitem{adami2002critical}
-C.~Adami and J.~Chu.
+Adami~C and Chu~J. (2002{\rm{}}).
 \newblock {C}ritical and near-critical branching processes.
-\newblock {\em Physical Review E}, 66(1):011907, 2002.
+\newblock {\em Physical Review E}, 66(1):011907.
 
 \bibitem{wald43}
-A.~Wald.
+Wald~A. (1943{\rm{}}).
 \newblock Tests of statistical hypotheses concerning several parameters when
   the number of observations is large.
-\newblock {\em Transactions of the American Mathematical Society}, 54:426--482,
-  1943.
+\newblock {\em Transactions of the American Mathematical Society}, 54:426--482.
 
 \end{thebibliography}
 
@@ -876,7 +851,9 @@ \section{Software and Software Availability}
 \includegraphics[width=2in]{f3b015}
 
 \caption{Graph visualizations demonstrating the decreasing fidelity of
-  graph structure with increasing false positive rate. From top left
+  graph structure with increasing false positive rate. Erroneous k-mers are 
+  colored red and k-mers corresponding to the original generated sequence 
+  are black. From top left
   to bottom right, the false positive rates are 0.01, 0.05, 0.10, and
   0.15.  Shortcuts ``across'' the graph are not created.}