Skip to content
Find file
Fetching contributors…
Cannot retrieve contributors at this time
61 lines (50 sloc) 3.13 KB
The first step towards analyzing a previously unsequenced organism is
to assemble the reads by merging similar reads into progressively
longer sequences. New assemblers such as Velvet and Euler attempt to
solve the assembly problem by constructing, simplifying, and traversing
the de Bruijn graph of the read sequences. Nodes in the graph represent
substrings of the reads, and directed edges connect consecutive substrings.
Genome assembly is then modeled as finding an Eulerian tour through the
graph, although repeats may lead to multiple possible tours. As such,
assemblers primarily focus on correcting errors, reconstructing unambiguous
regions, and resolving short repeats. These assemblers have successfully
assembled small genomes from short reads, but have had limited success
scaling to larger mammalian-sized genomes, in part, because they
require constructing and manipulating graphs far larger than can fit into
Addressing this limitation, we have developed a new assembly program Contrail,
that uses Hadoop for de novo assembly of large genomes from short sequencing
reads. Similar to other leading short read assembler, Contrail relies on the
graph-theoretic framework of de Bruijn graphs. However, unlike these programs,
which require large RAM resources, Contrail relies on Hadoop to iteratively
transform an on-disk representation of the assembly graph, allowing an in depth
analysis even for large genomes. Preliminary results show Contrail’s contigs
are of similar size and quality to those generated by Velvet when applied to
small (bacterial) genomes, but provides vastly superior scaling capabilities
when applied to large genomes. We are also developing extensions to Contrail to
efficiently compute a traditional overlap-graph based assembly of large genomes
within Hadoop, strategy that will be especially valuable as read lengths
increase beyond 100bp.
Contrail enables de novo assembly of large genomes from short reads by bridging
research in computation biology with research in high performance computation.
This combination is essential in light of the large data sets involved, and has
the potential to unlock discoveries of critical magnitude. Whereas the
published analysis of the African and Asian human individuals used read mapping
to discover conserved regions and regions with small polymorphisms, de novo
assembly has the unique potential to also discover large scale polymorphisms
between these individuals and the reference human genome. Mapping the
large-scale differences is an important step towards better understanding of
our own biology, and may reveal previously unknown characteristics of the human
genome related to health or disease. Furthermore, a short read assembler for
large genomes is also essential for sequencing the vast numbers of complex
organisms that have never been sequenced before, and will directly contribute
to new biological knowledge.
Release History
Version 0.8.2
Oct 13, 2010
Initial public release
Something went wrong with that request. Please try again.