Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
branch: master
Fetching contributors…

Cannot retrieve contributors at this time

file 761 lines (667 sloc) 39.613 kb
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761
% !TEX TS-program = pdflatex TEX encoding = UTF-8 Unicode

% This is a simple template for a LaTeX document using the "article"
% class. See "book", "report", "letter" for other types of document.

\documentclass{nature}% use larger type; default would be 10pt

%\usepackage[utf8]{inputenc} % set input encoding (not needed with
%XeLaTeX)


%%% Examples of Article customizations
% These packages are optional, depending whether you want the features
% they provide. See the LaTeX Companion or other references for full
% information.

%%% PAGE DIMENSIONS
%\usepackage{geometry} % to change the page dimensions
%\geometry{a4paper} % or letterpaper (US) ocher a5paper or....
%\geometry{margin=1in} % for example, change the margins to 2 inches
% all round \geometry{landscape} % set up the page for landscape
% read geometry.pdf for detailed page layout information
\usepackage{graphicx}
% \usepackage[parfill]{parskip} % Activate to begin paragraphs with an
% empty line rather than an indent

%%% PACKAGES
%\usepackage{booktabs} % for much better looking tables
%\usepackage{array} % for better arrays (eg matrices) in maths
%\usepackage{paralist} % very flexible & customisable lists
%(eg. enumerate/itemize, etc.)
%\usepackage{verbatim} % adds environment for commenting out blocks of
%text & for better verbatim \usepackage{subfig} % make it possible to
%include more than one captioned figure/table in a single float These
%packages are all incorporated in the memoir class to one degree or
%another...

%%% HEADERS & FOOTERS
%\usepackage{fancyhdr} % This should be set AFTER setting up the page
                      % geometry
%\pagestyle{fancy} % options: empty , plain , fancy
%\renewcommand{\headrulewidth}{0pt} % customise the layout...
%\lhead{}\chead{}\rhead{}
%\lfoot{}\cfoot{\thepage}\rfoot{}

%Acknowledgements:
%We acknowledge the support of Krystle Chavarria and
%Regina Lamendella for extraction of DNA from Great Prairie soil samples
%and the technical support of Stephanie Malfatti and Tijana Glavina del Rio
%at the DOE JGI. Add Eddy. I'm not sure in what capacity.
%Add GLBRC and NSF $ for Adina
%Add John Johnson & Eric and HPC
%Add any grants for Titus including Amazon

%%% SECTION TITLE APPEARANCE
%\usepackage{sectsty} \allsectionsfont{\sffamily\mdseries\upshape} %
%(See the fntguide.pdf for font help) (This matches ConTeXt defaults)

%%% ToC (table of contents) APPEARANCE
%\usepackage[nottoc,notlof,notlot]{tocbibind} % Put the bibliography
%in the ToC \usepackage[titles,subfigure]{tocloft} % Alter the style
%of the Table of Contents
%\renewcommand{\cftsecfont}{\rmfamily\mdseries\upshape}
%\renewcommand{\cftsecpagefont}{\rmfamily\mdseries\upshape} % No bold!

%\usepackage[T1]{fontenc}
% \usepackage[latin9]{inputenc}
%\usepackage[active]{srcltx}
%\usepackage{setspace}
%\usepackage{lscape}
%\doublespacing
\usepackage[english]{babel}
\bibliographystyle{naturemag}
 \usepackage{url}
 \usepackage{graphics}



\title{Assembling large, complex environmental metagenomes}


\author{Adina Chuang Howe$^{1,2}$, Janet Jansson$^{3,4}$, Stephanie A.
Malfatti$^{3}$, Susannah G. Tringe$^{3}$, James M. Tiedje$^{1,2}$, and C. Titus
Brown$^{1,5\ast}$} \begin{document} \maketitle

\begin{affiliations} \item Microbiology and Molecular Genetics, Michigan State
University, East Lansing, MI, USA\\ \item Plant, Soil, and Microbial Sciences,
Michigan State University, East Lansing, MI, USA\\ \item Department of Energy
(DOE) Joint Genome Institute, Walnut Creek, CA, USA\\ \item Lawrence Berkeley
National Laboratory, Earth Sciences Division, Berkeley, CA, USA\\ \item Computer Science and Engineering, Michigan State University, East Lansing, MI, USA\\
\end{affiliations}

\begin{abstract} The large volumes of sequencing data required to sample
complex environments deeply pose new challenges to sequence analysis approaches.
\emph{De novo} metagenomic assembly effectively reduces the total amount of
data to be analyzed but requires significant computational resources. We apply
two pre-assembly filtering approaches, digital normalization and partitioning, to
make large metagenome assemblies more computationally tractable. Using a human
gut mock community dataset, we demonstrate that these methods result in
assemblies nearly identical to assemblies from unprocessed data. We then
assemble two large soil metagenomes from matched Iowa corn and native prairie
soils. The predicted functional content and phylogenetic origin of the
assembled contigs indicate significant taxonomic differences despite similar
function. The assembly strategies presented are generic and can be extended to
any metagenome and any assembler; full source code is freely available under a
BSD license. \end{abstract}

Complex microbial communities operate at the heart of many crucial terrestrial,
aquatic,and host-associated processes, providing critical ecosystem
functionality that underpins much of biology
\cite{Arumugam:2011p735,Hess:2011p686,Iverson:2012p1281,
Mackelprang:2011p1087,Qin:2010p189,Tringe:2005p174,Venter:2004p170}. DNA
sequencing has begun to reveal the enormous diversity and heterogeneity
associated within these systems, making them difficult to study {\em in situ}
\cite{Hess:2011p686,Mackelprang:2011p1087,Qin:2010p189}. With ultradeep
sequencing, we now have unprecedented access to even the rare species in these
environment (i.e., ~50 Tbp required to adequately sample a gram of soil
\cite{Gans:2005p1365}).

Alongside sequencing breakthroughs, new challenges to sequence analysis
approaches have emerged. Sequencing dataset sizes are growing at an exponential
rate, requiring significant computational resources for data storage and
analysis. A single metagenomic project can readily generate as much or more
data than is in global reference databases;for example, a human-gut metagenome
sample containing 578 Gbp \cite{Qin:2010p189} produced more than double the data
in NCBI RefSeq (Release 56). Moreover, short reads contain only minimal signal for
homology searches and are error-prone, limiting direct annotation approaches
against reference databases. And finally, the majority of genes sequenced from
complex metagenomes typically contain little or no similarity to experimentally
studied genes, further complicating homology analysis
\cite{Arumugam:2011p735,Qin:2010p189}.

Consequently, investigators of metagenomic datasets are confounded by
overwhelming volumes of data which they do not have the computational resources
to efficiently analyze or which are unsuitable for the current suite of
bioinformatic tools (because of short read lengths or a lack of reference
genomes). \emph{De novo} assembly of sequence data offers several advantages
for analyzing metagenomic datasets. It provides improved accuracy of sequences by
removing most random sequencing errors and results in contigs longer and more
specific than unassembled sequencing reads. Further, assembly significantly
reduces the total volume of data required for downstream analysis (e.g., gene
annotation). Importantly, \emph{de novo} assembly also does not rely on the
existence of reference genomes, thus allowing for the discovery of novel
genomic elements. The main challenge for metagenomic applications of \emph{de
novo}
assembly is that current assembly tools do not scale to the high diversity and
large volume of metagenomic data: metagenomes from rumen, human gut, and
permafrost soil sequencing could only be assembled by discarding low abundance
sequences prior to assembly
\cite{Hess:2011p686,Mackelprang:2011p1087,Qin:2010p189}. Although many
metagenome-specific assemblers have recently been developed for community
assembly, they cannot work with the volume of reads necessary to achieve high
coverage for extremely diverse environmental metagenomes
\cite{Scholz:2012p1372}.

Here, we present two pre-assembly read filtering strategies, digital
normalization and partitioning, that provide a general strategy for scaling and
improving metagenome assembly. Digital normalization normalizes sequence
coverage and reduces the dataset size by discarding reads from high-coverage
regions \cite{browndiginorm}. Subsequently, partitioning separates reads based
on transitive connectivity, resulting in easily assembled subsets of reads
\cite{howeartifacts,Pell:2012cq}. We evaluate these approaches by applying them
to a human gut mock community (HGMC) dataset, and find that these filtering
methods result in assemblies nearly identical to assemblies from the
unprocessed dataset. Moreover, we show that partitioning separates most reads
into
species-level bins, providing an alternative to abundance-based and k-mer
approaches to species clustering.

We next apply these approaches to the assembly of previously intractable
metagenomes from two matched soils, 100-year cultivated Iowa agricultural soil
and native Iowa prairie. We compare the predicted functional capacities and
phylogenetic origins of the assembled contigs and conclude that despite
significant phylogenetic differences, the functions encoded in both soil data
sets are similar. We also show that virtually no strain-level heterogeneity is
detectable within the assembled reads.

\section*{Results}

\section*{Data reduction results in similar assemblies}

We evaluated the recovery of reference genomes from {\em de novo} metagenomic
assembly of the HGMC dataset by comparing unfiltered traditional assembly to
the the described filtered assembly (Fig. \ref{flowchart}; see Methods and
Supplementary Information). The abundance of genomes within the HGMC dataset
was estimated based on the reference genome coverage of sequence reads in the
unfiltered dataset. Coverage (excluding genomes with less than 3-fold coverage)
ranged from 6-fold to 2,000-fold (Supplementary Table 1 and Supplementary Fig.
2and 3). Overall, the unfiltered dataset reads covered a total of 93\% of the
reference genomes. Reads were removed based on their coverage within the
dataset (See Methods); we removed a total of 5.9 million reads (40\% of the total
reads) from the original HGMC dataset (Table ~\ref{data-summary}). The remaining
reads in the filtered HGMC dataset covered 91\% of the reference genomes (Table
~\ref{data-summary} and Supplementary Fig. 2 and 3).

We next compared the recovery of reference genomes from contigs assembled from
the original and filtered HGMC datasets. Using the Velvet assembler
\cite{Zerbino:2008p665},we recovered 43\% and 44\% of the reference genomes,
respectively. The assembly of the original dataset contained 29,063 contigs and
38 million bp, while the filtered assembly contained 30,082 contigs and 35
million bp (Table ~\ref{assembly-summary}). Comparable recoveries of references
between original and filtered datasets were also obtained with other assemblers
(SOAPdenovo \cite{Li:2010jz} and Meta-IDBA \cite{Peng:2011p898} ,
Table~\ref{assembly-summary}). Overall, the unfiltered and filtered assemblies
were very similar, sharing 95\% of genomic content. In the most abundant
references (the plasmids NC\_005008.1, NC\_005007.1, and NC\_005003.1), the
unfiltered assembly recovered significantly more of the original sequence;
however, for the large majority of genomes,the filtered assembly recovered
similar (and sometimes greater) amounts of the reference genomes (Supplementary
Fig. 2 and 3). The distribution of contig lengths in unfiltered and filtered
assemblies were also comparable (Supplementary Fig. 4).

We compared the estimated abundance of genomes in the HGMC dataset using reads
aligned to reference genomes and the unfiltered and filtered assemblies. Genome
abundance was estimated through the alignment of unassembled reads to either
the reference genome or assembled contigs. Sequencing coverage was determined as
the median base pair coverage of all aligned reads. (Supplementary Fig. 5). For
assembled contigs with coverage greater than 5, the majority of reads which
could be aligned contigs were also mapped to reference genomes (Supplementary
Fig. 3 and 4). Below this threshold, reads were mapped to reference genomes but
were less likely to be associated with assembled contigs. Comparing the
unfiltered and filtered assemblies, the estimated abundance of the HGMC genomes
from the filtered assembly were significantly closer to predicted abundances
from reference genomes (\emph{n} = 28,652; p-value = 0.032, see Supplementary
Information).

\section*{Partitioning separates most reads by species}

We next partitioned the filtered data set based on de Bruijn graph connectivity
and assembled each partition independently \cite{howeartifacts, Pell:2012cq}.
The HGMC dataset was partitioned into 85,818 disconnected partitions containing
a total of 9 million reads (Supplementary Fig. 6). Among these, only 2,359
(2.7\%) of the partitions contained reads originating from more than one
genome,indicating that partitioning separated reads from distinct species. The
resulting assemblies of the unpartitioned and partitioned dataset were very
similar, sharing 99\% identical genomic content.

In general, reference genomes with high sequencing coverage were associated
with fewer partitions (Supplementary Table 1); a total of 112 partitions
contained
reads from high abundance reference genomes (coverage above 25) compared to
2,771 partitions associated with lower abundance genomes (coverage below 25).
This is consistent with previous observations where low coverage in sequences
cause ``breaks'' in connectivity within the assembly graph
\cite{Chaisson:2008p1373,Pevzner:2001p1374}.

To further evaluate the effects of partitioning, we introduced spiked,
simulated reads from \emph{E. coli} genomes into the HGMC dataset. First,
simulated reads
from a single genome (\emph{E. coli} strain E24377A, NC\_009801.1 with 2\%
substitution error and 10x coverage) were added to the HGMC dataset and the
resulting dataset, HGMC.Ecoli1, was filtered, partitioned, and assembled as
described below. Similar amounts of data reduction after digital normalization
and partitioning (Table ~\ref{data-summary}) were observed. Among the 81,154
partitioned sets of reads in the HGMC.Ecoli1 dataset, only 2,580 (3.2\%)
partitions contained reads from multiple genomes. In total, 424
partitions contained reads from the spiked \emph{E. coli} genome (201 partitions
contained \emph{only} spiked reads) and when assembled, these reads aligned to
99.5\% of \emph{E. coli} strain E24377A genome (4,957,067 of 4,979,619 bp)
(Supplementary Fig. 6).

Next, we introduced five closely-related \emph{E. coli} strains into original
HGMC dataset. This dataset, referred to as HGMC.EColi5, was filtered,
partitioned, and assembled, resulting in 81,425 partitions. Among these, 1,154
(1.4\%) partitions contained reads associated with multiple genomes. Among the
partitions which contained reads associated with a single genome, 658
partitions contained reads originating from one of the spiked \emph{E. coli}
strains. In
partitions containing reads from more than one genome, 224 partitions contained
reads from a spiked \emph{E. coli} strain and one other reference genome
(either another spiked strain or from the mock community dataset)(Supplementary
Fig. 7).Independently assembling the partitions containing reads originating
from the
spiked \emph{E. coli} strains resulted in 6,076 contigs, all but three
originating from a spiked \emph{E. coli} genome. The remaining three contigs
were more than 99\% similar to available reference genomes (NC\_000915.1,
NC\_003112.2, and NC\_009614.1). The contigs associated with \emph{E. coli}
were aligned against the spiked reference genomes, recovering greater than 98\%
of
each of the five genomes. Many of these contigs contained similarities to reads
originating from multiple genomes found in the HGMC (Supp Fig. 8), and 3,075
contigs (51\%) could be aligned to reads which originated from more than one
spiked genome.

The assembly of the HGMC.Ecoli5 dataset was also performed without removal of
any reads (e.g., no digital normalization or partitioning). Comparing the
HGMC.Ecoli5 unfiltered and filtered assemblies, we found that the fraction of
contigs which are associated with multiple genomes were similar. In the
unfiltered data set, 66\% of 4,702 contigs were associated with spiked reads
which could have originated from more than one genome.

\section*{Data reduction and partitioning enable the assembly of two soil
metagenomes} We next applied digital normalization and partitioning approaches
to the {\em de novo} assembly of two soil metagenomes. Unfiltered Iowa corn and
prairie datasets (containing 1.8 billion and 3.3 billion reads, respectively)
could not be assembled by Velvet in 500GB of RAM. A 75 million reads subset of
the Iowa corn dataset alone required 110 GB of memory, suggesting that assembly
of the 3.3 billion read data set might need as much as 4 TB of RAM
(Supplementary Fig. 9). Applying the same filtering approaches as used for the
HGMC dataset, the Iowa corn and prairie datasets were reduced to 1.4 billion
and 2.2billion reads, respectively, and after partitioning, a total of 1.0
billion
and 1.7 billion reads remained, respectively. Pre-filtering used 300 GB of RAM
or less. Notably, the large majority of k-mers in the soil metagenomes are
relatively low-abundance (Fig. ~\ref{diginormcoverage}), and consequently
digital normalization did not remove as many reads in the soil metagenomes as
in the mock data set.

Based on the HGMC dataset, we estimated that above a sequencing depth of five,
the large majority of sequences could be assembled into contigs larger than 300
bp (Supplementary Fig. 1). Given the greater diversity expected in the soil
metagenomes, we normalized these datasets to a sequencing depth of 20 (i.e.,
discarding redundant reads within dataset above this coverage). After
partitioning the filtered datasets, we identified a total 31,537,798 and
55,993,006 partitions (containing more than five reads) in the corn and prairie
datasets, respectively. For assembly, we grouped partitions together into files
containing a minimum of 10 million reads. Data reduction and partitioning were
completed in less than 300 GB of RAM; once partitioned, each group of reads
could be assembled in less than 14 GB and 4 hours. This readily enabled the
usage of multiple assemblers and assembly parameters.

The final assembly of the corn and prairie soil metagenomes resulted in a total
of 1.9 million and 3.1 million contigs greater than 300 bp, respectively, and a
total assembly length of 912 million bp and 1.5 billion bp, respectively. To
estimate abundance of assembled contigs and evaluate incorporation of reads,
all quality-trimmed reads were aligned to assembled contigs. Overall, for the
Iowa
corn assembly, 8\% of single reads and 10\% of paired end reads mapped to the
assembly. Among paired end reads, 95.5\% of the reads aligned concordantly. In
the Iowa prairie assembly, 10\% of the single reads and 11\% of the paired end
reads aligned to the assembled contigs, and 95.4\% of the paired ends aligned
concordantly (Table ~\ref{read-map}). Based on the alignment of sequencing
reads to assembled contigs, we estimated the distribution of sequencing coverage
in
resulting assemblies (Fig. ~\ref{soilassemblycoverage}). Overall, there is
a positively skewed distribution of read overage of all contigs from both soil
metagenomes, biased towards a coverage of less than ten-fold, and 48\% and 31\%
of total contigs in Iowa corn and prairie assemblies respectively had a median
basepair coverage less than 10.

As the resulting assemblies are consensus sequences representative of the
unassembled dataset, we investigated the degree of variation (i.e.,
polymorphism) present among aligned reads to assembled contigs. (Supplementary
Information Methods). For both the Iowa corn and prairie metagenomes, more than
99.9\% of contigs contained base calls which were supported by a 95\% consensus
from mapped reads over 90\% of their lengths, demonstrating an unexpectedly low
polymorphism rate (Supplementary Fig. 10).
 
\section*{Annotation of the soil assemblies revealed similar functional
profiles but different taxonomy} We annotated assembled contigs through the
MG-RAST
pipeline, which was modified to account for per-contig abundance. This
annotation resulted in 2,089,779 and 3,460,496 predicted protein coding regions
in the corn and prairie metagenomes, respectively. The large majority of these
regions did not share similarity with any gene in the M5NR database -- 61.8\% in
corn and 70.0\% in prairie. In total, 613,213 (29.3\%)and 777,454 (22.5\%)
protein coding regions were assigned to functional categories. The functional
profile of these annotated features against SEED subsystems were compared (Fig.
~\ref{subsystem}). For both the corn and prairie metagenomes, the most abundant
functions in the assembly were associated with the carbohydrate (e.g., central
carbohydrate metabolism and sugar utilization), amino acid (e.g., biosynthesis
and degradation), and protein (e.g., biosynthesis, processing, and
modification)metabolism subsystems. The subsystem profile of both metagenomes
were very
similar while the taxonomic profile of the metagenomes based on the originating
taxonomy (phyla) was different (Fig. ~\ref{phyla}, Supp Methods). Within both
metagenomes, Proteobacteria were most abundant. In Iowa corn, Actinobacteria,
Bacteroidetes, and Firmicutes were the next most abundant, while in the Iowa
prairie, Acidobacteria, Bacteroidetes, and Actinobacteria were the next most
abundant. The Iowa prairie also had nearly double the fraction of
Verrucomicrobia than did Iowa corn.


\section*{Discussion}

%\subsection{Filtering approaches effectively reduce datasets}

We acknowledge that the diversity and sequencing depth represented by the HGMC
dataset is extremely low compared to that of most environmental metagenomes;
however, it represents a simplified, unevenly sampled model for a metagenomic
dataset which allows for the evaluation of novel approaches through the
availability of source genomes. For this dataset, the filtering approaches
described above were effective at reducing the dataset size without significant
loss of assembly. This strategic filtering normalizes the abundance of reads in
a dataset to a specific sequencing coverage. As there exists an observed
coverage ``sweet spot'' at which point sufficient sequences are present for
robust assembly (Supp Fig. 1), this effectively reduces the volume of the
dataset for assembly while removing errors introduced by extraneous reads.
Further, this normalization also resulted in more even coverage (Fig.
~\ref{diginormcoverage}), minimizing assembly problems caused by variable
coverage. Additional reduction of the dataset was achieved by the removal of
high abundance sequences \cite{howeartifacts}.

The specific effects of filtering varied depending on characteristics of
genomes. We observed that variable abundance and conserved regions in
references had an impact on the recovery of sequences in filtered HGMC datasets.
The
filtered assemblies of the three plasmids of the \emph{Staphylococcus
epidermidis} genome (NC\_005008.1, NC\_005007.1, and NC\_005003.1) were highly
abundant (Supplementary Table 1) and shared several conserved regions (90\%
identity over more than 290 bp). During normalization, repetitive elements in
these genomes appear as high coverage elements and consequently would be
removed, as evidenced by a large difference in the number of reads associated
with NC\_005008.1 in the unfiltered and filtered (normalized) datasets
(supplementary Fig. 2). The unfiltered dataset contained comparably more reads
spanning these repetitive regions which likely enabled assemblers to more
effectively extend the unfiltered assembly of these sequences, ultimately
observed as an increased recovery of these genomes in these assemblies.

This result, though rare among genomes in the HGMC dataset, identifies a
shortcoming of our approach, and indeed for most short-read assembly
approaches,related to repetitive regions and/or polymorphisms. We acknowledge
that in our
assembled soil metagenomes, data reduction may cause information loss but
exchange this disadvantage for the ability to assemble previously intractable
data sets. Overall, evaluation of the HGMC dataset suggests that this
information loss is minimal and that our approaches result in a comparable
assembly whose abundance estimations are even slightly improved.

%\subsection{Partitioning effectively separates genomes for assembly}
Metagenomes contain many distinct genomes, which are largely disconnected from
each other but which often share sequences due to sequence conservation or
lateral transfer. Our pre-filtering approach removes both common multi-genome
elements as well as artificial connectivity stemming from the sequencing
process(\cite{howeartifacts}). As shown above on the HGMC dataset, the removal of
these
sequences does not significantly alter the recovery of reference genomes
through {\em de novo} assembly: the resulting assemblies of unfiltered, filtered,
and
filtered and partitioned datasets were nearly identical. Further, the large
majority of these partitions contained reads from a single reference genome,
supporting our previous hypothesis that most connected subgraphs contain reads
from distinct genomes \cite{Pell:2012cq}. As expected, high abundance,
well-sampled genomes were found to contain fewer partitions and low abundance,
under-sampled genomes contained more partitions, due to fragmentation of the
assembly graph.

To further examine the recovery of sequences through partitioning,
computationally-derived sequences from one or more \emph{E. coli} strains were
amended to the HGMC dataset. When we spiked in a single \emph{E. coli} strain,
we could reassemble 99\% of the original genome (Supplementary Fig. 6). When we
spiked in five closely related strains, we could recover the large majority of
the genomic content of these strains, albeit largely in chimeric contigs
(Supplementary Fig. 8). This result is not unique to our approaches, however, as
assemblies of the unfiltered dataset resulted in a slightly higher fraction of
assembled contigs associated with multiple references. Overall, closely related
sequences which result from either repetitive or inter-strain polymorphisms
challenge assemblers, and our approaches are not specifically designed to
target such regions. However, the partitions resulting from our approach could
provide
a much reduced subset of sequences to be targeted for more sensitive assembly
approaches for highly variable regions (i.e. overlap-layout-consensus
approaches or abundance binning approaches \cite{Sharon:2012kx}).

One valuable result of partitioning is that it subdivides our datasets into
sets of reads which can be assembled with minimal computational resources. For
the
HGMC dataset,this gain was small, reducing unfiltered assembly at 12 GB and 4
hours to less than 2 GB and 1 hour. However, for the soil metagenomes,
previously impossible assemblies could be completed in less than a day and in
under 14 GB of memory enabling the usage of multiple assembly parameters (e.g.,
k-length) and
multiple assemblers (Velvet, SOAPdenovo, and MetaIDBA).

%\subsection{Soil assembly}

The final assemblies of the corn and prairie soil metagenomes resulted in a
total of 1.9million and 3.1 million contigs, respectively, and a total assembly
length of 912 million bp and 1.5 billion bp, respectively -- equivalent to
$\approx$ 500 \emph{E. coli} genomes worth of DNA. We evaluated these
assemblies based on paired-end concordance, which showed that the majority of the
assembled contigs agreed with the raw sequencing data. Overall, there is a
positively
skewed distribution of abundance of all contigs from both soil metagenomes,
biased towards an abundance of less than ten, indicative of the low sequencing
coverage of these metagenomes.

This study represents the largest published soil metagenomic sequencing effort
to date, and these assembly results demonstrate the enormous amount of
diversity within the soil.Even with this level of sequencing, millions of
putative genes
were defined for each metagenome, with hundreds of thousands of functions. More
than half of the assembled contigs are not similar to anything in known
databases, suggesting that soil holds considerable unexplored taxonomic and
functional novelty. Among the protein coding sequences which were annotated,
comparisons of the two soil datasets suggests that the functional profiles are
more similar to one another than the complementing phylogenetic profiles. This
result supports previous hypotheses that despite large diversity with
two different soil systems, the microbial community provides similar overall
function
\cite{Girvan:2005jv,McGradySteed:1997uj,Muller:2002cd,Konstantinidis:2004hr}.

%\section*{Conclusion}

We present two strategies that readily enable the assembly of very large
environmental metagenomes by discarding redundancy and subdividing the data
prior to assembly. The strategies are generic and broadly applicable to any
metagenome. We demonstrate their effectiveness by first evaluating them on the
assembly of a mock community metagenome, and then applying them to two
previously intractable soil metagenomes. Partitioning is an especially valuable
approach because it enables the extraction of read subsets that belong to
individual species. These read partitions are small enough that a variety of
assembly, abundance analysis, and polymorphism analysis techniques can be
easily applied to them individually.

These strategies filter and partition reads prior to assembly and can work with
any assembler, considerably reducing the computational resources needed to
complete an assembly. By acting as pre-filters, digital normalization and
partitioning let downstream assemblers focus on improving their performance on
low-coverage or high variability data without a strong consideration for
computational resources. This should enable significant improvement of
metagenome assembly techniques going forward and provide the critical references
which will enable future investigations of complex environments.

\begin{addendum}
\item This project was supported by Agriculture and Food Research Initiative
Competitive Grant no. 2010-65205-20361 from the United States
Department of Agriculture, National Institute of Food and Agriculture
and National Science Foundation IOS-0923812, both to C.T.B. A.H. was
supported by NSF Postdoctoral Fellowship Award \#0905961 and the Great
Lakes Bioenergy Research Center (Department of Energy BER
DE-FC02-07ER64494). The work conducted by the U.S. Department of
Energy Joint Genome Institute is supported by the Office of Science of
the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
We acknowledge the support of Krystle Chavarria and
Regina Lamendella for extraction of DNA from Great Prairie soil samples
and the technical support of Eddy Rubin and Tijana Glavina del Rio
at the DOE JGI and John Johnson and Eric McDonald at MSU HPC.
%Add any grants for Titus including Amazon
\item[Author Contributions] A.H. and C.T.B. designed experiments and wrote
paper.
A.H. performed experiments and analyzed the data. J.J., S.T., and J.T. discussed
results
and commented on manuscript. S.M. managed raw sequencing datasets.
\item[Competing Interests] The authors declare that they have no competing
financial
interests.
\item[Correspondence] Correspondence should be addressed to C. Titus Brown
(ctb@msu.edu).
\end{addendum}




\pagebreak

%Tables need to be submitted in a word document - gross, really? I leave them here for now.

\section*{Tables}

\begin{table}[h]
\caption{The total number of reads in unfiltered, filtered (normalized
  and high abundance (HA) k-mer removal), and partitioned datasets and
  the computational resources required (memory and time).}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l c c c }

& Unfiltered & Filtered & Partitioned \\
& Reads (Mbp) & Reads (Mbp) & Reads (Mbp) \\
\hline
HMP Mock & 14,494,884 (1,136) & 8,656,520 (636) & 8,560,124 (631) \\
HMP Mock Spike & 14,992,845 (1,137) & 8,189,928 (612) & 8,094,475 (607) \\
HMP Mock Multispike & 17,010,607 (1,339) & 9,037,142 (702) & 8,930,840 (697) \\
Iowa Corn & 1,810,630,781 (140,750) & 1,406,361,241 (91,043) & 1,040,396,940 (77,603) \\
Iowa Prairie & 3,303,375,485 (256,610) & 2,241,951,533 (144,962) & 1,696,187,797 (125,105) \\
\\
 & Unfiltered (GB / h) & Filtered and Partitioned (GB / h) \\
HMP Mock & 4 / $<$2 & 4 / $<$2 \\
HMP Mock Spike & 4 / $<$2 & 4 /
$<$2 \\
HMP Mock Multispike & 4 / $<$2 &
4 / $<$2 \\
Iowa Corn & 188 / 83 &
234 / 120 \\
Iowa Prairie & 258 / 178 & 287 / 310 \\ \hline

\end{tabular}%
}
\label{data-summary}
\end{table}

\newpage

\begin{table}[h]
\caption{Assembly summary statistics (total contigs, total million bp
  assembly length, maximum contig size bp) of unfiltered (UF) and
  filtered (F) or filtered/partitioned (FP) datasets with Velvet (V)
  assembler. Assembly for UF and FP datasets also shown for MetaIDBA
  (M) and SOAPdenovo(S) assemblers. Iowa corn and prairie metagenomes
  could not be completed on unfiltered datasets.}
\resizebox{\columnwidth}{!}{%
\begin{tabular}{l c c c c}
& UF & F & FP & Assembler \\
\hline
HMP Mock & 29,063 / 38 / 146,795 & 30,082 / 35 / 90,497 & 30,115 / 35
/ 90,497 & V \\
HMP Mock & 24,300 / 36 / 86,445 & - & 27,475 / 36 / 96,041 & M \\
HMP Mock & 36,689 / 37 / 32,736 & - & 29,295 / 37 / 58,598 & S \\
Iowa corn & N/A & N/A & 1,862,962 / 912/ 20,234 & V \\
Iowa corn & N/A & N/A & 1,334,841 / 623 / 15,013 & M \\
Iowa corn & N/A & N/A & 1,542,436 / 675 / 15,075 & S \\
Iowa prairie & N/A & N/A & 3,120,263 / 1,510 / 9,397 & V \\
Iowa prairie & N/A & N/A & 2,102,163 / 998 / 7,206 & M \\
Iowa prairie & N/A & N/A & 2,599,767 / 1,145 / 5,423 & S \\
\end{tabular}%
}
\label{assembly-summary}
\end{table}

\newpage

\begin{table}[h]
\caption{Assembly comparisons of unfiltered (UF) and filtered (F) or
  filtered/partitioned (FP) HMP mock datasets using different
  assemblers (Velvet (V), MetaIDBA (M) and SOAPdenovo (S)). Assembly
  content similarity is based on the fraction of alignment of
  assemblies and similarly, the coverage of reference genomes is based
  on the alignment of assembled contigs to reference genomes (RG).}
\begin{tabular}{l c c c}
Assembly Comparison & Percent Similarity & RG Coverage & Assembler \\
\hline
UF vs. F & 95\% & 43.3\% / 44.5\% & V \\
UF vs. FP & 95\% & 43.3\% / 44.4\% & V\\
UF vs. FP & 93\% & 46.5\% / 45.4\% & M\\
UF vs. FP & 98\% & 46.2\% / 46.4\% & S\\
\end{tabular}
\label{assembly-compare}
\end{table}

\newpage

\begin{table}[h]
\caption{Fraction of single-end (SE) and paired-end (PE) reads mapped
  to Iowa corn and prairie Velvet assemblies.}
\begin{tabular}{l c c}
 & Iowa Corn Assembly & Iowa Prairie Assemby \\
 \hline
Total Unfiltered Reads & 1,810,630,781 & 3,303,375,485\\
Total Unfiltered SE Reads & 141,517,075 & 358,817,057\\
SE aligned 1 time & 11,368,837 & 32,539,726\\
SE aligned $>$ 1 time & 562,637 & 1,437,284\\
\% SE Aligned & 8.43\% & 9.47\% \\
Total Unfiltered PE Reads & 834,556,853 & 1,472,279,214\\
PE aligned 1 time & 54,731,320 & 110,353,902\\
PE aligned $>$ 1 time &1,993,902 & 3,133,710\\
\% PE Aligned Disconcordantly & 0.47\% & 0.63\%\\
\% PE Aligned & 9.68\% & 11.20\%\\
\end{tabular}
\label{read-map}
\end{table}

\newpage

\section*{Figures}

\begin{figure}[ht]
\center{\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]
{./figures/fig1-hmpassemblyhist.pdf}}
\caption{K-mer coverage of HMP mock community dataset before and
  after filtering approaches.}
\label{kmercoverage}
\end{figure}

\newpage

\begin{figure}[ht]
\center{\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]
{./figures/fig2-diginormhist.pdf}}
\caption{K-mer coverage of Iowa corn and prairie metagenomes before
  and after filtering approaches.}
\label{diginormcoverage}
\end{figure}

\newpage

\begin{figure}[ht]
\center{\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]
  {./figures/fig3-coverage.pdf}}
\caption{Coverage (median basepair recovered) distribution of assembled contigs
  from Iowa corn soil (top) and Iowa prairie soil (bottom) metagenomes.}
\label{soilassemblycoverage}
\end{figure}

\newpage

\begin{figure}[ht]
\center{\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]
  {./figures/fig4-phyla.pdf}}
\caption{Phylogenetic distribution from SEED subsystem annotations for
  Iowa corn and prairie metagenomes.}
\label{phyla}
\end{figure}

\newpage

\begin{figure}[ht]
\center{\includegraphics[width=\textwidth,height=\textheight,keepaspectratio]
  {./figures/fig5-subsystems.pdf}}
\caption{Functional distribution from SEED subsystem annotations for
  Iowa corn and prairie metagenomes.}
\label{subsystem}
\end{figure}

\newpage
\newpage

\begin{figure}[ht!]
\center{\includegraphics[scale=0.5]{./figures/flowchart.pdf}}
\caption{Flowchart describing methods for \emph{de novo} metagenomic assembly. Using the HMP
mock community dataset, alternative approaches for data reduction and assembly were compared.
(a) Disconnected subgraphs of the assembly graph were partitioned. Most connected subgraphs contained
reads and contigs aligning to distinct genomes (Supplementary Fig. 6). (b) The genomic content of all assemblies were
found to be comparable in genomic content. (c) Reads and assembled contigs could be aligned to reference
genomes to determine effectiveness of recovery. (d) The abundance of contigs (based on read mapping) could
be compared to estimated abundances of corresponding reference genomes.}
\label{flowchart}
\end{figure}


\newpage

\newpage
\clearpage

\section*{Online Methods}

\section*{Assembly Pipeline}

The entire assembly pipeline for the mock community is described in
detail in an IPython notebook available for download at
\emph{http://nbviewer.ipython.org/urls/raw.github.com/ngs-docs/ngs-notebooks/
master/ngs-70-hmp-diginorm.ipynb}
and
\emph{http://nbviewer.ipython.org/urls/raw.github.com/ngs-docs/ngs-notebooks/
master/ngs-71-hmp-diginorm.ipynb}. Soil assembly was performed with
the same pipeline and parameter changes as described in Supplementary Information.
The annotated metagenome for Iowa corn can be found at
\emph{http://metagenomics.anl.gov/linkin.cgi?metagenome=4504797.3} and
Iowa prairie at \emph{http://metagenomics.anl.gov/linkin.cgi?metagenome=4504798.3}.

\section*{Statistical Methods}
The reference-based abundance (from reads mapped to reference genomes)
and assembly-based abundance (from reads mapped to contigs) of genomes
were compared. Using a one-directional, paired t-test of squared
deviations, the abundance estimates of the unfiltered and filtered
assemblies were compared. The mean and standard deviation of
the abundances of unfiltered contigs, filtered contigs, and reference genes
were 6.8 +/- 7.1, 8.1 +/- 7.7, and 7.8 +/- 5.2, respectively. We expected the
filtered assembly to have increased accuracy due to a reduction of errors (e.g. normalization
and high abundance filtering) and used a one-sided t-test which
indicated that abundance estimations from the filtered assembly were
significantly closer to predicted abundances from reference genomes
(n=28,652, p-value of 0.032).

\newpage

\bibliography{assembly-paper}

\end{document}
% LocalWords: situ ultradeep Tbp dataset metagenomic metagenome Gbp NCBI novo
% LocalWords: RefSeq metagenomes contigs
Something went wrong with that request. Please try again.