Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

many minor changes

  • Loading branch information...
commit 0bca8fcb1eca2d178ef26190d48bdd45ce0e12f2 1 parent c9d38a1
C. Titus Brown ctb authored

Showing 1 changed file with 139 additions and 122 deletions. Show diff stats Hide diff stats

  1. +139 122 recent-version/artifacts-paper2.tex
261 recent-version/artifacts-paper2.tex
@@ -82,53 +82,53 @@
82 82 % Please keep the abstract between 250 and 300 words
83 83 \section*{Abstract}
84 84
  85 +% @CTB revisit first sentence.
85 86 Coverage-based assembly approaches for metagenomic datasets are
86 87 hindered by the presence of sequencing errors and biases. Here, we
87   -examine several metagenomes for the presence of such sequencing
88   -artifacts through a connectivity analysis of reads within a
89   -representation of each metagenome's respective assembly graph. We
  88 +examine several metagenomes for the presence of sequencing
  89 +artifacts through a connectivity analysis of reads within
  90 +each metagenome's respective assembly graph. We
90 91 identified highly connected sequences which join a large proportion of
91 92 reads within each metagenome, suggesting the presence of
92 93 non-biological biases within sequencing reads. These sequences were
93   -found to be located at specific positions within original reads and
  94 +found to be biased towards specific positions within shotgun reads and
94 95 are minimally incorporated into final assemblies. The removal of
95   -these sequences prior to assembly resulted in similar assembly content
96   -for most metagenomes and enabled the partitioning of reads in the
97   -assembly graph by connectivity, significantly decreasing assembly
98   -memory and time requirements.
99   -
  96 +these sequences prior to assembly results in similar assembly content
  97 +for most metagenomes, and enables the use of graph partitioning
  98 +to decrease assembly memory and time requirements.
100 99
101 100 \section*{Introduction}
102 101
103   -Given the rapid decrease in the costs of sequencing, we can now
104   -achieve the sequencing depth necessary to study even the most complex
105   -environments \cite{Hess:2011p686,Qin:2010p189}. High throughput, deep
  102 +With the rapid decrease in the costs of sequencing, we can now
  103 +achieve the sequencing depth necessary to study microbes from even the most complex
  104 +environments \cite{Hess:2011p686,Qin:2010p189}. Deep
106 105 metagenomic sequencing efforts in permafrost soil, human gut, cow
107 106 rumen, and surface water have provided insights into the genetic and
108 107 biochemical diversity of environmental microbial populations
109   -\cite{Hess:2011p686,Iverson:2012p1281,Qin:2010p189} and the extent to
110   -which they are involved in responding to environmental changes
  108 +\cite{Hess:2011p686,Iverson:2012p1281,Qin:2010p189} and how
  109 +they are involved in responding to environmental changes
111 110 \cite{Mackelprang:2011p1087}. These metagenomic studies have all
112   -leveraged \emph{de novo} metagenomic assembly of short reads to assign
113   -sequences and functions to microbial taxa. \emph{De novo} assembly is
  111 +leveraged \emph{de novo} metagenomic assembly of short reads for
  112 +functional and phylogenetic analyses
  113 +\emph{De novo} assembly is
114 114 an advantageous approach to sequence analysis as it reduces the
115   -dataset size by collapsing numerous short reads into fewer contigs and
116   -provides longer sequences containing multiple genes and operons
117   -\cite{Miller:2010p226,Pop:2009p798} making annotation-based approaches
118   -more practical. Furthermore, it does not rely on the availability of
  115 +dataset size by collapsing the more numerous short reads into fewer contigs and
  116 +enabling better annotation-based approaches by providing longer sequences.
  117 +\cite{Miller:2010p226,Pop:2009p798}
  118 +Furthermore, it does not rely on the a priori availability of
119 119 reference genomes to enable identification of novel genetic features
120 120 and draft genomes \cite{Hess:2011p686,Iverson:2012p1281}.
121 121 % @CTB what does ``enabling identification ... of draft genomes'' mean?
122 122
123 123 Although \emph{de novo} metagenomic assembly is a promising approach
124   -for deep sequencing of metagenomes, it is complicated by the variable
  124 +for metagenomic sequence analysis, it is complicated by the variable
125 125 coverage of sequencing reads from mixed populations in the environment
126 126 and their associated sequencing errors and biases
127 127 \cite{Mende:2012p1262,Pignatelli:2011p742}. Several
128   -metagenomic-specific assemblers have been developed to deal with
  128 +metagenome-specific assemblers have been developed to deal with
129 129 variable coverage communities, including Meta-IDBA
130   -\cite{Peng:2011p898}, MetaVelvet, and SOAPdenovo. These assemblers
131   -rely on local models of sequencing coverage to help build assemblies
  130 +\cite{Peng:2011p898}, MetaVelvet, and SOAPdenovo (cite). These assemblers
  131 +rely on analysis of local sequencing coverage to help build assemblies
132 132 and thus are sensitive to the effects of sequencing errors and biases
133 133 on coverage estimations of the underlying dataset. The effects of
134 134 sequencing errors on \emph{de novo} assembly has been demonstrated in
@@ -139,22 +139,22 @@ \section*{Introduction}
139 139 Specifically, these models exclude the presence of known
140 140 non-biological sequencing biases
141 141 \cite{GomezAlvarez:2009p1334,Keegan:2012p1336,Niu:2010p1333} which
142   -hinder coverage-based assembly approaches.
  142 +hinder assembly approaches.
143 143 % @CTB also remember to discuss polymorphism
144 144 % @CTB for isolated genomes, add Chitsaz citation.
145 145
146 146 In this study, we examine metagenomic datasets for the presence of
147 147 artificial sequencing biases that affect assembly graph structure,
148 148 extending previous work to large and complex datasets produced from
149   -the Illumina platform. We characterized sequence connectivity in an
  149 +the Illumina platform. We characterize sequence connectivity in an
150 150 assembly graph, identifying potential sequencing biases in regions
151 151 where numerous reads are connected together. Within metagenomic
152   -datasets, we found that there exist highly connected sequences which
153   -originate, at least partially, from sequencing artifacts and that
  152 +datasets, we find that there exist highly connected sequences which
  153 +partially originate from sequencing artifacts. Moreover,
154 154 these sequences limit approaches to divide or partition large datasets
155   -for further analysis, e.g. {\em de novo} assembly. Here, we present
156   -approaches to identify and characterize these highly connected
157   -sequences and examine the effects of removing these sequences on
  155 +for further analysis, and may introduce artifacts into assemblies. Here, we
  156 +identify and characterize these highly connected
  157 +sequences, and examine the effects of removing these sequences on
158 158 downstream assemblies.
159 159
160 160 \section*{Results}
@@ -162,16 +162,17 @@ \section*{Results}
162 162 \subsection*{Connectivity analysis of metagenome datasets}
163 163
164 164 \subsubsection*{Presence of a single, highly connected lump in all datasets}
165   -We selected datasets from three diverse, medium to high diversity
  165 +We selected datasets from three medium to high diversity
166 166 metagenomes from the human gut \cite{Qin:2010p189}, cow rumen
167   -\cite{Hess:2011p686}, and agricultural soil (SRX099904 and SRX099905),
168   -representing metagenomes sequenced to various depths (Table 1). To
  167 +\cite{Hess:2011p686}, and agricultural soil (SRX099904 and SRX099905)
  168 +(Table 1). To
169 169 evaluate the effects of sequencing coverage, we included two subsets
170 170 of the 520 million read soil metagenome containing 50 and 100 million
171   -reads. We also included a previously published error-free simulated,
  171 +reads. We also included a previously published error-free simulated
172 172 metagenome based on a mixture of 112 reference genomes
173 173 \cite{Pignatelli:2011p742}.
174 174
  175 +% @CTB refactor paragraph: using a DBG, cite Pell, etc.
175 176 Initially, we evaluated the amount of connectivity between all
176 177 sequences in each metagenome using an approach similar to the initial
177 178 step of short read assemblers to identify overlaps of short sequences
@@ -187,9 +188,8 @@ \subsubsection*{Presence of a single, highly connected lump in all datasets}
187 188 % @CTB cite
188 189
189 190 Using this assembly graph representation, we separated reads
190   -contributing to disconnected portions of the metagenome assembly graph
191   -(e.g., representatives from separate populations in the source
192   -environment). For each metagenome, regardless of origin, we found a
  191 +contributing to disconnected portions of the metagenome assembly graph.
  192 +For each metagenome, regardless of origin, we found a
193 193 single dominant, highly connected set of sequencing reads which we
194 194 henceforth refer to as the ``lump'' of the dataset (Table 1, column
195 195 3). This lump contained the largest subset of connected sequencing
@@ -203,64 +203,71 @@ \subsubsection*{Presence of a single, highly connected lump in all datasets}
203 203
204 204 \subsubsection*{Characterizing the connectivity in the dominant lump}
205 205
206   -Given the large number of reads connected within metagenomic lumps (up
207   -to 182 and 262 million reads in the soil and human gut datasets,
208   -respectively), we quantified the degree of connectivity of sequences
209   -within the lump by estimating the average local graph density from
210   -each k-mer (k=32 unless otherwise stated) in the assembly graph (See
  206 +% @CTB check scripts to see if this is an accurate characterization.
  207 +% @CTB put scripts in scripts/!!
  208 +We characterized the connectivity of sequences
  209 +within each lump by estimating the average local graph density from
  210 +each k-mer (k=32 unless otherwise stated) in the assembly graph (see
211 211 Methods). Here, local graph density is a measurement of total
212   -connected reads within a radius distance. We observed that sequences
  212 +connected reads within a fixed radius. Sequences
213 213 in the identified metagenomic lumps were characterized by very high
214   -local graph densities, between 22 to 50\% of the total nodes in
  214 +local graph densities: between 22 to 50\% of the total nodes in
215 215 metagenomic lump assembly graphs had average graph densities greater
216   -than 20 (Table 1). In comparison, 17\% of the total nodes in the
  216 +than 20 (Table 1). This means that these nodes were in very nonlinear portions of the assembly graph and had high connectivity. In comparison, 17\% of the total nodes in the
217 217 simulated lump had an average local graph density greater than 20, and
218   -a mixture of the 112 source genomes for the simulated dataset had
219   -fewer than 2\% of its nodes with an average graph density greater than
220   -20.
  218 +fewer than 2\% of the nodes in the entire simulated data set had an
  219 +average graph density higher than 20.
221 220
222 221 We next assessed the extent to which graph density varied by position
223   -along the sequencing reads. The degree of position-specific bias of
  222 +along the sequencing reads. The degree of position-specific variation of
224 223 graph densities was estimated by calculating the average local graph
225 224 density within ten steps of every k-mer by position in each read. In
226   -all environmental metagenomic reads, we observed biases in graph
  225 +all environmental metagenomic reads, we observed variation in graph
227 226 density at the 3'-end region of reads (Figure 1). In soil
228   -metagenomes, we observed the most dramatic biases with local graph
  227 +metagenomes, we observed the most dramatic variation with local graph
229 228 density increasing in sequences located at the 3'-end of the reads.
230   -Notably, this bias was not present in the simulated dataset.
  229 +Notably, this trend was not present in the simulated dataset.
231 230
232 231 Next, we performed an exhaustive traversal of the assembly graph and
233 232 identified the specific sequences within dense regions of the assembly
234 233 graph which consistently contributed to high connectivity. We
235   -observed that this subset of sequences were also found to exhibit
236   -position-specific biases within sequencing reads, with the exception
237   -of these sequences in the simulated dataset (Figure 1, solid lines).
238   -Similar to local density trends, position-specific biases of these
239   -sequences also varied between metagenomes. As sequencing coverage
240   -increased among metagenomes, the amount of 3'-end bias appeared to
241   -decrease (e.g., the soils) or inverse (e.g., rumen and human gut).
  234 +observed that this subset of sequences was also found to exhibit
  235 +position-specific variation within sequencing reads, with the
  236 +exception of these sequences in the simulated dataset (Figure 1, solid
  237 +lines). Similar to local density trends, position-specific trends in
  238 +the location of these sequences also varied between metagenomes. As
  239 +sequencing coverage increased among metagenomes, the amount of 3'-end
  240 +variation appeared to decrease (e.g., the soils) or inverse (e.g.,
  241 +rumen and human gut).
242 242
243 243 \subsection*{Effects of removing highly connected sequences on assembly}
244 244
245 245 \subsubsection*{Removal of highly connected sequences enables graph partitioning of metagenome}
246   -Given that highly connected sequences exhibited position-specific
247   -biases associated with sequences of non-biological origin, we assessed
248   -the effects of their removal from reads in metagenomic lumps. We
  246 +
  247 +Since these highly connected sequences exhibited position-specific
  248 +variation indicative of sequences of non-biological origin, we removed
  249 +them and assessed the effect of their removal on assembly
  250 +(see Methods). We
249 251 found that by removing these k-mers, we could effectively break apart
250 252 metagenomic lumps, and the resulting largest partition of connected
251 253 reads in each metagenome was reduced to less than 7\% of the total
252 254 reads in the lump. As a consequence of partitioning the metagenomic
253   -lump, we were able to greatly reduce assembly requirements. Compared
  255 +lump, we were able to greatly reduce assembly requirements.
  256 +% @CTB refactor below
  257 +Compared
254 258 to unfiltered datasets which required greater than 100 GB and 100
255 259 hours in the case of the largest soil metagenome (Table 2), all
256 260 partitioned datasets could be assembled in less than 2 GB of memory
257 261 and less than 1 hour using multiple nodes.
258 262
259 263 \subsubsection*{Removal of highly connected sequences resulted in minimal losses of reference genes}
260   -To explore the extent to which the identified highly connected
261   -sequences impacted assembly, we first evaluated the effects of the
262   -removing these sequences from reads in the simulated lump and its
263   -resulting assemblies. The assembly of the reads in the original,
  264 +
  265 +% @CTB probably need to indicate that since lump is separated from rest
  266 +% we can assemble it separately w/o fear.
  267 +We explored the extent
  268 +to which the identified highly connected
  269 +sequences impacted assembly by first evaluating the effects of the
  270 +removing these sequences from the simulated lump. The assembly of the reads in the original,
264 271 unfiltered simulated lump and that of the reads remaining after
265 272 removing highly connected sequences (the filtered assembly) were
266 273 compared for three assemblers: Velvet, Meta-IDBA, and SOAPdenovo.
@@ -277,17 +284,18 @@ \subsubsection*{Removal of highly connected sequences resulted in minimal losses
277 284 of over 3\% of the total unique 32-mers in the simulated metagenome,
278 285 the resulting filtered assemblies resulted in only a loss of 0.1 -
279 286 0.6\% of annotated original reference genes (Tables 1 and 2).
  287 +% @@CTB was normalized blast used here?
280 288
281 289 We next evaluated the effects of using similar approaches on
282   -metagenomic datasets. Similar to the simulated assemblies, the
  290 +real datasets. Similar to the simulated assemblies, the
283 291 removal of highly connected sequences for all metagenomes and
284   -assemblers resulted in a loss of total number of contigs and assembly
  292 +assemblers resulted in a decrease of total number of contigs and assembly
285 293 length (Table 2). In general, filtered assemblies were largely
286 294 contained within unfiltered assemblies and comprised 51-88\% of the
287 295 unfiltered assembly. The observed changes in metagenomic assemblies
288   -were difficult to evaluate as the source genomes to these datasets are
289   -unknown, and a loss in assembly length may actually be beneficial due
290   -to the elimination of contigs which incorporated sequencing artifacts.
  296 +were difficult to evaluate as no reference genomes exist,
  297 +and a decrease in assembly length may actually be beneficial if it
  298 +eliminates contigs that incorporate sequencing artifacts.
291 299 To aid in this evaluation, we used the previously published set of
292 300 rumen draft genomes from \emph{de novo} assembly efforts of high
293 301 abundance sequences in the rumen metagenome \cite{Hess:2011p686}.
@@ -306,15 +314,15 @@ \subsubsection*{Unfiltered assemblies contained only a small fraction of highly
306 314 dependent on the total length of the contig) and examined for the
307 315 presence of the previously identified highly connected sequences. We
308 316 found that contigs, especially in assemblies from Velvet and
309   -Meta-IDBA, incorporated a larger fraction of these sequences at its
310   -ends relative to other binned positions (Figure 3). The SOAPdenovo
  317 +Meta-IDBA, incorporated a larger fraction of these sequences at their
  318 +ends relative to other positions (Figure 3). The SOAPdenovo
311 319 assembler incorporated fewer of the highly connected sequences into
312 320 its assembled contigs; none of these sequences in the simulated
313 321 dataset were assembled, and only 41 in the small soil dataset. For
314 322 the human gut metagenome assemblies, millions of the highly connected
315 323 sequences were incorporated into assembled contigs, comprising nearly
316 324 4\% of all assembled sequences on Velvet contig ends (Figure 4,
317   -suggestion to move to supp figures).
  325 +suggestion to move to supp figures -- YES CTB).
318 326 % @CTB do we want to talk about end bins or percentile bins? Probably fine
319 327 % to leave as is.
320 328
@@ -332,27 +340,29 @@ \subsubsection*{Identifying origins of highly connected sequences in known refer
332 340 (out of how many kmers???), we identified the closest reference
333 341 protein from the NCBI-nr database requiring complete sequence
334 342 identity. Only 1,018 sequences (13\%) matched existing reference
335   -proteins, and many of the annotated sequences matched multiple
336   -conserved protein sequences from multiple genomes. The top five
  343 +proteins, and many of the annotated sequences matched to
  344 +genes conserved across multiple genomes. The top five
337 345 proteins conserved in greater than 3 genomes are shown in Table 4, and
338 346 largely encode for genes involved in protein biosynthesis, DNA
339 347 metabolism, and biochemical cofactors (Table 5).
340 348 % @CTB yes, out of of how many k-mers?
341 349 % @CTB what is our conclusion here, anyway, about the origin?
  350 +% @CTB what does ``top five'' mean here -- abundance?
342 351
343   -A potential cause of artificial high connectivity within metagenomes
344   -is the presence of high abundance sequences. Thus, we identified the
  352 +One potential cause of artificial high connectivity within metagenomes
  353 +is the presence of high abundance subsequences. Thus, we identified the
345 354 subset of highly connected k-mers which were also present with an
346 355 abundance of greater than 50 within each metagenome and their location
347 356 in sequencing reads (Figure 2, dotted lines). These high abundance
348 357 k-mers comprised a very small proportion of the identified highly
349 358 connected sequences, less than 1\% in the soils, 1.5\% in the rumen,
350 359 and 6.4\% in the human gut metagenomes, but the position-specific
351   -biases of these sequences were very similar to the biases of the
  360 +variation of these sequences was very similar to the variation in the
352 361 larger set of highly connected k-mers.
  362 +% @CTB was diginorm used for abundance > 50?
353 363
354 364 To identify consistent patterns within sequences causing
355   -position-specific biases, we examined the abundance of distribution
  365 +position-specific variation, we examined the abundance distribution of
356 366 5-mers contained within the high abundance subset of each dataset's
357 367 highly connected 32-mers. There were significantly fewer 5-mers in
358 368 the simulated sequences compared to those in metagenomes: 336 5-mers
@@ -376,12 +386,13 @@ \section*{Discussion}
376 386 \subsection*{Sequencing artifacts are present in highly connected sequences}
377 387
378 388 Through assessing the connectivity of reads in several metagenomes, we
379   -identified a disproportionately large subset of reads which were
380   -connected together within an assembly graph, hereafter referred to as
381   -the ``lump'' in each metagenome. The total number of reads in
  389 +identified a disproportionately large subset of reads
  390 +connected together within an assembly graph, which we refer to as
  391 +the ``lump''.
  392 +The total number of reads in
382 393 metagenomic lumps (7-75\% of reads) was significantly larger than that
383 394 of simulated dataset (5\% of reads) (Table 1). As the simulated
384   -dataset contains no errors, its observed connectivity represents
  395 +dataset contains no errors, this observed connectivity represents
385 396 conserved sequences within a single genome or between multiple genomes
386 397 (specific genes identified in Table 4). The larger size of the highly
387 398 connected lump within the soil, rumen, and human gut metagenomes
@@ -392,13 +403,15 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
392 403 increased slightly from 4.7 to 5.6\% in the medium and large soil
393 404 metagenomes, the number of reads connected in the lump grew
394 405 significantly from 15 million to 182 million. Given the very high
395   -diversity and very low coverage of these soils, the magnitude of the
396   -observed increases in connectivity seemed unlikely from biological
397   -sources, further supporting the presence of sequencing biases within
  406 +diversity and very low coverage of these soil samples, the magnitude of the
  407 +observed increases in connectivity seemed unlikely to originate from biological
  408 +sources, further suggesting the presence of sequencing biases within
398 409 these datasets.
  410 +% @CTB what does ``a 5% increase of sequencing coverage'' mean? in reads?
  411 +% or reads mapped to assembly?
399 412
400 413 If sequencing biases were present within these metagenomes, we would
401   -expect to observe that the metagenomic lumps would consist not only of
  414 +expect that the metagenomic lumps would consist not only of
402 415 artificial sequences but also sequences from reads which would be
403 416 ``preferentially attached'' \cite{Barabasi:1999p1083}. Consider that
404 417 there is an original set of highly connecting ``X'' sequences in a
@@ -414,10 +427,10 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
414 427 datasets.
415 428 % @CTB rewrite
416 429
417   -To more rigorously demonstrate the presence of artifacts within our
418   -datasets, we considered that the sequencing of metagenomes is a random
419   -process and consequently any position-specific bias within sequencing
420   -reads is unexpected and non-biological (cite). For the metagenomes
  430 +The sequencing of metagenomes is a random
  431 +process and consequently any position-specific variation within sequencing
  432 +reads is unexpected and probably originates from bias in sample preparation
  433 +or the sequencing process (cite). For the metagenomes
421 434 studied here, we used two approaches to examine characteristics of
422 435 connectivity correlated to specific positions within sequencing reads.
423 436 First, we measured the connectivity of sequences at specific positions
@@ -428,11 +441,11 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
428 441 dataset, we observed no position-specific trends when assessing either
429 442 local graph density (Figure 1) or highly connected k-mers (Figure 2,
430 443 solid lines) as is consistent with the lack of sequencing errors and
431   -biases in this dataset. In all real metagenomes, however, we
  444 +variation in this dataset. In all real metagenomes, however, we
432 445 identified position-specific trends in measurements of both local
433 446 graph density and the location of highly connected sequences, clearly
434 447 indicating the presence of artificial sequences. Although present in
435   -all metagenomes, the direction of the bias varied between soil, rumen,
  448 +all metagenomes, the direction of the variation varied between soil, rumen,
436 449 and human gut datasets, especially for the position-specific presence
437 450 of identified highly connected sequences. It is likely that there is
438 451 a larger presence of indirectly preferentially attached reads which
@@ -446,32 +459,37 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
446 459 large soil metagenomes and in the soil, rumen, to human gut
447 460 metagenomes (Figure 2).
448 461 % @CTB is this last bit bullshit or not? Speculate on ligation efficiency
449   -% etc.
  462 +% etc. :)
450 463
451 464 \subsection*{Assessing the validity of removing highly connected sequences from metagenomes}
  465 +
452 466 % @CTB refactor roundabout section title
453 467 \subsubsection*{Highly connected sequences are difficult to assemble}
  468 +
  469 +% @CTB refactor
454 470 As is apparent from conserved biological sources of high connectivity
455 471 within the simulated metagenome, not all the observed connectivity
456 472 within real metagenomes is artificial, and our approaches are limited
457 473 in that they cannot differentiate between sequencing artifacts and
458 474 sources of real biological connectivity. Regardless of the origin of
459 475 highly connected sequences, we suspected that these sequences would
460   -challenge assemblers which rely on resolving the complex ``lump'' in
  476 +challenge assemblers which rely on traversing the complex ``lump'' in
461 477 the assembly graph. Indeed, very few highly connected sequences with
462   -abundances greater than 50 were incorporated into any assembly (Table
463   -3) and those which were assembled were often disproportionately placed
464   -at the ends of contigs (Figure 3), suggesting that assembly could
465   -often not extend beyond these sequences. Although this trend was
  478 +abundances greater than 50 were incorporated into contigs (Table
  479 +3). Moreover, those which were assembled were often disproportionately placed
  480 +at the ends of contigs (Figure 3), suggesting that they confused the
  481 +assembly process. Although this trend was
466 482 observed for all assemblers, it was more prevalent in the Velvet and
467 483 Meta-IDBA assemblers, highlighting differences in assembler
468 484 heuristics.
469 485
470 486 \subsubsection*{Removing highly connected sequences enabled more efficient assembly of partitioned reads}
471   -Given that these sequences were found to have position-specific biases
472   -within reads and challenged multiple assemblers, we assessed the
  487 +
  488 +Since these highly connected sequences contained artifacts and
  489 +were challenging for assemblers,
  490 +we assessed the
473 491 effects of removing them for the assembly of metagenomic lumps. We
474   -found that the removal of these highly connected sequences had two key
  492 +found that removal of these highly connected sequences had two key
475 493 advantages: first, it removed artificial sequences which should not be
476 494 assembled, and second, it resulted in the dissolution of the high
477 495 connectivity within the metagenomic lump and consequently allowed for
@@ -523,35 +541,34 @@ \subsubsection*{Removal of highly connected sequences prior to assembly did not
523 541 general, for all metagenomes, we observed ~25\% loss in assembly after
524 542 removing highly connected sequences, much more than observed in
525 543 assemblies of reference genes and genomes in the simulated and rumen
526   -datasets. Some of this loss is likely beneficial, resulting in the
  544 +datasets. Some of this loss is likely beneficial, resulting from
527 545 removal of sequencing artifacts; it is also possible that our approach
528   -removes sequences which can accurately be assembled but cannot be
529   -distinguished due to lack of reference genomes. However, without the
  546 +removes sequences which can accurately be assembled, but we cannot
  547 +evaluate this in the absence of reference genomes.
  548 +However, without the
530 549 removal of these sequences, many of the assemblies of the larger
531 550 metagenomes would not be practical.
532 551
533   -\subsection*{highly connected sequences do not match known reference sequences}
  552 +\subsection*{Highly connected sequences do not match known reference sequences}
534 553
535   -We attempted to identify any biological characteristics of highly
  554 +We attempted to identify biological characteristics of highly
536 555 connected sequences. Among these sequences in the simulated dataset
537 556 and those shared by all metagenomes, we identified only a small
538 557 fraction (13\% in simulated and less than 7\% in metagenomes) which
539   -matched reference genes, mostly associated with housekeeping functions
  558 +matched reference genes associated with core biological functions
540 559 (Tables 4 and 5). This suggests that the remaining sequences are
541 560 either not present in known reference genes (i.e., conserved
542   -non-coding regions) or originate from non-biological sources and
  561 +non-coding regions) or originate from non-biological sources. This
543 562 supports the removal of these sequences for typical assembly and
544 563 annotation pipelines, where assembly is often followed by the
545 564 identification of protein coding regions.
546 565
547 566 Speculating that many of the highly connected sequences originated
548   -from high abundance reads (possibly originating from biological
549   -sources of high connectivity or sequencing biases), we identified
550   -characteristics of the most abundant subset of sequences. We found
551   -that these sequences (present greater than 50x) displayed similar
552   -trends for position-specific biases compared to their respective sets
553   -of highly connected sequences (Figure 2), indicating that they are
554   -contribute significantly as sequencing biases. We attempted to
  567 +from high abundance reads, we examined the most abundant subsequences. We found
  568 +that these subsequences (present greater than 50x) displayed similar
  569 +trends for position-specific variation compared to their respective sets
  570 +of highly connected subsequences (Figure 2), indicating that they
  571 +contribute significantly to position-specific variation. We attempted to
555 572 identify signatures in the the abundant, highly connected sequences of
556 573 the simulated and metagenomic datasets. In the simulated dataset, we
557 574 found that the total number of unique 5-mers was significantly lower
@@ -560,12 +577,12 @@ \subsection*{highly connected sequences do not match known reference sequences}
560 577 with the identification of conserved biological motifs in the
561 578 simulated dataset which would result in a small number of highly
562 579 abundant sequences. In contrast, within metagenomic data, we found
563   -that these sequences are evenly distributed and random in metagenomes
  580 +that the 5-mersse are evenly distributed and random in metagenomes
564 581 (Figure 5), making them difficult to identify and evaluate.
565 582 Currently, we are evaluating a promising approach to improve the
566 583 identification and removal of probable sequencing artifacts based on
567 584 targeting high abundance sequencing.
568   -% @CTB this is the diginorm abundance removal, right?
  585 +% @CTB this is the diginorm abundance removal, right? should we keep this in?
569 586
570 587 \section*{Conclusion}
571 588
@@ -678,7 +695,7 @@ \subsection*{Local graph density and identifying highly connected k-mers}
678 695 data-in-paper/lumps/HC-kmers/HA-HC-kmers and
679 696 method-examples/4.abundant-hc-kmers. These high abundance, highly
680 697 connected sequences were aligned to sequencing reads to demonstrate
681   -position specific biases as described above. We evaluated the
  698 +position specific variation as described above. We evaluated the
682 699 existence of short k-mer (k=5) motifs within high abundance, highly
683 700 connected k-mers which did not have an exact match to the NCBI
684 701 non-redundant database. Each identified 32-mer was broken up into

0 comments on commit 0bca8fc

Please sign in to comment.
Something went wrong with that request. Please try again.