Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Browse files

many minor changes

  • Loading branch information...
commit 0bca8fcb1eca2d178ef26190d48bdd45ce0e12f2 1 parent c9d38a1
@ctb ctb authored
Showing with 139 additions and 122 deletions.
  1. +139 −122 recent-version/artifacts-paper2.tex
View
261 recent-version/artifacts-paper2.tex
@@ -82,53 +82,53 @@
% Please keep the abstract between 250 and 300 words
\section*{Abstract}
+% @CTB revisit first sentence.
Coverage-based assembly approaches for metagenomic datasets are
hindered by the presence of sequencing errors and biases. Here, we
-examine several metagenomes for the presence of such sequencing
-artifacts through a connectivity analysis of reads within a
-representation of each metagenome's respective assembly graph. We
+examine several metagenomes for the presence of sequencing
+artifacts through a connectivity analysis of reads within
+each metagenome's respective assembly graph. We
identified highly connected sequences which join a large proportion of
reads within each metagenome, suggesting the presence of
non-biological biases within sequencing reads. These sequences were
-found to be located at specific positions within original reads and
+found to be biased towards specific positions within shotgun reads and
are minimally incorporated into final assemblies. The removal of
-these sequences prior to assembly resulted in similar assembly content
-for most metagenomes and enabled the partitioning of reads in the
-assembly graph by connectivity, significantly decreasing assembly
-memory and time requirements.
-
+these sequences prior to assembly results in similar assembly content
+for most metagenomes, and enables the use of graph partitioning
+to decrease assembly memory and time requirements.
\section*{Introduction}
-Given the rapid decrease in the costs of sequencing, we can now
-achieve the sequencing depth necessary to study even the most complex
-environments \cite{Hess:2011p686,Qin:2010p189}. High throughput, deep
+With the rapid decrease in the costs of sequencing, we can now
+achieve the sequencing depth necessary to study microbes from even the most complex
+environments \cite{Hess:2011p686,Qin:2010p189}. Deep
metagenomic sequencing efforts in permafrost soil, human gut, cow
rumen, and surface water have provided insights into the genetic and
biochemical diversity of environmental microbial populations
-\cite{Hess:2011p686,Iverson:2012p1281,Qin:2010p189} and the extent to
-which they are involved in responding to environmental changes
+\cite{Hess:2011p686,Iverson:2012p1281,Qin:2010p189} and how
+they are involved in responding to environmental changes
\cite{Mackelprang:2011p1087}. These metagenomic studies have all
-leveraged \emph{de novo} metagenomic assembly of short reads to assign
-sequences and functions to microbial taxa. \emph{De novo} assembly is
+leveraged \emph{de novo} metagenomic assembly of short reads for
+functional and phylogenetic analyses
+\emph{De novo} assembly is
an advantageous approach to sequence analysis as it reduces the
-dataset size by collapsing numerous short reads into fewer contigs and
-provides longer sequences containing multiple genes and operons
-\cite{Miller:2010p226,Pop:2009p798} making annotation-based approaches
-more practical. Furthermore, it does not rely on the availability of
+dataset size by collapsing the more numerous short reads into fewer contigs and
+enabling better annotation-based approaches by providing longer sequences.
+\cite{Miller:2010p226,Pop:2009p798}
+Furthermore, it does not rely on the a priori availability of
reference genomes to enable identification of novel genetic features
and draft genomes \cite{Hess:2011p686,Iverson:2012p1281}.
% @CTB what does ``enabling identification ... of draft genomes'' mean?
Although \emph{de novo} metagenomic assembly is a promising approach
-for deep sequencing of metagenomes, it is complicated by the variable
+for metagenomic sequence analysis, it is complicated by the variable
coverage of sequencing reads from mixed populations in the environment
and their associated sequencing errors and biases
\cite{Mende:2012p1262,Pignatelli:2011p742}. Several
-metagenomic-specific assemblers have been developed to deal with
+metagenome-specific assemblers have been developed to deal with
variable coverage communities, including Meta-IDBA
-\cite{Peng:2011p898}, MetaVelvet, and SOAPdenovo. These assemblers
-rely on local models of sequencing coverage to help build assemblies
+\cite{Peng:2011p898}, MetaVelvet, and SOAPdenovo (cite). These assemblers
+rely on analysis of local sequencing coverage to help build assemblies
and thus are sensitive to the effects of sequencing errors and biases
on coverage estimations of the underlying dataset. The effects of
sequencing errors on \emph{de novo} assembly has been demonstrated in
@@ -139,22 +139,22 @@ \section*{Introduction}
Specifically, these models exclude the presence of known
non-biological sequencing biases
\cite{GomezAlvarez:2009p1334,Keegan:2012p1336,Niu:2010p1333} which
-hinder coverage-based assembly approaches.
+hinder assembly approaches.
% @CTB also remember to discuss polymorphism
% @CTB for isolated genomes, add Chitsaz citation.
In this study, we examine metagenomic datasets for the presence of
artificial sequencing biases that affect assembly graph structure,
extending previous work to large and complex datasets produced from
-the Illumina platform. We characterized sequence connectivity in an
+the Illumina platform. We characterize sequence connectivity in an
assembly graph, identifying potential sequencing biases in regions
where numerous reads are connected together. Within metagenomic
-datasets, we found that there exist highly connected sequences which
-originate, at least partially, from sequencing artifacts and that
+datasets, we find that there exist highly connected sequences which
+partially originate from sequencing artifacts. Moreover,
these sequences limit approaches to divide or partition large datasets
-for further analysis, e.g. {\em de novo} assembly. Here, we present
-approaches to identify and characterize these highly connected
-sequences and examine the effects of removing these sequences on
+for further analysis, and may introduce artifacts into assemblies. Here, we
+identify and characterize these highly connected
+sequences, and examine the effects of removing these sequences on
downstream assemblies.
\section*{Results}
@@ -162,16 +162,17 @@ \section*{Results}
\subsection*{Connectivity analysis of metagenome datasets}
\subsubsection*{Presence of a single, highly connected lump in all datasets}
-We selected datasets from three diverse, medium to high diversity
+We selected datasets from three medium to high diversity
metagenomes from the human gut \cite{Qin:2010p189}, cow rumen
-\cite{Hess:2011p686}, and agricultural soil (SRX099904 and SRX099905),
-representing metagenomes sequenced to various depths (Table 1). To
+\cite{Hess:2011p686}, and agricultural soil (SRX099904 and SRX099905)
+(Table 1). To
evaluate the effects of sequencing coverage, we included two subsets
of the 520 million read soil metagenome containing 50 and 100 million
-reads. We also included a previously published error-free simulated,
+reads. We also included a previously published error-free simulated
metagenome based on a mixture of 112 reference genomes
\cite{Pignatelli:2011p742}.
+% @CTB refactor paragraph: using a DBG, cite Pell, etc.
Initially, we evaluated the amount of connectivity between all
sequences in each metagenome using an approach similar to the initial
step of short read assemblers to identify overlaps of short sequences
@@ -187,9 +188,8 @@ \subsubsection*{Presence of a single, highly connected lump in all datasets}
% @CTB cite
Using this assembly graph representation, we separated reads
-contributing to disconnected portions of the metagenome assembly graph
-(e.g., representatives from separate populations in the source
-environment). For each metagenome, regardless of origin, we found a
+contributing to disconnected portions of the metagenome assembly graph.
+For each metagenome, regardless of origin, we found a
single dominant, highly connected set of sequencing reads which we
henceforth refer to as the ``lump'' of the dataset (Table 1, column
3). This lump contained the largest subset of connected sequencing
@@ -203,64 +203,71 @@ \subsubsection*{Presence of a single, highly connected lump in all datasets}
\subsubsection*{Characterizing the connectivity in the dominant lump}
-Given the large number of reads connected within metagenomic lumps (up
-to 182 and 262 million reads in the soil and human gut datasets,
-respectively), we quantified the degree of connectivity of sequences
-within the lump by estimating the average local graph density from
-each k-mer (k=32 unless otherwise stated) in the assembly graph (See
+% @CTB check scripts to see if this is an accurate characterization.
+% @CTB put scripts in scripts/!!
+We characterized the connectivity of sequences
+within each lump by estimating the average local graph density from
+each k-mer (k=32 unless otherwise stated) in the assembly graph (see
Methods). Here, local graph density is a measurement of total
-connected reads within a radius distance. We observed that sequences
+connected reads within a fixed radius. Sequences
in the identified metagenomic lumps were characterized by very high
-local graph densities, between 22 to 50\% of the total nodes in
+local graph densities: between 22 to 50\% of the total nodes in
metagenomic lump assembly graphs had average graph densities greater
-than 20 (Table 1). In comparison, 17\% of the total nodes in the
+than 20 (Table 1). This means that these nodes were in very nonlinear portions of the assembly graph and had high connectivity. In comparison, 17\% of the total nodes in the
simulated lump had an average local graph density greater than 20, and
-a mixture of the 112 source genomes for the simulated dataset had
-fewer than 2\% of its nodes with an average graph density greater than
-20.
+fewer than 2\% of the nodes in the entire simulated data set had an
+average graph density higher than 20.
We next assessed the extent to which graph density varied by position
-along the sequencing reads. The degree of position-specific bias of
+along the sequencing reads. The degree of position-specific variation of
graph densities was estimated by calculating the average local graph
density within ten steps of every k-mer by position in each read. In
-all environmental metagenomic reads, we observed biases in graph
+all environmental metagenomic reads, we observed variation in graph
density at the 3'-end region of reads (Figure 1). In soil
-metagenomes, we observed the most dramatic biases with local graph
+metagenomes, we observed the most dramatic variation with local graph
density increasing in sequences located at the 3'-end of the reads.
-Notably, this bias was not present in the simulated dataset.
+Notably, this trend was not present in the simulated dataset.
Next, we performed an exhaustive traversal of the assembly graph and
identified the specific sequences within dense regions of the assembly
graph which consistently contributed to high connectivity. We
-observed that this subset of sequences were also found to exhibit
-position-specific biases within sequencing reads, with the exception
-of these sequences in the simulated dataset (Figure 1, solid lines).
-Similar to local density trends, position-specific biases of these
-sequences also varied between metagenomes. As sequencing coverage
-increased among metagenomes, the amount of 3'-end bias appeared to
-decrease (e.g., the soils) or inverse (e.g., rumen and human gut).
+observed that this subset of sequences was also found to exhibit
+position-specific variation within sequencing reads, with the
+exception of these sequences in the simulated dataset (Figure 1, solid
+lines). Similar to local density trends, position-specific trends in
+the location of these sequences also varied between metagenomes. As
+sequencing coverage increased among metagenomes, the amount of 3'-end
+variation appeared to decrease (e.g., the soils) or inverse (e.g.,
+rumen and human gut).
\subsection*{Effects of removing highly connected sequences on assembly}
\subsubsection*{Removal of highly connected sequences enables graph partitioning of metagenome}
-Given that highly connected sequences exhibited position-specific
-biases associated with sequences of non-biological origin, we assessed
-the effects of their removal from reads in metagenomic lumps. We
+
+Since these highly connected sequences exhibited position-specific
+variation indicative of sequences of non-biological origin, we removed
+them and assessed the effect of their removal on assembly
+(see Methods). We
found that by removing these k-mers, we could effectively break apart
metagenomic lumps, and the resulting largest partition of connected
reads in each metagenome was reduced to less than 7\% of the total
reads in the lump. As a consequence of partitioning the metagenomic
-lump, we were able to greatly reduce assembly requirements. Compared
+lump, we were able to greatly reduce assembly requirements.
+% @CTB refactor below
+Compared
to unfiltered datasets which required greater than 100 GB and 100
hours in the case of the largest soil metagenome (Table 2), all
partitioned datasets could be assembled in less than 2 GB of memory
and less than 1 hour using multiple nodes.
\subsubsection*{Removal of highly connected sequences resulted in minimal losses of reference genes}
-To explore the extent to which the identified highly connected
-sequences impacted assembly, we first evaluated the effects of the
-removing these sequences from reads in the simulated lump and its
-resulting assemblies. The assembly of the reads in the original,
+
+% @CTB probably need to indicate that since lump is separated from rest
+% we can assemble it separately w/o fear.
+We explored the extent
+to which the identified highly connected
+sequences impacted assembly by first evaluating the effects of the
+removing these sequences from the simulated lump. The assembly of the reads in the original,
unfiltered simulated lump and that of the reads remaining after
removing highly connected sequences (the filtered assembly) were
compared for three assemblers: Velvet, Meta-IDBA, and SOAPdenovo.
@@ -277,17 +284,18 @@ \subsubsection*{Removal of highly connected sequences resulted in minimal losses
of over 3\% of the total unique 32-mers in the simulated metagenome,
the resulting filtered assemblies resulted in only a loss of 0.1 -
0.6\% of annotated original reference genes (Tables 1 and 2).
+% @@CTB was normalized blast used here?
We next evaluated the effects of using similar approaches on
-metagenomic datasets. Similar to the simulated assemblies, the
+real datasets. Similar to the simulated assemblies, the
removal of highly connected sequences for all metagenomes and
-assemblers resulted in a loss of total number of contigs and assembly
+assemblers resulted in a decrease of total number of contigs and assembly
length (Table 2). In general, filtered assemblies were largely
contained within unfiltered assemblies and comprised 51-88\% of the
unfiltered assembly. The observed changes in metagenomic assemblies
-were difficult to evaluate as the source genomes to these datasets are
-unknown, and a loss in assembly length may actually be beneficial due
-to the elimination of contigs which incorporated sequencing artifacts.
+were difficult to evaluate as no reference genomes exist,
+and a decrease in assembly length may actually be beneficial if it
+eliminates contigs that incorporate sequencing artifacts.
To aid in this evaluation, we used the previously published set of
rumen draft genomes from \emph{de novo} assembly efforts of high
abundance sequences in the rumen metagenome \cite{Hess:2011p686}.
@@ -306,15 +314,15 @@ \subsubsection*{Unfiltered assemblies contained only a small fraction of highly
dependent on the total length of the contig) and examined for the
presence of the previously identified highly connected sequences. We
found that contigs, especially in assemblies from Velvet and
-Meta-IDBA, incorporated a larger fraction of these sequences at its
-ends relative to other binned positions (Figure 3). The SOAPdenovo
+Meta-IDBA, incorporated a larger fraction of these sequences at their
+ends relative to other positions (Figure 3). The SOAPdenovo
assembler incorporated fewer of the highly connected sequences into
its assembled contigs; none of these sequences in the simulated
dataset were assembled, and only 41 in the small soil dataset. For
the human gut metagenome assemblies, millions of the highly connected
sequences were incorporated into assembled contigs, comprising nearly
4\% of all assembled sequences on Velvet contig ends (Figure 4,
-suggestion to move to supp figures).
+suggestion to move to supp figures -- YES CTB).
% @CTB do we want to talk about end bins or percentile bins? Probably fine
% to leave as is.
@@ -332,27 +340,29 @@ \subsubsection*{Identifying origins of highly connected sequences in known refer
(out of how many kmers???), we identified the closest reference
protein from the NCBI-nr database requiring complete sequence
identity. Only 1,018 sequences (13\%) matched existing reference
-proteins, and many of the annotated sequences matched multiple
-conserved protein sequences from multiple genomes. The top five
+proteins, and many of the annotated sequences matched to
+genes conserved across multiple genomes. The top five
proteins conserved in greater than 3 genomes are shown in Table 4, and
largely encode for genes involved in protein biosynthesis, DNA
metabolism, and biochemical cofactors (Table 5).
% @CTB yes, out of of how many k-mers?
% @CTB what is our conclusion here, anyway, about the origin?
+% @CTB what does ``top five'' mean here -- abundance?
-A potential cause of artificial high connectivity within metagenomes
-is the presence of high abundance sequences. Thus, we identified the
+One potential cause of artificial high connectivity within metagenomes
+is the presence of high abundance subsequences. Thus, we identified the
subset of highly connected k-mers which were also present with an
abundance of greater than 50 within each metagenome and their location
in sequencing reads (Figure 2, dotted lines). These high abundance
k-mers comprised a very small proportion of the identified highly
connected sequences, less than 1\% in the soils, 1.5\% in the rumen,
and 6.4\% in the human gut metagenomes, but the position-specific
-biases of these sequences were very similar to the biases of the
+variation of these sequences was very similar to the variation in the
larger set of highly connected k-mers.
+% @CTB was diginorm used for abundance > 50?
To identify consistent patterns within sequences causing
-position-specific biases, we examined the abundance of distribution
+position-specific variation, we examined the abundance distribution of
5-mers contained within the high abundance subset of each dataset's
highly connected 32-mers. There were significantly fewer 5-mers in
the simulated sequences compared to those in metagenomes: 336 5-mers
@@ -376,12 +386,13 @@ \section*{Discussion}
\subsection*{Sequencing artifacts are present in highly connected sequences}
Through assessing the connectivity of reads in several metagenomes, we
-identified a disproportionately large subset of reads which were
-connected together within an assembly graph, hereafter referred to as
-the ``lump'' in each metagenome. The total number of reads in
+identified a disproportionately large subset of reads
+connected together within an assembly graph, which we refer to as
+the ``lump''.
+The total number of reads in
metagenomic lumps (7-75\% of reads) was significantly larger than that
of simulated dataset (5\% of reads) (Table 1). As the simulated
-dataset contains no errors, its observed connectivity represents
+dataset contains no errors, this observed connectivity represents
conserved sequences within a single genome or between multiple genomes
(specific genes identified in Table 4). The larger size of the highly
connected lump within the soil, rumen, and human gut metagenomes
@@ -392,13 +403,15 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
increased slightly from 4.7 to 5.6\% in the medium and large soil
metagenomes, the number of reads connected in the lump grew
significantly from 15 million to 182 million. Given the very high
-diversity and very low coverage of these soils, the magnitude of the
-observed increases in connectivity seemed unlikely from biological
-sources, further supporting the presence of sequencing biases within
+diversity and very low coverage of these soil samples, the magnitude of the
+observed increases in connectivity seemed unlikely to originate from biological
+sources, further suggesting the presence of sequencing biases within
these datasets.
+% @CTB what does ``a 5% increase of sequencing coverage'' mean? in reads?
+% or reads mapped to assembly?
If sequencing biases were present within these metagenomes, we would
-expect to observe that the metagenomic lumps would consist not only of
+expect that the metagenomic lumps would consist not only of
artificial sequences but also sequences from reads which would be
``preferentially attached'' \cite{Barabasi:1999p1083}. Consider that
there is an original set of highly connecting ``X'' sequences in a
@@ -414,10 +427,10 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
datasets.
% @CTB rewrite
-To more rigorously demonstrate the presence of artifacts within our
-datasets, we considered that the sequencing of metagenomes is a random
-process and consequently any position-specific bias within sequencing
-reads is unexpected and non-biological (cite). For the metagenomes
+The sequencing of metagenomes is a random
+process and consequently any position-specific variation within sequencing
+reads is unexpected and probably originates from bias in sample preparation
+or the sequencing process (cite). For the metagenomes
studied here, we used two approaches to examine characteristics of
connectivity correlated to specific positions within sequencing reads.
First, we measured the connectivity of sequences at specific positions
@@ -428,11 +441,11 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
dataset, we observed no position-specific trends when assessing either
local graph density (Figure 1) or highly connected k-mers (Figure 2,
solid lines) as is consistent with the lack of sequencing errors and
-biases in this dataset. In all real metagenomes, however, we
+variation in this dataset. In all real metagenomes, however, we
identified position-specific trends in measurements of both local
graph density and the location of highly connected sequences, clearly
indicating the presence of artificial sequences. Although present in
-all metagenomes, the direction of the bias varied between soil, rumen,
+all metagenomes, the direction of the variation varied between soil, rumen,
and human gut datasets, especially for the position-specific presence
of identified highly connected sequences. It is likely that there is
a larger presence of indirectly preferentially attached reads which
@@ -446,32 +459,37 @@ \subsection*{Sequencing artifacts are present in highly connected sequences}
large soil metagenomes and in the soil, rumen, to human gut
metagenomes (Figure 2).
% @CTB is this last bit bullshit or not? Speculate on ligation efficiency
-% etc.
+% etc. :)
\subsection*{Assessing the validity of removing highly connected sequences from metagenomes}
+
% @CTB refactor roundabout section title
\subsubsection*{Highly connected sequences are difficult to assemble}
+
+% @CTB refactor
As is apparent from conserved biological sources of high connectivity
within the simulated metagenome, not all the observed connectivity
within real metagenomes is artificial, and our approaches are limited
in that they cannot differentiate between sequencing artifacts and
sources of real biological connectivity. Regardless of the origin of
highly connected sequences, we suspected that these sequences would
-challenge assemblers which rely on resolving the complex ``lump'' in
+challenge assemblers which rely on traversing the complex ``lump'' in
the assembly graph. Indeed, very few highly connected sequences with
-abundances greater than 50 were incorporated into any assembly (Table
-3) and those which were assembled were often disproportionately placed
-at the ends of contigs (Figure 3), suggesting that assembly could
-often not extend beyond these sequences. Although this trend was
+abundances greater than 50 were incorporated into contigs (Table
+3). Moreover, those which were assembled were often disproportionately placed
+at the ends of contigs (Figure 3), suggesting that they confused the
+assembly process. Although this trend was
observed for all assemblers, it was more prevalent in the Velvet and
Meta-IDBA assemblers, highlighting differences in assembler
heuristics.
\subsubsection*{Removing highly connected sequences enabled more efficient assembly of partitioned reads}
-Given that these sequences were found to have position-specific biases
-within reads and challenged multiple assemblers, we assessed the
+
+Since these highly connected sequences contained artifacts and
+were challenging for assemblers,
+we assessed the
effects of removing them for the assembly of metagenomic lumps. We
-found that the removal of these highly connected sequences had two key
+found that removal of these highly connected sequences had two key
advantages: first, it removed artificial sequences which should not be
assembled, and second, it resulted in the dissolution of the high
connectivity within the metagenomic lump and consequently allowed for
@@ -523,35 +541,34 @@ \subsubsection*{Removal of highly connected sequences prior to assembly did not
general, for all metagenomes, we observed ~25\% loss in assembly after
removing highly connected sequences, much more than observed in
assemblies of reference genes and genomes in the simulated and rumen
-datasets. Some of this loss is likely beneficial, resulting in the
+datasets. Some of this loss is likely beneficial, resulting from
removal of sequencing artifacts; it is also possible that our approach
-removes sequences which can accurately be assembled but cannot be
-distinguished due to lack of reference genomes. However, without the
+removes sequences which can accurately be assembled, but we cannot
+evaluate this in the absence of reference genomes.
+However, without the
removal of these sequences, many of the assemblies of the larger
metagenomes would not be practical.
-\subsection*{highly connected sequences do not match known reference sequences}
+\subsection*{Highly connected sequences do not match known reference sequences}
-We attempted to identify any biological characteristics of highly
+We attempted to identify biological characteristics of highly
connected sequences. Among these sequences in the simulated dataset
and those shared by all metagenomes, we identified only a small
fraction (13\% in simulated and less than 7\% in metagenomes) which
-matched reference genes, mostly associated with housekeeping functions
+matched reference genes associated with core biological functions
(Tables 4 and 5). This suggests that the remaining sequences are
either not present in known reference genes (i.e., conserved
-non-coding regions) or originate from non-biological sources and
+non-coding regions) or originate from non-biological sources. This
supports the removal of these sequences for typical assembly and
annotation pipelines, where assembly is often followed by the
identification of protein coding regions.
Speculating that many of the highly connected sequences originated
-from high abundance reads (possibly originating from biological
-sources of high connectivity or sequencing biases), we identified
-characteristics of the most abundant subset of sequences. We found
-that these sequences (present greater than 50x) displayed similar
-trends for position-specific biases compared to their respective sets
-of highly connected sequences (Figure 2), indicating that they are
-contribute significantly as sequencing biases. We attempted to
+from high abundance reads, we examined the most abundant subsequences. We found
+that these subsequences (present greater than 50x) displayed similar
+trends for position-specific variation compared to their respective sets
+of highly connected subsequences (Figure 2), indicating that they
+contribute significantly to position-specific variation. We attempted to
identify signatures in the the abundant, highly connected sequences of
the simulated and metagenomic datasets. In the simulated dataset, we
found that the total number of unique 5-mers was significantly lower
@@ -560,12 +577,12 @@ \subsection*{highly connected sequences do not match known reference sequences}
with the identification of conserved biological motifs in the
simulated dataset which would result in a small number of highly
abundant sequences. In contrast, within metagenomic data, we found
-that these sequences are evenly distributed and random in metagenomes
+that the 5-mersse are evenly distributed and random in metagenomes
(Figure 5), making them difficult to identify and evaluate.
Currently, we are evaluating a promising approach to improve the
identification and removal of probable sequencing artifacts based on
targeting high abundance sequencing.
-% @CTB this is the diginorm abundance removal, right?
+% @CTB this is the diginorm abundance removal, right? should we keep this in?
\section*{Conclusion}
@@ -678,7 +695,7 @@ \subsection*{Local graph density and identifying highly connected k-mers}
data-in-paper/lumps/HC-kmers/HA-HC-kmers and
method-examples/4.abundant-hc-kmers. These high abundance, highly
connected sequences were aligned to sequencing reads to demonstrate
-position specific biases as described above. We evaluated the
+position specific variation as described above. We evaluated the
existence of short k-mer (k=5) motifs within high abundance, highly
connected k-mers which did not have an exact match to the NCBI
non-redundant database. Each identified 32-mer was broken up into
Please sign in to comment.
Something went wrong with that request. Please try again.