Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Merging different sequence runs with heterogeneity spacers #331
I've been helping a friend work through using DADA2 with a project that was sequenced across two different MiSeq runs. An interesting aspect of his project is that nearly all samples from the first run were re-sequenced in the second run, so it gives us a great opportunity to examine the performance of various sequence processing pipelines in merging runs — you would expect the same vial of DNA to yield roughly similar sets of amplicon sequences when resequenced.
Working with a subset of samples, we found that a QIIME workflow with uclust gave us exactly what we expected: the OTUs picked individually had the same refseqs, and when picking OTUs with both runs at the same time, reads were split about evenly between the runs for each OTU (though the second run had greater sequencing depth, so there's a tail of rare OTUs present only in that run). Switching to DADA2 with the same sample was distressing: we found that all the variation collapsed, and
I thought that perhaps including some more samples for learning error rates might fix the issue. Instead, I was able to infer more sample sequences, but they were still segregating by run, even though I knew from the uclust results that the sequences were the same.
Pulling the two inferred sequences from the single-sample trial, I was distressed to find that they were exactly the same, save for one sequence being just a single base-pair longer on the 5' end.
That's odd, particularly since these sequences have already been trimmed (by the sequencing facility) to remove the adaptors and sequencing primers. So, I pulled a bunch of sequences out of the sample and had a look at how they aligned:
It turns out that there's a lot of variation in where the sequence starts and ends — the alignment has "raggedy" ends. I talked to my friend about this, and he mentioned that this was intentional: something the sequencing facility did to introduce heterogeneity into the sequencer. They place a 1–10 bp variable spacer after the sequencing primer, randomly, to keep from swamping the sequencer with all the same base at the same time. They're suppose to be removed with the sequencing primer!
Well, clearly they're not, and they're introducing small artificial differences at the ends, which is a problem (though easily fixable by trimming ~10bp from each end), but they're also keeping DADA2 from recognizing that it's the same RSV in both runs, which is a much bigger problem.
I solved the problem by including the
I think the way to solve the issue more elegantly will be to use the stable region at the beginning and ends of the amplicons (these are V5-V6-V7 16S sequences) to trim (pretending that they're primers), so that all of the sequences have the same start and end point. Pooling for sample inference across the full set of samples is a little bit computationally intensive, to say the least. It'll require the cluster, if that's the route my friend takes with this.
I'm putting this up because it was an issue I struggled with for days, and I wanted to put it out to the community. If you're having trouble merging sequencing runs with DADA2, check the alignment of your sequence ends: I'm not sure how prevalent this inclusion of variable length heterogeneity spacers in the primer thing is, but it seems to be gaining popularity.
There is a solution for this problem in the dada2 package:
In general we recommend cleaning up the artificial variation before applying
changed the title from
Merging different sequence runs
Merging different sequence runs with heterogeneity spacers
Sep 19, 2017
Thanks for posting this Roo! @werdnus ! Very interesting. Most of the amplicon seq data I've encountered had used the "phi-X method" to mitigate the nucleotide-underflow artifacts from highly similar sequences, as I imagine is no surprise for you to hear. This is the reason dada2 has included phi-X removal from pretty early on. However, this intentional random phase-shift solution is one that I like (it is more sequencing-efficient than phi-X), and I wanted to see it offered by more vendors.
Generally, it is a fundamental assumption for DADA2 that the input sequences have the same start/end. Minor/rare exceptions within the sequence population can be addressed with the "isShiftmera" method(s), but as you noticed in your careful failure forensics, a massive exception to the same-phase assumption across the sequences is going to lead dada2 to infer artificially high error rates, which results in most/all sequences being explainable as errors.
Another way to note this early in your process is to check that the error rates look reasonable for your platform/amplicon, e.g. if you had previous successful runs for that amplicon and seq platform, you could check that the error profiles are not wildly different. If they are, you usually have a problem with trimming.
Please post back here after you've fixed the trimming and re-run. I imagine I speak for @benjjneb when I say we'd like to hear how it worked out, and how it compared with your OTU clusters.
Okay, we've addressed the trimming issue, and I wanted to compare the error rates from a subset of samples before and after trimming, to see how the error rates differ. I've taken three samples replicated across two sequencing runs (totaling ~25,000 reads, split mostly evenly between samples, with the second sequencing run having about 20% more reads in all samples).
Here's the error rate plot from before trimming (when DADA2 was thinking there was no overlap between runs if we didn't use the
And here's the error rate plot from the same samples after trimming to the same start and end point:
Those are some subtle differences! I don't think I would have picked the second one out of a line up as the one with problematic trimming, honestly.
Even though it didn't seem to make a big difference to the error profiles, it did significantly improve the ability for
Compare that to the first table in the original post: these are the same samples, with the same processing, just trimmed first. Much better!
But we're still getting 45–65% of ASVs in only one run or the other. When we use
I wanted to chime is as I encounter what I think is a similar problem: I analyse highly diverse 16S soil samples and find that DADA2 gives me very little overlap between the ASVs in each sample. I troubleshooted a subset of the data (4 samples, subsampled to 10K reads each) and get the following (where the first table is generated with
Note that I tried
In my full dataset, I have 84 samples with > 100K reads each, so pooling is probably not an option...
Also note that according to the sequencing facility, phi-X was used and not random basepairs at the start of the sequence. This is also confirmed by an alignment - which shows that all sequences are aligned to the same start - and by trying to trim 10 bp from the start of each sequence (
Hm... that is interesting, and quite different from what I've previously observed.
Can you tell me a bit more about this data? It is soil, so quite diverse. What sequencing tech? What primer set? Is it possible there could be primers still on the reads w/ ambiguous nucleotides? What does the