Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Losing 50% reads after merging #831

Closed
MicroIrene opened this issue Aug 23, 2019 · 18 comments
Closed

Losing 50% reads after merging #831

MicroIrene opened this issue Aug 23, 2019 · 18 comments

Comments

@MicroIrene
Copy link

Hi,

I have a problem with my dataset, and that is that I am losing a lot of reads after merging. Approximately 50% of them. I have read all the threads related with this problem but they don't seem to give me a solution... Can you please help? Is this a real problem or shall I just accept that I loose so many and continue working with that?

My sequences are bacteria 16S V3-V4 region. Primers 341F and 805R. Expected fragment size (after removing primers) approx. 426 nt. I sequenced Illumina 2x300. I cut the forward sequences down to 270, reverse to 180. This deleted almost all the bad quality regions of the sequences. Overlap region should be approx. 24 nt (so more than the minimum 12 needed). Primers have definitely been removed from the sequences.

head(track)
input filtered denoisedF denoisedR merged nonchim
101-16S 48211 38736 30592 36481 18427 16600
102-16S 49561 41633 33087 39272 20470 18815
103-16S 52595 43115 34682 40605 22060 20211
104-16S 48032 39328 31738 37023 19806 17902
105-16S 54383 44568 44403 44409 44146 43829
111-16S 55997 45441 38579 43359 26823 25150

I have tried to check if it was a problem with the number of mismatches, but when I checked the non-merged sequences (with mergePairs, returnRjects = TRUE), I can see that all the ones that were not merged had a 0 matched nucleotides... Does this mean that they do not overlap? I tried to increase the overlap region cutting less of the reverse (I cut down to 210), but the result was exactly the same.

Any advice?

Thank you very very much!

Irene

@jgrzadziel
Copy link

Can you share with us a table of sequnces that were not merged ?

@benjjneb
Copy link
Owner

Is it possible that heterogeneity spacers were used in sequencing these samples?

@MicroIrene
Copy link
Author

Hi! Thank you very much for your answers!

First, @jgrzadziel , I have tried to retrieved the not merged samples but I have failed... All I get is this:

merg2 <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE, returnRejects = TRUE)
merg3 <- merg2[[1]]
nomerged <- merg3 %>%
filter(accept== "FALSE")

head(nomerged)
sequence abundance forward reverse nmatch nmismatch nindel prefer accept
1 36 717 530 1 0 0 2 FALSE
2 29 206 357 2 0 0 1 FALSE
3 27 905 346 1 0 0 2 FALSE
4 24 92 33 1 0 0 2 FALSE
5 22 92 486 1 0 0 1 FALSE
6 21 295 280 0 0 0 1 FALSE

So there is no sequence attached to them! While on the merged ones there is. I guess because there is no consensus sequence....

@benjjneb , I am not sure about the heterogeneity spacers... Maybe? The data that they sent me had all the indexes and I guess any other tags removed but not the primers. How will that affect my sequences? I did noticed something weird happening on the first bases of the forward primer (low quality) and I removed the first 10. Just in case.

@MicroIrene
Copy link
Author

Sorry, I meant on the forward sequences after removing the primer. Then I removed another 10 nt.

@benjjneb
Copy link
Owner

I am not sure about the heterogeneity spacers... Maybe? The data that they sent me had all the indexes and I guess any other tags removed but not the primers. How will that affect my sequences?

If heterogeneity spacers were left on the reads they sent you, it will interfere with DADA2 and can lead to results like what you are seeing, where a large fraction of reads end up failing to merge successfully. I don't know if that is what is going on here, but it would be worth confirming the exact library setup the seuqencing provider used, and perhaps asking them directly if they use heterogeneity spacers in the amplicon libraries.

@MicroIrene
Copy link
Author

Ok, thanks! Yes, I will contact them and see what I can find out. In the meantime... If the problem is those spacers... Will removing the primers with cutadap solve the problem? If I am correct, the spacers will be before the primer and cutadapt will look for the primer and remove everything before them..

@benjjneb
Copy link
Owner

Will removing the primers with cutadap solve the problem? If I am correct, the spacers will be before the primer and cutadapt will look for the primer and remove everything before them..

If you are right, then yes. But if the spacers are after the primers, then it won't solve the problem. And that second library design (spacers after primers) is one that we have seen more than once.

@MicroIrene
Copy link
Author

Um, Ok. Thanks. I hope that is not the case. But I will wait for their reply. Thanks a lot!

Irene

@MicroIrene
Copy link
Author

Hi!

I finally got an answer from the sequencing company. They say that they don't use heterogeneity spacers... I tried cutadapt just in case and, as expected, I got the same results as just trimming a fixed number of nucleotides to remove the primers. Not sure what to do now....

Thanks!

@jgrzadziel
Copy link

Hi again,
I have observed a similar situation in some of my samples. Mostly after filtering, chimera removal and merging I got about 80-90% of input sequences, rarely lower ratio.

But in some cases, there are about 30-40% (so I'm loosing over 50% of reads, just like you). I don't know what is the reason, but I suspect some errors during sequencing since this problem is always connected to whole sequencing "run".

For example in one run, I have 20 samples and each of them is properly filtered etc., but taking another run (sequenced on another day, another machine) the filtered-to-input sequences ratio is just like abovementioned ~30-40%.

Interestingly, the same samples when using another software (for example fastq join) and setting the same parameters (the same minimum overlap, maximum mismatch) as in mergePairs are much "better" merged, which means that 90-98% reads are merged. Just to be clear, when testing different software I used dada2 filtered sequences (from fastq/filtered folder).

In conclusion, I assume that the problem is linked to:

  1. Sequencing errors (maybe the chemistry is the reason?)
    and/or
  2. Error model calculation in dada2, which then produces problems with the subsequent merging of paired-end reads.

Examples (the same primers, the same sequencing technogy, the same company and the protocol, but different "run"):

Bad sequencing run:

  input filtered merged nonchim % passed
A20 119865 102438 47577 43893 0.37
A22 117759 98857 42156 39017 0.33
B1 166299 137772 44267 41629 0.25
B3 137891 116255 39380 37906 0.27
C23 129657 109436 43049 40351 0.31
C35 123348 105248 46208 43751 0.35
G10 145544 120777 43602 41392 0.28
G5 162744 135980 60080 55944 0.34
G7 126610 104909 45126 43134 0.34
G8 126168 107595 45447 42387 0.34
O11 126883 103759 38547 36715 0.29
O13 146323 122846 39754 37494 0.26
P14 127261 105903 36091 34313 0.27
P16 105805 90545 28805 27460 0.26
R17 131448 110208 37457 35647 0.27
R19 118008 99897 32623 31164 0.26

Good sequencing run:

  input filtered denoisedF denoisedR merged nonchim % passed
HK 59024 52524 52358 52370 52179 48448 0.82
HL 146517 129911 129299 129501 128233 96035 0.66
HP 64797 57285 57157 57151 56954 55946 0.86
RK 18679 16588 16514 16497 16450 16381 0.88
RL 49815 43961 43752 43765 43369 36853 0.74
RP 92668 83322 83134 83136 82828 81727 0.88
ZaH 90147 76997 76888 76782 76606 76233 0.85
ZaR 127245 110668 110509 110526 110245 106082 0.83
ZH 82032 71260 71127 71082 70920 70410 0.86
ZR 83590 70972 70916 70847 70737 69821 0.84

@benjjneb
Copy link
Owner

benjjneb commented Sep 5, 2019

@MicroIrene I'm not sure unfortunately. You could try extending truncLen even a bit more, e.g. truncLen = c(280, 220) which should be more than enough to overlap, but if that doesn't fix things I'm not sure what is interfering with the merging.

@benjjneb
Copy link
Owner

benjjneb commented Sep 5, 2019

@jgrzadziel

Huh. That is a major difference between runs from the same sequencing provider!
You suggest the chemistry could be different, but is that a possibility? Did these runs potentially use different sequencing chemistries?

@jgrzadziel
Copy link

@benjjneb
No, the chemistry was 100% identic. But maybe it's quality was different (expired?)...
I have no idea, but I think it's time to change the company.

@MicroIrene
Copy link
Author

Oh, that's a shame. Well, thank you very much anyway. I tried trimming a bit more already and still not working. I think my only option now is to assume that I lose that many reads and keep going with the analysis... As @jgrzadziel suggests it might be an issue with the quality of the run. Unfortunately the run cannot be repeated, so I'll stick to what I have.
Thank you very much for your suggestions.

@jgrzadziel
Copy link

@MicroIrene
Why don't you try to merge reads using other software? Maybe it would help, then back to dada2 and continue the analysis.

@MicroIrene
Copy link
Author

Interesting.... Yes, I could do that. Actually, I tried before with a different software (with obitools) and they merged very well (99% of the reads merged successfully). But I thought that dada2 needed to work with the reads before merging for the learning errors and sample inference steps.... Is it going to be a problem to work with the merged reads?
Thanks

@jgrzadziel
Copy link

@MicroIrene

My suggestion was to apply dada2 for:

filterAndTrim

then using another software:

merge sequences

then move back to dada2 for:

seqtab.nochim (or remove chimeras also in another software)
assignTaxonomy (or use DECIPHER idTAXA)

What do you think @benjjneb ?

@ariannacapparotto
Copy link

Hello @MicroIrene, I'm having the same problem with my dataset (I'm using DADA2 to process it).
I was wondering if you were able to solve it by merging your sequences using another tool or, if not, what you did. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants