Losing 50% reads after merging #831

MicroIrene · 2019-08-23T11:05:00Z

Hi,

I have a problem with my dataset, and that is that I am losing a lot of reads after merging. Approximately 50% of them. I have read all the threads related with this problem but they don't seem to give me a solution... Can you please help? Is this a real problem or shall I just accept that I loose so many and continue working with that?

My sequences are bacteria 16S V3-V4 region. Primers 341F and 805R. Expected fragment size (after removing primers) approx. 426 nt. I sequenced Illumina 2x300. I cut the forward sequences down to 270, reverse to 180. This deleted almost all the bad quality regions of the sequences. Overlap region should be approx. 24 nt (so more than the minimum 12 needed). Primers have definitely been removed from the sequences.

head(track)
input filtered denoisedF denoisedR merged nonchim
101-16S 48211 38736 30592 36481 18427 16600
102-16S 49561 41633 33087 39272 20470 18815
103-16S 52595 43115 34682 40605 22060 20211
104-16S 48032 39328 31738 37023 19806 17902
105-16S 54383 44568 44403 44409 44146 43829
111-16S 55997 45441 38579 43359 26823 25150

I have tried to check if it was a problem with the number of mismatches, but when I checked the non-merged sequences (with mergePairs, returnRjects = TRUE), I can see that all the ones that were not merged had a 0 matched nucleotides... Does this mean that they do not overlap? I tried to increase the overlap region cutting less of the reverse (I cut down to 210), but the result was exactly the same.

Any advice?

Thank you very very much!

Irene

jgrzadziel · 2019-08-23T11:24:25Z

Can you share with us a table of sequnces that were not merged ?

benjjneb · 2019-08-23T12:56:41Z

Is it possible that heterogeneity spacers were used in sequencing these samples?

MicroIrene · 2019-08-23T13:12:12Z

Hi! Thank you very much for your answers!

First, @jgrzadziel , I have tried to retrieved the not merged samples but I have failed... All I get is this:

merg2 <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, verbose=TRUE, returnRejects = TRUE)
merg3 <- merg2[[1]]
nomerged <- merg3 %>%
filter(accept== "FALSE")

head(nomerged)
sequence abundance forward reverse nmatch nmismatch nindel prefer accept
1 36 717 530 1 0 0 2 FALSE
2 29 206 357 2 0 0 1 FALSE
3 27 905 346 1 0 0 2 FALSE
4 24 92 33 1 0 0 2 FALSE
5 22 92 486 1 0 0 1 FALSE
6 21 295 280 0 0 0 1 FALSE

So there is no sequence attached to them! While on the merged ones there is. I guess because there is no consensus sequence....

@benjjneb , I am not sure about the heterogeneity spacers... Maybe? The data that they sent me had all the indexes and I guess any other tags removed but not the primers. How will that affect my sequences? I did noticed something weird happening on the first bases of the forward primer (low quality) and I removed the first 10. Just in case.

MicroIrene · 2019-08-23T13:13:21Z

Sorry, I meant on the forward sequences after removing the primer. Then I removed another 10 nt.

benjjneb · 2019-08-23T17:31:53Z

I am not sure about the heterogeneity spacers... Maybe? The data that they sent me had all the indexes and I guess any other tags removed but not the primers. How will that affect my sequences?

If heterogeneity spacers were left on the reads they sent you, it will interfere with DADA2 and can lead to results like what you are seeing, where a large fraction of reads end up failing to merge successfully. I don't know if that is what is going on here, but it would be worth confirming the exact library setup the seuqencing provider used, and perhaps asking them directly if they use heterogeneity spacers in the amplicon libraries.

MicroIrene · 2019-08-27T10:02:05Z

Ok, thanks! Yes, I will contact them and see what I can find out. In the meantime... If the problem is those spacers... Will removing the primers with cutadap solve the problem? If I am correct, the spacers will be before the primer and cutadapt will look for the primer and remove everything before them..

benjjneb · 2019-08-27T13:44:03Z

Will removing the primers with cutadap solve the problem? If I am correct, the spacers will be before the primer and cutadapt will look for the primer and remove everything before them..

If you are right, then yes. But if the spacers are after the primers, then it won't solve the problem. And that second library design (spacers after primers) is one that we have seen more than once.

MicroIrene · 2019-08-27T14:31:24Z

Um, Ok. Thanks. I hope that is not the case. But I will wait for their reply. Thanks a lot!

Irene

MicroIrene · 2019-09-05T09:37:17Z

Hi!

I finally got an answer from the sequencing company. They say that they don't use heterogeneity spacers... I tried cutadapt just in case and, as expected, I got the same results as just trimming a fixed number of nucleotides to remove the primers. Not sure what to do now....

Thanks!

jgrzadziel · 2019-09-05T12:41:02Z

Hi again,
I have observed a similar situation in some of my samples. Mostly after filtering, chimera removal and merging I got about 80-90% of input sequences, rarely lower ratio.

But in some cases, there are about 30-40% (so I'm loosing over 50% of reads, just like you). I don't know what is the reason, but I suspect some errors during sequencing since this problem is always connected to whole sequencing "run".

For example in one run, I have 20 samples and each of them is properly filtered etc., but taking another run (sequenced on another day, another machine) the filtered-to-input sequences ratio is just like abovementioned ~30-40%.

Interestingly, the same samples when using another software (for example fastq join) and setting the same parameters (the same minimum overlap, maximum mismatch) as in mergePairs are much "better" merged, which means that 90-98% reads are merged. Just to be clear, when testing different software I used dada2 filtered sequences (from fastq/filtered folder).

In conclusion, I assume that the problem is linked to:

Sequencing errors (maybe the chemistry is the reason?)
and/or
Error model calculation in dada2, which then produces problems with the subsequent merging of paired-end reads.

Examples (the same primers, the same sequencing technogy, the same company and the protocol, but different "run"):

Bad sequencing run:

	input	filtered	merged	nonchim	% passed
A20	119865	102438	47577	43893	0.37
A22	117759	98857	42156	39017	0.33
B1	166299	137772	44267	41629	0.25
B3	137891	116255	39380	37906	0.27
C23	129657	109436	43049	40351	0.31
C35	123348	105248	46208	43751	0.35
G10	145544	120777	43602	41392	0.28
G5	162744	135980	60080	55944	0.34
G7	126610	104909	45126	43134	0.34
G8	126168	107595	45447	42387	0.34
O11	126883	103759	38547	36715	0.29
O13	146323	122846	39754	37494	0.26
P14	127261	105903	36091	34313	0.27
P16	105805	90545	28805	27460	0.26
R17	131448	110208	37457	35647	0.27
R19	118008	99897	32623	31164	0.26

Good sequencing run:

	input	filtered	denoisedF	denoisedR	merged	nonchim	% passed
HK	59024	52524	52358	52370	52179	48448	0.82
HL	146517	129911	129299	129501	128233	96035	0.66
HP	64797	57285	57157	57151	56954	55946	0.86
RK	18679	16588	16514	16497	16450	16381	0.88
RL	49815	43961	43752	43765	43369	36853	0.74
RP	92668	83322	83134	83136	82828	81727	0.88
ZaH	90147	76997	76888	76782	76606	76233	0.85
ZaR	127245	110668	110509	110526	110245	106082	0.83
ZH	82032	71260	71127	71082	70920	70410	0.86
ZR	83590	70972	70916	70847	70737	69821	0.84

benjjneb · 2019-09-05T14:47:13Z

@MicroIrene I'm not sure unfortunately. You could try extending truncLen even a bit more, e.g. truncLen = c(280, 220) which should be more than enough to overlap, but if that doesn't fix things I'm not sure what is interfering with the merging.

benjjneb · 2019-09-05T14:48:41Z

@jgrzadziel

Huh. That is a major difference between runs from the same sequencing provider!
You suggest the chemistry could be different, but is that a possibility? Did these runs potentially use different sequencing chemistries?

jgrzadziel · 2019-09-05T15:01:23Z

@benjjneb
No, the chemistry was 100% identic. But maybe it's quality was different (expired?)...
I have no idea, but I think it's time to change the company.

MicroIrene · 2019-09-05T17:10:58Z

Oh, that's a shame. Well, thank you very much anyway. I tried trimming a bit more already and still not working. I think my only option now is to assume that I lose that many reads and keep going with the analysis... As @jgrzadziel suggests it might be an issue with the quality of the run. Unfortunately the run cannot be repeated, so I'll stick to what I have.
Thank you very much for your suggestions.

jgrzadziel · 2019-09-05T17:21:19Z

@MicroIrene
Why don't you try to merge reads using other software? Maybe it would help, then back to dada2 and continue the analysis.

MicroIrene · 2019-09-06T10:00:31Z

Interesting.... Yes, I could do that. Actually, I tried before with a different software (with obitools) and they merged very well (99% of the reads merged successfully). But I thought that dada2 needed to work with the reads before merging for the learning errors and sample inference steps.... Is it going to be a problem to work with the merged reads?
Thanks

jgrzadziel · 2019-09-06T10:18:36Z

@MicroIrene

My suggestion was to apply dada2 for:

filterAndTrim

then using another software:

merge sequences

then move back to dada2 for:

seqtab.nochim (or remove chimeras also in another software)
assignTaxonomy (or use DECIPHER idTAXA)

What do you think @benjjneb ?

ariannacapparotto · 2022-09-19T12:54:19Z

Hello @MicroIrene, I'm having the same problem with my dataset (I'm using DADA2 to process it).
I was wondering if you were able to solve it by merging your sequences using another tool or, if not, what you did. Thanks!

benjjneb closed this as completed Oct 11, 2019

SimonMorvan mentioned this issue Oct 26, 2021

Unexplained read loss during merging step #1440

Closed

EdoardoUibk mentioned this issue Aug 2, 2022

Shortening 16S seqs using "within" primer sets #1590

Closed

D-gallinson mentioned this issue Oct 11, 2023

Losing >50% of Reads on Merge #1832

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Losing 50% reads after merging #831

Losing 50% reads after merging #831

MicroIrene commented Aug 23, 2019

jgrzadziel commented Aug 23, 2019

benjjneb commented Aug 23, 2019

MicroIrene commented Aug 23, 2019

MicroIrene commented Aug 23, 2019

benjjneb commented Aug 23, 2019

MicroIrene commented Aug 27, 2019

benjjneb commented Aug 27, 2019

MicroIrene commented Aug 27, 2019

MicroIrene commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

benjjneb commented Sep 5, 2019

benjjneb commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

MicroIrene commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

MicroIrene commented Sep 6, 2019

jgrzadziel commented Sep 6, 2019

ariannacapparotto commented Sep 19, 2022

Losing 50% reads after merging #831

Losing 50% reads after merging #831

Comments

MicroIrene commented Aug 23, 2019

jgrzadziel commented Aug 23, 2019

benjjneb commented Aug 23, 2019

MicroIrene commented Aug 23, 2019

MicroIrene commented Aug 23, 2019

benjjneb commented Aug 23, 2019

MicroIrene commented Aug 27, 2019

benjjneb commented Aug 27, 2019

MicroIrene commented Aug 27, 2019

MicroIrene commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

Examples (the same primers, the same sequencing technogy, the same company and the protocol, but different "run"):

benjjneb commented Sep 5, 2019

benjjneb commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

MicroIrene commented Sep 5, 2019

jgrzadziel commented Sep 5, 2019

MicroIrene commented Sep 6, 2019

jgrzadziel commented Sep 6, 2019

ariannacapparotto commented Sep 19, 2022