Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

necessity of two-barcode requirement #52

Open
jvolkening opened this issue Jul 12, 2020 · 0 comments
Open

necessity of two-barcode requirement #52

jvolkening opened this issue Jul 12, 2020 · 0 comments

Comments

@jvolkening
Copy link

jvolkening commented Jul 12, 2020

Hello,

I apologize in advance for the long-winded post. I'm hoping to engage a discussion on the necessity of the "two-barcode" demultiplexing requirement.

In the ARTIC SARS-CoV-2 SOP, it states "For the current version of the ARTIC protocol it is essential to demultiplex using strict parameters to ensure barcodes are present at each end of the fragment." This is further elaborated on in the guide here, where it is explained that the main concern is due to the possible occurrence of in silico chimeric reads. This all seems fine and we have been following the recommended stringent Guppy demux settings for nanopore-based SARS-CoV-2 sequencing.

However, on some runs we are seeing a significant percentage of reads which are not being assigned into barcode bins because of the two-barcode requirement. The extent of the issue varies from run to run, but for example in the latest run, 78% of reads were thrown out because of this requirement. Based on the plot below, I suspect that the reason for this is that the barcode and adapter sequence are actually missing on one end of the reads. The reason for this is another issue that the lab doing the sequencing is talking to ONT about tomorrow (although any ideas are welcome).

length_by_class.pdf

In any case, if I remove the two-barcode requirement and leave the rest of the parameters as default, only 7% of the reads are unclassified. Because of uneven coverage for some of the amplicons, the result for many samples is a drastic difference in completeness of the consensus genome sequence.

In doing some benchmarking of demultiplexing and the effect of tuning various parameters, I'm questioning the need for the two-barcode requirement. My logic is based on the fact that there multiple other filters and factors which mitigate the potential for issues caused by in silico chimeric reads. Although we've been using Guppy lately for demuxing at the same time as basecalling, the following points are specific to Porechop as that's what I've used to investigate and benchmark the issue:

  1. Even when two barcode matches are not required, Porechop will not classify a read that has mismatched high quality barcode hits at both ends. If one end has a best hit to NB1 and a passing score, and the other end has a best hit to NB2 and a passing score within a certain distance of the first, the read is not classified. This is controlled by the --barcode_diff parameter of Porechop.

  2. Porechop will not classify a read that has a matching adapter sequence in the middle, as would be expected for chimeric reads. This is enabled by default when demultiplexing and in fact cannot be disabled.

  3. Perhaps most importantly, we are applying the size filtering using artic guppyplex as recommend in the SOP. Because the ncov19 amplicon sizes are fairly uniform and thus the read length distribution is fairly tight, the upper limit can be set conservatively, essentially precluding the possibility of chimeric reads passing through unless they were too short to begin with.

  4. Again because of the narrow read length distributions, any significant population of chimeric reads would be obvious on the read length QC plot as a second peak. In practice, there is sometimes a small peak at approximately twice the ampicon size (this is before size filtering), but it is a very small fraction of the total population.

Clearly, the ideal solution is to fix the problem of apparent missing barcodes shown in the plot above. However, in the meantime we would like to make the best use of the data we have. My inclination, based on the above reasoning, is to remove the two-barcode requirement and re-process the data. I would appreciate any feedback from the ARTIC experts on the points above and if this would be considered acceptable practice.

Many thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant