-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add function to assist Illumina index correction #917
Conversation
illumina.py::guess_barcodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be cool to handle non-even distributions of pooled samples.
@yesimon Yeah, it would be great to handle non-even distributions once we have relevant test data. This PR is mostly intended to add an automated check on demultiplexed data, so pool quants are not available at the time of demux since they're not part of the sample sheet. It's certainly worth revisiting in the future. We could standardize the practice of adding pool fraction information to the sample sheet. I hope to have a better handle on paths forward after some time in the lab. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added functionality seems quite useful for lab debugging -- even if there may be ways to improve upon it in the future, it seems worth having on by default. Quick question: do we have a sense of whether this adds much runtime to a demux task on a large flowcell, since it's run sequentially after demux and before the fastqc steps?
I haven't benchmarked it, but the additional runtime should be minimal since it compares the relatively few rows in the picard metrics file against a truncated and already-sorted list of (the top 1000) observed barcodes. I don't think it makes sense to spin it out into a separate task considering the additional overhead. |
Sounds good, yeah that's going to be quite quick, since it's only processing two |
This adds a function,
guess_barcodes
, toillumina.py
to assist identifying barcodes that are outliers by read count and potential alternative barcode pairs that may make sense. The heuristic followed tries to find a barcode pair that is not used by another sample that 1) has one index match (assuming a laboratory swap impacted only one of the indices) and 2) has a higher read count. If single-barcode matches do not work but there is a higher read count option with two different barcode this is suggested instead. Where colliding pairs exist, the output is cautious and does not suggest alternatives, though outliers are still identified. Outliers are identified based on variance from the assumption of a balanced pool with one negative control, though threshold, number of controls can be set. As an alternative to finding outliers, the user can specify a sample name explicitly or define a readcount threshold below which barcodes will be reassessed. An error is issued if the number of assigned reads is <70% of the pool (configurable). A call toguess_barcodes
has been incorporated into the demultiplexing workflows, both Snakemake and WDL. A separate WDL task to call onlyguess_barcodes
is not included in this PR. Basic tests are included.