add function to assist Illumina index correction #917

tomkinsc · 2019-01-25T15:55:14Z

This adds a function, guess_barcodes, to illumina.py to assist identifying barcodes that are outliers by read count and potential alternative barcode pairs that may make sense. The heuristic followed tries to find a barcode pair that is not used by another sample that 1) has one index match (assuming a laboratory swap impacted only one of the indices) and 2) has a higher read count. If single-barcode matches do not work but there is a higher read count option with two different barcode this is suggested instead. Where colliding pairs exist, the output is cautious and does not suggest alternatives, though outliers are still identified. Outliers are identified based on variance from the assumption of a balanced pool with one negative control, though threshold, number of controls can be set. As an alternative to finding outliers, the user can specify a sample name explicitly or define a readcount threshold below which barcodes will be reassessed. An error is issued if the number of assigned reads is <70% of the pool (configurable). A call to guess_barcodes has been incorporated into the demultiplexing workflows, both Snakemake and WDL. A separate WDL task to call only guess_barcodes is not included in this PR. Basic tests are included.

usage: illumina.py subcommand guess_barcodes [-h]
                                             [--readcount_threshold READCOUNT_THRESHOLD | --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]]
                                             [--outlier_threshold OUTLIER_THRESHOLD]
                                             [--expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION]
                                             [--number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS]
                                             [--rows_limit ROWS_LIMIT]
                                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                             [--version] [--tmp_dir TMP_DIR]
                                             [--tmp_dirKeep]
                                             in_barcodes in_picard_metrics
                                             out_summary_tsv

Guess the barcode value for a sample name, based on the following: - a list is
made of novel barcode pairs seen in the data, but not in the picard metrics -
for the sample in question, get the most abundant novel barcode pair where one
of the barcodes seen in the data matches one of the barcodes in the picard
metrics (partial match) - if there are no partial matches, get the most
abundant novel barcode pair Limitations: - If multiple samples share a barcode
with multiple novel barcodes, disentangling them is difficult or impossible.
The names of samples to guess are selected: - explicitly by name, passed via
argument, OR - explicitly by read count threshold, OR - automatically (if
names or count threshold are omitted) based on basic outlier detection of
deviation from an assumed-balanced pool with some number of negative controls

positional arguments:
  in_barcodes           The barcode counts file produced by common_barcodes.
  in_picard_metrics     The demultiplexing read metrics produced by Picard.
  out_summary_tsv       Path to the summary file (.tsv format). It includes
                        several columns: (sample_name, expected_barcode_1,
                        expected_barcode_2, expected_barcode_1_name,
                        expected_barcode_2_name, expected_barcodes_read_count,
                        guessed_barcode_1, guessed_barcode_2,
                        guessed_barcode_1_name, guessed_barcode_2_name,
                        guessed_barcodes_read_count, match_type), where the
                        expected values are those used by Picard during
                        demultiplexing and the guessed values are based on the
                        barcodes seen among the data.

optional arguments:
  -h, --help            show this help message and exit
  --readcount_threshold READCOUNT_THRESHOLD
                        If specified, guess barcodes for samples with fewer
                        than this many reads. (default: None)
  --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]
                        If specified, only guess barcodes for these sample
                        names. (default: None)
  --outlier_threshold OUTLIER_THRESHOLD
                        threshold of how far from unbalanced a sample must be
                        to be considered an outlier. (default: 0.675)
  --expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION
                        The fraction of reads expected to be assigned. An
                        exception is raised if fewer than this fraction are
                        assigned. (default: 0.7)
  --number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS
                        The number of negative controls in the pool, for
                        calculating expected number of reads in the rest of
                        the pool. (default: 1)
  --rows_limit ROWS_LIMIT
                        The number of rows to use from the in_barcodes.
                        (default: 1000)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)

illumina.py::guess_barcodes

…ipeline

yesimon

Would be cool to handle non-even distributions of pooled samples.

util/illumina_indices.py

tomkinsc · 2019-01-28T14:54:32Z

@yesimon Yeah, it would be great to handle non-even distributions once we have relevant test data. This PR is mostly intended to add an automated check on demultiplexed data, so pool quants are not available at the time of demux since they're not part of the sample sheet. It's certainly worth revisiting in the future. We could standardize the practice of adding pool fraction information to the sample sheet. I hope to have a better handle on paths forward after some time in the lab.

dpark01

Added functionality seems quite useful for lab debugging -- even if there may be ways to improve upon it in the future, it seems worth having on by default. Quick question: do we have a sense of whether this adds much runtime to a demux task on a large flowcell, since it's run sequentially after demux and before the fastqc steps?

tomkinsc · 2019-01-30T14:33:43Z

I haven't benchmarked it, but the additional runtime should be minimal since it compares the relatively few rows in the picard metrics file against a truncated and already-sorted list of (the top 1000) observed barcodes. I don't think it makes sense to spin it out into a separate task considering the additional overhead.

dpark01 · 2019-01-30T14:46:08Z

Sounds good, yeah that's going to be quite quick, since it's only processing two O(1) input sources.

tomkinsc added 8 commits January 24, 2019 11:04

add functions to assist with Illumina index correction

ef34c94

illumina.py::guess_barcodes

add basic tests related to barcode correction

abfe85f

include outlier barcode detection and guessing as part of WDL pipeline

a2782ee

include outlier barcode detection and guessing as part of snakemake p…

d33e972

…ipeline

missing comma

9f5d189

Merge branch 'master' into ct-guess-barcodes

d432f8e

float cast

b712a26

add --expected_assigned_fraction=0 to guess_barcodes call in WDL

0355f29

yesimon reviewed Jan 25, 2019

View reviewed changes

util/illumina_indices.py Show resolved Hide resolved

dpark01 approved these changes Jan 30, 2019

View reviewed changes

Merge branch 'master' into ct-guess-barcodes

562f140

tomkinsc added 2 commits January 30, 2019 13:10

Merge branch 'master' into ct-guess-barcodes

cc5744b

Merge branch 'master' into ct-guess-barcodes

81cfcf9

tomkinsc merged commit 9c0bc80 into master Jan 31, 2019

tomkinsc deleted the ct-guess-barcodes branch January 31, 2019 02:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add function to assist Illumina index correction #917

add function to assist Illumina index correction #917

tomkinsc commented Jan 25, 2019 •

edited

Loading

yesimon left a comment

tomkinsc commented Jan 28, 2019

dpark01 left a comment

tomkinsc commented Jan 30, 2019 •

edited

Loading

dpark01 commented Jan 30, 2019

add function to assist Illumina index correction #917

add function to assist Illumina index correction #917

Conversation

tomkinsc commented Jan 25, 2019 • edited Loading

yesimon left a comment

Choose a reason for hiding this comment

tomkinsc commented Jan 28, 2019

dpark01 left a comment

Choose a reason for hiding this comment

tomkinsc commented Jan 30, 2019 • edited Loading

dpark01 commented Jan 30, 2019

tomkinsc commented Jan 25, 2019 •

edited

Loading

tomkinsc commented Jan 30, 2019 •

edited

Loading