Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add function to assist Illumina index correction #917

Merged
merged 11 commits into from
Jan 31, 2019
Merged

Conversation

tomkinsc
Copy link
Member

@tomkinsc tomkinsc commented Jan 25, 2019

This adds a function, guess_barcodes, to illumina.py to assist identifying barcodes that are outliers by read count and potential alternative barcode pairs that may make sense. The heuristic followed tries to find a barcode pair that is not used by another sample that 1) has one index match (assuming a laboratory swap impacted only one of the indices) and 2) has a higher read count. If single-barcode matches do not work but there is a higher read count option with two different barcode this is suggested instead. Where colliding pairs exist, the output is cautious and does not suggest alternatives, though outliers are still identified. Outliers are identified based on variance from the assumption of a balanced pool with one negative control, though threshold, number of controls can be set. As an alternative to finding outliers, the user can specify a sample name explicitly or define a readcount threshold below which barcodes will be reassessed. An error is issued if the number of assigned reads is <70% of the pool (configurable). A call to guess_barcodes has been incorporated into the demultiplexing workflows, both Snakemake and WDL. A separate WDL task to call only guess_barcodes is not included in this PR. Basic tests are included.

usage: illumina.py subcommand guess_barcodes [-h]
                                             [--readcount_threshold READCOUNT_THRESHOLD | --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]]
                                             [--outlier_threshold OUTLIER_THRESHOLD]
                                             [--expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION]
                                             [--number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS]
                                             [--rows_limit ROWS_LIMIT]
                                             [--loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}]
                                             [--version] [--tmp_dir TMP_DIR]
                                             [--tmp_dirKeep]
                                             in_barcodes in_picard_metrics
                                             out_summary_tsv

Guess the barcode value for a sample name, based on the following: - a list is
made of novel barcode pairs seen in the data, but not in the picard metrics -
for the sample in question, get the most abundant novel barcode pair where one
of the barcodes seen in the data matches one of the barcodes in the picard
metrics (partial match) - if there are no partial matches, get the most
abundant novel barcode pair Limitations: - If multiple samples share a barcode
with multiple novel barcodes, disentangling them is difficult or impossible.
The names of samples to guess are selected: - explicitly by name, passed via
argument, OR - explicitly by read count threshold, OR - automatically (if
names or count threshold are omitted) based on basic outlier detection of
deviation from an assumed-balanced pool with some number of negative controls

positional arguments:
  in_barcodes           The barcode counts file produced by common_barcodes.
  in_picard_metrics     The demultiplexing read metrics produced by Picard.
  out_summary_tsv       Path to the summary file (.tsv format). It includes
                        several columns: (sample_name, expected_barcode_1,
                        expected_barcode_2, expected_barcode_1_name,
                        expected_barcode_2_name, expected_barcodes_read_count,
                        guessed_barcode_1, guessed_barcode_2,
                        guessed_barcode_1_name, guessed_barcode_2_name,
                        guessed_barcodes_read_count, match_type), where the
                        expected values are those used by Picard during
                        demultiplexing and the guessed values are based on the
                        barcodes seen among the data.

optional arguments:
  -h, --help            show this help message and exit
  --readcount_threshold READCOUNT_THRESHOLD
                        If specified, guess barcodes for samples with fewer
                        than this many reads. (default: None)
  --sample_names [SAMPLE_NAMES [SAMPLE_NAMES ...]]
                        If specified, only guess barcodes for these sample
                        names. (default: None)
  --outlier_threshold OUTLIER_THRESHOLD
                        threshold of how far from unbalanced a sample must be
                        to be considered an outlier. (default: 0.675)
  --expected_assigned_fraction EXPECTED_ASSIGNED_FRACTION
                        The fraction of reads expected to be assigned. An
                        exception is raised if fewer than this fraction are
                        assigned. (default: 0.7)
  --number_of_negative_controls NUMBER_OF_NEGATIVE_CONTROLS
                        The number of negative controls in the pool, for
                        calculating expected number of reads in the rest of
                        the pool. (default: 1)
  --rows_limit ROWS_LIMIT
                        The number of rows to use from the in_barcodes.
                        (default: 1000)
  --loglevel {DEBUG,INFO,WARNING,ERROR,CRITICAL,EXCEPTION}
                        Verboseness of output. [default: INFO]
  --version, -V         show program's version number and exit
  --tmp_dir TMP_DIR     Base directory for temp files. [default:
                        /var/folders/bx/90px0g1n1v122slzjnyvrsk5gn42vm/T/]
  --tmp_dirKeep         Keep the tmp_dir if an exception occurs while running.
                        Default is to delete all temp files at the end, even
                        if there's a failure. (default: False)

Copy link
Contributor

@yesimon yesimon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be cool to handle non-even distributions of pooled samples.

util/illumina_indices.py Show resolved Hide resolved
@tomkinsc
Copy link
Member Author

@yesimon Yeah, it would be great to handle non-even distributions once we have relevant test data. This PR is mostly intended to add an automated check on demultiplexed data, so pool quants are not available at the time of demux since they're not part of the sample sheet. It's certainly worth revisiting in the future. We could standardize the practice of adding pool fraction information to the sample sheet. I hope to have a better handle on paths forward after some time in the lab.

Copy link
Member

@dpark01 dpark01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added functionality seems quite useful for lab debugging -- even if there may be ways to improve upon it in the future, it seems worth having on by default. Quick question: do we have a sense of whether this adds much runtime to a demux task on a large flowcell, since it's run sequentially after demux and before the fastqc steps?

@tomkinsc
Copy link
Member Author

tomkinsc commented Jan 30, 2019

I haven't benchmarked it, but the additional runtime should be minimal since it compares the relatively few rows in the picard metrics file against a truncated and already-sorted list of (the top 1000) observed barcodes. I don't think it makes sense to spin it out into a separate task considering the additional overhead.

@dpark01
Copy link
Member

dpark01 commented Jan 30, 2019

Sounds good, yeah that's going to be quite quick, since it's only processing two O(1) input sources.

@tomkinsc tomkinsc merged commit 9c0bc80 into master Jan 31, 2019
@tomkinsc tomkinsc deleted the ct-guess-barcodes branch January 31, 2019 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants