Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix coverage calculation using counts #598

Merged
merged 1 commit into from
Apr 26, 2021
Merged

Conversation

tskir
Copy link
Collaborator

@tskir tskir commented Apr 24, 2021

Fixes #593. Thank you @joys8998 for a super detailed bug report and @tetedange13 for helping out!

This has to do with the alternative coverage calculation method (cnvkit.py coverage -c). For reads with insertions/deletions, calculations using rlen and read.pos will produce incorrect results (for example, as reported in the linked ticket, a negative total read depth).

Instead, for a general solution, we can use the .positions attribute provided by pysam to see where each base maps to the reference genome, and count only the bases which come from within the specified region.

For reads with insertions/deletions, calculations using rlen and
read.pos will produce incorrect results.

Instead, for a general solution, we can use the `.positions` attribute
provided by pysam to see where each base maps to the reference genome,
and count only the bases which come from within the specified region.
@tskir tskir requested a review from etal April 24, 2021 05:44
@tskir
Copy link
Collaborator Author

tskir commented Apr 24, 2021

I have verified that this indeed resolves the issue. It also runs in reasonable time, only about twice as long as the the standard cnvkit.py coverage, and produces reasonable results.

What's interesting, however, is that coverage -c doesn't always agree with coverage, sometimes quite drastically (excuse the hastily built Google Sheets scatterplot):
image

X axis is the standard method and Y axis is with -c. The values are log2.

I suppose it is expected that the two methods sometimes produce different results, but perhaps this is something to be investigated in the future.

@tskir
Copy link
Collaborator Author

tskir commented Apr 24, 2021

And here's another quick graph; this is comparing coverage -c before (X) and after (Y) the fix:

image

(In "before the fix", the regions with negative absolute depth are converted to log2 = –20.)

This is just to verify that the fix didn't drastically break anything. As we can see, most of the regions are right where they were, with minor adjustments in cases where insertions/deletions were affecting the by-count calculation.

@tskir tskir mentioned this pull request Apr 24, 2021
@etal
Copy link
Owner

etal commented Apr 24, 2021

The -c flag is there for exploring different coverage calculation strategies, basically. It makes sense that -c could under-count regions with gapped or secondary alignments, or other quirks.

Here's the rationale behind the two coverage methods:

  • Default: Use htslib's pileup engine to calculate per-base coverages exactly the way samtools depth would to it. CNVkit does very little of its own math here. It would take a strong argument to justify doing coverage calculations any other way.
  • -c: Some other copy number tools take the more naive and possibly faster approach of just counting read start positions, or midpoints, within each bin. It's a little tricky to avoid double-counting reads that span bin boundaries. In this method, CNVkit attempts to behave like other tools, with a bit of extra logic to discount the bases that span bin boundaries.

At the time it was released, -c was about twice as fast as the default, but less robust and accurate than the pileup engine. Since then, the pileup engine has probably gotten faster, too. So, I would not recommend using -c in production, only for algorithm exploration.

Using read.positions makes sense, and I'm surprised the result still isn't identical to the default pileup engine's. Maybe secondary reads are being handled differently.

Another method that would be interesting to implement here, and hopefully significantly faster, is mosdepth: #304

Summary: This fix looks good to me, and I don't mind the performance hit.

@tskir tskir merged commit 2fc633f into etal:master Apr 26, 2021
@tskir tskir deleted the fix-coverage-count branch April 26, 2021 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Coverage error
2 participants