Fix coverage calculation using counts #598

tskir · 2021-04-24T05:44:04Z

Fixes #593. Thank you @joys8998 for a super detailed bug report and @tetedange13 for helping out!

This has to do with the alternative coverage calculation method (cnvkit.py coverage -c). For reads with insertions/deletions, calculations using rlen and read.pos will produce incorrect results (for example, as reported in the linked ticket, a negative total read depth).

Instead, for a general solution, we can use the .positions attribute provided by pysam to see where each base maps to the reference genome, and count only the bases which come from within the specified region.

For reads with insertions/deletions, calculations using rlen and read.pos will produce incorrect results. Instead, for a general solution, we can use the `.positions` attribute provided by pysam to see where each base maps to the reference genome, and count only the bases which come from within the specified region.

tskir · 2021-04-24T05:49:48Z

I have verified that this indeed resolves the issue. It also runs in reasonable time, only about twice as long as the the standard cnvkit.py coverage, and produces reasonable results.

What's interesting, however, is that coverage -c doesn't always agree with coverage, sometimes quite drastically (excuse the hastily built Google Sheets scatterplot):

X axis is the standard method and Y axis is with -c. The values are log2.

I suppose it is expected that the two methods sometimes produce different results, but perhaps this is something to be investigated in the future.

tskir · 2021-04-24T05:58:26Z

And here's another quick graph; this is comparing coverage -c before (X) and after (Y) the fix:

(In "before the fix", the regions with negative absolute depth are converted to log2 = –20.)

This is just to verify that the fix didn't drastically break anything. As we can see, most of the regions are right where they were, with minor adjustments in cases where insertions/deletions were affecting the by-count calculation.

etal · 2021-04-24T17:10:05Z

The -c flag is there for exploring different coverage calculation strategies, basically. It makes sense that -c could under-count regions with gapped or secondary alignments, or other quirks.

Here's the rationale behind the two coverage methods:

Default: Use htslib's pileup engine to calculate per-base coverages exactly the way samtools depth would to it. CNVkit does very little of its own math here. It would take a strong argument to justify doing coverage calculations any other way.
-c: Some other copy number tools take the more naive and possibly faster approach of just counting read start positions, or midpoints, within each bin. It's a little tricky to avoid double-counting reads that span bin boundaries. In this method, CNVkit attempts to behave like other tools, with a bit of extra logic to discount the bases that span bin boundaries.

At the time it was released, -c was about twice as fast as the default, but less robust and accurate than the pileup engine. Since then, the pileup engine has probably gotten faster, too. So, I would not recommend using -c in production, only for algorithm exploration.

Using read.positions makes sense, and I'm surprised the result still isn't identical to the default pileup engine's. Maybe secondary reads are being handled differently.

Another method that would be interesting to implement here, and hopefully significantly faster, is mosdepth: #304

Summary: This fix looks good to me, and I don't mind the performance hit.

tskir requested a review from etal April 24, 2021 05:44

tskir mentioned this pull request Apr 24, 2021

Coverage error #593

Closed

etal approved these changes Apr 24, 2021

View reviewed changes

tskir merged commit 2fc633f into etal:master Apr 26, 2021

tskir deleted the fix-coverage-count branch April 26, 2021 18:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix coverage calculation using counts #598

Fix coverage calculation using counts #598

tskir commented Apr 24, 2021

tskir commented Apr 24, 2021

tskir commented Apr 24, 2021 •

edited

Loading

etal commented Apr 24, 2021

Fix coverage calculation using counts #598

Fix coverage calculation using counts #598

Conversation

tskir commented Apr 24, 2021

tskir commented Apr 24, 2021

tskir commented Apr 24, 2021 • edited Loading

etal commented Apr 24, 2021

tskir commented Apr 24, 2021 •

edited

Loading