Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

JoeVieira · 2023-09-05T22:13:21Z

This PR resolves issue #1767

Per documentation COVERAGE CAP was intended to only apply to Theoretical Sensitivity calculations. But it was applied to MEDIAN_COVERAGE, which then applied it to all statistics which that is used for.
https://gatk.broadinstitute.org/hc/en-us/articles/360036856051-CollectHsMetrics-Picard-#--COVERAGE_CAP

Changes with test coverage are

Update Histogram directly, rather than via an intermediate data structure
Which uses the full density passed, rather than limiting to the CAP
Test FOLD_80 score when passing a VERY low cap.
Coverage MEAN, MEDIAN & FOLD should still indicate perfect coverage.

…coverage metrics

…Coverage Cap - per documentation.

kockan · 2023-09-12T18:01:36Z

Hi @JoeVieira , thanks for this PR! I don't believe the user you've tagged is working on Picard right now so we've asked for a review from someone else.

JoeVieira · 2023-09-13T14:50:31Z

@kockan appreciated.

kachulis

Thanks for this fix @JoeVieira! Just a few small questions/comments.

Also, have you checked how speed and memory usage are affected by this change? presumably they aren't significant, but since you're replacing an array being indexed into with a map just would be good to confirm that there isn't a massive performance hit.

src/main/java/picard/analysis/directed/TargetMetricsCollector.java

src/test/java/picard/analysis/directed/CollectHsMetricsTest.java

JoeVieira · 2023-09-13T23:06:20Z

@kachulis Appreciate the feedback! I've updated this based on your feedback. I also shared your concern previously about additional memory usage - now with this change, it's only possible to have reduced memory usage & time.

kachulis · 2023-09-14T14:52:46Z

I agree about the memory usage now, but I don't think it's guaranteed this will only ever lead to a speedup. We are replacing 10's of millions of long[] operations and 100's of Histogram operations with 10's of millions of Histogram operations. If the Histogram operations are very slow, this could result in a slowdown. I think the risk is quite low, but I just want to make sure this is run on some normal sized data and nothing crazy happens.

JoeVieira · 2023-09-14T14:56:11Z

Fair enough - i'll run some heuristics

JoeVieira · 2023-09-14T19:28:38Z

@kachulis

This isn't a perfect comparison, since the histogram now (correctly) has ~3x as many bins in it as the bugged current version. But at least it shows that performance is on par.

For a bam with 539688 reads:

PR:
[Thu Sep 14 15:22:40 EDT 2023] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.21 minutes. Runtime.totalMemory()=1090519040 java -jar picard.jar CollectHsMetrics -I S2.disarmed.sorted.bam -TI -BI -O 14.89s user 0.27s system 116% cpu 13.003 total
Current:
[Thu Sep 14 15:23:06 EDT 2023] picard.analysis.directed.CollectHsMetrics done. Elapsed time: 0.20 minutes. Runtime.totalMemory()=1090519040 java -jar main.jar CollectHsMetrics -I S2.disarmed.sorted.bam -TI -BI -O 15.21s user 0.28s system 122% cpu 12.647 total

JoeVieira · 2023-09-20T22:14:24Z

@kachulis Anything else before approving?

kachulis · 2023-09-29T18:55:51Z

Hi @JoeVieira, sorry for the delayed response.

I think the performance test is reasonably convincing, but your comment about there being 3x as many bins in the output histogram lead me to think about some other issues as well. Specifically, writing out the full histogram (instead of the capped histogram) is a change from previous behavior, even compared to before the initial bug was introduced.

Looking into this a bit more, there are a two issues I'm wrestling with. The first is that as currently written this PR will introduce a behavior difference between highQualityDepthHistogram, which will not be capped, and unfilteredDepthHistogram, which will be capped. I think the behavior for both histograms should be the same when written out. However, I will note that unfilteredDepthHistogram is used in the TheoreticalSensitivity calculation, so naively uncapping it would uncap that calculation as well.

The other issue is that (and this is a long standing issue, not something introduced in this PR), there seems to be some differences between how COVERAGE_CAP is utilized in CollectWgsMetrics vs in CollectHsMetrics (and other targeted metrics). In wgs, the metrics are truly calculated in a capped fashion, and this is specified well in the documentation.

picard/src/main/java/picard/analysis/CollectWgsMetrics.java

Line 117 in a9194bd

    
           @Argument(shortName = "CAP", doc = "Treat positions with coverage exceeding this value as if they had coverage at this value (but calculate the difference for PCT_EXC_CAPPED).")

)

However, for targeted metrics, the situation seems a bit messier. It looks like, historically, target metrics were initially uncapped, and then some (such and median target coverage and the fold_x metrics) but not all became capped, and this PR would then uncap them again. I think this different behavior between wgs and targeted is reasonable, since for wgs very high coverage spikes are likely artifactual, while for targeted sequencing very high coverage is often the entire point (@yfarjoun do you remember if this was the intention?). It is a bit problematic, IMO, that the same parameter name (COVERAGE_CAP) is used in different ways for similar tools, but changing a parameter name that has been in use for years is rough, and anyway that's not the responsibility of your PR.

After all that, my current feeling is we have three different places in targeted metrics that are relevant to your PR, where we need to decide whether the calculation should be coverage capped or not:

Theoretical Sensitivity
Other coverage metrics (MEAN/MEDIAN_TARGET_COVERAGE, FOLD_80_BASE_PENALTY, etc)
histograms as written to metrics files

currently, 1 and 3 are capped, while 2 is a mix of capped and uncapped (although originally was all uncapped). I think from the perspective of this PR, this best option is to leave 1 capped, make 2 uncapped, and then take your pick on 3 (I think either choice is reasonable). I think this means you just need to align the behaviors of the two histograms, while simultaneously keeping 1 capped and 2 uncapped, and then I will feel comfortable merging this.

@yfarjoun @takutosato do either of you have any thoughts about the COVERAGE_CAP behavior in CollectHsMetrics?

JoeVieira · 2023-09-29T19:17:11Z

@kachulis I totally agree with this - this region of the code really became tangled a few years back.

Does anyone on this thread understand the intention of using the unfiltered data for the theoretical ( simply to get "all" data for simulation i presumed? )

The capped data set being called unfiltered only doesn't seem right, because it's also normalized.

Do we want to have the (uncapped, raw) unfiltered & high quality histograms outputted?

Do we need to have the histogram used for theoretical calculations also output?

Speaking of - is it correct to filter the datas used for (MEAN/MEDIAN_TARGET_COVERAGE, FOLD_80_BASE_PENALTY, etc) to just the "high quality coverage" ?

Edited based on reviewing the code again & updated my thinking on this.

JoeVieira · 2023-10-12T20:45:33Z

I would propose

1.) Theoretical can no be uncapped - as that defeats it's purpose.
2.) Each histogram should be outputted ( Uncapped HighQuality, Uncapped Unflitered )
3.) outputting the exact histograms used for the stat generation seems important as well & therefore we should also output a new Capped Unfiltered Histograph called "Theoretical Capped Unfiltered ... "

I'm happy to update this PR to do so, if folks are in agreement.

JoeVieira · 2023-11-27T16:28:52Z

@kachulis, @lbergelson Any thoughts here? I'm happy to update with the above proposed logic. I would love to wrap this up.

kachulis · 2023-11-29T16:22:06Z

@JoeVieira sorry for the delayed response, have gotten distracted by other things.

I don't think the new histogram would be necessary, since the uncapped histogram contains all the information in the capped histogram (and the metrics would be calculated based on the uncapped data after this PR, wouldn't they).

For understanding the values used to calculate theoretical sensitivity the uncapped histogram can be easily converted into capped just by collapsing the top bins, so I would prefer to leave that as something a user could do themselves if they are investigating some data over adding another histogram.

otherwise I think your proposals are reasonable.

eboyden · 2024-01-25T23:06:55Z

Bumping this - we're very eager for a fixed version of CollectHsMetrics that outputs correct median coverage and Fold-80 scores.

kockan · 2024-01-29T16:02:46Z

Hi @eboyden , I'd like to get this merged as soon as possible as well. I was going through the previous comments and it seems like @kachulis was happy with the changes proposed by @JoeVieira , except the addition of a new capped unfiltered histogram. If @JoeVieira would like to make these changes I could ask for a quick re-review.

create unfiltered, uncapped histograph for output close TODO of normalizing array directly, to avoid the extra flip between array & histograph.

eboyden · 2024-01-30T06:41:24Z

Appreciate everyone's help on this. For the Picard developers: this would be outside the scope of this PR, but I suggest looking into whether CollectTargetedPcrMetrics or CollectWgsMetrics or any other Collect...Metrics tools are affected by this same bug. Seems like at least CollectWgsMetrics might be affected, based on a brief investigation I did for the original bug report #1767

keep capped data for theoretical stats move calculation of min / max depth into base loading ensure histograms are not sparsely populated

kockan · 2024-01-31T16:07:15Z

Thanks a lot @eboyden ! If this is a bug that affects multiple metric collection tools, they should definitely be fixed as well. I'll relay your findings to the team and find out what can be done.

JoeVieira · 2024-01-31T19:05:04Z

Okay all - i've updated this a lot, a lot of threads going on in this code.
I've made a solid effort to clean up where i touched.

This builds histograms directly, and ensured they aren't sparsely populated - which using getDepths directly would cause, this removes the need for another loop over the coverage data

The changes to output uncapped histograms does cause us to create an additional histograms, the quality histogram needs to be uncapped. The unfiltered coverage histogram is never outputted, so doesn't need to be created as an uncapped histogram.

@kockan @kachulis - lmk what you think - this was a bit of a rats nest to solve.

WRT @eboyden 's question - I'm not clear on why WGSmetris would be impacted by this specific bug, since it's a different code path - but a similar logical issue might exist in that collector

kockan · 2024-01-31T20:02:50Z

@JoeVieira Thanks for the updates! Looks good to me, but I am not as familiar with the intricacies of this tool as @kachulis (who returns in about a week and a half), so just to be safe I'd like to wait for his final review before merging this.

JoeVieira · 2024-02-13T23:05:57Z

@kachulis @kockan - Hoping everyone is back from holiday & can have a look?

meganshand

@kachulis is out on leave for a few more weeks so I took a look at this. I tried to follow the conversation and I think this PR does what was agreed upon in the comments, so 👍

JoeVieira added 3 commits September 2, 2023 09:48

No coverage cap on Histogram ArrayList used for statistics on actual …

8cc976c

…coverage metrics

Use hashmap for sparse array structure

9ccf182

test coverage to ensure Fold80 & median coverage are not impacted by …

6f9ebeb

…Coverage Cap - per documentation.

JoeVieira mentioned this pull request Sep 5, 2023

HsMetrics "coverage cap" parameter affects more than theoretical sensitivity calculations #1767

Closed

kockan requested a review from kachulis September 12, 2023 17:24

kachulis requested changes Sep 13, 2023

View reviewed changes

Using histogram directly

73a9563

JoeVieira requested a review from kachulis September 18, 2023 15:27

lbergelson assigned kachulis Oct 10, 2023

JoeVieira added 2 commits January 29, 2024 19:01

Normalize Depth Array directly & test

0f95ef0

create unfiltered, capped array

299956a

create unfiltered, uncapped histograph for output close TODO of normalizing array directly, to avoid the extra flip between array & histograph.

JoeVieira added 4 commits January 30, 2024 20:13

remove tests for normalizeDepthArray

fd2e475

remove normalizeDepthArray

ad4ba42

build uncapped data for outputting histograms

894eefb

keep capped data for theoretical stats move calculation of min / max depth into base loading ensure histograms are not sparsely populated

remove longstream import

5a9c4b7

JoeVieira added 2 commits January 30, 2024 20:36

clean up extra whitespace

f1b6a65

remove uncappedunfiltered, it's not needed

817ddde

clean up unneeded properties

f653ea4

meganshand approved these changes Feb 15, 2024

View reviewed changes

kockan removed the request for review from kachulis February 15, 2024 18:46

kockan merged commit 6d33c94 into broadinstitute:master Feb 15, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

JoeVieira commented Sep 5, 2023 •

edited

Loading

kockan commented Sep 12, 2023

JoeVieira commented Sep 13, 2023

kachulis left a comment

JoeVieira commented Sep 13, 2023 •

edited

Loading

kachulis commented Sep 14, 2023

JoeVieira commented Sep 14, 2023

JoeVieira commented Sep 14, 2023 •

edited

Loading

JoeVieira commented Sep 20, 2023

kachulis commented Sep 29, 2023

JoeVieira commented Sep 29, 2023 •

edited

Loading

JoeVieira commented Oct 12, 2023

JoeVieira commented Nov 27, 2023

kachulis commented Nov 29, 2023

eboyden commented Jan 25, 2024

kockan commented Jan 29, 2024

eboyden commented Jan 30, 2024

kockan commented Jan 31, 2024

JoeVieira commented Jan 31, 2024

kockan commented Jan 31, 2024

JoeVieira commented Feb 13, 2024

meganshand left a comment

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913

Conversation

JoeVieira commented Sep 5, 2023 • edited Loading

kockan commented Sep 12, 2023

JoeVieira commented Sep 13, 2023

kachulis left a comment

Choose a reason for hiding this comment

JoeVieira commented Sep 13, 2023 • edited Loading

kachulis commented Sep 14, 2023

JoeVieira commented Sep 14, 2023

JoeVieira commented Sep 14, 2023 • edited Loading

JoeVieira commented Sep 20, 2023

kachulis commented Sep 29, 2023

JoeVieira commented Sep 29, 2023 • edited Loading

JoeVieira commented Oct 12, 2023

JoeVieira commented Nov 27, 2023

kachulis commented Nov 29, 2023

eboyden commented Jan 25, 2024

kockan commented Jan 29, 2024

eboyden commented Jan 30, 2024

kockan commented Jan 31, 2024

JoeVieira commented Jan 31, 2024

kockan commented Jan 31, 2024

JoeVieira commented Feb 13, 2024

meganshand left a comment

Choose a reason for hiding this comment

JoeVieira commented Sep 5, 2023 •

edited

Loading

JoeVieira commented Sep 13, 2023 •

edited

Loading

JoeVieira commented Sep 14, 2023 •

edited

Loading

JoeVieira commented Sep 29, 2023 •

edited

Loading