-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913
Fix CollectHSMetrics - Don't use Coverage Cap on fold score #1913
Conversation
…Coverage Cap - per documentation.
Hi @JoeVieira , thanks for this PR! I don't believe the user you've tagged is working on Picard right now so we've asked for a review from someone else. |
@kockan appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this fix @JoeVieira! Just a few small questions/comments.
Also, have you checked how speed and memory usage are affected by this change? presumably they aren't significant, but since you're replacing an array being indexed into with a map just would be good to confirm that there isn't a massive performance hit.
src/main/java/picard/analysis/directed/TargetMetricsCollector.java
Outdated
Show resolved
Hide resolved
src/main/java/picard/analysis/directed/TargetMetricsCollector.java
Outdated
Show resolved
Hide resolved
@kachulis Appreciate the feedback! I've updated this based on your feedback. I also shared your concern previously about additional memory usage - now with this change, it's only possible to have reduced memory usage & time. |
I agree about the memory usage now, but I don't think it's guaranteed this will only ever lead to a speedup. We are replacing 10's of millions of |
Fair enough - i'll run some heuristics |
This isn't a perfect comparison, since the histogram now (correctly) has ~3x as many bins in it as the bugged current version. But at least it shows that performance is on par. For a bam with 539688 reads: PR: |
@kachulis Anything else before approving? |
Hi @JoeVieira, sorry for the delayed response. I think the performance test is reasonably convincing, but your comment about there being 3x as many bins in the output histogram lead me to think about some other issues as well. Specifically, writing out the full histogram (instead of the capped histogram) is a change from previous behavior, even compared to before the initial bug was introduced. Looking into this a bit more, there are a two issues I'm wrestling with. The first is that as currently written this PR will introduce a behavior difference between The other issue is that (and this is a long standing issue, not something introduced in this PR), there seems to be some differences between how COVERAGE_CAP is utilized in
However, for targeted metrics, the situation seems a bit messier. It looks like, historically, target metrics were initially uncapped, and then some (such and median target coverage and the fold_x metrics) but not all became capped, and this PR would then uncap them again. I think this different behavior between wgs and targeted is reasonable, since for wgs very high coverage spikes are likely artifactual, while for targeted sequencing very high coverage is often the entire point (@yfarjoun do you remember if this was the intention?). It is a bit problematic, IMO, that the same parameter name ( After all that, my current feeling is we have three different places in targeted metrics that are relevant to your PR, where we need to decide whether the calculation should be coverage capped or not:
currently, 1 and 3 are capped, while 2 is a mix of capped and uncapped (although originally was all uncapped). I think from the perspective of this PR, this best option is to leave 1 capped, make 2 uncapped, and then take your pick on 3 (I think either choice is reasonable). I think this means you just need to align the behaviors of the two histograms, while simultaneously keeping 1 capped and 2 uncapped, and then I will feel comfortable merging this. @yfarjoun @takutosato do either of you have any thoughts about the COVERAGE_CAP behavior in CollectHsMetrics? |
@kachulis I totally agree with this - this region of the code really became tangled a few years back. Does anyone on this thread understand the intention of using the unfiltered data for the theoretical ( simply to get "all" data for simulation i presumed? ) The capped data set being called unfiltered only doesn't seem right, because it's also normalized. Do we want to have the (uncapped, raw) unfiltered & high quality histograms outputted? Do we need to have the histogram used for theoretical calculations also output? Speaking of - is it correct to filter the datas used for (MEAN/MEDIAN_TARGET_COVERAGE, FOLD_80_BASE_PENALTY, etc) to just the "high quality coverage" ? Edited based on reviewing the code again & updated my thinking on this. |
I would propose 1.) Theoretical can no be uncapped - as that defeats it's purpose. I'm happy to update this PR to do so, if folks are in agreement. |
@kachulis, @lbergelson Any thoughts here? I'm happy to update with the above proposed logic. I would love to wrap this up. |
@JoeVieira sorry for the delayed response, have gotten distracted by other things. I don't think the new histogram would be necessary, since the uncapped histogram contains all the information in the capped histogram (and the metrics would be calculated based on the uncapped data after this PR, wouldn't they). For understanding the values used to calculate theoretical sensitivity the uncapped histogram can be easily converted into capped just by collapsing the top bins, so I would prefer to leave that as something a user could do themselves if they are investigating some data over adding another histogram. otherwise I think your proposals are reasonable. |
Bumping this - we're very eager for a fixed version of CollectHsMetrics that outputs correct median coverage and Fold-80 scores. |
Hi @eboyden , I'd like to get this merged as soon as possible as well. I was going through the previous comments and it seems like @kachulis was happy with the changes proposed by @JoeVieira , except the addition of a new capped unfiltered histogram. If @JoeVieira would like to make these changes I could ask for a quick re-review. |
create unfiltered, uncapped histograph for output close TODO of normalizing array directly, to avoid the extra flip between array & histograph.
Appreciate everyone's help on this. For the Picard developers: this would be outside the scope of this PR, but I suggest looking into whether CollectTargetedPcrMetrics or CollectWgsMetrics or any other Collect...Metrics tools are affected by this same bug. Seems like at least CollectWgsMetrics might be affected, based on a brief investigation I did for the original bug report #1767 |
keep capped data for theoretical stats move calculation of min / max depth into base loading ensure histograms are not sparsely populated
Thanks a lot @eboyden ! If this is a bug that affects multiple metric collection tools, they should definitely be fixed as well. I'll relay your findings to the team and find out what can be done. |
Okay all - i've updated this a lot, a lot of threads going on in this code. This builds histograms directly, and ensured they aren't sparsely populated - which using getDepths directly would cause, this removes the need for another loop over the coverage data The changes to output uncapped histograms does cause us to create an additional histograms, the quality histogram needs to be uncapped. The unfiltered coverage histogram is never outputted, so doesn't need to be created as an uncapped histogram. @kockan @kachulis - lmk what you think - this was a bit of a rats nest to solve. WRT @eboyden 's question - I'm not clear on why WGSmetris would be impacted by this specific bug, since it's a different code path - but a similar logical issue might exist in that collector |
@JoeVieira Thanks for the updates! Looks good to me, but I am not as familiar with the intricacies of this tool as @kachulis (who returns in about a week and a half), so just to be safe I'd like to wait for his final review before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kachulis is out on leave for a few more weeks so I took a look at this. I tried to follow the conversation and I think this PR does what was agreed upon in the comments, so 👍
This PR resolves issue #1767
Per documentation COVERAGE CAP was intended to only apply to Theoretical Sensitivity calculations. But it was applied to MEDIAN_COVERAGE, which then applied it to all statistics which that is used for.
https://gatk.broadinstitute.org/hc/en-us/articles/360036856051-CollectHsMetrics-Picard-#--COVERAGE_CAP
Changes with test coverage are