New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rip out the indel recalibration code from BaseRecalibrationEngine #1056
Comments
To clarify, this means that the |
That's correct, @akiezun. |
agreed, just wanted to know the top level starting point. |
CEUTrio.HiSeq.WEx.b37.NA12892.chr10.bam (1.2Gb on local drive) initial optimized run on 1.2Gb file after ripping out some of the indel code:
For reference, here are times for pre-optimization GATK3 and GATK4 (some data from #1033) GATK4 master branch, with indels
GATK4 master branch, the
(which is already faster than GATK3.4.46 - numbers listed below) GATK3.4.46 with indels
GATK3.4.46 with the
For reference, the best possible bottom line (for bqsr optimizations, not reading/writing itself) is established by PrintReads on the same data:
|
@akiezun it's really important to make sure that the results match between the original and newly optimized versions (and if they don't, to make sure we understand why). Some of the "indel stuff" should probably stay, e.g. the masking of sites using known indels. Since we aren't running Indel Realigner anymore, there may be some "errors" that we do want to mask because they are alignment artifacts and not sequencing errors. |
for now this is just the apply step. |
Hey folks, based on this effort, at what point do we start telling users to disable indel quals (or do it for them by default) in the output of BQSR in GATK3? People complain about the file sizes and this would alleviate some of that pain. I know that that's how it's done in production. |
a little bit shaved off. working on more
|
@vdauwera we stopped using indel quals a while back... |
removed boxing:
|
don't even compute indel covariates on applyBQSR
|
reduce memory allocation at calls using varargs on applyBQSR:
|
pre-compute the platform from the header rather than recompute it for every single read
|
@droazen looking into the ripping out a bit more, I think it's too disruptive for alpha. The recalibration table will change and it will require a more thorough validation. As this is a potentially results-changing change, I vote to move this past alpha. I have removed the indel calculations from the ApplyBQSR because that does not change any semantics. |
@akiezun For alpha-1 we should either do this or decide that it's not worth doing and close the ticket. |
let's push past alpha-1. I'd like to focus on non-disruptive speedups and on eval. @droazen ok? |
@akiezun Yes, agreed -- moving the milestone for this one. |
Ongoing analyses in gsa6:/local/akiezun/gatk4_bqsr_deleteIndels_v2 The analysis is to run with and without indels and compare recalibrated quals with and without binning |
Update1: Update2: Results are qual-by-qual identical on a 30GB exome file runtime of BaseRecalibrators (on the 30GB)
runtime of ApplyBQSR
|
rerun on 30GB
|
what was the cause of the previous |
how do i know? NFS or other people using gsa6 maybe |
for comparison, GATK3 on same file |
this one is done |
After doing this, do another comparison run against GATK3 BQSR to see how much eliminating this code bought us in terms of performance.
The text was updated successfully, but these errors were encountered: