Rip out the indel recalibration code from BaseRecalibrationEngine #1056

droazen · 2015-10-27T19:22:04Z

After doing this, do another comparison run against GATK3 BQSR to see how much eliminating this code bought us in terms of performance.

akiezun · 2015-10-28T14:07:02Z

To clarify, this means that the disable_indel_quals parameter will always be true, right @eitanbanks ?

eitanbanks · 2015-10-29T01:29:16Z

That's correct, @akiezun.
However, it's not just stripping out that code. There are a ton of optimizations that can then be made to the code to simplify it afterwards. These classes were made very bulky to accommodate the indel calibration, and we should really remove that bulk. I can help with that.

akiezun · 2015-10-29T02:37:50Z

agreed, just wanted to know the top level starting point.

akiezun · 2015-10-30T17:31:07Z

CEUTrio.HiSeq.WEx.b37.NA12892.chr10.bam (1.2Gb on local drive)
Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17;

initial optimized run on 1.2Gb file after ripping out some of the indel code:
4.pre-alpha-71-g50b094b

real    4m58.795s
user    5m3.401s

For reference, here are times for pre-optimization GATK3 and GATK4 (some data from #1033)

GATK4 master branch, with indels

real    9m7.691s
user    9m2.250s

GATK4 master branch, the -DIQ option (ie skip indel quals)

real    5m19.914s
user    5m14.710s

(which is already faster than GATK3.4.46 - numbers listed below)

GATK3.4.46 with indels

real    11m21.538s
user    17m24.320s

GATK3.4.46 with the -DIQ option (ie skip indel quals)

real    6m3.662s
user    10m34.859s

For reference, the best possible bottom line (for bqsr optimizations, not reading/writing itself) is established by PrintReads on the same data:

real    2m48.873s
user    2m27.752s

eitanbanks · 2015-10-30T17:42:11Z

@akiezun it's really important to make sure that the results match between the original and newly optimized versions (and if they don't, to make sure we understand why). Some of the "indel stuff" should probably stay, e.g. the masking of sites using known indels. Since we aren't running Indel Realigner anymore, there may be some "errors" that we do want to mask because they are alignment artifacts and not sequencing errors.
Pinging @yfarjoun for his thoughts. We can discuss this at a future methods meeting if you like.

akiezun · 2015-10-30T17:43:53Z

for now this is just the apply step.

vdauwera · 2015-10-30T17:49:17Z

Hey folks, based on this effort, at what point do we start telling users to disable indel quals (or do it for them by default) in the output of BQSR in GATK3? People complain about the file sizes and this would alleviate some of that pain. I know that that's how it's done in production.

akiezun · 2015-10-30T20:28:37Z

a little bit shaved off. working on more

real    4m40.226s
user    4m42.279s

eitanbanks · 2015-10-30T21:01:15Z

@vdauwera we stopped using indel quals a while back...

akiezun · 2015-10-31T04:44:44Z

removed boxing:

real    4m35.791s
user    4m38.627s

akiezun · 2015-10-31T05:32:18Z

don't even compute indel covariates on applyBQSR

real    4m21.122s
user    4m23.733s

akiezun · 2015-11-02T05:08:12Z

reduce memory allocation at calls using varargs on applyBQSR:

real    4m11.298s
user    4m17.046s

akiezun · 2015-11-02T16:42:09Z

pre-compute the platform from the header rather than recompute it for every single read

real    4m7.124s
user    4m12.170s

akiezun · 2015-11-16T14:38:48Z

@droazen looking into the ripping out a bit more, I think it's too disruptive for alpha. The recalibration table will change and it will require a more thorough validation. As this is a potentially results-changing change, I vote to move this past alpha. I have removed the indel calculations from the ApplyBQSR because that does not change any semantics.
(i propose instead to nominate #1078 for alpha - I can put that in very quickly)

droazen · 2016-01-05T18:39:15Z

@akiezun For alpha-1 we should either do this or decide that it's not worth doing and close the ticket.

akiezun · 2016-01-12T15:25:21Z

let's push past alpha-1. I'd like to focus on non-disruptive speedups and on eval. @droazen ok?

droazen · 2016-01-12T17:19:12Z

@akiezun Yes, agreed -- moving the milestone for this one.

akiezun · 2016-04-20T22:03:49Z

Ongoing analyses in gsa6:/local/akiezun/gatk4_bqsr_deleteIndels_v2
branch with indels removed origin/ak_removeIndelsFromBQSR_take2

The analysis is to run with and without indels and compare recalibrated quals with and without binning

akiezun · 2016-04-20T22:10:53Z

Update1:
Results are qual-by-qual identical on our test data src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam

Update2: Results are qual-by-qual identical on a 30GB exome file
on gsa6 /local/akiezun/data/CEUTrio.HiSeq.WEx.b37.NA12892.bam

runtime of BaseRecalibrators (on the 30GB)

- with indels 115.0 minutes    100%
- without indels 102.8 minutes  89%

runtime of ApplyBQSR

- with indels 43.7 minutes (100%)
- without indels 51.7 minutes  <-- (109% weird, no idea why. rerunning to re-measure)

akiezun · 2016-04-21T14:07:38Z

rerun on 30GB

BaseRecalibrator with indels 119.7 minutes (100%)
BaseRecalibrator no indels    96.1 minutes  (80%)

ApplyBQSR with indels 48.5 minutes 100%
ApplyBQSR no indels 48.4 minutes   100%

droazen · 2016-04-21T14:08:57Z

what was the cause of the previous ApplyBQSR anomaly?

akiezun · 2016-04-21T14:10:16Z

how do i know? NFS or other people using gsa6 maybe

akiezun · 2016-04-21T19:34:12Z

for comparison, GATK3 on same file
BaseRecalibrator 129.29 min
PrintReads 232.25 min

akiezun · 2016-05-18T18:38:10Z

this one is done

droazen added tools performance labels Oct 27, 2015

droazen added this to the alpha milestone Oct 27, 2015

droazen assigned akiezun Oct 27, 2015

akiezun mentioned this issue Nov 8, 2015

50% speedup for ApplyBQSR by removing indel quals and other optimizations #1099

Merged

akiezun modified the milestones: beta, alpha Nov 16, 2015

droazen modified the milestones: alpha-1, beta Jan 5, 2016

droazen modified the milestones: beta, alpha-1 Jan 12, 2016

akiezun modified the milestones: alpha-2, beta Mar 10, 2016

akiezun mentioned this issue Apr 20, 2016

BQSR add option to not compute indel recalibration tables #1745

Merged

akiezun mentioned this issue May 16, 2016

no indel tables in BQSR tables by default #1821

Merged

akiezun closed this as completed May 18, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

droazen commented Oct 27, 2015

akiezun commented Oct 28, 2015

eitanbanks commented Oct 29, 2015

akiezun commented Oct 29, 2015

akiezun commented Oct 30, 2015

eitanbanks commented Oct 30, 2015

akiezun commented Oct 30, 2015

vdauwera commented Oct 30, 2015

akiezun commented Oct 30, 2015

eitanbanks commented Oct 30, 2015

akiezun commented Oct 31, 2015

akiezun commented Oct 31, 2015

akiezun commented Nov 2, 2015

akiezun commented Nov 2, 2015

akiezun commented Nov 16, 2015

droazen commented Jan 5, 2016

akiezun commented Jan 12, 2016

droazen commented Jan 12, 2016

akiezun commented Apr 20, 2016

akiezun commented Apr 20, 2016 •

edited

akiezun commented Apr 21, 2016

droazen commented Apr 21, 2016

akiezun commented Apr 21, 2016

akiezun commented Apr 21, 2016 •

edited

akiezun commented May 18, 2016

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

Comments

droazen commented Oct 27, 2015

akiezun commented Oct 28, 2015

eitanbanks commented Oct 29, 2015

akiezun commented Oct 29, 2015

akiezun commented Oct 30, 2015

eitanbanks commented Oct 30, 2015

akiezun commented Oct 30, 2015

vdauwera commented Oct 30, 2015

akiezun commented Oct 30, 2015

eitanbanks commented Oct 30, 2015

akiezun commented Oct 31, 2015

akiezun commented Oct 31, 2015

akiezun commented Nov 2, 2015

akiezun commented Nov 2, 2015

akiezun commented Nov 16, 2015

droazen commented Jan 5, 2016

akiezun commented Jan 12, 2016

droazen commented Jan 12, 2016

akiezun commented Apr 20, 2016

akiezun commented Apr 20, 2016 • edited

akiezun commented Apr 21, 2016

droazen commented Apr 21, 2016

akiezun commented Apr 21, 2016

akiezun commented Apr 21, 2016 • edited

akiezun commented May 18, 2016

akiezun commented Apr 20, 2016 •

edited

akiezun commented Apr 21, 2016 •

edited