Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

Closed
droazen opened this issue Oct 27, 2015 · 24 comments
Closed

Rip out the indel recalibration code from BaseRecalibrationEngine #1056

droazen opened this issue Oct 27, 2015 · 24 comments
Assignees
Milestone

Comments

@droazen
Copy link
Collaborator

droazen commented Oct 27, 2015

After doing this, do another comparison run against GATK3 BQSR to see how much eliminating this code bought us in terms of performance.

@akiezun
Copy link
Contributor

akiezun commented Oct 28, 2015

To clarify, this means that the disable_indel_quals parameter will always be true, right @eitanbanks ?

@eitanbanks
Copy link
Contributor

That's correct, @akiezun.
However, it's not just stripping out that code. There are a ton of optimizations that can then be made to the code to simplify it afterwards. These classes were made very bulky to accommodate the indel calibration, and we should really remove that bulk. I can help with that.

@akiezun
Copy link
Contributor

akiezun commented Oct 29, 2015

agreed, just wanted to know the top level starting point.

@akiezun
Copy link
Contributor

akiezun commented Oct 30, 2015

CEUTrio.HiSeq.WEx.b37.NA12892.chr10.bam (1.2Gb on local drive)
Mac OS X 10.9.5 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_25-b17;

initial optimized run on 1.2Gb file after ripping out some of the indel code:
4.pre-alpha-71-g50b094b

real    4m58.795s
user    5m3.401s

For reference, here are times for pre-optimization GATK3 and GATK4 (some data from #1033)

GATK4 master branch, with indels

real    9m7.691s
user    9m2.250s

GATK4 master branch, the -DIQ option (ie skip indel quals)

real    5m19.914s
user    5m14.710s

(which is already faster than GATK3.4.46 - numbers listed below)

GATK3.4.46 with indels

real    11m21.538s
user    17m24.320s

GATK3.4.46 with the -DIQ option (ie skip indel quals)

real    6m3.662s
user    10m34.859s

For reference, the best possible bottom line (for bqsr optimizations, not reading/writing itself) is established by PrintReads on the same data:

real    2m48.873s
user    2m27.752s

@eitanbanks
Copy link
Contributor

@akiezun it's really important to make sure that the results match between the original and newly optimized versions (and if they don't, to make sure we understand why). Some of the "indel stuff" should probably stay, e.g. the masking of sites using known indels. Since we aren't running Indel Realigner anymore, there may be some "errors" that we do want to mask because they are alignment artifacts and not sequencing errors.
Pinging @yfarjoun for his thoughts. We can discuss this at a future methods meeting if you like.

@akiezun
Copy link
Contributor

akiezun commented Oct 30, 2015

for now this is just the apply step.

@vdauwera
Copy link
Contributor

Hey folks, based on this effort, at what point do we start telling users to disable indel quals (or do it for them by default) in the output of BQSR in GATK3? People complain about the file sizes and this would alleviate some of that pain. I know that that's how it's done in production.

@akiezun
Copy link
Contributor

akiezun commented Oct 30, 2015

a little bit shaved off. working on more

real    4m40.226s
user    4m42.279s

@eitanbanks
Copy link
Contributor

@vdauwera we stopped using indel quals a while back...

@akiezun
Copy link
Contributor

akiezun commented Oct 31, 2015

removed boxing:

real    4m35.791s
user    4m38.627s

@akiezun
Copy link
Contributor

akiezun commented Oct 31, 2015

don't even compute indel covariates on applyBQSR

real    4m21.122s
user    4m23.733s

@akiezun
Copy link
Contributor

akiezun commented Nov 2, 2015

reduce memory allocation at calls using varargs on applyBQSR:

real    4m11.298s
user    4m17.046s

@akiezun
Copy link
Contributor

akiezun commented Nov 2, 2015

pre-compute the platform from the header rather than recompute it for every single read

real    4m7.124s
user    4m12.170s

@akiezun
Copy link
Contributor

akiezun commented Nov 16, 2015

@droazen looking into the ripping out a bit more, I think it's too disruptive for alpha. The recalibration table will change and it will require a more thorough validation. As this is a potentially results-changing change, I vote to move this past alpha. I have removed the indel calculations from the ApplyBQSR because that does not change any semantics.
(i propose instead to nominate #1078 for alpha - I can put that in very quickly)

@akiezun akiezun modified the milestones: beta, alpha Nov 16, 2015
@droazen
Copy link
Collaborator Author

droazen commented Jan 5, 2016

@akiezun For alpha-1 we should either do this or decide that it's not worth doing and close the ticket.

@droazen droazen modified the milestones: alpha-1, beta Jan 5, 2016
@akiezun
Copy link
Contributor

akiezun commented Jan 12, 2016

let's push past alpha-1. I'd like to focus on non-disruptive speedups and on eval. @droazen ok?

@droazen
Copy link
Collaborator Author

droazen commented Jan 12, 2016

@akiezun Yes, agreed -- moving the milestone for this one.

@droazen droazen modified the milestones: beta, alpha-1 Jan 12, 2016
@akiezun akiezun modified the milestones: alpha-2, beta Mar 10, 2016
@akiezun
Copy link
Contributor

akiezun commented Apr 20, 2016

Ongoing analyses in gsa6:/local/akiezun/gatk4_bqsr_deleteIndels_v2
branch with indels removed origin/ak_removeIndelsFromBQSR_take2

The analysis is to run with and without indels and compare recalibrated quals with and without binning

@akiezun
Copy link
Contributor

akiezun commented Apr 20, 2016

Update1:
Results are qual-by-qual identical on our test data src/test/resources/large/CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam

Update2: Results are qual-by-qual identical on a 30GB exome file
on gsa6 /local/akiezun/data/CEUTrio.HiSeq.WEx.b37.NA12892.bam

runtime of BaseRecalibrators (on the 30GB)

- with indels 115.0 minutes    100%
- without indels 102.8 minutes  89%

runtime of ApplyBQSR

- with indels 43.7 minutes (100%)
- without indels 51.7 minutes  <-- (109% weird, no idea why. rerunning to re-measure)

@akiezun
Copy link
Contributor

akiezun commented Apr 21, 2016

rerun on 30GB

BaseRecalibrator with indels 119.7 minutes (100%)
BaseRecalibrator no indels    96.1 minutes  (80%)
ApplyBQSR with indels 48.5 minutes 100%
ApplyBQSR no indels 48.4 minutes   100%

@droazen
Copy link
Collaborator Author

droazen commented Apr 21, 2016

what was the cause of the previous ApplyBQSR anomaly?

@akiezun
Copy link
Contributor

akiezun commented Apr 21, 2016

how do i know? NFS or other people using gsa6 maybe

@akiezun
Copy link
Contributor

akiezun commented Apr 21, 2016

for comparison, GATK3 on same file
BaseRecalibrator 129.29 min
PrintReads 232.25 min

@akiezun
Copy link
Contributor

akiezun commented May 18, 2016

this one is done

@akiezun akiezun closed this as completed May 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants