Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users reporting unreasonable memory usage in GenotypeGVCFs #4544

Closed
lbergelson opened this issue Mar 20, 2018 · 38 comments
Closed

Users reporting unreasonable memory usage in GenotypeGVCFs #4544

lbergelson opened this issue Mar 20, 2018 · 38 comments

Comments

@lbergelson
Copy link
Member

Several people are reporting unreasonable memory usage in GenotypeGVCFs.
See https://gatkforums.broadinstitute.org/gatk/discussion/11634/genotypegvcfs-resource-problems#latest
and #4467.

We should investigate what's going on, probably we're loading to many lines into memory at once somehow. It's possible it's related to running on a base pair resolution file. Possible that #3480 might help?

@droazen
Copy link
Collaborator

droazen commented Mar 20, 2018

related to #4512 possibly?

@lbergelson
Copy link
Member Author

Yes, very possible.

@Neato-Nick
Copy link

Jumping on to report I've also been having issues using GATK 4 genotyping. CombineGVCFs (v4.0.1.1) runs fine no matter how many samples I use, but GenotypeGVCFs chokes at some limit. I iteratively subsetted my sample list to see at what point it begins to choke. I used line count of the final vcf file as an approximation of how far the genotyper got. Even at the 110 samples, though, CombineGVCFs ran for several hours. It just produced a final, genotyped VCF file that was severely truncated with few variants. See the graph at this image attached.
iterative memory loss

As for the errors:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
and several
at java.util.... and at org.broadinstitute.hellbender.tools.walkers.genotyper....

I attached three of these error files so you can see the full list of memory problems. genotype55.e... is the log file with the most number of samples that worked. genotype60.e... ends abruptly. genotype65.e... contains the memory errors. Note that these error/log files include both CombineGVCFs and GenotypeGVCFs.

genotype55.e5195822.txt
genotype60.e5195820.txt
genotype65.e5195821.txt

For reference, genotyping using GATK 3.8.0 on all 108 of my samples produced a final vcf file 2784 lines long in 36 seconds with no issue. Let me know if you have any other questions!

@droazen droazen added this to the Engine-2Q2018 milestone Mar 26, 2018
@Neato-Nick
Copy link

UPDATE: I solved the issue on my end. A collaborator was having the same issue with their haploid data, but not their diploid. The problems I described above were for haploid data. He added "--new-qual" to his GenotypeGVCFs command and that solved the issue. It did for me as well! Using the same combined GVCF file as I was before, genotyping finished in less than half a minute after adding the new-qual parameter. Thought it would be useful to know that:

1, this issue appears to affect non-diploids more than diploids

2, using --new-qual solved the issue, at least for me.

I've attached the log-file generated from this new output, hopefully it helps in debugging.
genotype108_newqual.e5290702.txt

@droazen
Copy link
Collaborator

droazen commented Mar 27, 2018

This is interesting. @davidbenjamin, can you think of a reason why --new-qual would use much less memory for non-diploid data vs. old qual?

@davidbenjamin
Copy link
Contributor

@droazen I don't have a good reason. For ploidy greater than 2, sure, but I would expect old qual's brute force approach to do okay on haploids. I will say that for non-diploids it goes to GeneralPloidyExactAFCalculator, which is a class that nobody paid attention to and contains a lot of spooky code.

@droazen
Copy link
Collaborator

droazen commented Mar 28, 2018

In that case it's likely that there is a bug specific to the old-qual code path in GATK4, that was not present in GATK3.

@chandrans
Copy link
Contributor

chandrans commented Apr 6, 2018

Another user reported back with some information in the same thread.

When not using the '-new-qual' option the memory overrun does seem to depend on the inclusion of non-diploid samples in the cohort (haploid males in my case). I have now separately analyzed the 37 diploid females in the dataset. The script ran successfully with the following specs

...ncpus=1:mem=12g... gatk CombineGVCFs... gatk --java-options "-Xmx8g" GenotypeGVCFs... Peak memory use of the job was 9.5gb. A previous job with 8gb total memory was terminated due to memory overrun at the CombineGVCFs stage.

@V-Z
Copy link

V-Z commented Jul 2, 2018

I can confirm this issue (4.0.5.2). With 16 tetraploid samples (CombineGVCFs outputs 420 MB file) GenotypeGVCFs stucks on the beginning and later crashes (32 GB RAM is not enough).

@Neato-Nick
Copy link

@V-Z did enabling the new-qual option resolve your issue?

@V-Z
Copy link

V-Z commented Jul 2, 2018

@Neato-Nick No, it's very same. Sorry, typing too fast. I tried it again and it crashes:

Using GATK jar /home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar GenotypeGVCFs -O rad34test.comb2.raw.g.vcf.gz -R ../../../jic_reference/alygenomes.fasta -V rad34test.comb2.raw.vcf.gz -new-qual
21:10:39.975 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:10:43.152 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:10:43.153 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.5.2
21:10:43.153 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:10:44.334 INFO  GenotypeGVCFs - Initializing engine
21:10:44.849 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/vojta/dokumenty/fakulta/botanika/arabidopsis/samples/lib_2018_06/4_joined/rad34test.comb2.raw.vcf.gz
21:10:44.979 INFO  GenotypeGVCFs - Done initializing engine
21:10:45.057 INFO  ProgressMeter - Starting traversal
21:10:45.057 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
21:10:45.344 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
21:10:47.780 INFO  GenotypeGVCFs - Shutting down engine
[2. července 2018 21:10:47 CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.13 minutes.
Runtime.totalMemory()=501219328
java.lang.IllegalArgumentException: log10LikelihoodsOfAC are bad 6.911788849595091E-17,NaN
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AFCalculationResult.<init>(AFCalculationResult.java:72)
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.getLog10PNonRef(AlleleFrequencyCalculator.java:143)
at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:255)
at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:210)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.calculateGenotypes(GenotypeGVCFs.java:266)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:222)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:201)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:149)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

@lbergelson
Copy link
Member Author

@V-Z Huh, that might be a different issue then. @davidbenjamin Any thoughts?

@V-Z
Copy link

V-Z commented Jul 2, 2018

@lbergelson Might be, but otherwise the description in the original post fits well to my problem.

@lbergelson
Copy link
Member Author

@V-Z The massive memory use without --new-qual is almost certainly the same problem. The solution should be to use --new-qual. The IllegalArgumentException` I haven't seen before though. I suspect it's a different bug, or there's something weird about your input data that violates some assumption we make.

@V-Z
Copy link

V-Z commented Jul 2, 2018

Weird. I don't aim to hijack the issue, but how to verify such possibility? I have been using GATK 3.8 and I can process my data there...

@droazen
Copy link
Collaborator

droazen commented Jul 2, 2018

@V-Z The error you encountered is a regression in new-qual that was introduced in 4.0.5.0. You can track this issue here: #4975

For the purposes of this ticket, you might want to do a test with GATK 4.0.4.0 and new-qual, and see if that works.

@V-Z
Copy link

V-Z commented Jul 2, 2018

@droazen 4.0.4.0 with new-equal works as expected, 4.0.5+ doesn't.

@droazen
Copy link
Collaborator

droazen commented Jul 3, 2018

@V-Z As I suspected -- in that case I recommend running with 4.0.4.0 for now, and upgrade once #4975 is fixed.

@davidbenjamin
Copy link
Contributor

@V-Z Would you mind sharing your GVCF, or just the offending chunk, with me so I can debug? I'm pretty sure it's a finite precision error and have a simple fix in mind but I would like to confirm on real data.

@V-Z
Copy link

V-Z commented Jul 3, 2018

@davidbenjamin Sure, I'm sending it.

@davidbenjamin
Copy link
Contributor

Thanks @V-Z. I also just noticed that it's not a reference that I'm familiar with (I work on human cancer; Arabidopsis doesn't come up all that often). Could you send me the reference fasta or tell me where to download it?

@davidbenjamin
Copy link
Contributor

By the way, @V-Z I see something called Arabidopsis_thaliana_TAIR10.fasta on the Broad server, but I have no idea if that's the same as yours.

@V-Z
Copy link

V-Z commented Jul 3, 2018

@davidbenjamin I'm sorry, I'm sending it.

@davidbenjamin
Copy link
Contributor

@V-Z Thanks very much for your help! The change I had in mind fixes the issue. @droazen I will submit the PR now.

@sanjeevksh
Copy link

I have also experienced the same issue i.e. GenotypeGVCFs not proceeding further after initial few hundred lines (242 lines to be precise). I'm running 384 potato samples (242 diploid, 138 tetraploid and 4 hexaploid) on chromosome-by-chromosome basis. After adding '-new-qual' option run completed for the whole chromosome (dummy chromosome 0) taking 7654 lines. My query is whether using '-new-qual' is advisable under gatk best practices or is it a temporary arrangement until a final fix is found?

Also in the above completed run, the actual variant sites processed are only '102' as rest all have the warning 'WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples'. This looks a bit strange to me as potato is highly heterozygous and there should be more variants common to more than 10 samples unless there is huge disparity in coverage across samples. This is GBS data which further forces the reads to more specific sites and thus increasing chances of meeting the above requirement. Any comments on this warning issue is also highly appreciated.

@Neato-Nick
Copy link

Neato-Nick commented Jul 30, 2018 via email

@sanjeevksh
Copy link

Thank you so much @Neato-Nick for your feedback, highly useful indeed! I was just worried that all these locations with warning signs are getting bypassed which as per your feedback should not be the case.
Regards,
Sanjeev

@davidbenjamin
Copy link
Contributor

@sanjeevksh new-qual is a final fix for the deficiencies of the old model. You should use it.

@sanjeevksh
Copy link

Thanks @davidbenjamin, that's really great to hear! With -new-qual option added, the GenotypeGVCFs runs completed very swiftly.
Regards,
Sanjeev

@sanjeevksh
Copy link

My GenotypeGVCFs run for a single chromosome returned the following completion statement:
18:54:40.516 INFO ProgressMeter - Traversal complete. Processed 606308 total variants in 75.2 minutes.

However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'?

Also, there are only 16863 lines in my log file, does this mean that the 'Current Locus' column in the log file doesn't correspond to a single genomic location (bp) in the fasta file?

I am curious to know what is the relation between all these figures to fully understand what is happening while processing the gCVF files.

Also, on the inbreeding coefficient warning issue, I understand from your @Neato-Nick feedback that the variants with these warnings may still be fine and can be retained. However, this still leaves me worrying that out of 384 samples the locus doesn't even have 10 samples for generating the required metrics. Such variants won't be of any use for downstream analyses anyway where any variants with more than 80% missing samples will be removed. Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant?

Regards,
Sanjeev

@davidbenjamin
Copy link
Contributor

However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'?

606308 is not 'variants x number of samples'. Rather, it is just 'variants'. However, keep in mind that most of the input "variants" that comprise these 606308 are GVCF reference confidence blocks that do not end up in the output.

To be more precise, each input GVCF has a mix of variants and reference confidence blocks, which don't necessarily overlap from sample to sample. The GATK engine turns the independent stream of records from each GVCF into a single multi-sample stream as if it came from a single multi-sample GVCF. This includes reconciling non-overlapping reference confidence blocks. 606308 is the number of effective GVCF records in this multi-sample stream. (BTW I don't mean a Java 8 Stream object -- it's just an Iterator).

Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant?

The warning happens when fewer than 10 samples have likelihoods (PLs) for the variant -- it's not a matter of how many samples have the variant. That is, if 10 samples have PLs that say they are hom ref, you don't get the warning.

@Neato-Nick
Copy link

Neato-Nick commented Jul 31, 2018 via email

@sanjeevksh
Copy link

Thanks @davidbenjamin for a very detailed explanation, this helps a lot!

Thanks @Neato-Nick, yes that's the plan - I am just waiting for all runs to finish. There was some pressing need to clarify the 10 sample issue. I have done vcf filtering in the past and plan to do so by applying hard flitering using FilterVcf/VariantFiltration/SelectVariants. I was not aware of vcfR package, thanks for directing me to this - looks very useful. Not so brilliant but have some working knowledge in R so hopefully should manage using this R package.
Regards,
Sanjeev

@sanjeevksh
Copy link

Hi @Neato-Nick @davidbenjamin

Apologies for posting this message here. I have posted this message few days before at the regular GATK forum and also using the direct inbox option but have got no response so maybe something wrong with my account.

The issue is - I have done variant calling on 384 potato samples following, mostly, GATK best ##practices and have applied hard filters to select SNPs for further usage. However, I am noticing that '--max-nocall-fraction', '--max-nocall-number' and '--max-fraction-filtered-genotypes' arguments for 'SelectVariants' are not working properly. I have tried with various cutoff settings and every time I am observing SNPs with a much larger number of genotypes (~246 out of 384 with 0.10 setting) with 'no call' than the set thresholds. I have searched the forum first but couldn't find any relevant threads. I am using the latest GATK version (4.0.7.0). I am attaching three example sets of (1) log files (2) subset vcf files and (3) vcf index file for the three main vcfs. I would appreciate if you could provide any feedback on this issue and/or if this behaviour has been observed by some other users also.

The link to the original post is here:
https://gatkforums.broadinstitute.org/gatk/discussion/12688/possible-bug-in-selectvariants-tool#latest
SelectVariantBugReport.zip

Regards,
Sanjeev

@Neato-Nick
Copy link

@sanjeevksh General github practice is to only comment on an issue with information that is directly related to the original post. As your forum post states, your new comment is about the SelectVariants tool. I'm not with GATK, but from what I've seen, posting both on their forum and this github repository is okay and encouraged. But since this issue is unrelated to the the GenotypeGVCFs tool, you really should open a new issue. It will be more helpful to both you and the GATK team (for various reasons) to open a new issue with your new question rather than just commenting here.

@lbergelson
Copy link
Member Author

@sanjeevksh This is a good place for bug reports, would you mind moving this report to a new issue as suggested? Thank you.

@sanjeevksh
Copy link

@Neato-Nick @lbergelson
Thanks for your comments, I will create a new thread,
Cheers,
Sanjeev

@droazen droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018
@davidbenjamin
Copy link
Contributor

TL;DR new qual (which is now default) fixes it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants