Users reporting unreasonable memory usage in GenotypeGVCFs #4544

lbergelson · 2018-03-20T18:07:36Z

Several people are reporting unreasonable memory usage in GenotypeGVCFs.
See https://gatkforums.broadinstitute.org/gatk/discussion/11634/genotypegvcfs-resource-problems#latest
and #4467.

We should investigate what's going on, probably we're loading to many lines into memory at once somehow. It's possible it's related to running on a base pair resolution file. Possible that #3480 might help?

droazen · 2018-03-20T18:08:59Z

related to #4512 possibly?

lbergelson · 2018-03-20T18:19:49Z

Yes, very possible.

Neato-Nick · 2018-03-26T21:15:42Z

Jumping on to report I've also been having issues using GATK 4 genotyping. CombineGVCFs (v4.0.1.1) runs fine no matter how many samples I use, but GenotypeGVCFs chokes at some limit. I iteratively subsetted my sample list to see at what point it begins to choke. I used line count of the final vcf file as an approximation of how far the genotyper got. Even at the 110 samples, though, CombineGVCFs ran for several hours. It just produced a final, genotyped VCF file that was severely truncated with few variants. See the graph at this image attached.

As for the errors:

Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
and several
at java.util.... and at org.broadinstitute.hellbender.tools.walkers.genotyper....

I attached three of these error files so you can see the full list of memory problems. genotype55.e... is the log file with the most number of samples that worked. genotype60.e... ends abruptly. genotype65.e... contains the memory errors. Note that these error/log files include both CombineGVCFs and GenotypeGVCFs.

genotype55.e5195822.txt
genotype60.e5195820.txt
genotype65.e5195821.txt

For reference, genotyping using GATK 3.8.0 on all 108 of my samples produced a final vcf file 2784 lines long in 36 seconds with no issue. Let me know if you have any other questions!

Neato-Nick · 2018-03-27T18:04:14Z

UPDATE: I solved the issue on my end. A collaborator was having the same issue with their haploid data, but not their diploid. The problems I described above were for haploid data. He added "--new-qual" to his GenotypeGVCFs command and that solved the issue. It did for me as well! Using the same combined GVCF file as I was before, genotyping finished in less than half a minute after adding the new-qual parameter. Thought it would be useful to know that:

1, this issue appears to affect non-diploids more than diploids

2, using --new-qual solved the issue, at least for me.

I've attached the log-file generated from this new output, hopefully it helps in debugging.
genotype108_newqual.e5290702.txt

droazen · 2018-03-27T18:08:14Z

This is interesting. @davidbenjamin, can you think of a reason why --new-qual would use much less memory for non-diploid data vs. old qual?

davidbenjamin · 2018-03-28T16:32:28Z

@droazen I don't have a good reason. For ploidy greater than 2, sure, but I would expect old qual's brute force approach to do okay on haploids. I will say that for non-diploids it goes to GeneralPloidyExactAFCalculator, which is a class that nobody paid attention to and contains a lot of spooky code.

droazen · 2018-03-28T16:35:19Z

In that case it's likely that there is a bug specific to the old-qual code path in GATK4, that was not present in GATK3.

chandrans · 2018-04-06T16:56:37Z

Another user reported back with some information in the same thread.

When not using the '-new-qual' option the memory overrun does seem to depend on the inclusion of non-diploid samples in the cohort (haploid males in my case). I have now separately analyzed the 37 diploid females in the dataset. The script ran successfully with the following specs

...ncpus=1:mem=12g... gatk CombineGVCFs... gatk --java-options "-Xmx8g" GenotypeGVCFs... Peak memory use of the job was 9.5gb. A previous job with 8gb total memory was terminated due to memory overrun at the CombineGVCFs stage.

V-Z · 2018-07-02T15:38:48Z

I can confirm this issue (4.0.5.2). With 16 tetraploid samples (CombineGVCFs outputs 420 MB file) GenotypeGVCFs stucks on the beginning and later crashes (32 GB RAM is not enough).

Neato-Nick · 2018-07-02T18:18:20Z

@V-Z did enabling the new-qual option resolve your issue?

V-Z · 2018-07-02T19:10:59Z

@Neato-Nick ~~No, it's very same.~~ Sorry, typing too fast. I tried it again and it crashes:

Using GATK jar /home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar GenotypeGVCFs -O rad34test.comb2.raw.g.vcf.gz -R ../../../jic_reference/alygenomes.fasta -V rad34test.comb2.raw.vcf.gz -new-qual
21:10:39.975 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/vojta/bin/gatk/gatk-package-4.0.5.2-local.jar!/com/intel/gkl/native/libgkl_compression.so
21:10:43.152 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:10:43.153 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.0.5.2
21:10:43.153 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:10:44.334 INFO  GenotypeGVCFs - Initializing engine
21:10:44.849 INFO  FeatureManager - Using codec VCFCodec to read file file:///home/vojta/dokumenty/fakulta/botanika/arabidopsis/samples/lib_2018_06/4_joined/rad34test.comb2.raw.vcf.gz
21:10:44.979 INFO  GenotypeGVCFs - Done initializing engine
21:10:45.057 INFO  ProgressMeter - Starting traversal
21:10:45.057 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
21:10:45.344 WARN  InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples
21:10:47.780 INFO  GenotypeGVCFs - Shutting down engine
[2. července 2018 21:10:47 CEST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.13 minutes.
Runtime.totalMemory()=501219328
java.lang.IllegalArgumentException: log10LikelihoodsOfAC are bad 6.911788849595091E-17,NaN
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AFCalculationResult.<init>(AFCalculationResult.java:72)
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.getLog10PNonRef(AlleleFrequencyCalculator.java:143)
at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:255)
at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:210)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.calculateGenotypes(GenotypeGVCFs.java:266)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.regenotypeVC(GenotypeGVCFs.java:222)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:201)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.lambda$traverse$0(VariantWalkerBase.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
at org.broadinstitute.hellbender.engine.VariantWalkerBase.traverse(VariantWalkerBase.java:149)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:984)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:135)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:180)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:199)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)

lbergelson · 2018-07-02T20:32:59Z

@V-Z Huh, that might be a different issue then. @davidbenjamin Any thoughts?

V-Z · 2018-07-02T20:37:01Z

@lbergelson Might be, but otherwise the description in the original post fits well to my problem.

lbergelson · 2018-07-02T20:41:21Z

@V-Z The massive memory use without --new-qual is almost certainly the same problem. The solution should be to use --new-qual. The IllegalArgumentException` I haven't seen before though. I suspect it's a different bug, or there's something weird about your input data that violates some assumption we make.

V-Z · 2018-07-02T20:51:42Z

Weird. I don't aim to hijack the issue, but how to verify such possibility? I have been using GATK 3.8 and I can process my data there...

droazen · 2018-07-02T20:53:08Z

@V-Z The error you encountered is a regression in new-qual that was introduced in 4.0.5.0. You can track this issue here: #4975

For the purposes of this ticket, you might want to do a test with GATK 4.0.4.0 and new-qual, and see if that works.

V-Z · 2018-07-02T21:19:36Z

@droazen 4.0.4.0 with new-equal works as expected, 4.0.5+ doesn't.

droazen · 2018-07-03T14:47:07Z

@V-Z As I suspected -- in that case I recommend running with 4.0.4.0 for now, and upgrade once #4975 is fixed.

davidbenjamin · 2018-07-03T15:35:00Z

@V-Z Would you mind sharing your GVCF, or just the offending chunk, with me so I can debug? I'm pretty sure it's a finite precision error and have a simple fix in mind but I would like to confirm on real data.

V-Z · 2018-07-03T16:40:29Z

@davidbenjamin Sure, I'm sending it.

davidbenjamin · 2018-07-03T18:22:52Z

Thanks @V-Z. I also just noticed that it's not a reference that I'm familiar with (I work on human cancer; Arabidopsis doesn't come up all that often). Could you send me the reference fasta or tell me where to download it?

davidbenjamin · 2018-07-03T18:24:40Z

By the way, @V-Z I see something called Arabidopsis_thaliana_TAIR10.fasta on the Broad server, but I have no idea if that's the same as yours.

V-Z · 2018-07-03T18:27:13Z

@davidbenjamin I'm sorry, I'm sending it.

davidbenjamin · 2018-07-03T18:55:38Z

@V-Z Thanks very much for your help! The change I had in mind fixes the issue. @droazen I will submit the PR now.

sanjeevksh · 2018-07-30T16:05:14Z

I have also experienced the same issue i.e. GenotypeGVCFs not proceeding further after initial few hundred lines (242 lines to be precise). I'm running 384 potato samples (242 diploid, 138 tetraploid and 4 hexaploid) on chromosome-by-chromosome basis. After adding '-new-qual' option run completed for the whole chromosome (dummy chromosome 0) taking 7654 lines. My query is whether using '-new-qual' is advisable under gatk best practices or is it a temporary arrangement until a final fix is found?

Also in the above completed run, the actual variant sites processed are only '102' as rest all have the warning 'WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples'. This looks a bit strange to me as potato is highly heterozygous and there should be more variants common to more than 10 samples unless there is huge disparity in coverage across samples. This is GBS data which further forces the reads to more specific sites and thus increasing chances of meeting the above requirement. Any comments on this warning issue is also highly appreciated.

Neato-Nick · 2018-07-30T18:08:24Z

Fyi - just having the warning about inbreeding coefficient doesn't mean you need to throw out the whole variant. You just won't have the inbreeding coefficient for that variant.

…

On Mon, Jul 30, 2018, 12:05 PM sanjeevksh ***@***.***> wrote: I have also experienced the same issue i.e. GenotypeGVCFs not proceeding further after initial few hundred lines (242 lines to be precise). I'm running 384 potato samples (242 diploid, 138 tetraploid and 4 hexaploid) on chromosome-by-chromosome basis. After adding '-new-qual' option run completed for the whole chromosome (dummy chromosome 0) taking 7654 lines. My query is whether using '-new-qual' is advisable under gatk best practices or is it a temporary arrangement until a final fix is found? Also in the above completed run, the actual variant sites processed are only '102' as rest all have the warning 'WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples'. This looks a bit strange to me as potato is highly heterozygous and there should be more variants common to more than 10 samples unless there is huge disparity in coverage across samples. This is GBS data which further forces the reads to more specific sites and thus increasing chances of meeting the above requirement. Any comments on this warning issue is also highly appreciated. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4544 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFlB0rdvU0Yh3zDpObuzvr-60Unp5OEVks5uLy7AgaJpZM4SyZJV> .

sanjeevksh · 2018-07-30T18:15:21Z

Thank you so much @Neato-Nick for your feedback, highly useful indeed! I was just worried that all these locations with warning signs are getting bypassed which as per your feedback should not be the case.
Regards,
Sanjeev

davidbenjamin · 2018-07-30T18:56:30Z

@sanjeevksh new-qual is a final fix for the deficiencies of the old model. You should use it.

sanjeevksh · 2018-07-30T19:07:54Z

Thanks @davidbenjamin, that's really great to hear! With -new-qual option added, the GenotypeGVCFs runs completed very swiftly.
Regards,
Sanjeev

sanjeevksh · 2018-07-31T15:07:47Z

My GenotypeGVCFs run for a single chromosome returned the following completion statement:
18:54:40.516 INFO ProgressMeter - Traversal complete. Processed 606308 total variants in 75.2 minutes.

However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'?

Also, there are only 16863 lines in my log file, does this mean that the 'Current Locus' column in the log file doesn't correspond to a single genomic location (bp) in the fasta file?

I am curious to know what is the relation between all these figures to fully understand what is happening while processing the gCVF files.

Also, on the inbreeding coefficient warning issue, I understand from your @Neato-Nick feedback that the variants with these warnings may still be fine and can be retained. However, this still leaves me worrying that out of 384 samples the locus doesn't even have 10 samples for generating the required metrics. Such variants won't be of any use for downstream analyses anyway where any variants with more than 80% missing samples will be removed. Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant?

Regards,
Sanjeev

davidbenjamin · 2018-07-31T15:38:02Z

However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'?

606308 is not 'variants x number of samples'. Rather, it is just 'variants'. However, keep in mind that most of the input "variants" that comprise these 606308 are GVCF reference confidence blocks that do not end up in the output.

To be more precise, each input GVCF has a mix of variants and reference confidence blocks, which don't necessarily overlap from sample to sample. The GATK engine turns the independent stream of records from each GVCF into a single multi-sample stream as if it came from a single multi-sample GVCF. This includes reconciling non-overlapping reference confidence blocks. 606308 is the number of effective GVCF records in this multi-sample stream. (BTW I don't mean a Java 8 Stream object -- it's just an Iterator).

Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant?

The warning happens when fewer than 10 samples have likelihoods (PLs) for the variant -- it's not a matter of how many samples have the variant. That is, if 10 samples have PLs that say they are hom ref, you don't get the warning.

Neato-Nick · 2018-07-31T15:52:47Z

Your best bet is to just start analyzing your data with this VCF. Doesn't sound like your output log file showed any truly problematic errors. Things like VCF Tools or vcfR (if you're familiar with R or want to start learning it) give you some basic stats about your vcf file very quickly. This will alleviate many of your concerns

…

On Tue, Jul 31, 2018, 11:10 AM sanjeevksh ***@***.***> wrote: My GenotypeGVCFs run for a single chromosome returned the following completion statement: 18:54:40.516 INFO ProgressMeter - Traversal complete. Processed 606308 total variants in 75.2 minutes. However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'? Also, there are only 16863 lines in my log file, does this mean that the 'Current Locus' column in the log file doesn't correspond to a single genomic location (bp) in the fasta file? I am curious to know what is the relation between all these figures to fully understand what is happening while processing the gCVF files. Also, on the inbreeding coefficient warning issue, I understand from your @Neato-Nick <https://github.com/Neato-Nick> feedback that the variants with these warnings may still be fine and can be retained. However, this still leaves me worrying that out of 384 samples the locus doesn't even have 10 samples for generating the required metrics. Such variants won't be of any use for downstream analyses anyway where any variants with more than 80% missing samples will be removed. Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant? Regards, Sanjeev — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4544 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AFlB0vsaHipT7i0GC5BcMgDZsS2DHbpaks5uMHNmgaJpZM4SyZJV> .

sanjeevksh · 2018-07-31T19:17:18Z

Thanks @davidbenjamin for a very detailed explanation, this helps a lot!

Thanks @Neato-Nick, yes that's the plan - I am just waiting for all runs to finish. There was some pressing need to clarify the 10 sample issue. I have done vcf filtering in the past and plan to do so by applying hard flitering using FilterVcf/VariantFiltration/SelectVariants. I was not aware of vcfR package, thanks for directing me to this - looks very useful. Not so brilliant but have some working knowledge in R so hopefully should manage using this R package.
Regards,
Sanjeev

sanjeevksh · 2018-08-15T18:09:48Z

Hi @Neato-Nick @davidbenjamin

Apologies for posting this message here. I have posted this message few days before at the regular GATK forum and also using the direct inbox option but have got no response so maybe something wrong with my account.

The issue is - I have done variant calling on 384 potato samples following, mostly, GATK best ##practices and have applied hard filters to select SNPs for further usage. However, I am noticing that '--max-nocall-fraction', '--max-nocall-number' and '--max-fraction-filtered-genotypes' arguments for 'SelectVariants' are not working properly. I have tried with various cutoff settings and every time I am observing SNPs with a much larger number of genotypes (~246 out of 384 with 0.10 setting) with 'no call' than the set thresholds. I have searched the forum first but couldn't find any relevant threads. I am using the latest GATK version (4.0.7.0). I am attaching three example sets of (1) log files (2) subset vcf files and (3) vcf index file for the three main vcfs. I would appreciate if you could provide any feedback on this issue and/or if this behaviour has been observed by some other users also.

The link to the original post is here:
https://gatkforums.broadinstitute.org/gatk/discussion/12688/possible-bug-in-selectvariants-tool#latest
SelectVariantBugReport.zip

Regards,
Sanjeev

Neato-Nick · 2018-08-15T18:20:36Z

@sanjeevksh General github practice is to only comment on an issue with information that is directly related to the original post. As your forum post states, your new comment is about the SelectVariants tool. I'm not with GATK, but from what I've seen, posting both on their forum and this github repository is okay and encouraged. But since this issue is unrelated to the the GenotypeGVCFs tool, you really should open a new issue. It will be more helpful to both you and the GATK team (for various reasons) to open a new issue with your new question rather than just commenting here.

lbergelson · 2018-08-15T18:54:59Z

@sanjeevksh This is a good place for bug reports, would you mind moving this report to a new issue as suggested? Thank you.

sanjeevksh · 2018-08-15T19:21:27Z

@Neato-Nick @lbergelson
Thanks for your comments, I will create a new thread,
Cheers,
Sanjeev

davidbenjamin · 2020-02-25T14:50:44Z

TL;DR new qual (which is now default) fixes it.

droazen added tools performance labels Mar 20, 2018

droazen added this to the Engine-2Q2018 milestone Mar 26, 2018

droazen added the PRIORITY_HIGH label Mar 26, 2018

droazen added the bug label Mar 28, 2018

lbergelson mentioned this issue Mar 29, 2018

make newQual the default for GenotypeGVCFs #4614

Closed

droazen assigned jonn-smith Apr 17, 2018

droazen mentioned this issue Jul 3, 2018

New qual parameter errors GenotypeGVCFs in v4.0.5.0. #4975

Closed

droazen removed this from the Engine-2Q2018 milestone Oct 4, 2018

davidbenjamin added JointCalling and removed tools labels Feb 25, 2020

davidbenjamin closed this as completed Feb 25, 2020

Users reporting unreasonable memory usage in GenotypeGVCFs #4544

Users reporting unreasonable memory usage in GenotypeGVCFs #4544

Comments

lbergelson commented Mar 20, 2018

droazen commented Mar 20, 2018

lbergelson commented Mar 20, 2018

Neato-Nick commented Mar 26, 2018

Neato-Nick commented Mar 27, 2018

droazen commented Mar 27, 2018

davidbenjamin commented Mar 28, 2018

droazen commented Mar 28, 2018

chandrans commented Apr 6, 2018 • edited

V-Z commented Jul 2, 2018

Neato-Nick commented Jul 2, 2018

V-Z commented Jul 2, 2018 • edited

lbergelson commented Jul 2, 2018

V-Z commented Jul 2, 2018

lbergelson commented Jul 2, 2018

V-Z commented Jul 2, 2018

droazen commented Jul 2, 2018

V-Z commented Jul 2, 2018

droazen commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

V-Z commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

V-Z commented Jul 3, 2018

davidbenjamin commented Jul 3, 2018

sanjeevksh commented Jul 30, 2018

Neato-Nick commented Jul 30, 2018 via email

sanjeevksh commented Jul 30, 2018

davidbenjamin commented Jul 30, 2018

sanjeevksh commented Jul 30, 2018

sanjeevksh commented Jul 31, 2018

davidbenjamin commented Jul 31, 2018

Neato-Nick commented Jul 31, 2018 via email

sanjeevksh commented Jul 31, 2018

sanjeevksh commented Aug 15, 2018

Neato-Nick commented Aug 15, 2018

lbergelson commented Aug 15, 2018

sanjeevksh commented Aug 15, 2018

davidbenjamin commented Feb 25, 2020

chandrans commented Apr 6, 2018 •

edited

V-Z commented Jul 2, 2018 •

edited