New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users reporting unreasonable memory usage in GenotypeGVCFs #4544
Comments
related to #4512 possibly? |
Yes, very possible. |
Jumping on to report I've also been having issues using GATK 4 genotyping. CombineGVCFs (v4.0.1.1) runs fine no matter how many samples I use, but GenotypeGVCFs chokes at some limit. I iteratively subsetted my sample list to see at what point it begins to choke. I used line count of the final vcf file as an approximation of how far the genotyper got. Even at the 110 samples, though, CombineGVCFs ran for several hours. It just produced a final, genotyped VCF file that was severely truncated with few variants. See the graph at this image attached. As for the errors:
I attached three of these error files so you can see the full list of memory problems. genotype55.e5195822.txt For reference, genotyping using GATK 3.8.0 on all 108 of my samples produced a final vcf file 2784 lines long in 36 seconds with no issue. Let me know if you have any other questions! |
UPDATE: I solved the issue on my end. A collaborator was having the same issue with their haploid data, but not their diploid. The problems I described above were for haploid data. He added "--new-qual" to his GenotypeGVCFs command and that solved the issue. It did for me as well! Using the same combined GVCF file as I was before, genotyping finished in less than half a minute after adding the new-qual parameter. Thought it would be useful to know that: 1, this issue appears to affect non-diploids more than diploids 2, using --new-qual solved the issue, at least for me. I've attached the log-file generated from this new output, hopefully it helps in debugging. |
This is interesting. @davidbenjamin, can you think of a reason why |
@droazen I don't have a good reason. For ploidy greater than 2, sure, but I would expect old qual's brute force approach to do okay on haploids. I will say that for non-diploids it goes to |
In that case it's likely that there is a bug specific to the old-qual code path in GATK4, that was not present in GATK3. |
Another user reported back with some information in the same thread.
|
I can confirm this issue (4.0.5.2). With 16 tetraploid samples (CombineGVCFs outputs 420 MB file) GenotypeGVCFs stucks on the beginning and later crashes (32 GB RAM is not enough). |
@V-Z did enabling the |
@Neato-Nick
|
@V-Z Huh, that might be a different issue then. @davidbenjamin Any thoughts? |
@lbergelson Might be, but otherwise the description in the original post fits well to my problem. |
@V-Z The massive memory use without |
Weird. I don't aim to hijack the issue, but how to verify such possibility? I have been using GATK 3.8 and I can process my data there... |
@droazen |
@V-Z Would you mind sharing your GVCF, or just the offending chunk, with me so I can debug? I'm pretty sure it's a finite precision error and have a simple fix in mind but I would like to confirm on real data. |
@davidbenjamin Sure, I'm sending it. |
Thanks @V-Z. I also just noticed that it's not a reference that I'm familiar with (I work on human cancer; Arabidopsis doesn't come up all that often). Could you send me the reference fasta or tell me where to download it? |
By the way, @V-Z I see something called Arabidopsis_thaliana_TAIR10.fasta on the Broad server, but I have no idea if that's the same as yours. |
@davidbenjamin I'm sorry, I'm sending it. |
I have also experienced the same issue i.e. GenotypeGVCFs not proceeding further after initial few hundred lines (242 lines to be precise). I'm running 384 potato samples (242 diploid, 138 tetraploid and 4 hexaploid) on chromosome-by-chromosome basis. After adding '-new-qual' option run completed for the whole chromosome (dummy chromosome 0) taking 7654 lines. My query is whether using '-new-qual' is advisable under gatk best practices or is it a temporary arrangement until a final fix is found? Also in the above completed run, the actual variant sites processed are only '102' as rest all have the warning 'WARN InbreedingCoeff - Annotation will not be calculated, must provide at least 10 samples'. This looks a bit strange to me as potato is highly heterozygous and there should be more variants common to more than 10 samples unless there is huge disparity in coverage across samples. This is GBS data which further forces the reads to more specific sites and thus increasing chances of meeting the above requirement. Any comments on this warning issue is also highly appreciated. |
Fyi - just having the warning about inbreeding coefficient doesn't mean you
need to throw out the whole variant. You just won't have the inbreeding
coefficient for that variant.
…On Mon, Jul 30, 2018, 12:05 PM sanjeevksh ***@***.***> wrote:
I have also experienced the same issue i.e. GenotypeGVCFs not proceeding
further after initial few hundred lines (242 lines to be precise). I'm
running 384 potato samples (242 diploid, 138 tetraploid and 4 hexaploid) on
chromosome-by-chromosome basis. After adding '-new-qual' option run
completed for the whole chromosome (dummy chromosome 0) taking 7654 lines.
My query is whether using '-new-qual' is advisable under gatk best
practices or is it a temporary arrangement until a final fix is found?
Also in the above completed run, the actual variant sites processed are
only '102' as rest all have the warning 'WARN InbreedingCoeff - Annotation
will not be calculated, must provide at least 10 samples'. This looks a bit
strange to me as potato is highly heterozygous and there should be more
variants common to more than 10 samples unless there is huge disparity in
coverage across samples. This is GBS data which further forces the reads to
more specific sites and thus increasing chances of meeting the above
requirement. Any comments on this warning issue is also highly appreciated.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4544 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFlB0rdvU0Yh3zDpObuzvr-60Unp5OEVks5uLy7AgaJpZM4SyZJV>
.
|
Thank you so much @Neato-Nick for your feedback, highly useful indeed! I was just worried that all these locations with warning signs are getting bypassed which as per your feedback should not be the case. |
@sanjeevksh |
Thanks @davidbenjamin, that's really great to hear! With -new-qual option added, the GenotypeGVCFs runs completed very swiftly. |
My GenotypeGVCFs run for a single chromosome returned the following completion statement: However, there are only 46814 variant rows (excluding 52 header rows) in the corresponding vcf file. Does the above figure of 606308 correspond to a multiple of 'variants x number of samples'? Also, there are only 16863 lines in my log file, does this mean that the 'Current Locus' column in the log file doesn't correspond to a single genomic location (bp) in the fasta file? I am curious to know what is the relation between all these figures to fully understand what is happening while processing the gCVF files. Also, on the inbreeding coefficient warning issue, I understand from your @Neato-Nick feedback that the variants with these warnings may still be fine and can be retained. However, this still leaves me worrying that out of 384 samples the locus doesn't even have 10 samples for generating the required metrics. Such variants won't be of any use for downstream analyses anyway where any variants with more than 80% missing samples will be removed. Therefore, I wish to seek some more information about this 10 sample thing - does it have some other context or does it literally mean that there are only less than 10 samples carrying that variant? Regards, |
606308 is not 'variants x number of samples'. Rather, it is just 'variants'. However, keep in mind that most of the input "variants" that comprise these 606308 are GVCF reference confidence blocks that do not end up in the output. To be more precise, each input GVCF has a mix of variants and reference confidence blocks, which don't necessarily overlap from sample to sample. The GATK engine turns the independent stream of records from each GVCF into a single multi-sample stream as if it came from a single multi-sample GVCF. This includes reconciling non-overlapping reference confidence blocks. 606308 is the number of effective GVCF records in this multi-sample stream. (BTW I don't mean a Java 8
The warning happens when fewer than 10 samples have likelihoods (PLs) for the variant -- it's not a matter of how many samples have the variant. That is, if 10 samples have PLs that say they are hom ref, you don't get the warning. |
Your best bet is to just start analyzing your data with this VCF. Doesn't
sound like your output log file showed any truly problematic errors. Things
like VCF Tools or vcfR (if you're familiar with R or want to start learning
it) give you some basic stats about your vcf file very quickly. This will
alleviate many of your concerns
…On Tue, Jul 31, 2018, 11:10 AM sanjeevksh ***@***.***> wrote:
My GenotypeGVCFs run for a single chromosome returned the following
completion statement:
18:54:40.516 INFO ProgressMeter - Traversal complete. Processed 606308
total variants in 75.2 minutes.
However, there are only 46814 variant rows (excluding 52 header rows) in
the corresponding vcf file. Does the above figure of 606308 correspond to a
multiple of 'variants x number of samples'?
Also, there are only 16863 lines in my log file, does this mean that the
'Current Locus' column in the log file doesn't correspond to a single
genomic location (bp) in the fasta file?
I am curious to know what is the relation between all these figures to
fully understand what is happening while processing the gCVF files.
Also, on the inbreeding coefficient warning issue, I understand from your
@Neato-Nick <https://github.com/Neato-Nick> feedback that the variants
with these warnings may still be fine and can be retained. However, this
still leaves me worrying that out of 384 samples the locus doesn't even
have 10 samples for generating the required metrics. Such variants won't be
of any use for downstream analyses anyway where any variants with more than
80% missing samples will be removed. Therefore, I wish to seek some more
information about this 10 sample thing - does it have some other context or
does it literally mean that there are only less than 10 samples carrying
that variant?
Regards,
Sanjeev
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4544 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFlB0vsaHipT7i0GC5BcMgDZsS2DHbpaks5uMHNmgaJpZM4SyZJV>
.
|
Thanks @davidbenjamin for a very detailed explanation, this helps a lot! Thanks @Neato-Nick, yes that's the plan - I am just waiting for all runs to finish. There was some pressing need to clarify the 10 sample issue. I have done vcf filtering in the past and plan to do so by applying hard flitering using FilterVcf/VariantFiltration/SelectVariants. I was not aware of vcfR package, thanks for directing me to this - looks very useful. Not so brilliant but have some working knowledge in R so hopefully should manage using this R package. |
Apologies for posting this message here. I have posted this message few days before at the regular GATK forum and also using the direct inbox option but have got no response so maybe something wrong with my account. The issue is - I have done variant calling on 384 potato samples following, mostly, GATK best ##practices and have applied hard filters to select SNPs for further usage. However, I am noticing that '--max-nocall-fraction', '--max-nocall-number' and '--max-fraction-filtered-genotypes' arguments for 'SelectVariants' are not working properly. I have tried with various cutoff settings and every time I am observing SNPs with a much larger number of genotypes (~246 out of 384 with 0.10 setting) with 'no call' than the set thresholds. I have searched the forum first but couldn't find any relevant threads. I am using the latest GATK version (4.0.7.0). I am attaching three example sets of (1) log files (2) subset vcf files and (3) vcf index file for the three main vcfs. I would appreciate if you could provide any feedback on this issue and/or if this behaviour has been observed by some other users also. The link to the original post is here: Regards, |
@sanjeevksh General github practice is to only comment on an issue with information that is directly related to the original post. As your forum post states, your new comment is about the SelectVariants tool. I'm not with GATK, but from what I've seen, posting both on their forum and this github repository is okay and encouraged. But since this issue is unrelated to the the GenotypeGVCFs tool, you really should open a new issue. It will be more helpful to both you and the GATK team (for various reasons) to open a new issue with your new question rather than just commenting here. |
@sanjeevksh This is a good place for bug reports, would you mind moving this report to a new issue as suggested? Thank you. |
@Neato-Nick @lbergelson |
TL;DR new qual (which is now default) fixes it. |
Several people are reporting unreasonable memory usage in GenotypeGVCFs.
See https://gatkforums.broadinstitute.org/gatk/discussion/11634/genotypegvcfs-resource-problems#latest
and #4467.
We should investigate what's going on, probably we're loading to many lines into memory at once somehow. It's possible it's related to running on a base pair resolution file. Possible that #3480 might help?
The text was updated successfully, but these errors were encountered: