Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GenotypeGVCFs error in 4.2.4.1 java.lang.IllegalStateException #7639

Closed
GATKSupportTeam opened this issue Jan 14, 2022 · 26 comments
Closed

GenotypeGVCFs error in 4.2.4.1 java.lang.IllegalStateException #7639

GATKSupportTeam opened this issue Jan 14, 2022 · 26 comments

Comments

@GATKSupportTeam
Copy link
Collaborator

GATKSupportTeam commented Jan 14, 2022

A user running GenotypeGVCFs with a GenomicsDB ran into a new issue with 4.2.4.1. They were previously running 4.1.9.0.

Their complete program log is below:

This request was created from a contribution made by Andrius Jonas Dagilis on January 06, 2022 16:24 UTC.

Link: https://gatk.broadinstitute.org/hc/en-us/community/posts/360073212811-GenotypeGVCFs-stalls-while-using-all-sites#community_comment_4414746059419

Using GATK jar /nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx190g -jar /nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar GenotypeGVCFs -R /proj/matutelb/projects/drosophila/melanogaster/dmel6_ref.fasta -V gendb://all_mels_chr2L -L chr2L -O all_mels_chr2L.vcf.gz
19:40:48.803 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/nas/longleaf/apps/gatk/4.2.4.1/gatk-4.2.4.1/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 13, 2022 7:40:50 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
19:40:50.088 INFO GenotypeGVCFs - ------------------------------------------------------------
19:40:50.089 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.4.1
19:40:50.090 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
19:40:50.102 INFO GenotypeGVCFs - Executing as adagilis@t0601.ll.unc.edu on Linux v3.10.0-1160.2.2.el7.x86_64 amd64
19:40:50.103 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_292-b10
19:40:50.103 INFO GenotypeGVCFs - Start Date/Time: January 13, 2022 7:40:48 PM EST
19:40:50.103 INFO GenotypeGVCFs - ------------------------------------------------------------
19:40:50.103 INFO GenotypeGVCFs - ------------------------------------------------------------
19:40:50.104 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
19:40:50.104 INFO GenotypeGVCFs - Picard Version: 2.25.4
19:40:50.104 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
19:40:50.104 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
19:40:50.105 INFO GenotypeGVCFs - Deflater: IntelDeflater
19:40:50.105 INFO GenotypeGVCFs - Inflater: IntelInflater
19:40:50.105 INFO GenotypeGVCFs - GCS max retries/reopens: 20
19:40:50.105 INFO GenotypeGVCFs - Requester pays: disabled
19:40:50.105 INFO GenotypeGVCFs - Initializing engine
19:40:51.727 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
19:41:10.142 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
19:41:10.143 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
19:41:10.143 info NativeGenomicsDB - pid=29116 tid=29117 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
19:41:28.066 INFO IntervalArgumentCollection - Processing 23513712 bp from intervals
19:41:28.137 INFO GenotypeGVCFs - Done initializing engine
19:41:28.487 INFO ProgressMeter - Starting traversal
19:41:28.488 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
19:42:19.471 INFO ProgressMeter - chr2L:5364 0.8 1000 1176.9
19:42:32.573 INFO ProgressMeter - chr2L:6364 1.1 2000 1872.5
19:42:48.677 INFO ProgressMeter - chr2L:8364 1.3 4000 2992.9
19:43:03.248 INFO ProgressMeter - chr2L:10364 1.6 6000 3799.1
Chromosome chr2L position 11353 (TileDB column 11352) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:43:18.779 INFO ProgressMeter - chr2L:12364 1.8 8000 4352.1
Chromosome chr2L position 13464 (TileDB column 13463) has too many alleles in the combined VCF record : 11 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:43:32.483 INFO ProgressMeter - chr2L:14364 2.1 10000 4838.9
19:43:43.998 INFO ProgressMeter - chr2L:16364 2.3 12000 5313.3
Chromosome chr2L position 16479 (TileDB column 16478) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr2L position 17004 (TileDB column 17003) has too many alleles in the combined VCF record : 13 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:43:57.857 INFO ProgressMeter - chr2L:17364 2.5 13000 5222.0
Chromosome chr2L position 18754 (TileDB column 18753) has too many alleles in the combined VCF record : 9 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

Chromosome chr2L position 19311 (TileDB column 19310) has too many alleles in the combined VCF record : 16 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:44:17.535 INFO ProgressMeter - chr2L:19364 2.8 15000 5324.0
Chromosome chr2L position 19835 (TileDB column 19834) has too many alleles in the combined VCF record : 11 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:44:34.904 INFO ProgressMeter - chr2L:21364 3.1 17000 5471.6
19:44:47.867 INFO ProgressMeter - chr2L:23364 3.3 19000 5717.8
Chromosome chr2L position 25349 (TileDB column 25348) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

19:45:00.952 INFO GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),57.82864317899994,Cpu time(s),49.25545264700008
[January 13, 2022 7:45:01 PM EST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 4.22 minutes.
Runtime.totalMemory()=2076049408
java.lang.IllegalStateException: Genotype [Ark CTTT/CTTT GQ 24 DP 8 AD 0,8,0,0,0,0,0,0 {SB=0,0,4,4}] does not contain likelihoods necessary to calculate posteriors.
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.log10NormalizedGenotypePosteriors(AlleleFrequencyCalculator.java:89)
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.effectiveAlleleCounts(AlleleFrequencyCalculator.java:258)
at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.calculate(AlleleFrequencyCalculator.java:141)
at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:147)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.calculateGenotypes(GenotypeGVCFsEngine.java:244)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.regenotypeVC(GenotypeGVCFsEngine.java:152)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:135)
at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:283)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
at java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Iterator.forEachRemaining(Iterator.java:116)
(base) [adagilis@longleaf-login4 logs]$
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:490)
at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
@eriqande
Copy link

I have the same problem. The error seems to occur when the number of alleles is exactly one greater than the value of --max-alternate-alleles. Note that is the case in Andrius's output above: it throws a warning, but no error, when there are more than 7 alleles, but when there are exactly 7 alleles (one more than the default max of 6) it errors out.

I have tried this on a number of different chromosomes (small test cases with 4 Chinook salmon) and always get the error immediately after a warning about having 7 alleles. On one chromosome, an earlier warning about 8 alleles causes no error, UNLESS I set --max-alternate-alleles to 7, in which case, the position with 8 alleles causes the error immediately after the warning. This is evident from the two log files for two different runs, below.

Log from run with value of --max-alternate-alleles left at default value of 6:

on-chinookomes-dna-seq-gatk-variant-calling]--% gatk --java-options "-Xmx4g" GenotypeGVCFs    -R resources/genome.fasta    -V gendb://results/genomics_db/chromosomes/CM031199.1   -O results/vcf_parts/CM031199.1.vcf.gz


Using GATK jar /home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar GenotypeGVCFs -R resources/genome.fasta -V gendb://results/genomics_db/chromosomes/CM031199.1 -O results/vcf_parts/CM031199.1.vcf.gz
22:17:18.737 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 16, 2022 10:17:18 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
22:17:18.863 INFO  GenotypeGVCFs - ------------------------------------------------------------
22:17:18.863 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.4.1
22:17:18.863 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
22:17:18.864 INFO  GenotypeGVCFs - Executing as eanderson@node34.cluster on Linux v4.18.0-193.28.1.el8_2.x86_64 amd64
22:17:18.864 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1-internal+0-adhoc..src
22:17:18.864 INFO  GenotypeGVCFs - Start Date/Time: January 16, 2022 at 10:17:18 PM PST
22:17:18.864 INFO  GenotypeGVCFs - ------------------------------------------------------------
22:17:18.864 INFO  GenotypeGVCFs - ------------------------------------------------------------
22:17:18.864 INFO  GenotypeGVCFs - HTSJDK Version: 2.24.1
22:17:18.864 INFO  GenotypeGVCFs - Picard Version: 2.25.4
22:17:18.865 INFO  GenotypeGVCFs - Built for Spark Version: 2.4.5
22:17:18.865 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
22:17:18.865 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
22:17:18.865 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
22:17:18.865 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
22:17:18.865 INFO  GenotypeGVCFs - Deflater: IntelDeflater
22:17:18.865 INFO  GenotypeGVCFs - Inflater: IntelInflater
22:17:18.865 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
22:17:18.865 INFO  GenotypeGVCFs - Requester pays: disabled
22:17:18.865 INFO  GenotypeGVCFs - Initializing engine
22:17:19.411 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
22:17:19.560 info  NativeGenomicsDB - pid=1148842 tid=1148843 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
22:17:19.560 info  NativeGenomicsDB - pid=1148842 tid=1148843 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
22:17:19.560 info  NativeGenomicsDB - pid=1148842 tid=1148843 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
22:17:19.961 INFO  GenotypeGVCFs - Done initializing engine
22:17:20.067 INFO  ProgressMeter - Starting traversal
22:17:20.068 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
22:17:26.952 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position CM031199.1:72 and possibly subsequent; at least 10 samples must have called genotypes
22:17:30.076 INFO  ProgressMeter -    CM031199.1:418514              0.2                 77000         461630.7
22:17:40.142 INFO  ProgressMeter -   CM031199.1:2361698              0.3                427000        1276277.8
22:17:50.156 INFO  ProgressMeter -   CM031199.1:4294969              0.5                763000        1521536.8
22:18:00.209 INFO  ProgressMeter -   CM031199.1:6366520              0.7               1158000        1730898.6
Chromosome CM031199.1 position 6786063 (TileDB column 6786062) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

22:18:10.220 INFO  ProgressMeter -   CM031199.1:7700897              0.8               1438000        1720370.1
22:18:20.237 INFO  ProgressMeter -   CM031199.1:9799772              1.0               1840000        1834862.4
22:18:30.318 INFO  ProgressMeter -  CM031199.1:11583359              1.2               2205000        1883274.0
22:18:40.327 INFO  ProgressMeter -  CM031199.1:13596363              1.3               2623000        1960926.0
22:18:50.379 INFO  ProgressMeter -  CM031199.1:15138950              1.5               2935000        1949928.6
22:19:00.394 INFO  ProgressMeter -  CM031199.1:16835922              1.7               3304000        1975958.4
22:19:10.397 INFO  ProgressMeter -  CM031199.1:18887424              1.8               3732000        2029566.1
22:19:20.417 INFO  ProgressMeter -  CM031199.1:20313013              2.0               4041000        2014640.8
22:19:30.446 INFO  ProgressMeter -  CM031199.1:22312939              2.2               4473000        2058491.9
22:19:40.460 INFO  ProgressMeter -  CM031199.1:24063238              2.3               4842000        2069348.7
22:19:52.201 INFO  ProgressMeter -  CM031199.1:25798492              2.5               5213000        2055964.2
22:20:02.224 INFO  ProgressMeter -  CM031199.1:27826886              2.7               5646000        2089099.4
22:20:12.237 INFO  ProgressMeter -  CM031199.1:29589960              2.9               6022000        2098635.6
Chromosome CM031199.1 position 29761047 (TileDB column 29761046) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

22:20:13.114 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),84.82845143211256,Cpu time(s),73.66556314507389
[January 16, 2022 at 10:20:13 PM PST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 2.91 minutes.
Runtime.totalMemory()=683671552
java.lang.IllegalStateException: Genotype [T199970 ATATATAT/T GQ 49 DP 4 AD 0,2,0,0,0,2,0,0 {SB=0,0,2,2}] does not contain likelihoods necessary to calculate posteriors.
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.log10NormalizedGenotypePosteriors(AlleleFrequencyCalculator.java:89)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.effectiveAlleleCounts(AlleleFrequencyCalculator.java:258)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.calculate(AlleleFrequencyCalculator.java:141)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:147)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.calculateGenotypes(GenotypeGVCFsEngine.java:244)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.regenotypeVC(GenotypeGVCFsEngine.java:152)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:135)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:283)
	at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:502)
	at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)

Log From run with value of --max-alternate-alleles set to 7:

on-chinookomes-dna-seq-gatk-variant-calling]--% gatk --java-options "-Xmx4g" GenotypeGVCFs    -R resources/genome.fasta    -V gendb://results/genomics_db/chromosomes/CM031199.1   --max-alternate-alleles 7  -O results/vcf_parts/CM031199.1.vcf.gz


Using GATK jar /home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx4g -jar /home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar GenotypeGVCFs -R resources/genome.fasta -V gendb://results/genomics_db/chromosomes/CM031199.1 --max-alternate-alleles 7 -O results/vcf_parts/CM031199.1.vcf.gz
21:57:11.346 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/eanderson/Documents/projects/yukon-chinookomes-dna-seq-gatk-variant-calling/.snakemake/conda/cd50d464/share/gatk4-4.2.4.1-0/gatk-package-4.2.4.1-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jan 16, 2022 9:57:11 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
21:57:11.476 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:57:11.477 INFO  GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.4.1
21:57:11.477 INFO  GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
21:57:11.477 INFO  GenotypeGVCFs - Executing as eanderson@node34.cluster on Linux v4.18.0-193.28.1.el8_2.x86_64 amd64
21:57:11.477 INFO  GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v11.0.9.1-internal+0-adhoc..src
21:57:11.477 INFO  GenotypeGVCFs - Start Date/Time: January 16, 2022 at 9:57:11 PM PST
21:57:11.477 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:57:11.477 INFO  GenotypeGVCFs - ------------------------------------------------------------
21:57:11.478 INFO  GenotypeGVCFs - HTSJDK Version: 2.24.1
21:57:11.478 INFO  GenotypeGVCFs - Picard Version: 2.25.4
21:57:11.478 INFO  GenotypeGVCFs - Built for Spark Version: 2.4.5
21:57:11.478 INFO  GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
21:57:11.478 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
21:57:11.478 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
21:57:11.478 INFO  GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
21:57:11.478 INFO  GenotypeGVCFs - Deflater: IntelDeflater
21:57:11.478 INFO  GenotypeGVCFs - Inflater: IntelInflater
21:57:11.478 INFO  GenotypeGVCFs - GCS max retries/reopens: 20
21:57:11.478 INFO  GenotypeGVCFs - Requester pays: disabled
21:57:11.478 INFO  GenotypeGVCFs - Initializing engine
21:57:12.094 INFO  GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
21:57:12.243 info  NativeGenomicsDB - pid=1148095 tid=1148096 No valid combination operation found for INFO field InbreedingCoeff  - the field will NOT be part of INFO fields in the generated VCF records
21:57:12.243 info  NativeGenomicsDB - pid=1148095 tid=1148096 No valid combination operation found for INFO field MLEAC  - the field will NOT be part of INFO fields in the generated VCF records
21:57:12.243 info  NativeGenomicsDB - pid=1148095 tid=1148096 No valid combination operation found for INFO field MLEAF  - the field will NOT be part of INFO fields in the generated VCF records
21:57:12.642 INFO  GenotypeGVCFs - Done initializing engine
21:57:12.752 INFO  ProgressMeter - Starting traversal
21:57:12.752 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
21:57:19.659 WARN  InbreedingCoeff - InbreedingCoeff will not be calculated at position CM031199.1:72 and possibly subsequent; at least 10 samples must have called genotypes
21:57:22.809 INFO  ProgressMeter -    CM031199.1:441638              0.2                 81000         483293.6
21:57:32.865 INFO  ProgressMeter -   CM031199.1:2436041              0.3                441000        1315567.0
21:57:42.870 INFO  ProgressMeter -   CM031199.1:4476368              0.5                793000        1579838.6
21:57:52.884 INFO  ProgressMeter -   CM031199.1:6506439              0.7               1189000        1777633.8
Chromosome CM031199.1 position 6786063 (TileDB column 6786062) has too many alleles in the combined VCF record : 8 : current limit : 7. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

21:57:54.374 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),15.171388932000157,Cpu time(s),14.039159156998815
[January 16, 2022 at 9:57:54 PM PST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.72 minutes.
Runtime.totalMemory()=444596224
java.lang.IllegalStateException: Genotype [T199967 A*/ATT GQ 29 DP 29 AD 25,3,1,0,0,0,0,0,0 {SB=15,10,3,1}] does not contain likelihoods necessary to calculate posteriors.
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.log10NormalizedGenotypePosteriors(AlleleFrequencyCalculator.java:89)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.effectiveAlleleCounts(AlleleFrequencyCalculator.java:258)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.calculate(AlleleFrequencyCalculator.java:141)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.GenotypingEngine.calculateGenotypes(GenotypingEngine.java:147)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.calculateGenotypes(GenotypeGVCFsEngine.java:244)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.regenotypeVC(GenotypeGVCFsEngine.java:152)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFsEngine.callRegion(GenotypeGVCFsEngine.java:135)
	at org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs.apply(GenotypeGVCFs.java:283)
	at org.broadinstitute.hellbender.engine.VariantLocusWalker.lambda$traverse$0(VariantLocusWalker.java:135)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:183)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:177)
	at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
	at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
	at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
	at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
	at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
	at java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
	at java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
	at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
	at java.base/java.util.stream.ReferencePipeline.forEachOrdered(ReferencePipeline.java:502)
	at org.broadinstitute.hellbender.engine.VariantLocusWalker.traverse(VariantLocusWalker.java:132)
	at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1085)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
	at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
	at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
	at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
	at org.broadinstitute.hellbender.Main.main(Main.java:289)

@eriqande
Copy link

eriqande commented Jan 17, 2022

Given the above behavior, and looking at the error message produced, I wonder if it might not be an issue with the GenomicsDB, but rather just a bug in the code that doesn't properly no-call individuals at sites with too many alleles.

I wonder that because the error message is emitted at the bottom of this block:

https://github.com/broadinstitute/gatk/blob/master/src/main/java/org/broadinstitute/hellbender/tools/walkers/genotyper/afcalc/AlleleFrequencyCalculator.java#L78-L90

and it only gets there for diploids if neither of g.isHomRef() or g.isNoCall() is true.

@eriqande
Copy link

eriqande commented Jan 17, 2022

If you would like the GenomicsDB for chromosome CM031199.1 (which, by the way, was created with GATK 4.2.4.1) that I used in the above two examples, for your own debugging purposes, you can download it as a .tar archive from:

https://drive.google.com/file/d/1LzZCkWfmNb8IcZpdreaNIxtJ8GQQ-b7g/view?usp=sharing

It is 385 Mb, and it has an SHA1 hash (from the Unix shasum utility) of d330a28120713fb05c95d1bf54342944f5d741c9.

@seboyden
Copy link

I'm having the same problem, running GenotypeGVCFs 4.2.4.1 following GenomicsDBImport 4.2.4.1 on 3 human WGS samples, NOT using -all-sites (as in original post), and using default --max-alternate-alleles 6, I get:

...
Chromosome chr1 position 94824 (TileDB column 94823) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 104160 (TileDB column 104159) has too many alleles in the combined VCF record : 10 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 181583 (TileDB column 181582) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 267777 (TileDB column 267776) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 271413 (TileDB column 271412) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

13:40:02.922 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5637504830000029,Cpu time(s),0.5106995390000095
[January 18, 2022 1:40:02 PM MST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 0.56 minutes.
Runtime.totalMemory()=2227699712
java.lang.IllegalStateException: Genotype [1057-01 CTT*/* GQ 99 DP 67 AD 34,21,3,6,1,0,2,0 {SB=7,27,15,18}] does not contain likelihoods necessary to calculate posteriors.

If I add -G StandardAnnotation -G AS_StandardAnnotation to my command line, it gets a little farther:

...
Chromosome chr1 position 94824 (TileDB column 94823) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 104160 (TileDB column 104159) has too many alleles in the combined VCF record : 10 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 181583 (TileDB column 181582) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 267777 (TileDB column 267776) has too many alleles in the combined VCF record : 8 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 271413 (TileDB column 271412) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 275583 (TileDB column 275582) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 275664 (TileDB column 275663) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.
Chromosome chr1 position 285109 (TileDB column 285108) has too many alleles in the combined VCF record : 7 : current limit : 6. Fields, such as  PL, with length equal to the number of genotypes will NOT be added for this location.

20:40:00.698 INFO  GenotypeGVCFs - Shutting down engine
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.6447453060000009,Cpu time(s),0.6353685659999999
[January 14, 2022 8:40:00 PM MST] org.broadinstitute.hellbender.tools.walkers.GenotypeGVCFs done. Elapsed time: 1.40 minutes.
Runtime.totalMemory()=2523922432
java.lang.IllegalStateException: Genotype [1057-01 CTT*/* GQ 99 DP 67 AD 34,21,3,6,1,0,2,0 {SB=7,27,15,18}] does not contain likelihoods necessary to calculate posteriors.

So in my case, it is also failing after a site where number of alt alleles (7) is exactly 1 more than --max-alternate-alleles (6), but not always on the first such variant with 7 alt alleles, depending on other command line options.

If I revert GenotypeGVCFs to previous versions from 4.1.9.0 to 4.2.4.0 they all work (using the same GVCF database from GenomicsDBImport 4.2.4.1), so this apparent bug is new in GenotypeGVCFs 4.2.4.1.

@gbrandt6
Copy link
Contributor

@nalinigans @mlathara do you have any insights to if this would result from any of the GenomicsDB changes in 4.2.4.1?

@mlathara
Copy link
Contributor

I think changes in 4.2.4.1 are exposing an issue, but I think @eriqande is right that the issue isn't with GenomicsDB. The change was that previously GATK wasn't correctly passing the --max-alternate-allele argument to GenomicsDB, but now that it is doing so it makes it more likely that some sites don't contain fields like PL (dependent on number of possible genotypes).

It looks like the AF calculation used to ignore sites if they didn't have likelihoods, but has now been updated slightly to also allow sites with GQ or sites where any alleles are called+nonref+not symbolic.

public static boolean genotypeIsUsableForAFCalculation(Genotype g) {
return g.hasLikelihoods() || g.hasGQ() || g.getAlleles().stream().anyMatch(a -> a.isCalled() && a.isNonReference() && !a.isSymbolic());
}

Possibly that condition needs to be tweaked? @ldgauthier might be the best person to ask.

As for this working with previous versions, that is only because the --max-alternate-alleles wasn't being passed to GenomicsDB correctly. if you want to revert to that behavior you can set --max-alternate-alleles to 50 (which is what GenomicsDB defaults to).

@seboyden
Copy link

To elaborate & clarify, in my use case:

GenomicsDBImport 4.2.4.0 plus GenotypeGVCFs 4.2.4.0: works
GenomicsDBImport 4.2.4.0 plus GenotypeGVCFs 4.2.4.1: fails
GenomicsDBImport 4.2.4.1 plus GenotypeGVCFs 4.2.4.0: works
GenomicsDBImport 4.2.4.1 plus GenotypeGVCFs 4.2.4.1: fails

So I agree the issue appears to be the change to AF calculation in GenotypeGVCFs 4.2.4.1, not the --max-alternate-alleles bug fix in GenomicsDBImport. Workaround would be to revert to GenotypeGVCFs 4.2.4.0; you don't have to change --max-alternate-alleles value.

@eriqande
Copy link

Thanks Seboyden. That is super helpful. I am just finishing up a long series of GenotypeGVCFs 4.2.4.0 using genomicsDBs from GenomicsDBImport 4.2.4.1, without having passed any --max-alternate-alleles option to GenomicsDBImport. These have all run without any problems.

@bbimber
Copy link
Contributor

bbimber commented Jan 24, 2022

We're also hitting this, and will try to downgrade. However, can anyone from GATK comment on whether it's possible to add a fix to GATK itself?

@ldgauthier
Copy link
Contributor

@eriqande thanks for sharing your data, but I'll need your reference fasta as well to be able to reproduce your error.

@ldgauthier
Copy link
Contributor

Never mind -- refseq to the rescue. I'm working on a patch now.

@eriqande
Copy link

eriqande commented Jan 27, 2022 via email

@bbimber
Copy link
Contributor

bbimber commented Jan 27, 2022

I havent looked into this in detail, but if the description above is right, would running this with a GenomicsDB data source and "--max-alternate-alleles=1" likely trigger this? The reasoning is that there are probably sites in your existing test data with more than one ALT?

@ldgauthier
Copy link
Contributor

Yep, right now that's what would happen. The previous behavior was that GenomicsDB would return likelihoods for up to 50 alts and then no likelihoods for more than 50 and GenotypeGVCFs I think would subset 7-49 ALTs down to 6 and drop sites with more than 50 ALTs, which is sort of the computational efficiency compromise. I personally don't trust a diploid site with 50 ALTs. I believe the old CombineGVCFs behavior capped the number of likelihoods, so ploidy 1 could have more than 50 ALTs, but higher ploidies would have fewer.

So the current plan is to separate the max alts for GGVCFs and GenomicsDB so that the GenomicsDB max alts to combine can be greater and thus convey enough information for GGVCFs to subset down to its max alts, if desired. But if you want to be conservative with memory, you could set them to be equal and low.

Should we filter sites with too many alts, which is the recommendation for the GnarlyGenotyper pipeline, rather than drop them, like GGVCFs used to do?

@droazen
Copy link
Collaborator

droazen commented Jan 27, 2022

@mlathara The underlying issue here is that in previous versions of GATK, users would run GenotypeGVCFs with --max-alternate-alleles set to some low value such as 6, while GenomicsDB would at the same time run with a limit of 50 alleles. Now that GenomicsDB is respecting the --max-alternate-alleles argument, we're running with the same value (eg., 6) in both GenotypeGVCFs and GenomicsDB and getting these exceptions.

We're probably going to need a separate max alts argument specifically for GenomicsDB....

@seboyden
Copy link

To reiterate, for me, GenotypeGVCFs 4.2.4.1 gives an IllegalStateException at the exact same place regardless of whether I run GenomicsDBImport 4.2.4.0 (with the 50 max allele bug) or GenomicsDBImport 4.2.4.1 (with the fix to respect 6 max alleles). As far as I can tell, the IllegalStateException has nothing to do with GenomicsDBImport, it's simply a new bug in GenotypeGVCFs 4.2.4.1 that happens to occur when the number of ALT alleles is 1 more than the limit set in GenotypeGVCFs.

As @mlathara stated: "the AF calculation [in GenotypeGVCFs] used to [in 4.2.4.0] ignore sites if they didn't have likelihoods, but has now been updated slightly [in 4.2.4.1] to also allow sites with GQ or sites where any alleles are called+nonref+not symbolic.

The stack traces above show the AlleleFrequencyCalculator as the culprit:

java.lang.IllegalStateException: Genotype [T199970 ATATATAT/T GQ 49 DP 4 AD 0,2,0,0,0,2,0,0 {SB=0,0,2,2}] does not contain likelihoods necessary to calculate posteriors.
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.log10NormalizedGenotypePosteriors(AlleleFrequencyCalculator.java:89)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.effectiveAlleleCounts(AlleleFrequencyCalculator.java:258)
	at org.broadinstitute.hellbender.tools.walkers.genotyper.afcalc.AlleleFrequencyCalculator.calculate(AlleleFrequencyCalculator.java:141)

@ldgauthier
Copy link
Contributor

Yep, this should fix your issue as well.

@droazen
Copy link
Collaborator

droazen commented Jan 27, 2022

We will do a GATK point release within the next few days to patch the GenotypeGVCFs bug, and also add a --max-genomicsdb-alternate-alleles argument that is separate from (and greater than) --max-alternate-alleles

@mlathara
Copy link
Contributor

Let me paraphrase to make sure I understand: GenotypeGVCFs somehow uses the "extra" PL fields returned by GenomicsDB when they each have different max-alternate-alleles arguments. We want to preserve this because this offers a good balance of compute time vs accuracy. Is that right?

So, outlining a few cases:

  • GenotypeGVCFs and GenomicsDB have the same max-alternate-alleles set: some sort of fix still needed for this?
  • GenotypeGVCFs max-alternate-alleles is less than GenomicsDB: default
  • GenotypeGVCFs max-alternate-alleles is greater than GenomicsDB: illegal?

Am I thinking about that correctly?

Do we need to worry about max-genotype-count here? That is also tied together in GenomicsDB and GATK. And that is actually the argument more directly tied to the number of likelihoods...

@ldgauthier
Copy link
Contributor

@mlathara do you mean because of the <NON_REF> allele? As in GenomicsDB max should be at least GGVCFs max plus one for the non-ref that GGVCFs won't output since the GGVCFs arg describes the genotyped output?

@mlathara
Copy link
Contributor

@ldgauthier you mean this part?

GenotypeGVCFs max-alternate-alleles is greater than GenomicsDB: illegal?

I just wasn't sure whether that particular permutation makes sense (either from a user perspective, or one that GenotypeGVCFs should support). I suppose as long as the argument description defines the potential interaction between the arguments, all permutations can be supported.

@ldgauthier
Copy link
Contributor

@eriqande your data looks good with the update in #7655 Hopefully we can get a point release out next week, but if you're in a rush you can roll your own jar from that branch.

There are two args now (both to be used in the GenotypeGVCFs command). I would suggest setting the --max-alternate-alleles-for-gdb to a value 3-4 more than the --max-alternate-alleles value that's used by GenotypeGVCFs if you want to be sensitive about retaining variants, keeping in mind that sites with more than --max-alternate-alleles-for-gdb will ultimately be dropped. For a human cohort, I would typically expect sites with more than 6 alleles to be (indel) PCR stutter errors, but use your own judgement based on your organism's mutation rate and population diversity.

@nalinigans
Copy link
Collaborator

nalinigans commented Jan 28, 2022

@ldgauthier, would it be possible to rename --max-alternate-alleles-for-gdb to --genomicsdb-max-alternate-alleles? That way all genomicsdb related args are together and in my mind gdb refers to the debugger gdb.

Based on @mlathara's comment, do we need to worry about --max-genotype-count as what is passed to GenomicsDB is tethered to the GenotypeGVCFs argument?

@droazen
Copy link
Collaborator

droazen commented Feb 4, 2022

Fixed in GATK 4.2.5.0

@droazen droazen closed this as completed Feb 4, 2022
@xylebori
Copy link

Hello all. Thank you for this incredibly useful software tome of tools and documentation. I can't imagine progressing far in any meaningful way without your hard work.

I am re-opening this thread because I have encountered a problem and this thread seems to be closest to the issue I am experiencing. I am running GATK 4.2.5.0 on a GenomicsDB consisting of ~500 wolf exomes, created using GATK 4.2.2.0. I have tried running GGVCFs with 500gb memory, 600gb memory, and 1.6tb memory. In each case the run stops without issuing a warning or exception at exactly the same spot. I can't run a stack trace as the java process terminates. My exact command is

gatk --java-options "-Xmx1600g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenotypeGVCFs
-R /home/dan_vanderpool/Wolf_raw_reads/Wolf_genome/GCA_905319855.2_mCanLor1.2_genomic.fa
-V gendb://Wolf_Genome_Variantsdb
-O All_Wolf_Samples_Joint_Genotypes_Raw.vcf.gz
-L /scratch/dan/Wolf_reads_raw/Wolf_GenCov300_Q20_Merged.interval_list
-imr ALL
--genomicsdb-max-alternate-alleles 10
--max-alternate-alleles 6

This runs perfectly until it reaches the 2 millionth variant mark whereupon everything stops, and all processes are terminated. You will notice this isn't occurring at chr1= ~200k (as in previous reports), but instead on the variants processed = ~2million. It seems odd that previous posts had a similar error (reported alternately as chr position ~200k or 2m).

If I try running "SelectVariants" on any interval in the database

gatk SelectVariants
-R /home/dan_vanderpool/Wolf_raw_reads/Wolf_genome/GCA_905319855.2_mCanLor1.2_genomic.fa
-V gendb://Wolf_Genome_Variantsdb
-select-type SNP
-O test_error1m.vcf.gz
-L chr1:1000000-2000000

The process stalls (doesn't terminate) without reporting any variants with the progress meter as below.

18:46:19.529 INFO SelectVariants - Done initializing engine
18:46:19.574 INFO ProgressMeter - Starting traversal
18:46:19.574 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute

The stack trace on the stalled "SelectVariants" command looks like:

"G1 Refine#78" os_prio=0 tid=0x00007ff536ea6000 nid=0x1af0e0 runnable

"G1 Refine#79" os_prio=0 tid=0x00007ff536ea8000 nid=0x1af0e1 runnable

"G1 Refine#80" os_prio=0 tid=0x00007ff536ea9800 nid=0x1af0e2 runnable

"G1 Refine#81" os_prio=0 tid=0x00007ff536eab800 nid=0x1af0e3 runnable

"G1 Refine#82" os_prio=0 tid=0x00007ff536ead000 nid=0x1af0e4 runnable

"G1 Young RemSet Sampling" os_prio=0 tid=0x00007ff536eaf000 nid=0x1af0e5 runnable
"VM Periodic Task Thread" os_prio=0 tid=0x00007ff5310eb000 nid=0x1af101 waiting on condition

JNI global references: 13

Heap
garbage-first heap total 5378048K, used 1388875K [0x0000000082000000, 0x0000000800000000)
region size 4096K, 334 young (1368064K), 1 survivors (4096K)
Metaspace used 49722K, capacity 50240K, committed 50768K, reserved 1093632K
class space used 6018K, capacity 6219K, committed 6268K, reserved 1048576K

This is the screen log for GGVCFs command, although I was hoping by adding the flag "-DGATK_STACKTRACE_ON_USER_EXCEPTION=true" I would get more information to share. As it takes 11 hours to get to the point where the error occurs it has been difficult to trouble shoot, I am hoping that I can fix this without rebuilding everything which is why I decided to write. Thanks for any information or suggestions you may have.

Dan

Using GATK jar /home/dan_vanderpool/src/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -Xmx1600g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true -jar /home/dan_vanderpool/src/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar GenotypeGVCFs -R /home/dan_vanderpool/Wolf_raw_reads/Wolf_genome/GCA_905319855.2_mCanLor1.2_genomic.fa -V gendb://Wolf_Genome_Variantsdb -O All_Wolf_Samples_Joint_Genotypes_Raw.vcf.gz -L /scratch/dan/Wolf_reads_raw/Wolf_GenCov300_Q20_Merged.interval_list -imr ALL --genomicsdb-max-alternate-alleles 10 --max-alternate-alleles 6
17:49:29.781 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/dan_vanderpool/src/gatk-4.2.5.0/gatk-package-4.2.5.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Feb 23, 2022 5:49:30 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
17:49:30.164 INFO GenotypeGVCFs - ------------------------------------------------------------
17:49:30.165 INFO GenotypeGVCFs - The Genome Analysis Toolkit (GATK) v4.2.5.0
17:49:30.165 INFO GenotypeGVCFs - For support and documentation go to https://software.broadinstitute.org/gatk/
17:49:30.165 INFO GenotypeGVCFs - Executing as dan_vanderpool@0e07622619ad on Linux v4.4.0-210-generic amd64
17:49:30.165 INFO GenotypeGVCFs - Java runtime: OpenJDK 64-Bit Server VM v10.0.2+13
17:49:30.166 INFO GenotypeGVCFs - Start Date/Time: February 23, 2022 at 5:49:29 PM UTC
17:49:30.166 INFO GenotypeGVCFs - ------------------------------------------------------------
17:49:30.166 INFO GenotypeGVCFs - ------------------------------------------------------------
17:49:30.167 INFO GenotypeGVCFs - HTSJDK Version: 2.24.1
17:49:30.167 INFO GenotypeGVCFs - Picard Version: 2.25.4
17:49:30.167 INFO GenotypeGVCFs - Built for Spark Version: 2.4.5
17:49:30.167 INFO GenotypeGVCFs - HTSJDK Defaults.COMPRESSION_LEVEL : 2
17:49:30.167 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
17:49:30.168 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
17:49:30.168 INFO GenotypeGVCFs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
17:49:30.168 INFO GenotypeGVCFs - Deflater: IntelDeflater
17:49:30.168 INFO GenotypeGVCFs - Inflater: IntelInflater
17:49:30.168 INFO GenotypeGVCFs - GCS max retries/reopens: 20
17:49:30.168 INFO GenotypeGVCFs - Requester pays: disabled
17:49:30.168 INFO GenotypeGVCFs - Initializing engine
17:49:31.083 INFO GenomicsDBLibLoader - GenomicsDB native library version : 1.4.3-6069e4a
17:49:34.547 info NativeGenomicsDB - pid=1756702 tid=1756703 No valid combination operation found for INFO field InbreedingCoeff - the field will NOT be part of INFO fields in the generated VCF records
17:49:34.547 info NativeGenomicsDB - pid=1756702 tid=1756703 No valid combination operation found for INFO field MLEAC - the field will NOT be part of INFO fields in the generated VCF records
17:49:34.547 info NativeGenomicsDB - pid=1756702 tid=1756703 No valid combination operation found for INFO field MLEAF - the field will NOT be part of INFO fields in the generated VCF records
17:49:36.065 INFO FeatureManager - Using codec IntervalListCodec to read file file:///scratch/dan/Wolf_reads_raw/Wolf_GenCov300_Q20_Merged.interval_list
17:49:37.550 INFO IntervalArgumentCollection - Processing 82453944 bp from intervals
17:49:37.586 INFO GenotypeGVCFs - Done initializing engine
17:49:37.646 INFO ProgressMeter - Starting traversal
17:49:37.646 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.33596011100000017,Cpu time(s),0.33382469800000003
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8989001699999994,Cpu time(s),0.717782154
17:50:06.448 INFO ProgressMeter - chr1:68635 0.5 1000 2083.2
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5475372219999998,Cpu time(s),0.5469811949999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.730766379,Cpu time(s),0.6442443980000002
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.4045866530000001,Cpu time(s),0.40229067200000024
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5350814799999993,Cpu time(s),0.53377121

.............
...........
.......
....
...
.

13:59:13.758 INFO ProgressMeter - chr1:93376595 638.0 1968000 3084.5
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8929931749999995,Cpu time(s),0.8602520589999988
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.38337427000000013,Cpu time(s),0.3833719339999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.18107355799999994,Cpu time(s),0.18105254200000012
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.158825212,Cpu time(s),0.15884046699999993
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.06781657199999999,Cpu time(s),0.06781145800000003
14:00:12.277 INFO ProgressMeter - chr1:93457728 639.0 1969000 3081.4
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.20574665200000009,Cpu time(s),0.20577224800000005
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.4271922139999996,Cpu time(s),0.4271902219999999
Chromosome chr1 position 93497864 (TileDB column 93497863) has too many alleles in the combined VCF record : 30 : current limit : 10. Fields, such as PL, with length equal to the number of genot
ypes will NOT be added for this location.

Chromosome chr1 position 93497885 (TileDB column 93497884) has too many alleles in the combined VCF record : 11 : current limit : 10. Fields, such as PL, with length equal to the number of genot
ypes will NOT be added for this location.

Chromosome chr1 position 93497887 (TileDB column 93497886) has too many alleles in the combined VCF record : 13 : current limit : 10. Fields, such as PL, with length equal to the number of genot
ypes will NOT be added for this location.

Chromosome chr1 position 93497889 (TileDB column 93497888) has too many alleles in the combined VCF record : 16 : current limit : 10. Fields, such as PL, with length equal to the number of genot
ypes will NOT be added for this location.

Chromosome chr1 position 93497891 (TileDB column 93497890) has too many alleles in the combined VCF record : 43 : current limit : 10. Fields, such as PL, with length equal to the number of genot
ypes will NOT be added for this location.

GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1.009227539,Cpu time(s),1.0086322770000005
14:00:48.676 INFO ProgressMeter - chr1:93517737 639.6 1970000 3080.0
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8217597579999995,Cpu time(s),0.8216645980000011
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.12151769400000002,Cpu time(s),0.12153772400000001
14:01:14.063 INFO ProgressMeter - chr1:93623332 640.0 1971000 3079.5
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.6328776320000002,Cpu time(s),0.6328368160000004
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.09548567399999998,Cpu time(s),0.095487573
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5727771769999997,Cpu time(s),0.5728123240000006
14:01:50.074 INFO ProgressMeter - chr1:93683674 640.6 1972000 3078.2
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.9968464419999995,Cpu time(s),0.9966557420000003
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8894786709999988,Cpu time(s),0.8894220020000002
14:02:13.616 INFO ProgressMeter - chr1:93744177 641.0 1973000 3077.9
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.4512276350000002,Cpu time(s),0.4511926880000002
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.9605921880000003,Cpu time(s),0.7330661930000003
14:02:39.922 INFO ProgressMeter - chr1:93824730 641.5 1974000 3077.3
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5861020839999999,Cpu time(s),0.5860650889999997
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.194984899,Cpu time(s),0.194934499
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.04492449,Cpu time(s),0.044919816
14:03:15.246 INFO ProgressMeter - chr1:93885681 642.1 1975000 3076.1
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8663779130000002,Cpu time(s),0.8662686440000006
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.05405293100000002,Cpu time(s),0.054069363999999995
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7206055259999994,Cpu time(s),0.7204763370000008
14:03:50.355 INFO ProgressMeter - chr1:93966088 642.6 1976000 3074.8
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.46577267300000014,Cpu time(s),0.46567525699999984
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8081788709999999,Cpu time(s),0.8079976489999998
14:04:17.870 INFO ProgressMeter - chr1:94021107 643.1 1977000 3074.2

.............
...........
.........
....
..
.

GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.203489618,Cpu time(s),0.2034794189999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.29252334599999985,Cpu time(s),0.2924567550000001
Chromosome chr1 position 94664769 (TileDB column 94664768) has too many alleles in the combined VCF record : 11 : current limit : 10. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7820505179999996,Cpu time(s),0.7815515469999997
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.3168020280000001,Cpu time(s),0.3168988810000002
14:11:05.627 INFO ProgressMeter - chr1:94684848 649.9 1987000 3057.4
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.12011685800000002,Cpu time(s),0.12010213400000001
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8632601950000016,Cpu time(s),0.8633032279999995
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.12743692499999998,Cpu time(s),0.12740448599999996
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.14069430000000005,Cpu time(s),0.14074666499999997
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),7.25002E-4,Cpu time(s),6.93458E-4
14:12:12.560 INFO ProgressMeter - chr1:94785159 651.0 1988000 3053.7
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.4463671760000001,Cpu time(s),0.4463201570000004
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.009312711999999999,Cpu time(s),0.009311537
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.46523442000000026,Cpu time(s),0.4651473790000001
14:12:50.881 INFO ProgressMeter - chr1:94866778 651.6 1989000 3052.3
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8587987370000005,Cpu time(s),0.858104161
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7125395419999994,Cpu time(s),0.7125397220000003
14:13:19.980 INFO ProgressMeter - chr1:94927297 652.1 1990000 3051.5
Chromosome chr1 position 94927445 (TileDB column 94927444) has too many alleles in the combined VCF record : 15 : current limit : 10. Fields, such as PL, with length equal to the number of genotypes will NOT be added for this location.

GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7910685349999996,Cpu time(s),0.7911189049999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.051357398999999984,Cpu time(s),0.05138313999999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1.0258357940000005,Cpu time(s),1.025852223
14:14:00.091 INFO ProgressMeter - chr1:95007480 652.8 1991000 3049.9
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8301361840000009,Cpu time(s),0.8300436479999992
14:14:12.464 INFO ProgressMeter - chr1:95028006 653.0 1992000 3050.5
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8915903110000003,Cpu time(s),0.8917716420000006
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),1.0141684109999995,Cpu time(s),1.0141129959999997
14:14:42.670 INFO ProgressMeter - chr1:95089317 653.5 1993000 3049.7
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.15443601299999993,Cpu time(s),0.15444636099999998
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.00388675,Cpu time(s),0.0038859320000000004
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.011457441,Cpu time(s),0.01146462
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.9513679139999998,Cpu time(s),0.9510472580000006
14:15:32.332 INFO ProgressMeter - chr1:95149684 654.3 1994000 3047.4
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7858223750000004,Cpu time(s),0.7860161419999994
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5609820189999997,Cpu time(s),0.5610445
14:15:58.556 INFO ProgressMeter - chr1:95190360 654.8 1995000 3046.9
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7966412100000002,Cpu time(s),0.7964724209999999
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7449437060000001,Cpu time(s),0.7449131769999995
14:16:26.143 INFO ProgressMeter - chr1:95230563 655.2 1996000 3046.2
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.2818602259999998,Cpu time(s),0.28184821800000015
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7874050149999989,Cpu time(s),0.7865822079999993
14:16:51.877 INFO ProgressMeter - chr1:95271581 655.7 1997000 3045.8
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.5230967649999999,Cpu time(s),0.52315023
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.6448535830000001,Cpu time(s),0.6398179819999991
14:17:17.644 INFO ProgressMeter - chr1:95312109 656.1 1998000 3045.3
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.7141784470000003,Cpu time(s),0.7140978140000002
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.790297878,Cpu time(s),0.7903505010000009
14:17:46.985 INFO ProgressMeter - chr1:95352498 656.6 1999000 3044.6
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.8911955110000004,Cpu time(s),0.8911741600000005
GENOMICSDB_TIMER,GenomicsDB iterator next() timer,Wall-clock time(s),0.6716321679999999,Cpu time(s),0.6714488319999998

@droazen
Copy link
Collaborator

droazen commented Feb 23, 2022

@xylebori There were a couple of issues with the GenomicsDB / GenotypeGVCFs pipeline in the 4.2.5.0 release that will be fixed with the patch in #7670. I'm not certain they are related to the issue you're encountering, but you could try running with that patch (or waiting for 4.2.6.0, which should be out shortly) and see if it helps. If it doesn't help, file a new ticket (rather than commenting on this closed ticket) and we'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants