Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GATK fails to genotype insertion immediately following a deletion at hg19:chr1:68896832 #6538

Open
1 of 2 tasks
tfenne opened this issue Apr 3, 2020 · 1 comment
Open
1 of 2 tasks

Comments

@tfenne
Copy link
Contributor

tfenne commented Apr 3, 2020

Bug Report

Affected tool(s) or class(es)

HaplotypeCaller

Affected version(s)

  • Latest public release version [4.1.6.0]
  • Latest master branch (not tested)

Description

I have a sample that has a slightly complicated event in it that is getting miscalled. The easiest way to see it is probably with an IGV screenshot:

variant

BWA aligns the reads with a 7bp deletion followed by 2 mismatches, though am inclined to think of it as a 9bp deletion coupled with a 2bp insertion (or a swap of 9bp of reference for 2bp of novel sequence). The original alignments are in the top half of the IGV view. The bottom is the assembly BAM from running the HaplotypeCaller. From what I see the assembly is getting it right.

But the problem is that the event extraction/genotyping goes wrong. I've run it two ways. If I run to generate a called VCF directly using:

gatk HaplotypeCaller \
  --input snippet.bam \
  --output snippet.vcf \
  -R hg19/hg19.fa \
  --bam-output assembly.bam \
  -L chr1:68896800-68896900 \
  --ploidy 2 \
  --min-pruning 2 \
  --min-dangling-branch-length 2 \
  --pcr-indel-model CONSERVATIVE  

Then I get only a single variant reported in the region (the 9bp deletion):

#CHROM  POS       ID  REF         ALT  QUAL     FILTER  INFO                                                                                                                                                         FORMAT          test-sample
chr1    68896832  .   CTTTAGTTTT  C    1597.60  .       AC=1;AF=0.500;AN=2;BaseQRankSum=0.000;DP=122;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=14.52;ReadPosRankSum=1.341;SOR=0.350  GT:AD:DP:GQ:PL  0/1:67,43:110:99:1605,0,2683

If i run to generate a gvcf then things get more interesting:

gatk HaplotypeCaller \
  --input snippet.bam \
  --output snippet.g.vcf \
  -R hg19/hg19.fa \
  -ERC GVCF \
  --bam-output assembly.bam \
  -L chr1:68896800-68896900 \
  --ploidy 2 \
  --min-pruning 2 \
  --min-dangling-branch-length 2 \
  --pcr-indel-model CONSERVATIVE  

yields:

#CHROM  POS       ID  REF         ALT              QUAL     FILTER  INFO             FORMAT              test-sample
chr1    68896800  .   G           <NON_REF>        .        .       END=68896831     GT:DP:GQ:MIN_DP:PL  0/0:118:99:107:0,120,1800
chr1    68896832  .   CTTTAGTTTT  C,<NON_REF>      1597.60  .       BaseQRankSum...  GT:AD:DP:GQ:PL:SB   0/1:67,43,0:110:99:1605,0,2683,1807,2813,4620:67,0,43,0
chr1    68896841  .   T           *,TCC,<NON_REF>  344.02   .       DP=110;Exces...  GT:GQ:PL            ./.:99:0,0,0,0,0,0,0,0,0,0
chr1    68896842  .   G           <NON_REF>        .        .       END=68896899     GT:DP:GQ:MIN_DP:PL  0/0:71:99:41:0,99,1485

I.e. a record is emitted for the insertion but the genotype is ./. with a quality of 99, 0s for all the PLs and the other per-sample annotations we'd expect on a variant record missing.

I suspect the problem has something to do with the fact that the deletion and insertion are in cis and the insertion's anchor base is within the deletion. I'm not even sure how one would represent this as a pair of variants. I think ideally this would be emitted as a single variant with REF=CTTTAGTTTT and ALT=CCC.

Steps to reproduce

Use the command lines above with the files attached in
deletion-then-insertion-files.zip

Expected behavior

The resulting VCF (and gVCF) should emit one or more variants with called genotypes that indicate a 9bp deletion followed by a 2bp insertion in cis.

Actual behavior

The 9bp deletion is emitted and a) in the gVCF the 2bp insertion is emitted without a genotype or other information that would allow genotyping, b) if a VCF is called directly the 2bp insertion is absent entirely.

@nh13
Copy link
Contributor

nh13 commented Feb 25, 2024

When I ran this with 4.5.0.0, I get:

chr1	68896800	.	G	<NON_REF>	.	.	END=68896831	GT:DP:GQ:MIN_DP:PL	0/0:118:99:107:0,120,1800
chr1	68896832	.	CTTTAGTTTT	C,<NON_REF>	1597.60	.	BaseQRankSum=0.000;DP=122;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=439200,122;ReadPosRankSum=1.341	GT:AD:DP:GQ:PGT:PID:PL:PS:SB	0|1:67,43,0:110:99:0|1:68896832_CTTTAGTTTT_C:1605,0,2683,1807,2813,4620:68896832:67,0,43,0
chr1	68896841	.	T	TCC,<NON_REF>	1596.60	.	DP=110;ExcessHet=0.0000;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=0.000;RAW_MQandDP=396000,110;ReadPosRankSum=1.566	GT:AD:DP:GQ:PGT:PID:PL:PS:SB	0|1:67,43,0:110:99:0|1:68896832_CTTTAGTTTT_C:1604,0,2684,1806,2813,4619:68896832:67,0,43,0
chr1	68896842	.	G	<NON_REF>	.	.	END=68896899	GT:DP:GQ:MIN_DP:PL	0/0:71:99:41:0,99,1485
chr1	68896900	.	T	<NON_REF>	.	.	END=68896900	GT:DP:GQ:MIN_DP:PL	0/0:41:78:41:0,78,1170

And the 2bp insertion has genotype 0|1 (phased) as we'd like.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants