Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VariantFiltration is incorrectly filtering variants #5916

Open
sahilseth opened this issue May 3, 2019 · 7 comments

Comments

Projects
None yet
2 participants
@sahilseth
Copy link

commented May 3, 2019

To me, it seems VariantFiltration is incorrectly adding a filter when it clearly should not - unless I am missing something.

I am adding a filter when VAF in normal is > 2% or VAF in tumor is < 5%. For this i am using the following expression:

gatk VariantFiltration -R $ref_fasta -V tmp5 \
  --filter-expression 'vc.getGenotype("3105-T").getAD().1 / vc.getGenotype("3105-T").getDP() < 0.05 ' --filter-name "low_taf" \
  --filter-expression 'vc.getGenotype("3105-N").getAD().1 / vc.getGenotype("3105-N").getDP() > 0.02 ' --filter-name "high_naf" \
  -O tmp6

However, for this critical variant, the tool is incorrectly adding a filter, when TAF is 0.55.

tail -n2 tmp5
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  3105-N        3105-T
12      25398284        .       C       T       .       clustered_events        CONTQ=93;DP=529;ECNT=5;GERMQ=93;MBQ=27,27;MFRL=0,0;MMQ=97,91;MPOS=80;NALOD=2.18;NLOD=31.43;POPAF=6.00;REF_BASES=GCCTACGCCACCAGCTCCAAC;SEQQ=93;STRANDQ=93;TLOD=309.98        GT:AD:AF:DP:F1R2:F2R1:SB        0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0     0/1:99,123:0.556:222:20,50:71,69:72,27,71,52
tail -n1 tmp6
12      25398284        .       C       T       .       clustered_events;low_taf        CONTQ=93;DP=529;ECNT=5;GERMQ=93;MBQ=27,27;MFRL=0,0;MMQ=97,91;MPOS=80;NALOD=2.18;NLOD=31.43;POPAF=6.00;REF_BASES=GCCTACGCCACCAGCTCCAAC;SEQQ=93;STRANDQ=93;TLOD=309.98        GT:AD:AF:DP:F1R2:F2R1:SB        0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0     0/1:99,123:0.556:222:20,50:71,69:72,27,71,52

VAF calc:

#   GT:AD:AF:DP:F1R2:F2R1:SB
#   0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0
#   0/1:99,123:0.556:222:20,50:71,69:72,27,71,52

TAF: 123/222=0.5540541
NAF: 0/105=0 [however 7.035e-03 is reported in AF field which is also fine...]

I have another unrelated question regarding non-mutect files. If a tool has the AD values listed under a different name (in this case AO: allele observed, instead of allele depth), how can I construct a jexl expression to filter the same?

Imagine having the same data as above, with a slightly different naming convention:

AD --> AO
#   GT:AO:AF:DP:F1R2:F2R1:SB
#   0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0
#   0/1:99,123:0.556:222:20,50:71,69:72,27,71,52

Thanks!

@sahilseth

This comment has been minimized.

Copy link
Author

commented May 3, 2019

Just to follow up, I see the same issue with gatk versions 4.1.2.0 and 3.7. It seems the division operation has an issue:

java -jar ~/gatk/3.7/GenomeAnalysisTK.jar -T VariantFiltration -R $ref_fasta -V tmp5 \
  --filterExpression 'vc.getGenotype("3105-T").getAD().1 > 122 ' --filterName "ad_more_than_122" \
  --filterExpression 'vc.getGenotype("3105-T").getDP() == 222 ' --filterName "dp_is_222" \
  --filterExpression 'vc.getGenotype("3105-T").getAD().1 / vc.getGenotype("3105-T").getDP() < 0.01 ' --filterName "taf_less_than_1pct" \
  --filterExpression 'vc.getGenotype("3105-T").getAD().1 / vc.getGenotype("3105-T").getDP() < 0.001 ' --filterName "taf_less_than_0.1pct" \
  --filterExpression 'vc.getGenotype("3105-T").getAD().1 / vc.getGenotype("3105-T").getDP() == 0.0 ' --filterName "taf_is_zero" \
  --filterExpression 'vc.getGenotype("3105-T").getAD().1 / vc.getGenotype("3105-T").getDP() > 0.05 ' --filterName "taf_more_than_5pct" \
  -o tmp6
tail -n1 tmp6
12      25398284        .       C       T       .       ad_more_than_122;clustered_events;dp_is_222;taf_is_zero;taf_less_than_0.1pct;taf_less_than_1pct CONTQ=93;DP=529;ECNT=5;GERMQ=93;MBQ=27,27;MFRL=0,0;MMQ=97,91;MPOS=80;NALOD=2.18;NLOD=31.43;POPAF=6.00;REF_BASES=GCCTACGCCACCAGCTCCAAC;SEQQ=93;STRANDQ=93;TLOD=309.98        GT:AD:AF:DP:F1R2:F2R1:SB    0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0     0/1:99,123:0.556:222:20,50:71,69:72,27,71,52

It is picking up the right values, but the division is giving ZERO instead of 0.55.

@cmnbroad

This comment has been minimized.

Copy link
Collaborator

commented May 3, 2019

@sahilseth This is because you're doing integer division, and then comparing the resulting int to a float. Try multiplying the (AD/DP) values by 1.0 and then doing the division.

@sahilseth

This comment has been minimized.

Copy link
Author

commented May 4, 2019

Thanks that helped, and also going through forums again I saw this post on jexl, and the suggested approach also worked as expected:

vc.getGenotype("3105-T").getAD().1.floatValue()/vc.getGenotype("3105-T").getDP() > 0.4

Re: the second question, what would be a way to retrieve non-cannonical values?

gatk --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true' VariantFiltration -R $ref_fasta -V tmp5 -O tmp6 --verbosity DEBUG \
  --filter-expression ' vc.getGenotype("3105-T").getAttributeAsInt2("DP", 0) > 0 ' --filter-name "dp_gt0_default0_asint" \
  --filter-expression ' vc.getGenotype("3105-T").getAttributeAsInt("DP", 0.5) > 0 ' --filter-name "dp_gt0_default0.5_asint" \
  --filter-expression ' vc.getGenotype("3105-T").getAttributeAsDouble("DP", 0.5) > 0 ' --filter-name "dp_gt0_default0.5_asdouble"
tail -n1 tmp6
12      25398284        .       C       T       .       clustered_events;dp_gt0_default0.5_asdouble     CONTQ=93;DP=529;ECNT=5;GERMQ=93;MBQ=27,27;MFRL=0,0;MMQ=97,91;MPOS=80;NALOD=2.18;NLOD=31.43;POPAF=6.00;REF_BASES=GCCTACGCCACCAGCTCCAAC;SEQQ=93;STRANDQ=93;TLOD=309.98    GT:AD:AF:DP:F1R2:F2R1:SB        0/0:105,0:7.035e-03:105:15,0:84,0:87,18,0,0     0/1:99,123:0.556:222:20,50:71,69:72,27,71,52

This to me suggests:

  1. setting defaults using (getAttributeAsInt("DP", 0.5), 0.5 in this case), works as expected
  2. However, I am not able to retrieve the actual value of DP, which is 222

Thanks!!

@cmnbroad

This comment has been minimized.

Copy link
Collaborator

commented May 6, 2019

@sahilseth Yes, you should be able to call those.

@sahilseth

This comment has been minimized.

Copy link
Author

commented May 7, 2019

thanks @cmnbroad, however vc.getGenotype("3105-T").getAttributeAsInt("DP", 0) is always returning 0 irrespective of the actual vaue of DP - am I missing something in the code?

@cmnbroad

This comment has been minimized.

Copy link
Collaborator

commented May 14, 2019

@sahilseth My apologies, I missed the last notification for this thread. For your last question, are you actually trying to retrieve genotype "DP" (in which case you should use vc.getGenotype("3105-T").getDP()), or some custom attribute? If you include the actual variant and expression you're using I can take a closer look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.