AS_FilterStatus uses an incorrect number of fields for some records #6857

DonFreed · 2020-09-30T22:17:02Z

Bug Report

Affected tool

Mutect2

Affected versions

Latest public release version 4.1.8.1
Latest master branch as of 9/20/2020

Description

Mutect2’s header defines AS_FilterStatus as follows:

##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">

AS_FilterStatus uses the pipe character | for per-allele concatenation and a comma , for filter concatenation. This causes records to have an incorrect number of values at sites with multiple filters or multiple alleles. Some examples:

chr1    826950  .       G       T       .       clustered_events;contamination;map_qual;strand_bias    AS_FilterStatus=map_qual,strand_bias,contamination;AS_SB_TABLE=86,101|7,0;DP=199;ECNT=3;GERMQ=93;MBQ=35,34;MFRL=193,211;MMQ=33,27;MPOS=6;NALOD=1.28;NLOD=5.42;POPAF=1.39;ROQ=80;TLOD=8.79       GT:AD:AF:DP:F1R2:F2R1:SB        0/0:36,0:0.05:36:18,0:18,0:24,12,0,0    0/1:151,7:0.055:158:76,4:71,3:62,89,7,0
chr1    3633298 .       GT      G,GTT   .       contamination;multiallelic;normal_artifact;slippage;weak_evidence    AS_FilterStatus=weak_evidence,contamination|weak_evidence,contamination;AS_SB_TABLE=89,7|7,0|7,0;DP=129;ECNT=1;GERMQ=67;MBQ=20,20,20;MFRL=0,0,0;MMQ=60,60,60;MPOS=17,31;NALOD=-0.2424,0.21;NLOD=6.4,6.36;POPAF=2.49,2.04;ROQ=93;RPA=11,10,12;RU=T;STR;STRQ=1;TLOD=3.04,4.6      GT:AD:AF:DP:F1R2:F2R1:SB        0/0:58,4,4:0.072,0.069:66:28,2,2:28,2,2:54,4,8,0        0/1/2:38,3,3:0.083,0.084:44:21,1,3:15,2,0:35,3,6,0

A quick fix would be to define Number=1 for AS_FilterStatus in the VCF header. Alternatively, using a pipe for filter concatenation and a comma for per-allele concatenation might be more compliant with the VCF specification.

The text was updated successfully, but these errors were encountered:

…ute#6857.

droazen · 2020-10-05T18:10:34Z

@ldgauthier It seems that we might be violating the VCF spec with the way values are delimited in the AS_FilterStatus annotation. How would you feel about correcting this?

ldgauthier · 2020-10-06T15:18:24Z

@droazen Are you thinking String with count=1 and then DIY parsing? Despite years of discussions, we never made any progress on lists of lists in the VCF spec.

ldgauthier · 2020-10-06T16:02:22Z

It looks like that's just what Don did in #6858

TnakaNY · 2020-12-21T23:00:29Z

I am facing same problem when I tried to merge Mutect2 vcf files (from several patients) to one. I tried to use bcftools merge, but I got a same error. I used Mutect2 of GATK4.1.9.0.

From when does Mutect2 have this error?

I am thinking of downgrade GATK tools such as 4.1.7.0 or older version.
Does anyone have information? Which GATK4 Mutect2 version should I try?

asntech · 2021-08-04T10:58:48Z

Is this issue fixed yet? @TnakaNY I am using GATK v4.1.7.0 and still getting the error wrong number of fields in AS_FilterStatus? while merging Mutect2 VCF files from multiple donors using the bcftools merge.

jkobject · 2022-01-31T15:17:06Z

I am running v4.2.4.0 and still have the issue. It seems that mutect2 is doing the wrong thing of separating the AS_FilterStatus for different allele using | and listing many AS_FilterStatus per alleles using , but the rule is actually the opposite apparently samtools/bcftools#1570

davetang · 2022-05-10T02:13:40Z

I came across this problem when using bcftools merge on some Mutect2 VCFs and as suggested, a quick fix is to change AS_FilterStatus in the VCF header. However I changed Number to ., which according to the VCF spec:

If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’.

Sharing my code below in case it is useful for others:

bcftools head my.vcf \
   | perl -nle "if (/ID=AS_FilterStatus,/){ s/Number=A/Number=./ } print" > my.header.txt

bcftools reheader -h my.header.txt my.vcf \
   | bgzip > my.vcf.gz
tabix my.vcf.gz

kasia-te · 2022-06-08T21:19:15Z

I have the same problem and I use sed to change the header line before merging.
But could you consider changing the allele separator in this field to , and keep the Number=A in the header to allow for easy splitting of the field with bcftools norm -m, as in the examples below:
Number=A - split

zcat test.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    ANNOTATION=aboutT,aboutC        GT      0/2     0/1

bcftools norm -m -any test.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test.vcf.gz; Date=Wed Jun  8 20:51:27 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT  GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutC  GT      0/1     0/0

Number=. - no split

zcat test1.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/2     0/1

bcftools norm -m -any test1.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test1.vcf.gz; Date=Wed Jun  8 21:14:22 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/1     0/0

Best,
Kasia

jkobject · 2022-06-10T20:53:16Z

Yes we made a workflow on Terra to debug this: https://dockstore.org/workflows/github.com/broadinstitute/depmap_omics/omics_mutect2:dev?tab=files
this is the fix_mutect2.wdl file I believe

dining-philosopher · 2022-09-14T12:03:15Z

Hello!

I wrote a script to circumvent this error:

#!/usr/bin/python3

# makes Mutect2 output vcf compliant with VCF specification in case of AS_FilterStatus delimiters at multiallelic loci
# ("wrong number of fields in AS_FilterStatus?")

# https://github.com/broadinstitute/gatk/issues/6857

# usage: cat sample.vcf | correct_mutect.py | bgzip > sample.right.vcf.gz && tabix -p vcf sample.right.vcf.gz

import sys

if __name__ == "__main__":
    for l in sys.stdin:
        if l[0] != "#":
            cols = l.split("\t")
            # if "," in cols[4]: # not sure if the correction is needed at biallelic sites
            if True:
                info = cols[7].split(";")
                for i in range(len(info)):
                    a = info[i]
                    if a[:16] == "AS_FilterStatus=":
                        b = list(map(lambda c: c.split(","), a[16:].split("|")))
                        info[i] = "AS_FilterStatus=" + ",".join(map(lambda c: "|".join(c), b))
                cols = cols[:7] + [";".join(info)] + cols[8:]
                l = "\t".join(cols)
        print(l, end = "")

antonylebechec · 2023-02-08T11:38:33Z

Hello!

Same problem with comma in Description field in VCF header.

I wrote a awk script to prevent error in further tools that can not deal with this format.

fix_vcf_header_with_comma_in_description.awk

BEGIN {
    sep=";"    # alternative separator 
}
{
    # header line with comma in Description
    if ( ($0 ~ /^##INFO/ || $0 ~ /^##FORMAT/) && $0 ~ /Description=\".*,.*\"/ ) {
        match($0, /^(.*)(Description=\".*\")(.*)$/, arr);
        gsub(",",sep,arr[2]);
        print arr[1] arr[2] arr[3]
    # other lines
    } else {
        print $0
    }
}

Usage:

awk -f fix_vcf_header_with_comma_in_description.awk example.vcf
cat example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
bcftools view example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk

dpmccabe · 2024-06-12T12:55:54Z

A faster awk alternative to the above Python script to swap the characters in the variant rows:

awk '{
    # find the filter status field
    match($0, /AS_FilterStatus=[^;]+;/);

    if (RSTART != 0) {
    	# the line matches
        before = substr($0, 1, RSTART - 1);
        match_str = substr($0, RSTART, RLENGTH);
        after = substr($0, RSTART + RLENGTH);

        # temp replace "|" with a non-printing char to swap "|" and "," chars
        gsub(/\|/, "\x1e", match_str);
        gsub(/,/, "|", match_str);
        gsub(/\x1e/, ",", match_str);

        # print modified line
        print before match_str after;
    } else {
    	# no match
        print $0;
    }
}' your.vcf > fixed.vcf

DonFreed added a commit to DonFreed/gatk that referenced this issue Sep 30, 2020

Update the header for AS_FilterStatus. Possible fix for broadinstit…

a95abf4

…ute#6857.

DonFreed mentioned this issue Sep 30, 2020

Update the header of AS_FilterStatus #6858

Open

droazen added the Annotations label Oct 5, 2020

droazen assigned ldgauthier Oct 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AS_FilterStatus uses an incorrect number of fields for some records #6857

AS_FilterStatus uses an incorrect number of fields for some records #6857

DonFreed commented Sep 30, 2020

droazen commented Oct 5, 2020

ldgauthier commented Oct 6, 2020

ldgauthier commented Oct 6, 2020

TnakaNY commented Dec 21, 2020

asntech commented Aug 4, 2021

jkobject commented Jan 31, 2022 •

edited

Loading

davetang commented May 10, 2022 •

edited

Loading

kasia-te commented Jun 8, 2022

jkobject commented Jun 10, 2022

dining-philosopher commented Sep 14, 2022 •

edited

Loading

antonylebechec commented Feb 8, 2023 •

edited

Loading

dpmccabe commented Jun 12, 2024

AS_FilterStatus uses an incorrect number of fields for some records #6857

AS_FilterStatus uses an incorrect number of fields for some records #6857

Comments

DonFreed commented Sep 30, 2020

Bug Report

Affected tool

Affected versions

Description

droazen commented Oct 5, 2020

ldgauthier commented Oct 6, 2020

ldgauthier commented Oct 6, 2020

TnakaNY commented Dec 21, 2020

asntech commented Aug 4, 2021

jkobject commented Jan 31, 2022 • edited Loading

davetang commented May 10, 2022 • edited Loading

kasia-te commented Jun 8, 2022

jkobject commented Jun 10, 2022

dining-philosopher commented Sep 14, 2022 • edited Loading

antonylebechec commented Feb 8, 2023 • edited Loading

dpmccabe commented Jun 12, 2024

jkobject commented Jan 31, 2022 •

edited

Loading

davetang commented May 10, 2022 •

edited

Loading

dining-philosopher commented Sep 14, 2022 •

edited

Loading

antonylebechec commented Feb 8, 2023 •

edited

Loading