Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AS_FilterStatus uses an incorrect number of fields for some records #6857

Open
2 tasks done
DonFreed opened this issue Sep 30, 2020 · 12 comments
Open
2 tasks done

AS_FilterStatus uses an incorrect number of fields for some records #6857

DonFreed opened this issue Sep 30, 2020 · 12 comments
Assignees

Comments

@DonFreed
Copy link
Contributor

Bug Report

Affected tool

Mutect2

Affected versions

  • Latest public release version 4.1.8.1
  • Latest master branch as of 9/20/2020

Description

Mutect2’s header defines AS_FilterStatus as follows:

##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Filter status for each allele, as assessed by ApplyRecalibration. Note that the VCF filter field will reflect the most lenient/sensitive status across all alleles.">

AS_FilterStatus uses the pipe character | for per-allele concatenation and a comma , for filter concatenation. This causes records to have an incorrect number of values at sites with multiple filters or multiple alleles. Some examples:

chr1    826950  .       G       T       .       clustered_events;contamination;map_qual;strand_bias    AS_FilterStatus=map_qual,strand_bias,contamination;AS_SB_TABLE=86,101|7,0;DP=199;ECNT=3;GERMQ=93;MBQ=35,34;MFRL=193,211;MMQ=33,27;MPOS=6;NALOD=1.28;NLOD=5.42;POPAF=1.39;ROQ=80;TLOD=8.79       GT:AD:AF:DP:F1R2:F2R1:SB        0/0:36,0:0.05:36:18,0:18,0:24,12,0,0    0/1:151,7:0.055:158:76,4:71,3:62,89,7,0
chr1    3633298 .       GT      G,GTT   .       contamination;multiallelic;normal_artifact;slippage;weak_evidence    AS_FilterStatus=weak_evidence,contamination|weak_evidence,contamination;AS_SB_TABLE=89,7|7,0|7,0;DP=129;ECNT=1;GERMQ=67;MBQ=20,20,20;MFRL=0,0,0;MMQ=60,60,60;MPOS=17,31;NALOD=-0.2424,0.21;NLOD=6.4,6.36;POPAF=2.49,2.04;ROQ=93;RPA=11,10,12;RU=T;STR;STRQ=1;TLOD=3.04,4.6      GT:AD:AF:DP:F1R2:F2R1:SB        0/0:58,4,4:0.072,0.069:66:28,2,2:28,2,2:54,4,8,0        0/1/2:38,3,3:0.083,0.084:44:21,1,3:15,2,0:35,3,6,0

A quick fix would be to define Number=1 for AS_FilterStatus in the VCF header. Alternatively, using a pipe for filter concatenation and a comma for per-allele concatenation might be more compliant with the VCF specification.

@droazen
Copy link
Contributor

droazen commented Oct 5, 2020

@ldgauthier It seems that we might be violating the VCF spec with the way values are delimited in the AS_FilterStatus annotation. How would you feel about correcting this?

@ldgauthier
Copy link
Contributor

@droazen Are you thinking String with count=1 and then DIY parsing? Despite years of discussions, we never made any progress on lists of lists in the VCF spec.

@ldgauthier
Copy link
Contributor

It looks like that's just what Don did in #6858

@TnakaNY
Copy link

TnakaNY commented Dec 21, 2020

I am facing same problem when I tried to merge Mutect2 vcf files (from several patients) to one. I tried to use bcftools merge, but I got a same error. I used Mutect2 of GATK4.1.9.0.

From when does Mutect2 have this error?

I am thinking of downgrade GATK tools such as 4.1.7.0 or older version.
Does anyone have information? Which GATK4 Mutect2 version should I try?

@asntech
Copy link

asntech commented Aug 4, 2021

Is this issue fixed yet? @TnakaNY I am using GATK v4.1.7.0 and still getting the error wrong number of fields in AS_FilterStatus? while merging Mutect2 VCF files from multiple donors using the bcftools merge.

@jkobject
Copy link

jkobject commented Jan 31, 2022

I am running v4.2.4.0 and still have the issue. It seems that mutect2 is doing the wrong thing of separating the AS_FilterStatus for different allele using | and listing many AS_FilterStatus per alleles using , but the rule is actually the opposite apparently samtools/bcftools#1570

@davetang
Copy link

davetang commented May 10, 2022

I came across this problem when using bcftools merge on some Mutect2 VCFs and as suggested, a quick fix is to change AS_FilterStatus in the VCF header. However I changed Number to ., which according to the VCF spec:

If the number of possible values varies, is unknown, or is unbounded, then this value should be ‘.’.

Sharing my code below in case it is useful for others:

bcftools head my.vcf \
   | perl -nle "if (/ID=AS_FilterStatus,/){ s/Number=A/Number=./ } print" > my.header.txt

bcftools reheader -h my.header.txt my.vcf \
   | bgzip > my.vcf.gz
tabix my.vcf.gz

@kasia-te
Copy link

kasia-te commented Jun 8, 2022

I have the same problem and I use sed to change the header line before merging.
But could you consider changing the allele separator in this field to , and keep the Number=A in the header to allow for easy splitting of the field with bcftools norm -m, as in the examples below:
Number=A - split

zcat test.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    ANNOTATION=aboutT,aboutC        GT      0/2     0/1

bcftools norm -m -any test.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=A,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test.vcf.gz; Date=Wed Jun  8 20:51:27 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT  GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutC  GT      0/1     0/0

Number=. - no split

zcat test1.vcf.gz 
##fileformat=VCFv4.2
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T,C     100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/2     0/1

bcftools norm -m -any test1.vcf.gz
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=AS_FilterStatus,Number=.,Type=String,Description="Some Description">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=chr1>
##bcftools_normVersion=1.10.2+htslib-1.10.2
##bcftools_normCommand=norm -m -any test1.vcf.gz; Date=Wed Jun  8 21:14:22 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  s1      s2
chr1    1000    .       A       T       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/0     0/1
chr1    1000    .       A       C       100     PASS    AS_FilterStatus=aboutT|aboutC   GT      0/1     0/0

Best,
Kasia

@jkobject
Copy link

Yes we made a workflow on Terra to debug this: https://dockstore.org/workflows/github.com/broadinstitute/depmap_omics/omics_mutect2:dev?tab=files
this is the fix_mutect2.wdl file I believe

@dining-philosopher
Copy link

dining-philosopher commented Sep 14, 2022

Hello!

I wrote a script to circumvent this error:

#!/usr/bin/python3

# makes Mutect2 output vcf compliant with VCF specification in case of AS_FilterStatus delimiters at multiallelic loci
# ("wrong number of fields in AS_FilterStatus?")

# https://github.com/broadinstitute/gatk/issues/6857

# usage: cat sample.vcf | correct_mutect.py | bgzip > sample.right.vcf.gz && tabix -p vcf sample.right.vcf.gz

import sys

if __name__ == "__main__":
    for l in sys.stdin:
        if l[0] != "#":
            cols = l.split("\t")
            # if "," in cols[4]: # not sure if the correction is needed at biallelic sites
            if True:
                info = cols[7].split(";")
                for i in range(len(info)):
                    a = info[i]
                    if a[:16] == "AS_FilterStatus=":
                        b = list(map(lambda c: c.split(","), a[16:].split("|")))
                        info[i] = "AS_FilterStatus=" + ",".join(map(lambda c: "|".join(c), b))
                cols = cols[:7] + [";".join(info)] + cols[8:]
                l = "\t".join(cols)
        print(l, end = "")

@antonylebechec
Copy link

antonylebechec commented Feb 8, 2023

Hello!

Same problem with comma in Description field in VCF header.

I wrote a awk script to prevent error in further tools that can not deal with this format.

fix_vcf_header_with_comma_in_description.awk

BEGIN {
    sep=";"    # alternative separator 
}
{
    # header line with comma in Description
    if ( ($0 ~ /^##INFO/ || $0 ~ /^##FORMAT/) && $0 ~ /Description=\".*,.*\"/ ) {
        match($0, /^(.*)(Description=\".*\")(.*)$/, arr);
        gsub(",",sep,arr[2]);
        print arr[1] arr[2] arr[3]
    # other lines
    } else {
        print $0
    }
}

Usage:

awk -f fix_vcf_header_with_comma_in_description.awk example.vcf
cat example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk
bcftools view example.vcf | awk -f fix_vcf_header_with_comma_in_description.awk

@dpmccabe
Copy link

A faster awk alternative to the above Python script to swap the characters in the variant rows:

awk '{
    # find the filter status field
    match($0, /AS_FilterStatus=[^;]+;/);

    if (RSTART != 0) {
    	# the line matches
        before = substr($0, 1, RSTART - 1);
        match_str = substr($0, RSTART, RLENGTH);
        after = substr($0, RSTART + RLENGTH);

        # temp replace "|" with a non-printing char to swap "|" and "," chars
        gsub(/\|/, "\x1e", match_str);
        gsub(/,/, "|", match_str);
        gsub(/\x1e/, ",", match_str);

        # print modified line
        print before match_str after;
    } else {
    	# no match
        print $0;
    }
}' your.vcf > fixed.vcf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests