Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drop the incorrect genoype attributes for decomposes SNP's. #1334

Closed
NeillGibson opened this issue Apr 13, 2016 · 9 comments
Closed

Drop the incorrect genoype attributes for decomposes SNP's. #1334

NeillGibson opened this issue Apr 13, 2016 · 9 comments
Labels

Comments

@NeillGibson
Copy link

NeillGibson commented Apr 13, 2016

Hi,

Would it be possible to drop the incorrect genotype attributes for decomposed variants?

The Freebayes variant calling pipe currently decomposes longer variants to SNPs/indels if possible and keeps all genotype attributes.

vcfallelicprimitives -t DECOMPOSED --keep-geno

https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/freebayes.py#L128

The total set of genotype attributes kept is

GT
AO
DP
GQ
PL
QA
QR
RO

Of these genotype attributes the following can be corrupt for decomposed multi-allelic variants.
They are corrupt because the number of values for these attributes doesn't correspond anymore with the number of alleles in the VCF ALT colum.

PL
RO
AO
QA
QR 

This causes a BCFTools assertions to fail that is used for some subset operations like for example

bcftools view -s ^sample_to_exclude --trim-alt-alleles    my_file.vcf  

samtools/bcftools#404

This could be fixed by dropping the PL, AO and QA attributes for the all or just the decomposed variants.

This would leave the following attributes for the decomposed SNPs.

GT
DP
GQ

I tried dropping the PL,AO and QA attributes for the decomposed variants and the resulting VCF file still seems to be valid. I did this by first splitting the VCF file into decomposed and non-decomposed variants.

I thought the following command would just remove the possible corrupt GT attributes from the decomposed variants and also output the non-decomposed variants unchanged.

bcftools annotate -x FMT/PL,FMT/AO,FMT/QA  -i INFO/DECOMPOSED=1 my.vcf

It however filters all the non-decomposed variants / outputs only variants with the DECOMPOSED flag set.

Maybe another tool could be added that only removes the possible corrupt genotype attributes for the decomposed variants?
Or I could open a ticket at vcflib to ask for an option "--keep-minimal-geno" ?

Do you have any ideas on how to handle this?

@chapmanb
Copy link
Member

Neill;
Thanks for the detailed report. I agree that we should be providing output that downstream tools will not error out on. I also hate to throw away PLs and other data for every DECOMPOSED variant, when this should only be a problem for those that start multi-allelic and end up single allele after decomposition.

It might be worth suggesting to drop (or fix) these in vcfallelicprimatives itself. I know @zeeev and @ekg have been working on vcflib recently so might have ideas how to make it do the right thing.

@ekg
Copy link

ekg commented Apr 13, 2016

If you use vcfallelicprimitives -kg then it will keep the INFO and FORMAT fields through the decomposition.

I need to make this more clear, because it's come up many times. But, I can't make this the default behavior as it is not correct and I don't know how to automatically re-derive the fields after decomposition.

@zeeev
Copy link

zeeev commented Apr 13, 2016

@ekg Are you saying the after breaking a tri-allelic site the genotype likelihoods are no longer valid?

I was just bit by the same behavior in vcffilter.

Maybe a tool is order?

@chapmanb
Copy link
Member

Erik and Zev;
Thanks for the thoughts. We use -g/--keep-geno in bcbio which is what leads to the incorrect PLs for these tricky records. Losing all of the genotype information by not using -g is not great either, so this is the compromise we've settled on. A tool that removed invalid GTs that don't match alleles after decomposition would be perfect. Then we could save all the things that were fine and still end up with valid VCF for the tricky ones. Thanks so much for thinking about ideas for this.

@ekg
Copy link

ekg commented Apr 14, 2016

So it is possible to filter out the alleles and maintain correct genotype
likelihoods. However, this isn't what's being used in the
vcfallelicprimitives code.

On Thu, Apr 14, 2016 at 2:05 AM Brad Chapman notifications@github.com
wrote:

Erik and Zev;
Thanks for the thoughts. We use -g/--keep-geno in bcbio which is what
leads to the incorrect PLs for these tricky records. Losing all of the
genotype information by not using -g is not great either, so this is the
compromise we've settled on. A tool that removed invalid GTs that don't
match alleles after decomposition would be perfect. Then we could save all
the things that were fine and still end up with valid VCF for the tricky
ones. Thanks so much for thinking about ideas for this.


You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub
#1334 (comment)

@nh13
Copy link

nh13 commented Jun 5, 2016

@NeillGibson I am running into this same issue, but isn't "bcftools view -a - 2> /dev/null being run before the vcfallelicprimitives command? See: https://github.com/chapmanb/bcbio-nextgen/blob/master/bcbio/variation/freebayes.py#L128-L130

@NeillGibson
Copy link
Author

Hi @nh13 I think vcfallelicprimitive itself trims the alternative alleles that are duplicate alt alleles after converting to primitives.

My guess is that the bcftools view -a - 2> /dev/null command is in place to filter alt alleles that arise from another source and are not present in any sample (even before converting to primitives).

Something like an alt allele that arises from sequencing noise and is put in the alt alleles field but was never assigned to a sample genotype.

chapmanb added a commit that referenced this issue Jun 16, 2016
- Use --strict-vcf to avoid Integer/Float problems for genotype quality
  (samtools/bcftools#420)
- Remove FMT/DPR since bcftools does not like outputs annotated as A for
  REF/ALT (#1334)
- Removes needs for /dev/null stderr redirection since
  FreeBayes includes contig lines.
- Fixes gVCF reference allele problems. Thanks to @lijiayong
@chapmanb
Copy link
Member

Neill and Nils;
I was testing the latest development FreeBayes with bcbio and it looks like the DPR attribute triggers this problem with bcftools view -a. This might be the pre-allelelicprimitives problems you saw. I updated the bcbio wrapper to handle this (and we can also quit swallowing stderr since FreeBayes includes contig lines) with the annotate -x approach. We still need to work on a fix for post vcfallelicprimitive but wanted to update on other cases where I saw the issue.

@naumenko-sa
Copy link
Contributor

Please open and update if it is still a issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants