Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vt normalize doesn't handle collisions #16

Closed
tfarrah opened this issue Mar 25, 2015 · 6 comments
Closed

vt normalize doesn't handle collisions #16

tfarrah opened this issue Mar 25, 2015 · 6 comments

Comments

@tfarrah
Copy link

tfarrah commented Mar 25, 2015

vt normalize doesn't cover certain cases where two variants are very close together and the right one wants to left-align past the other one.

Here is an example:

chrM reference 56-60 is ATTTT

Here are two lines from a fictitious VCF file. Together they specify ATTTT-> AGTTTT
chrM 57 . T G
chrM 59 . T TT

A correct normalization would be:
chrM 57 . T G
chrM 58 . T TT

(The ideal normalization would be a single line, but vt normalize is not intended to do anything this sophisticated:
chrM 56 . A AG )

vt incorrectly gives the below, which together specify ATTTT->ATGTTT, different from what's specified by the original
chrM 56 . A AT
chrM 57 . T G

You may find it helpful that this same issue came up with the SMaSH normalizer: https://groups.google.com/forum/#!topic/smash-benchmarking/2Vnn7OR0ug8

They addressed it by writing some collision-handling code: amplab/smash#7

@atks
Copy link
Owner

atks commented Mar 25, 2015

Hi Terry,

Thanks for the report. This is an interesting example.

But before I begin, I would like to distinguish the difference between normalization and decomposition of variants (as we defined it)

Decomposition of variants involves the breaking down of a variant record into multiple records. It may be done vertically - as in multiallelics becoming biallelics or it can be done horizontally - a cluster of indels and SNPs represented as a complex variant being splitted up into several records. Horizontal decompositions in general do not have a unique solution. Because of this, reconstruction of a haplotype from simpler variants is not a straight forward process.

Normalization involves ensuring the representation of a variant record is left aligned and parsimonious and does not increase or decrease the number of records representing that variant. Normalization can be applied to biallelic variants or multiallelic variants. The problem of normalization is solvable and there exists a unique representation that is left aligned and parsimonious. Mathematical proof is published. [http://bioinformatics.oxfordjournals.org/content/early/2015/03/22/bioinformatics.btv112]

Supporting haplotype reconstruction is actually not the goal of vt's normalization, it is meant for allowing one to compare the alleles of variant call sets from different variant callers applied possibly to multiple samples.

The notion of normalization that you described involves reconstruction of haplotypes, and you are right to say that there should be some inbuilt collision detection mechanism. It was a really nice example and in the context of a single sample and the assumption that the alternate alleles must occur on the same haplotype, it is correct.

Some points to note:

  1. Considering alternate sequences containing all possible alleles.
    When I look at 2 records, like the following:
    chrM 57 . T G
    chrM 58 . T TT
    The single record combining them should allow for all combinations of the alternate alleles and that should yield
    chrM 56 ATTTT AGTTT, ATTTTT, AGTTTT
    which may be normalized to
    chrM 57 T G,TT,GT

  2. Considering a variant record for 2 samples
    It is possible that in one sample, there is only a T->G SNP and the other sample only has a T insertion giving you:
    chrM 57 . T G
    chrM 58 . T TT
    which can be safely normalized to
    chrM 56 . A AT
    chrM 57 . T G
    Thus for this case, applying the collision algorithm and then combining the 2 variant records will result in creating a fake variant.

  3. Retaining haplotype information
    In general, if a variant caller finds a need to retain haplotype information, it should report a variant containing the closeby variants. By calling the variants separately, information is lost and it is difficult to reconstruct the original haplotypes observed due to the inherent issues in identifiability.

@tfarrah
Copy link
Author

tfarrah commented Mar 25, 2015

Thank you, Adrian, for your prompt reply.

I now understand the difference between normalization and decomposition of variants, as you define them. And I see that the desired result of normalization in the case of closeby variants depends on several characteristics of the input.

In our case, we are combining several single-sample and multi-sample WGS data sources to create a catalog (Kaviar, http://db.systemsbiology.org/kaviar) of allele frequencies for all observed SNVs and short indels/substitutions. Some data sources are full of unnormalized variants (Wellderly and PGP, for example) and we want to normalize them so that we know when we are observing the same variant across the different data sources.

I now see that the correct normalization for cases such as I describe depends on:

  • whether the two variants are seen in the same sample,
  • if same sample, whether they are homozygous or heterozygous,
  • if both heterozygous, whether they are in the same haplotype or not (if known).
    (and as you say in point 3, if the variant caller retains haplotype information it should report a variant containing both closeby variants for clarity)

If same sample and either variant is homozygous or if both are in the same haplotype, then my collision-avoiding normalization is correct.

If those conditions not meant, then vt already does the right thing

I can see that the issue is complex, and that your software would need to consider homozygous vs. heterozygous, phased vs. unphased, and single-sample vs. multi-sample in order to avoid collisions in appropriate cases.

@atks
Copy link
Owner

atks commented Mar 25, 2015

To be correct, even if those conditions are not met, vt is not guaranteed to do the right thing as it has no idea what is the correct thing to do. Eventually, vt should use such information if it is available and perform the correct variant consolidation steps through other sub programs (not normalize).

Thanks for linking Kaviar, http://db.systemsbiology.org/kaviar, it's a very useful catalog!

@tfarrah
Copy link
Author

tfarrah commented Apr 9, 2015

Adrian, thank you for the very detailed discussion. I now have a much better understanding of the issues.

@tfarrah tfarrah closed this as completed Apr 9, 2015
@xuyangy
Copy link

xuyangy commented Jan 9, 2017

Can vt at least mark up those variants that has closely located variants, e.g. within a window size 3 bp? Then these variants can be conveniently filtered out for manual checking. This feature will also contribute to the complex but crucial scenario where two variants located in the the same coden, one variant is rare while they other is a common variant (will be filtered out), but these two variants together make up a stop codon, for example.

@atks
Copy link
Owner

atks commented Jan 12, 2017

@xuyangy

You can use "vt filter_overlap".
http://genome.sph.umich.edu/wiki/Vt#Filter_overlap

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants