Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Biallelic variants #754

Open
david4096 opened this issue Dec 9, 2016 · 0 comments
Open

Biallelic variants #754

david4096 opened this issue Dec 9, 2016 · 0 comments

Comments

@david4096
Copy link
Member

Splitting from #752 (comment)

In reply to @mbaudis and @ljdursi

It is somewhat interesting feature we have that allows one to identify uniquely variants within a VCF, and supporting bi-allelic variants will lead to some changes in how we use pysam, but they are minimal. We currently combine position information for the variant the bases array to uniquely identify filters by filtering them following a range fetch. I believe this generalizes to bi-allelic variants.

If this issue brings the swell of support needed to support implementing this feature, it has my full support. It greatly simplifies variant representation and eases interchange in some cases. An induced problem becomes that calls in multi-allelic VCFs are made against both of the alternate bases. Would we create new call messages to go with both alleles of a multi-allelic variant?

For example, if I received a positive call on the VCF line where A becomes G or T, do I create a call message saying I observed the variant representing the G replacement, T replacement, or both? Perhaps we can help VCF move away from multiallelic variants, but this complicates the perspective.

Another induced problem of bi-allelic variants is that dbSNP has represented named variants that are multi-allelic, and these rsIds see common use. Multiple variant messages in a VCF would be represented by the same rs number in this case, but neither message would represent what is shown in dbSNP. Again, perhaps multiallelic variants present problems for analysis, but our goal is to provide ease of interchange.

We can solve the first problem, of generating call messages, by giving them a variant_ids field, in that a call can be made against multiple variant messages at once. This allows us to reconstruct the original VCF structure more reliably by using the Calls message as a way of storing when two variants were observed as multiallelic in the file. That way we only generate one call message and two variant messages for one multiallelic line.

The second problem, that dbSNP represents variants in a multiallelic way, seems a bit more problematic. It doesn't make sense to label both variant messages as being this or that rsID since they will only contain the single alternate base field. You might have to provide a field in the variant message that let's a client know that it is part of a multiallelic representation. That way, you could resolve from message to message when two variants are returned with the same rsID.

I fully support the move to bi-allelic variants, but we need to make sure that we provide a way for clients and data preparers to interchange their multiallelic representations.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant