Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Must we split multi-allelic sites in our Genotype model? #1231

Closed
jpdna opened this issue Oct 28, 2016 · 4 comments
Closed

Must we split multi-allelic sites in our Genotype model? #1231

jpdna opened this issue Oct 28, 2016 · 4 comments

Comments

@jpdna
Copy link
Member

jpdna commented Oct 28, 2016

Taking a look at https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto
again, @heuermh pointed out to me that indeed despite previously advocating a "one allele per row" model the current GA4GH model in the schema above directly support multi-allelic sites.

Specifically, the genotype_likelihood field here:
https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L132

can be a home to Number=G fields like GL, which we currently cannot represent in our "multi-allele splitting" model in ADAM.

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

IMO, There would be some advantages to using the current GA4GH model, even if represented in avro, - including making it easier to round-trip a VCF file without data loss - including a home for Number=G fields, as well as better compatibility with GA4GH.

I'm considering making avro schemas like GA4GHVariant and GA4GHCall to explore the idea more concretely.

@jpdna
Copy link
Member Author

jpdna commented Nov 3, 2016

Here is some interesting documentation about how Hail handles multi-allele splitting
https://hail.is/commands.html#splitmulti

  • they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"
    This would seem to parallel ADAM's current focus on splitting alleles
  • To deal with INFO fields of number of alelles (R) they say:

"Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele's position"

  • They do have a "non split" mode when importing VCFs

@fnothaft
Copy link
Member

fnothaft commented Nov 3, 2016

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"

The TL;DR is that this was the exact motivation.

@jpdna jpdna closed this as completed Nov 3, 2016
@fnothaft
Copy link
Member

fnothaft commented Nov 3, 2016

There's a variety of other representational issues too. E.g., what do you do for sites where you have a long INDEL/linear SV (e.g., loss of heterozygosity for a whole gene) that covers up all of the point variants on a haplotype?

@jpdna
Copy link
Member Author

jpdna commented Nov 3, 2016

Yeah - I guess my main interest was in being able to round trip a VCF into parquet/Hbase as a "lossless" archive. However - this may not be a particularly urgent need, given if the gzipped VCF are on HDFS, they are pretty easy to go back to an scan quickly if the need arose to recover dropped tags or the exact multi-alleleic representation. Thanks for the comments - I'mm gonna go ahead and keep this closed this so it doesn't clutter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants