Must we split multi-allelic sites in our Genotype model? #1231

jpdna · 2016-10-28T14:18:38Z

Taking a look at https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto
again, @heuermh pointed out to me that indeed despite previously advocating a "one allele per row" model the current GA4GH model in the schema above directly support multi-allelic sites.

Specifically, the genotype_likelihood field here:
https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L132

can be a home to Number=G fields like GL, which we currently cannot represent in our "multi-allele splitting" model in ADAM.

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

IMO, There would be some advantages to using the current GA4GH model, even if represented in avro, - including making it easier to round-trip a VCF file without data loss - including a home for Number=G fields, as well as better compatibility with GA4GH.

I'm considering making avro schemas like GA4GHVariant and GA4GHCall to explore the idea more concretely.

The text was updated successfully, but these errors were encountered:

jpdna · 2016-11-03T16:55:59Z

Here is some interesting documentation about how Hail handles multi-allele splitting
https://hail.is/commands.html#splitmulti

they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"
This would seem to parallel ADAM's current focus on splitting alleles
To deal with INFO fields of number of alelles (R) they say:

"Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele's position"

They do have a "non split" mode when importing VCFs

fnothaft · 2016-11-03T17:11:06Z

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"

The TL;DR is that this was the exact motivation.

fnothaft · 2016-11-03T17:14:28Z

There's a variety of other representational issues too. E.g., what do you do for sites where you have a long INDEL/linear SV (e.g., loss of heterozygosity for a whole gene) that covers up all of the point variants on a haplotype?

jpdna · 2016-11-03T17:18:49Z

Yeah - I guess my main interest was in being able to round trip a VCF into parquet/Hbase as a "lossless" archive. However - this may not be a particularly urgent need, given if the gzipped VCF are on HDFS, they are pretty easy to go back to an scan quickly if the need arose to recover dropped tags or the exact multi-alleleic representation. Thanks for the comments - I'mm gonna go ahead and keep this closed this so it doesn't clutter.

jpdna closed this as completed Nov 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Must we split multi-allelic sites in our Genotype model? #1231

Must we split multi-allelic sites in our Genotype model? #1231

jpdna commented Oct 28, 2016 •

edited

jpdna commented Nov 3, 2016

fnothaft commented Nov 3, 2016

fnothaft commented Nov 3, 2016

jpdna commented Nov 3, 2016

Must we split multi-allelic sites in our Genotype model? #1231

Must we split multi-allelic sites in our Genotype model? #1231

Comments

jpdna commented Oct 28, 2016 • edited

jpdna commented Nov 3, 2016

fnothaft commented Nov 3, 2016

fnothaft commented Nov 3, 2016

jpdna commented Nov 3, 2016

jpdna commented Oct 28, 2016 •

edited