New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Must we split multi-allelic sites in our Genotype model? #1231
Comments
Here is some interesting documentation about how Hail handles multi-allele splitting
"Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele's position"
|
The TL;DR is that this was the exact motivation. |
There's a variety of other representational issues too. E.g., what do you do for sites where you have a long INDEL/linear SV (e.g., loss of heterozygosity for a whole gene) that covers up all of the point variants on a haplotype? |
Yeah - I guess my main interest was in being able to round trip a VCF into parquet/Hbase as a "lossless" archive. However - this may not be a particularly urgent need, given if the gzipped VCF are on HDFS, they are pretty easy to go back to an scan quickly if the need arose to recover dropped tags or the exact multi-alleleic representation. Thanks for the comments - I'mm gonna go ahead and keep this closed this so it doesn't clutter. |
Taking a look at https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto
again, @heuermh pointed out to me that indeed despite previously advocating a "one allele per row" model the current GA4GH model in the schema above directly support multi-allelic sites.
Specifically, the genotype_likelihood field here:
https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L132
can be a home to Number=G fields like
GL
, which we currently cannot represent in our "multi-allele splitting" model in ADAM.@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?
IMO, There would be some advantages to using the current GA4GH model, even if represented in avro, - including making it easier to round-trip a VCF file without data loss - including a home for Number=G fields, as well as better compatibility with GA4GH.
I'm considering making avro schemas like GA4GHVariant and GA4GHCall to explore the idea more concretely.
The text was updated successfully, but these errors were encountered: