Must we split multi-allelic sites in our Genotype model? #1231

Closed
jpdna opened this Issue Oct 28, 2016 · 4 comments

Comments

Projects
None yet
2 participants
@jpdna
Member

jpdna commented Oct 28, 2016

Taking a look at https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto
again, @heuermh pointed out to me that indeed despite previously advocating a "one allele per row" model the current GA4GH model in the schema above directly support multi-allelic sites.

Specifically, the genotype_likelihood field here:
https://github.com/ga4gh/schemas/blob/master/src/main/proto/ga4gh/variants.proto#L132

can be a home to Number=G fields like GL, which we currently cannot represent in our "multi-allele splitting" model in ADAM.

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

IMO, There would be some advantages to using the current GA4GH model, even if represented in avro, - including making it easier to round-trip a VCF file without data loss - including a home for Number=G fields, as well as better compatibility with GA4GH.

I'm considering making avro schemas like GA4GHVariant and GA4GHCall to explore the idea more concretely.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 3, 2016

Member

Here is some interesting documentation about how Hail handles multi-allele splitting
https://hail.is/commands.html#splitmulti

  • they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"
    This would seem to parallel ADAM's current focus on splitting alleles
  • To deal with INFO fields of number of alelles (R) they say:

"Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele's position"

  • They do have a "non split" mode when importing VCFs
Member

jpdna commented Nov 3, 2016

Here is some interesting documentation about how Hail handles multi-allele splitting
https://hail.is/commands.html#splitmulti

  • they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"
    This would seem to parallel ADAM's current focus on splitting alleles
  • To deal with INFO fields of number of alelles (R) they say:

"Hail does not split annotations in the info field. This means that if a multiallelic site with info.AC value [10, 2] is split, each split site will contain the same array [10, 2]. The provided allele index annotation va.aIndex can be used to select the value corresponding to the split allele's position"

  • They do have a "non split" mode when importing VCFs
@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 3, 2016

Member

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"

The TL;DR is that this was the exact motivation.

Member

fnothaft commented Nov 3, 2016

@fnothaft and others, can you remind us here of what the history and motivation is for us to require splitting all multi-allelic sites?

they state: "most analytic methods only support analyzing data represented as biallelics. Therefore, the current recommendation is to split multiallelics using the command splitmulti when importing a VCF"

The TL;DR is that this was the exact motivation.

@jpdna jpdna closed this Nov 3, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Nov 3, 2016

Member

There's a variety of other representational issues too. E.g., what do you do for sites where you have a long INDEL/linear SV (e.g., loss of heterozygosity for a whole gene) that covers up all of the point variants on a haplotype?

Member

fnothaft commented Nov 3, 2016

There's a variety of other representational issues too. E.g., what do you do for sites where you have a long INDEL/linear SV (e.g., loss of heterozygosity for a whole gene) that covers up all of the point variants on a haplotype?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Nov 3, 2016

Member

Yeah - I guess my main interest was in being able to round trip a VCF into parquet/Hbase as a "lossless" archive. However - this may not be a particularly urgent need, given if the gzipped VCF are on HDFS, they are pretty easy to go back to an scan quickly if the need arose to recover dropped tags or the exact multi-alleleic representation. Thanks for the comments - I'mm gonna go ahead and keep this closed this so it doesn't clutter.

Member

jpdna commented Nov 3, 2016

Yeah - I guess my main interest was in being able to round trip a VCF into parquet/Hbase as a "lossless" archive. However - this may not be a particularly urgent need, given if the gzipped VCF are on HDFS, they are pretty easy to go back to an scan quickly if the need arose to recover dropped tags or the exact multi-alleleic representation. Thanks for the comments - I'mm gonna go ahead and keep this closed this so it doesn't clutter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment