Variant data model should be conceptual, not define in terms of VCF file format. #379

diekhans · 2015-08-14T14:41:50Z

The variant data models is described in terms of VCF as opposed to a clear conceptual model, which are then related to VCF.

For instance, the statement: 'The variant set is equivalent to a VCF file.' lead to the interpenetration that a VCF split by chromosome should be multiple variant sets.

Transliterating the VCF format into JSON, as opposed to the VCF conceptual model has lead to a more complex and confusing API.

jacmarjorie · 2015-08-14T16:52:57Z

+1

Another confusing aspect that has come out of this is that a GAVariant has a String variantSetId, which according to the documentation states that this is the "ID of the variant set that this variant belongs to". If we think of the GAVariantsets, GAVariants, etc. in terms of the abstract matrix representation this would imply that if two callsets had the same exact variant, but originated from different variantSets, then this variant would exist as a duplicate with a different set of calls returned for each wrt the searchVariants function. If we want to have a variant object as a representation of a unique variant this becomes troublesome and leads to the development of database workaround functions - much like the mergeVariant function (https://cloud.google.com/genomics/v1beta2/reference/variantsets/mergeVariants).

To us, the String variantSetId on a GAVariant seemed like a consequence of designing the schema based on a file hierarchy. Regardless, it is then left to the database to logically merge / track which variants are identical and merge the calls into one upon returning a response - depressing some of the advantages of columnar store.

Though, this could potentially be it's own issue, but could be fixed with a new conceptual design; I'd suggest taking this into consideration when designing a more conceptual model.

dglazer · 2015-08-14T23:42:27Z

@diekhans , are you suggesting in this issue that we improve the documentation in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl, that we change the schema, or both? (Personally I agree the doc can use some work; I don't know of reasons to change the schema, but I'm open to suggestions.) Re doc, we might be able to borrow words from the Variants section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support a world where all calls from all samples are retrievable by a single searchVariants request -- is that right? If so, I think today's API already supports that, by letting you load all of your data into a single variantset. But it also lets you have different populations / studies, that are separately searched, if you want. Or am I misunderstanding?

diekhans · 2015-08-15T15:40:03Z

Hi @dglazer,

This is a documentation ticket. It has documentation label set
so bioinformatician who is updating documentation will work on
it when we get more bandwidth. She found the variant API
perplexing.

We really needed to take careful look at the variant API. It
has baggage from a difficult, compromise, file format that will
be a burden going forwards. Particularly the fact that a
variant can contain thousands of supporting calls is not
forward-looking data model. This will not compose well, work
with caching, and is mismatched to query languages.

A good rethinking when adding structural variation would be well
advised, as we are going to have to live with this for a long
time.

David Glazer notifications@github.com writes:

@diekhans , are you suggesting in this issue that we improve the documentation
in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
variants.avdl, that we change the schema, or both? (Personally I agree the doc
can use some work; I don't know of reasons to change the schema, but I'm open
to suggestions.) Re doc, we might be able to borrow words from the Variants
section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support
a world where all calls from all samples are retrievable by a single
searchVariants request -- is that right? If so, I think today's API already
supports that, by letting you load all of your data into a single variantset.
But it also lets you have different populations / studies, that are separately
searched, if you want. Or am I misunderstanding?

—
Reply to this email directly or view it on GitHub.*

dglazer · 2015-08-15T15:51:47Z

Thanks @diekhans -- I missed the label. Happy to look at a doc pull request when ready; feel free to borrow from the ga4gh.org site, and/or from Google's documentation.

(And re rethinking the API itself -- sounds like that's a topic for another thread.)

diekhans added the Documentation label Aug 14, 2015

diekhans changed the title ~~Variant data model should be conceptual, not define in therm of VCF files.~~ Variant data model should be conceptual, not define in terms of VCF file format. Aug 14, 2015

dglazer mentioned this issue Aug 14, 2015

compliance_redux: questions/suggestions for test data ga4gh/compliance#34

Closed

sguthrie mentioned this issue Aug 15, 2015

Made a CallSet belong to one VariantSet. #376

Closed

diekhans mentioned this issue Aug 17, 2015

Formal definition of the data model needed #380

Open

diekhans added this to the comprehensive doc milestone Sep 14, 2015

This was referenced Mar 22, 2016

Review and update variants documentation #408

Open

Can a callset be in multiple variant sets? #583

Open

diekhans mentioned this issue Apr 5, 2016

VariantSetMetadata specification is unclear #598

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Variant data model should be conceptual, not define in terms of VCF file format. #379

Variant data model should be conceptual, not define in terms of VCF file format. #379

diekhans commented Aug 14, 2015

jacmarjorie commented Aug 14, 2015

dglazer commented Aug 14, 2015

diekhans commented Aug 15, 2015

dglazer commented Aug 15, 2015

Variant data model should be conceptual, not define in terms of VCF file format. #379

Variant data model should be conceptual, not define in terms of VCF file format. #379

Comments

diekhans commented Aug 14, 2015

jacmarjorie commented Aug 14, 2015

dglazer commented Aug 14, 2015

diekhans commented Aug 15, 2015

dglazer commented Aug 15, 2015