Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Variant data model should be conceptual, not define in terms of VCF file format. #379

Open
diekhans opened this issue Aug 14, 2015 · 4 comments

Comments

@diekhans
Copy link
Contributor

The variant data models is described in terms of VCF as opposed to a clear conceptual model, which are then related to VCF.

For instance, the statement: 'The variant set is equivalent to a VCF file.' lead to the interpenetration that a VCF split by chromosome should be multiple variant sets.

Transliterating the VCF format into JSON, as opposed to the VCF conceptual model has lead to a more complex and confusing API.

@diekhans diekhans changed the title Variant data model should be conceptual, not define in therm of VCF files. Variant data model should be conceptual, not define in terms of VCF file format. Aug 14, 2015
@jacmarjorie
Copy link
Member

+1

Another confusing aspect that has come out of this is that a GAVariant has a String variantSetId, which according to the documentation states that this is the "ID of the variant set that this variant belongs to". If we think of the GAVariantsets, GAVariants, etc. in terms of the abstract matrix representation this would imply that if two callsets had the same exact variant, but originated from different variantSets, then this variant would exist as a duplicate with a different set of calls returned for each wrt the searchVariants function. If we want to have a variant object as a representation of a unique variant this becomes troublesome and leads to the development of database workaround functions - much like the mergeVariant function (https://cloud.google.com/genomics/v1beta2/reference/variantsets/mergeVariants).

To us, the String variantSetId on a GAVariant seemed like a consequence of designing the schema based on a file hierarchy. Regardless, it is then left to the database to logically merge / track which variants are identical and merge the calls into one upon returning a response - depressing some of the advantages of columnar store.

Though, this could potentially be it's own issue, but could be fixed with a new conceptual design; I'd suggest taking this into consideration when designing a more conceptual model.

@dglazer
Copy link
Member

dglazer commented Aug 14, 2015

@diekhans , are you suggesting in this issue that we improve the documentation in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl, that we change the schema, or both? (Personally I agree the doc can use some work; I don't know of reasons to change the schema, but I'm open to suggestions.) Re doc, we might be able to borrow words from the Variants section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support a world where all calls from all samples are retrievable by a single searchVariants request -- is that right? If so, I think today's API already supports that, by letting you load all of your data into a single variantset. But it also lets you have different populations / studies, that are separately searched, if you want. Or am I misunderstanding?

@diekhans
Copy link
Contributor Author

Hi @dglazer,

This is a documentation ticket. It has documentation label set
so bioinformatician who is updating documentation will work on
it when we get more bandwidth. She found the variant API
perplexing.

We really needed to take careful look at the variant API. It
has baggage from a difficult, compromise, file format that will
be a burden going forwards. Particularly the fact that a
variant can contain thousands of supporting calls is not
forward-looking data model. This will not compose well, work
with caching, and is mismatched to query languages.

A good rethinking when adding structural variation would be well
advised, as we are going to have to live with this for a long
time.

David Glazer notifications@github.com writes:

@diekhans , are you suggesting in this issue that we improve the documentation
in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/
variants.avdl, that we change the schema, or both? (Personally I agree the doc
can use some work; I don't know of reasons to change the schema, but I'm open
to suggestions.) Re doc, we might be able to borrow words from the Variants
section of http://ga4gh.org/#/documentation, which is much less file-centric.

@jacmarjorie , I think you're suggesting that the API should be able to support
a world where all calls from all samples are retrievable by a single
searchVariants request -- is that right? If so, I think today's API already
supports that, by letting you load all of your data into a single variantset.
But it also lets you have different populations / studies, that are separately
searched, if you want. Or am I misunderstanding?


Reply to this email directly or view it on GitHub.*

@dglazer
Copy link
Member

dglazer commented Aug 15, 2015

Thanks @diekhans -- I missed the label. Happy to look at a doc pull request when ready; feel free to borrow from the ga4gh.org site, and/or from Google's documentation.

(And re rethinking the API itself -- sounds like that's a topic for another thread.)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants