-
Notifications
You must be signed in to change notification settings - Fork 114
Variant data model should be conceptual, not define in terms of VCF file format. #379
Comments
+1 Another confusing aspect that has come out of this is that a GAVariant has a String variantSetId, which according to the documentation states that this is the "ID of the variant set that this variant belongs to". If we think of the GAVariantsets, GAVariants, etc. in terms of the abstract matrix representation this would imply that if two callsets had the same exact variant, but originated from different variantSets, then this variant would exist as a duplicate with a different set of calls returned for each wrt the searchVariants function. If we want to have a variant object as a representation of a unique variant this becomes troublesome and leads to the development of database workaround functions - much like the mergeVariant function (https://cloud.google.com/genomics/v1beta2/reference/variantsets/mergeVariants). To us, the String variantSetId on a GAVariant seemed like a consequence of designing the schema based on a file hierarchy. Regardless, it is then left to the database to logically merge / track which variants are identical and merge the calls into one upon returning a response - depressing some of the advantages of columnar store. Though, this could potentially be it's own issue, but could be fixed with a new conceptual design; I'd suggest taking this into consideration when designing a more conceptual model. |
@diekhans , are you suggesting in this issue that we improve the documentation in https://github.com/ga4gh/schemas/blob/master/src/main/resources/avro/variants.avdl, that we change the schema, or both? (Personally I agree the doc can use some work; I don't know of reasons to change the schema, but I'm open to suggestions.) Re doc, we might be able to borrow words from the Variants section of http://ga4gh.org/#/documentation, which is much less file-centric. @jacmarjorie , I think you're suggesting that the API should be able to support a world where all calls from all samples are retrievable by a single |
Hi @dglazer, This is a documentation ticket. It has documentation label set We really needed to take careful look at the variant API. It A good rethinking when adding structural variation would be well David Glazer notifications@github.com writes:
|
Thanks @diekhans -- I missed the label. Happy to look at a doc pull request when ready; feel free to borrow from the ga4gh.org site, and/or from Google's documentation. (And re rethinking the API itself -- sounds like that's a topic for another thread.) |
The variant data models is described in terms of VCF as opposed to a clear conceptual model, which are then related to VCF.
For instance, the statement: 'The variant set is equivalent to a VCF file.' lead to the interpenetration that a VCF split by chromosome should be multiple variant sets.
Transliterating the VCF format into JSON, as opposed to the VCF conceptual model has lead to a more complex and confusing API.
The text was updated successfully, but these errors were encountered: