Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Variant annotation support #302

Merged
merged 27 commits into from
Jul 1, 2015
Merged

Variant annotation support #302

merged 27 commits into from
Jul 1, 2015

Conversation

sarahhunt
Copy link
Contributor

These three new protocols form an initial proposal for variant annotation support.

VariantAnnotation and AlleleAnnotation records hold different types of annotation derived by comparing a Variant or Allele to a set of reference data. AnnotationSets group VariantAnnotation/ AlleleAnnotation records and hold full details of all software and reference data sets used.

The effect of alternate alleles on transcript sets is the first type of annotation to be considered in detail.

Two methods protocols are proposed. alleleAnnotationmethods.avdl supports the mining of pre-calculated annotation data and annotateAllelemethods.avdl supports annotation as a service.

@reece
Copy link
Member

reece commented May 9, 2015

Verbatim copy of GA4GH_prototype_annotations2.docx, sent by Sarah Hunt to ga4gh-dwg-annotation@googlegroups.com on May 7 is available as a gist at https://gist.github.com/reece/0604e25e78f5c63e1a5f .

@pgrosu
Copy link
Contributor

pgrosu commented May 9, 2015

Thank you for sharing @reece - it definitely helps having a concrete example. I'm not on that email list, since it would be helpful to see how we got to this design. Is it free to join?

Thanks,
Paul

@reece
Copy link
Member

reece commented May 11, 2015

@pgrosu I think you need to contact @skeenan to be added to the annotation list.

@pgrosu
Copy link
Contributor

pgrosu commented May 11, 2015

Ah, thank you @reece. @skeenan any possibility? It would just help with being in sync.

Thank you,
Paul

@reece
Copy link
Member

reece commented May 12, 2015

@sarahhunt - Thanks for the proposal. That helped tremendously!

Below is a long list of comments specific to your proposal and, more importantly, some design goals that I believe we should consider.

Putting money where my mouth is, I created an example (https://gist.github.com/reece/880aa4283d404a2563cd) that illustrates some of these goals. (N.B. My example has it's own flaws, like not being able to store annotations from multiple versions of the same tool. This group should consider it a strawman alternative, not a proposed solution.)

Specific suggestions about the proposal:

  • Use consistent syntax. I'm especially thinking about camelCase v. underscore v. all-caps. Assuming we're standardizing on cameCase, "IMPACT" should be "impact", "variantannotations" should be "variantAnnotations", etc. I'm fine with notable exceptions like "HGVSg". (I don't care at all about which convention, just that there is exactly one.)
  • Drop the concept of annotationset in the annotation itself. This is better modeled as an annotation set containing annotations (rather than annotations referring to their container).

Goals:

  • Files structures containing variants and variant annotations should be easily mergable. The primary use case is to merge precomputed variant annotations with a patient's set of annotations.
  • Predictions should be typed. That is, annotations from a single source or of a particular type should follow a similar schema.
  • Use stable external refs wherever possible. For example, "effects" should use SO ids. (The human readable names are not guaranteed stable and have changed at least once to my knowledge.)
  • All locations must have a reference sequence specified by accession (not name, like "chr22"). Doing otherwise is as good as a address number without a street name. I think it's essential that this annotation spec make the sequence reference unambiguous.

@pcingola
Copy link
Contributor

Hi @reece, thank you for your comments, here are some answers (at least my opinion):

i) Consistent syntax: The examples you mentioned seem to in proper cameCase in Sarah's PR ('variantannotations' is lowercase only in POST URLs, and I did not find the "IMPACT" in all caps). May be I'm looking at the wrong place...?

ii) The concept of annotationSet in the annotation itself is analogous to the concept of 'readGroupId' in 'ReadAlignment' itself. This was added as a suggestion form the group (the initial version did not have it) and we all thought it was quite useful.

Goals:
i) Your comment about "variants and variant annotations should be easily mergable", in my opinion has been take care of. "Merging" variant with their respective annotations is a function of the annotation algorithm, which is included in Sarah's proposal (see AnnotateVariantsRequest). The API user should NEVER have to this manually because doing so is error prone and annotation algorithms take care of many corner cases that users are usually unaware of.

ii) I'm not sure what you meant with "Predictions should be typed", can you clarify?

iii) SO IDs: We could include them, but in my opinion this makes the record less useful ('missense_variant' is much more readable than 'SO:0001583'). In any case, I think that what you mention is a problem about versioning ontologies. A problem that we don't intend to solve here, furthermore, the structure of the ontology can also change, so using IDs doesn't solve it. We should leave this as it is and modify it later, when GA4GH solves ontologies versioning issues and then use an 'OntologyTerm'.

iv) Annotations are anchored to variants, which already have reference sequence ID, so asking for "chr22" makes a lot of sense because this is a context specific query (may be I misunderstood your point here). In any case, there should perhaps be an option for using sequence IDs instead of chromosome names, but I would suggest to add it after the PR has been accepted.

@helenp
Copy link

helenp commented May 12, 2015

Hi @pcingola iii) you could use the SO term, id and version. If you don't do this then ids need to be looked up and there is an annotation-ontology mapping problem and an ontology versioning problem to deal with. We will work on the ontology component to make some recommendations for discussion in Leiden so that this is consistent across different components.
All ontologies can change structure and there are diff tools that allow tracking of changes - moving between ontology versions is a complex problem and I don't think GA4GH can solve this though there are some tools available.

@sarahhunt
Copy link
Contributor Author

Hi @reece and @pcingola

A few additional comments -

The avro schemas follow the standard syntax: https://github.com/ga4gh/schemas/blob/master/CONTRIBUTING.md#syntax_style, apart from the HGVS attributes, which maybe should really be all be lower case too. @reece - I suspect you are looking at the output from the prototype implementation that I mailed round the group rather than the schemas. As mentioned, it is a work in progress and you have spotted a couple of compliance errors! Maybe it was too early to share this output.

The annotationSet id is in the annotation record to support the extraction of a subset of annotation - maybe over a specific gene - and the annotationSet record which contains information on software and reference data versions used. An annotationSet could be a database of 'all current variants vs all current transcripts', so if the set referenced the annotations it could get quite unwieldy.

Thanks for the recommendation @helenp. The effects are currently represented by an array of OntologyTerm's. There is no version in the OntologyTerm at the moment, ( just source, id, name) but it would make sense to add it.

@pcingola
Copy link
Contributor

+1

Only return variant annotations for any of these features
If null, return all variant annotations in specified window.
*/
union { null, array<org.ga4gh.models.Feature> } features;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the matching work? Is this matching feature IDs of TranscriptEffects or is this doing a live JOIN against the regions annotated by those features?

If it's the former, this should be named more specifically and should refer to the feature_id. If it's the latter, I'm not sure what this buys you beyond the above start/end parameters.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the former. Do the new comments at the top of the file make this clear?

@sarahhunt
Copy link
Contributor Author

Thanks for the input @calbach!

@diekhans - I've added to the documentation, but if you, or anyone else, find anything else too vague do let me know.

@sarahhunt sarahhunt closed this May 29, 2015
@sarahhunt sarahhunt reopened this May 29, 2015

/** The ID of the annotation set this record belongs to. */
string annotationSetId;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please remove these double newlines throughout, unless this style convention is used elsewhere(?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - that's tidied up now.

@calbach
Copy link
Contributor

calbach commented Jun 26, 2015

I've added some more comments inline. In general things look sane, but I haven't dug too much into the details of the contents of the actual annotations. I would like it if someone from a more bio-heavy background could do a thorough review of those fields.

@jacmarjorie
Copy link
Member

+1

@AngieHinrichs
Copy link

+1 Thanks for laying the foundation!

@rajgopals
Copy link

+1

@skeenan
Copy link
Member

skeenan commented Jul 1, 2015

Merging to master.

skeenan added a commit that referenced this pull request Jul 1, 2015
@skeenan skeenan merged commit 2d1afa8 into master Jul 1, 2015
@stevenbrenner
Copy link

+1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet