G2P '/genotypephenotype/search' endpoint #607

bwalsh · 2015-09-04T01:20:44Z

Summary

We extended the GA4GH Reference server to include a the '/genotypephenotype/search' endpoint. This document describes the work to date

Approach

We based our work on the model captured in ga4gh/schemas commit of Jul 30, 2015. This version of the schema predates the separated genotype to phenotype files from baseline.

The code was based on a branch setup for this purpose by the server team.
No major refactoring of the server was needed, additional code was added to ga4gh/backend.py,ga4gh/frontend.py and test/unit/test_views.py. A new model was created in datamodel/genotype_phenotype.py

Data

The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source of Evidence.

API

The GA4GH schemas define a single endpoint /genotypephenotype/search which accepts a POST of a request body containing one or more of Feature, PhenotypeInstance, EnvironmentalContext, and Evidence which are combined as a logical AND to query the underlying datastore. Missing types are treated as a wildcard returning all data. Responses of matching data are returned as a list of FeaturePhenotypeAssociation. All types rely heavily on OntologyTerm

Request

http://yuml.me/edit/bf06b90a

Response

http://yuml.me/edit/25343da1

Implementation

http://yuml.me/c97fada2

Issues

Query by example

There are four datatypes types for each entity [string, external identifier, ontology identifier and 'entity'].
Currently the implementation handles queries of [string, external identifier and ontology identifier].

The 'entity' query is a type of query-by-example has been deferred. Challenges that arose:

schema constraints: there are several fields within the schemas that are defined as non-null. This may be fine when creating an entity from a data store, however, they are problematic when creating an entity to be used in a query.
additional discussions needed to determine what properties from an existing entity will be used for the query and which will be ignored. For example a Feature has [id,parentIds, featureSetId, referenceName, start,end, strand, featureType, attributes] we need to specify exactly what the query's expectations are.

Recommendation: Leave the schema definitions as-is. However, leave the entity query-by-example unimplemented. Implement when demand exists with sufficient use case details.

Name collision (SearchFeaturesResponse)

That schema contains two definitions of the class [SearchFeaturesRequest,SearchFeaturesResponse]. These conflict in the generated code with other classes of the same name.

The schema project the current server is based on is version = '0.6.be171b00'
Snippets from this commit follow

One in the file genotypephenotypemethods.avdl, protocol GenotypePhenotypeMethods

/** This is the response from `POST /genotypephenotype/search` expressed as JSON. */
record SearchFeaturesResponse {
  /**
  The list of matching FeaturePhenotypeAssociation.
  */
  array<org.ga4gh.models.FeaturePhenotypeAssociation> associations = [];

  ...

The second one is found in sequenceAnnotationmethods.avdl

  /** This is the response from `POST /features/search` expressed as JSON. */
  record SearchFeaturesResponse {
    /**
    The list of matching annotations, sorted by start position. Annotations which
    share a start position are returned in a deterministic order.
    */
    array<org.ga4gh.models.Feature> features = [];

    ...

The generated code only has the class associated with sequenceAnnotationmethods.avdl

Both sequenceAnnotationmethods.avdl and genotypephenotypemethods.avdl share the same namespace @namespace("org.ga4gh.methods") each file defines an enclosing protocol.

In the names section of the spec

A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.

The schemas pass validation.

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)

Recommendation: Rename the GenotypePhenotypeMethods [SearchFeaturesRequest,SearchFeaturesResponse] to [SearchGenotypePhenotypeRequest,SearchGenotypePhenotypeResponse]

jeromekelleher · 2015-09-04T16:50:17Z

This is superb @bwalsh --- great overview, and the code looks great too. I'll get back with some specific comments next week: there's a lot to take in here!

adamnovak · 2015-09-09T16:41:02Z

As Benedict just said on the call, can you please add corresponding PRs against ga4gh/schemas and ga4gh/compliance?

The compliance PR may have to be based on the compliance_redux branch.

Once that's done, we can merge all 3 PRs at once, with discussion probably happening mostly here.

adamnovak · 2015-09-09T16:46:24Z

tests/unit/test_views.py

@@ -27,7 +27,7 @@ def setUpClass(cls):
            "SIMULATED_BACKEND_NUM_CALLS": 1,
            "SIMULATED_BACKEND_VARIANT_DENSITY": 1.0,
            "SIMULATED_BACKEND_NUM_VARIANT_SETS": 1,
-            # "DEBUG" : True
+            "DEBUG": True


Do you want to leave this in?

Good catch. I'll remove it

adamnovak · 2015-09-09T16:52:11Z

So on the server side it looks like this backends into RDF queried with sparql.

How is the RDF data indexed? Will queries be efficient from disk, or will the server have to keep the whole g2p graph in memory?

Also, does this PR include the recommended changes to resolve the name collision? (That would probably actually belong in the corresponding PR against the schemas repo).

Where are the .avdl files that correspond to this server?

bwalsh · 2015-09-09T17:51:27Z

Thanks for the detailed feedback. I'll be attending the meeting & Hackathon in NYC in October. My goal is to support this PR and hopefully be ready for new areas (metadata).

Also, does this PR include the recommended changes to resolve the name collision? (That would probably actually belong in the corresponding PR against the schemas repo).

It does not. As suggested, I'll create a PR against the schema.

Where are the .avdl files that correspond to this server?

The version of the schema these changes were based on: ga4gh/ga4gh-schemas@846b711

As Benedict just said on the call, can you please add corresponding PRs against ga4gh/schemas and ga4gh/compliance? The compliance PR may have to be based on the compliance_redux branch.

Did not realize there was a call. Will do. This may get tricky as I saw the following in the ga4gh/compliance docs. Is there a contact over on that project available for questions?

This version of the Schema is not tracking changes to the real v0.5.1 Schema!

Nor, of course, does it track to your own version of the schema if you were changing that. So you may need to convert the schemas module to be a git submodule which is tracking the version/branch of Schema you care about. If you do this (or otherwise edit the schema) you'll need to run the assertions generator so your test assertThat() functions match your new schema classes

Will queries be efficient from disk, or will the server have to keep the whole g2p graph in memory?

Regarding the runtime performance, my goal was to create an initial implementation; something self contained and easy to implement and understand ( no extra servers and minimum dependencies).

As-is, the server uses memory, there are other stores that ship with RDFLib. My preference is to roadmap adjustments or optimizations after this (set of) PR

hjellinek · 2015-09-09T17:58:50Z

Hi @bwalsh - The comment

This version of the Schema is not tracking changes to the real v0.5.1 Schema!

is obsolete. I will fix that document.

It is correct to say that the compliance tool kit does not automatically sync its schemas with an outside repo or authority. If you're testing against modified schemas, you need to copy those modifications into the compliance test kit's ctk-schemas module, add appropriate endpoint URLs, etc.

jeromekelleher · 2015-09-10T09:00:37Z

ga4gh/backend.py

@@ -10,11 +10,13 @@
 import json
 import random

+


Is this extra blank line intentional? We don't have this in any of the other files, so I would remove it in the interest of consistency (unless there's a good argument for it).

jeromekelleher · 2015-10-08T09:07:32Z

ga4gh/datamodel/genotype_phenotype.py

+"""
+Module responsible for translating g2p data into GA4GH native
+objects.
+"""


We're missing the standard future imports here. Copy these from any existing file.

jeromekelleher · 2015-10-08T09:46:00Z

Looks great @bwalsh, thank for the updates. I've made some more style-type comments above. Some high-level stuff that I think needs to be addressed before we merge:

We have to make the g2p dataset parameterisable. We should be able to find it in the data tree just like referenceSets and dataSets.
We need to break the tests for g2p out into their own module. Do we really need the cgd.ttl file to be included in our test data? Isn't this an example file that should be provided as example data, and not shipped around with our source code? In general, users are going to install the code from pypi, so we can't make any assumptions at all about file system paths. So, I suggest deleting this file, unless it's absolutely necessary we keep it for testing purposes.
I'm pretty sure we can simplify the paging stuff. Let's talk about it at the hackathon.

jeromekelleher · 2015-10-08T09:46:41Z

ga4gh/backend.py

+            ])
+
+
+class GenotypePhenotypeBackend(AbstractBackend):


This class doesn't really do anything, just remove it.

jeromekelleher · 2015-10-08T09:50:27Z

Also, it would be good to rebase this against the latest develop so we get the new data input format code.

bwalsh · 2015-10-09T19:29:27Z

Guys, I was working through these latest comments. All was fine until I tried to rebase again.
I wont get into the minutia, but to check my sanity, I cloned https://github.com/ga4gh/server and ran tests on that.

======================================================================
ERROR: test suite for <class 'tests.end_to_end.test_oidc.TestOidc'>
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 209, in run
    self.setUp()
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 292, in setUp
    self.setupContext(ancestor)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 315, in setupContext
    try_run(context, names)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/util.py", line 471, in try_run
    return func()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server_test.py", line 70, in setUpClass
    cls.otherSetup()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/test_oidc.py", line 58, in otherSetup
    cls.opServer.start()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server.py", line 243, in start
    super(OidcOpServerForTesting, self).start()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server.py", line 59, in start
    assert not self.isRunning(), "Another server is running"
AssertionError: Another server is running

======================================================================
ERROR: testSamConversion (tests.unit.test_converters.TestSamConverter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ga4gh/testupstream/server/tests/unit/test_converters.py", line 69, in testSamConversion
    self.verifyFullConversion(readGroupSet, readGroup, reference)
  File "/home/ga4gh/testupstream/server/tests/unit/test_converters.py", line 43, in verifyFullConversion
    converter.convert()
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 71, in convert
    alignedSegment = SamLine.toAlignedSegment(read, targetIds)
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 129, in toAlignedSegment
    ret.flag = cls.toSamFlag(read)
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 163, in toSamFlag
    if read.nextMatePosition.strand == protocol.Strand.NEG_STRAND:
AttributeError: 'NoneType' object has no attribute 'strand'

jeromekelleher · 2015-10-09T19:48:53Z

The first problem is due to a server running in a background process left over from some old tests. If you kill this it should go away.

The second problem is doesn't happen for me when I run the tests. Do you have some other read data in the tests/data directory? What's happening is you've provoked issue #705, but this shouldn't happen with the way that particular test is set up at the moment. Or, perhaps tests are run slightly differently on a whatever platform you're on... You're using Fedora, right? It'd be surprising if there was a difference between this and Debian.

The input data directory must now be in the following format: dataDir/ datasets/ dataset1 /variants /reads dataset2 #etc referenceSets referenceSet1/ referenceSet2 Datasets and ReferenceSets are now treated symmetrically.

benedictpaten added the in progress label Sep 4, 2015

jeromekelleher removed the in progress label Sep 4, 2015

bwalsh force-pushed the g2p branch 2 times, most recently from 2cdf427 to 25e19f2 Compare September 4, 2015 19:42

adamnovak reviewed Sep 9, 2015
View reviewed changes

jeromekelleher reviewed Sep 10, 2015
View reviewed changes

jeromekelleher reviewed Oct 8, 2015
View reviewed changes

jeromekelleher and others added 4 commits October 9, 2015 22:04

Changed the input data format for consistency.

529999a

The input data directory must now be in the following format: dataDir/ datasets/ dataset1 /variants /reads dataset2 #etc referenceSets referenceSet1/ referenceSet2 Datasets and ReferenceSets are now treated symmetrically.

fix style changes part3

5ae0cea

fix style changes part3

98a721e

sparql perfomance improvements

6f74dc1

dcolligan closed this Oct 12, 2015

bwalsh mentioned this pull request Oct 23, 2015

G2P '/genotypephenotype/search' endpoint #770

Closed

This was referenced Jan 21, 2016

Create documentation for g2p schema ga4gh/g2p-team#14

Closed

merge after rebase from master - trigger travis build #881

Closed

bwalsh mentioned this pull request Feb 6, 2016

g2p reference feature branch #915

Merged

bwalsh mentioned this pull request Feb 16, 2016

G2p ga4gh/compliance#169

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

G2P '/genotypephenotype/search' endpoint #607

G2P '/genotypephenotype/search' endpoint #607

bwalsh commented Sep 4, 2015

jeromekelleher commented Sep 4, 2015

adamnovak commented Sep 9, 2015

adamnovak Sep 9, 2015

bwalsh Sep 9, 2015

adamnovak commented Sep 9, 2015

bwalsh commented Sep 9, 2015

hjellinek commented Sep 9, 2015

jeromekelleher Sep 10, 2015

jeromekelleher Oct 8, 2015

jeromekelleher commented Oct 8, 2015

jeromekelleher Oct 8, 2015

jeromekelleher commented Oct 8, 2015

bwalsh commented Oct 9, 2015

jeromekelleher commented Oct 9, 2015

G2P '/genotypephenotype/search' endpoint #607

G2P '/genotypephenotype/search' endpoint #607

Conversation

bwalsh commented Sep 4, 2015

Summary

Approach

Data

API

Request

Response

Implementation

Issues

Query by example

Name collision (SearchFeaturesResponse)

jeromekelleher commented Sep 4, 2015

adamnovak commented Sep 9, 2015

adamnovak Sep 9, 2015

Choose a reason for hiding this comment

bwalsh Sep 9, 2015

Choose a reason for hiding this comment

adamnovak commented Sep 9, 2015

bwalsh commented Sep 9, 2015

hjellinek commented Sep 9, 2015

jeromekelleher Sep 10, 2015

Choose a reason for hiding this comment

jeromekelleher Oct 8, 2015

Choose a reason for hiding this comment

jeromekelleher commented Oct 8, 2015

jeromekelleher Oct 8, 2015

Choose a reason for hiding this comment

jeromekelleher commented Oct 8, 2015

bwalsh commented Oct 9, 2015

jeromekelleher commented Oct 9, 2015