-
Notifications
You must be signed in to change notification settings - Fork 91
G2P '/genotypephenotype/search' endpoint #607
Conversation
This is superb @bwalsh --- great overview, and the code looks great too. I'll get back with some specific comments next week: there's a lot to take in here! |
2cdf427
to
25e19f2
Compare
As Benedict just said on the call, can you please add corresponding PRs against ga4gh/schemas and ga4gh/compliance? The compliance PR may have to be based on the compliance_redux branch. Once that's done, we can merge all 3 PRs at once, with discussion probably happening mostly here. |
@@ -27,7 +27,7 @@ def setUpClass(cls): | |||
"SIMULATED_BACKEND_NUM_CALLS": 1, | |||
"SIMULATED_BACKEND_VARIANT_DENSITY": 1.0, | |||
"SIMULATED_BACKEND_NUM_VARIANT_SETS": 1, | |||
# "DEBUG" : True | |||
"DEBUG": True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to leave this in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. I'll remove it
So on the server side it looks like this backends into RDF queried with sparql. How is the RDF data indexed? Will queries be efficient from disk, or will the server have to keep the whole g2p graph in memory? Also, does this PR include the recommended changes to resolve the name collision? (That would probably actually belong in the corresponding PR against the schemas repo). Where are the .avdl files that correspond to this server? |
Thanks for the detailed feedback. I'll be attending the meeting & Hackathon in NYC in October. My goal is to support this PR and hopefully be ready for new areas (metadata).
It does not. As suggested, I'll create a PR against the schema.
The version of the schema these changes were based on: ga4gh/ga4gh-schemas@846b711
Did not realize there was a call. Will do. This may get tricky as I saw the following in the ga4gh/compliance docs. Is there a contact over on that project available for questions?
Regarding the runtime performance, my goal was to create an initial implementation; something self contained and easy to implement and understand ( no extra servers and minimum dependencies). As-is, the server uses memory, there are other stores that ship with RDFLib. My preference is to roadmap adjustments or optimizations after this (set of) PR |
Hi @bwalsh - The comment
is obsolete. I will fix that document. It is correct to say that the compliance tool kit does not automatically sync its schemas with an outside repo or authority. If you're testing against modified schemas, you need to copy those modifications into the compliance test kit's |
@@ -10,11 +10,13 @@ | |||
import json | |||
import random | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this extra blank line intentional? We don't have this in any of the other files, so I would remove it in the interest of consistency (unless there's a good argument for it).
""" | ||
Module responsible for translating g2p data into GA4GH native | ||
objects. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're missing the standard future imports here. Copy these from any existing file.
Looks great @bwalsh, thank for the updates. I've made some more style-type comments above. Some high-level stuff that I think needs to be addressed before we merge:
|
]) | ||
|
||
|
||
class GenotypePhenotypeBackend(AbstractBackend): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This class doesn't really do anything, just remove it.
Also, it would be good to rebase this against the latest develop so we get the new data input format code. |
Guys, I was working through these latest comments. All was fine until I tried to rebase again.
|
The first problem is due to a server running in a background process left over from some old tests. If you kill this it should go away. The second problem is doesn't happen for me when I run the tests. Do you have some other read data in the |
The input data directory must now be in the following format: dataDir/ datasets/ dataset1 /variants /reads dataset2 #etc referenceSets referenceSet1/ referenceSet2 Datasets and ReferenceSets are now treated symmetrically.
Summary
We extended the GA4GH Reference server to include a the '/genotypephenotype/search' endpoint. This document describes the work to date
Approach
We based our work on the model captured in ga4gh/schemas commit of Jul 30, 2015. This version of the schema predates the separated genotype to phenotype files from baseline.
The code was based on a branch setup for this purpose by the server team.
No major refactoring of the server was needed, additional code was added to ga4gh/backend.py,ga4gh/frontend.py and test/unit/test_views.py. A new model was created in datamodel/genotype_phenotype.py
Data
The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source of Evidence.
API
The GA4GH schemas define a single endpoint
/genotypephenotype/search
which accepts a POST of a request body containing one or more of Feature, PhenotypeInstance, EnvironmentalContext, and Evidence which are combined as a logical AND to query the underlying datastore. Missing types are treated as awildcard
returning all data. Responses of matching data are returned as a list of FeaturePhenotypeAssociation. All types rely heavily on OntologyTermRequest
http://yuml.me/edit/bf06b90a
Response
http://yuml.me/edit/25343da1
Implementation
http://yuml.me/c97fada2
Issues
Query by example
There are four datatypes types for each entity [string, external identifier, ontology identifier and 'entity'].
Currently the implementation handles queries of [string, external identifier and ontology identifier].
The 'entity' query is a type of query-by-example has been deferred. Challenges that arose:
Recommendation: Leave the schema definitions as-is. However, leave the entity query-by-example unimplemented. Implement when demand exists with sufficient use case details.
Name collision (SearchFeaturesResponse)
That schema contains two definitions of the class
[SearchFeaturesRequest,SearchFeaturesResponse]
. These conflict in the generated code with other classes of the same name.The schema project the current server is based on is
version = '0.6.be171b00'
Snippets from this commit follow
sequenceAnnotationmethods.avdl
Both sequenceAnnotationmethods.avdl and genotypephenotypemethods.avdl share the same namespace
@namespace("org.ga4gh.methods")
each file defines an enclosingprotocol
.In the names section of the spec
The schemas pass validation.
Recommendation: Rename the GenotypePhenotypeMethods [SearchFeaturesRequest,SearchFeaturesResponse] to [SearchGenotypePhenotypeRequest,SearchGenotypePhenotypeResponse]