Skip to content
This repository has been archived by the owner on Jan 24, 2018. It is now read-only.

G2P '/genotypephenotype/search' endpoint #607

Closed
wants to merge 8 commits into from
Closed

Conversation

bwalsh
Copy link
Member

@bwalsh bwalsh commented Sep 4, 2015

Summary

We extended the GA4GH Reference server to include a the '/genotypephenotype/search' endpoint. This document describes the work to date

Approach

We based our work on the model captured in ga4gh/schemas commit of Jul 30, 2015. This version of the schema predates the separated genotype to phenotype files from baseline.

The code was based on a branch setup for this purpose by the server team.
No major refactoring of the server was needed, additional code was added to ga4gh/backend.py,ga4gh/frontend.py and test/unit/test_views.py. A new model was created in datamodel/genotype_phenotype.py

Data

The cancer genome database Clinical Genomics Knowledge Base published by the Monarch project was the source of Evidence.

image

API

The GA4GH schemas define a single endpoint /genotypephenotype/search which accepts a POST of a request body containing one or more of Feature, PhenotypeInstance, EnvironmentalContext, and Evidence which are combined as a logical AND to query the underlying datastore. Missing types are treated as a wildcard returning all data. Responses of matching data are returned as a list of FeaturePhenotypeAssociation. All types rely heavily on OntologyTerm

Request

image
http://yuml.me/edit/bf06b90a

Response

image

http://yuml.me/edit/25343da1


Implementation

image

http://yuml.me/c97fada2


Issues

Query by example

There are four datatypes types for each entity [string, external identifier, ontology identifier and 'entity'].
Currently the implementation handles queries of [string, external identifier and ontology identifier].

The 'entity' query is a type of query-by-example has been deferred. Challenges that arose:

  • schema constraints: there are several fields within the schemas that are defined as non-null. This may be fine when creating an entity from a data store, however, they are problematic when creating an entity to be used in a query.
  • additional discussions needed to determine what properties from an existing entity will be used for the query and which will be ignored. For example a Feature has [id,parentIds, featureSetId, referenceName, start,end, strand, featureType, attributes] we need to specify exactly what the query's expectations are.

Recommendation: Leave the schema definitions as-is. However, leave the entity query-by-example unimplemented. Implement when demand exists with sufficient use case details.

Name collision (SearchFeaturesResponse)

That schema contains two definitions of the class [SearchFeaturesRequest,SearchFeaturesResponse]. These conflict in the generated code with other classes of the same name.

The schema project the current server is based on is version = '0.6.be171b00'
Snippets from this commit follow

  • One in the file genotypephenotypemethods.avdl, protocol GenotypePhenotypeMethods
/** This is the response from `POST /genotypephenotype/search` expressed as JSON. */
record SearchFeaturesResponse {
  /**
  The list of matching FeaturePhenotypeAssociation.
  */
  array<org.ga4gh.models.FeaturePhenotypeAssociation> associations = [];

  ...
  • The second one is found in sequenceAnnotationmethods.avdl
  /** This is the response from `POST /features/search` expressed as JSON. */
  record SearchFeaturesResponse {
    /**
    The list of matching annotations, sorted by start position. Annotations which
    share a start position are returned in a deterministic order.
    */
    array<org.ga4gh.models.Feature> features = [];

    ... 
  • The generated code only has the class associated with sequenceAnnotationmethods.avdl

Both sequenceAnnotationmethods.avdl and genotypephenotypemethods.avdl share the same namespace @namespace("org.ga4gh.methods") each file defines an enclosing protocol.

In the names section of the spec

A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.

The schemas pass validation.

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)

Recommendation: Rename the GenotypePhenotypeMethods [SearchFeaturesRequest,SearchFeaturesResponse] to [SearchGenotypePhenotypeRequest,SearchGenotypePhenotypeResponse]

@jeromekelleher
Copy link
Contributor

This is superb @bwalsh --- great overview, and the code looks great too. I'll get back with some specific comments next week: there's a lot to take in here!

@adamnovak
Copy link
Contributor

As Benedict just said on the call, can you please add corresponding PRs against ga4gh/schemas and ga4gh/compliance?

The compliance PR may have to be based on the compliance_redux branch.

Once that's done, we can merge all 3 PRs at once, with discussion probably happening mostly here.

@@ -27,7 +27,7 @@ def setUpClass(cls):
"SIMULATED_BACKEND_NUM_CALLS": 1,
"SIMULATED_BACKEND_VARIANT_DENSITY": 1.0,
"SIMULATED_BACKEND_NUM_VARIANT_SETS": 1,
# "DEBUG" : True
"DEBUG": True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to leave this in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll remove it

@adamnovak
Copy link
Contributor

So on the server side it looks like this backends into RDF queried with sparql.

How is the RDF data indexed? Will queries be efficient from disk, or will the server have to keep the whole g2p graph in memory?

Also, does this PR include the recommended changes to resolve the name collision? (That would probably actually belong in the corresponding PR against the schemas repo).

Where are the .avdl files that correspond to this server?

@bwalsh
Copy link
Member Author

bwalsh commented Sep 9, 2015

Thanks for the detailed feedback. I'll be attending the meeting & Hackathon in NYC in October. My goal is to support this PR and hopefully be ready for new areas (metadata).

Also, does this PR include the recommended changes to resolve the name collision? (That would probably actually belong in the corresponding PR against the schemas repo).

It does not. As suggested, I'll create a PR against the schema.

Where are the .avdl files that correspond to this server?

The version of the schema these changes were based on: ga4gh/ga4gh-schemas@846b711

As Benedict just said on the call, can you please add corresponding PRs against ga4gh/schemas and ga4gh/compliance? The compliance PR may have to be based on the compliance_redux branch.

Did not realize there was a call. Will do. This may get tricky as I saw the following in the ga4gh/compliance docs. Is there a contact over on that project available for questions?

This version of the Schema is not tracking changes to the real v0.5.1 Schema!

Nor, of course, does it track to your own version of the schema if you were changing that. So you may need to convert the schemas module to be a git submodule which is tracking the version/branch of Schema you care about. If you do this (or otherwise edit the schema) you'll need to run the assertions generator so your test assertThat() functions match your new schema classes

Will queries be efficient from disk, or will the server have to keep the whole g2p graph in memory?

Regarding the runtime performance, my goal was to create an initial implementation; something self contained and easy to implement and understand ( no extra servers and minimum dependencies).

As-is, the server uses memory, there are other stores that ship with RDFLib. My preference is to roadmap adjustments or optimizations after this (set of) PR

@hjellinek
Copy link
Contributor

Hi @bwalsh - The comment

This version of the Schema is not tracking changes to the real v0.5.1 Schema!

is obsolete. I will fix that document.

It is correct to say that the compliance tool kit does not automatically sync its schemas with an outside repo or authority. If you're testing against modified schemas, you need to copy those modifications into the compliance test kit's ctk-schemas module, add appropriate endpoint URLs, etc.

@@ -10,11 +10,13 @@
import json
import random


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extra blank line intentional? We don't have this in any of the other files, so I would remove it in the interest of consistency (unless there's a good argument for it).

"""
Module responsible for translating g2p data into GA4GH native
objects.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're missing the standard future imports here. Copy these from any existing file.

@jeromekelleher
Copy link
Contributor

Looks great @bwalsh, thank for the updates. I've made some more style-type comments above. Some high-level stuff that I think needs to be addressed before we merge:

  1. We have to make the g2p dataset parameterisable. We should be able to find it in the data tree just like referenceSets and dataSets.
  2. We need to break the tests for g2p out into their own module. Do we really need the cgd.ttl file to be included in our test data? Isn't this an example file that should be provided as example data, and not shipped around with our source code? In general, users are going to install the code from pypi, so we can't make any assumptions at all about file system paths. So, I suggest deleting this file, unless it's absolutely necessary we keep it for testing purposes.
  3. I'm pretty sure we can simplify the paging stuff. Let's talk about it at the hackathon.

])


class GenotypePhenotypeBackend(AbstractBackend):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class doesn't really do anything, just remove it.

@jeromekelleher
Copy link
Contributor

Also, it would be good to rebase this against the latest develop so we get the new data input format code.

@bwalsh
Copy link
Member Author

bwalsh commented Oct 9, 2015

Guys, I was working through these latest comments. All was fine until I tried to rebase again.
I wont get into the minutia, but to check my sanity, I cloned https://github.com/ga4gh/server and ran tests on that.

======================================================================
ERROR: test suite for <class 'tests.end_to_end.test_oidc.TestOidc'>
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 209, in run
    self.setUp()
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 292, in setUp
    self.setupContext(ancestor)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/suite.py", line 315, in setupContext
    try_run(context, names)
  File "/root/.pyenv/versions/2.7.10/lib/python2.7/site-packages/nose/util.py", line 471, in try_run
    return func()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server_test.py", line 70, in setUpClass
    cls.otherSetup()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/test_oidc.py", line 58, in otherSetup
    cls.opServer.start()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server.py", line 243, in start
    super(OidcOpServerForTesting, self).start()
  File "/home/ga4gh/testupstream/server/tests/end_to_end/server.py", line 59, in start
    assert not self.isRunning(), "Another server is running"
AssertionError: Another server is running

======================================================================
ERROR: testSamConversion (tests.unit.test_converters.TestSamConverter)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ga4gh/testupstream/server/tests/unit/test_converters.py", line 69, in testSamConversion
    self.verifyFullConversion(readGroupSet, readGroup, reference)
  File "/home/ga4gh/testupstream/server/tests/unit/test_converters.py", line 43, in verifyFullConversion
    converter.convert()
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 71, in convert
    alignedSegment = SamLine.toAlignedSegment(read, targetIds)
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 129, in toAlignedSegment
    ret.flag = cls.toSamFlag(read)
  File "/home/ga4gh/testupstream/server/ga4gh/converters.py", line 163, in toSamFlag
    if read.nextMatePosition.strand == protocol.Strand.NEG_STRAND:
AttributeError: 'NoneType' object has no attribute 'strand'

@jeromekelleher
Copy link
Contributor

The first problem is due to a server running in a background process left over from some old tests. If you kill this it should go away.

The second problem is doesn't happen for me when I run the tests. Do you have some other read data in the tests/data directory? What's happening is you've provoked issue #705, but this shouldn't happen with the way that particular test is set up at the moment. Or, perhaps tests are run slightly differently on a whatever platform you're on... You're using Fedora, right? It'd be surprising if there was a difference between this and Debian.

jeromekelleher and others added 4 commits October 9, 2015 22:04
The input data directory must now be in the following
format:

dataDir/
    datasets/
        dataset1
            /variants
            /reads
        dataset2
            #etc
    referenceSets
        referenceSet1/
        referenceSet2

Datasets and ReferenceSets are now treated symmetrically.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants