New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handover method for data access #114

Open
mbaudis opened this Issue Oct 31, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@mbaudis
Contributor

mbaudis commented Oct 31, 2017

As was discussed previously and with some assigned future spot on the Beacon roadmap, there is general consensus about the need to implement a specification for a handoff protocol. The arguments supporting this development can be summarized as:

  • For Beacon protocol simplicity and for scoping security issues, the Beacon protocol should not include direct access to non-aggregated data.
  • However there is a considerable interest in utilizing results of Beacon queries, as entry points for "data discovery" activities.
  • A promising concept is to deliver a key pointing at the sample specific query results, which can be used - after going through an authentication system - to access/stream the data which matched the Beacon query. This can be called a handoff scenario.

As starting point for discussions on the merits of this concept and how to implement it, we have prototyped a very basic version of a handoff concept (without use of a proper authentication procedure):

Beacon+ query
=> internal matched variants
==> internal retrieved callset ids
===> internal storage of callset ids in record in tmp database
====> external delivery of BeaconDatasetAlleleResponse + info.callset_access_handle

Data access
=> callset_access_handle is submitted to authentication system
==> authentication procedure + fwd of callset_access_handle
===> data retrieval options based on authentication status

As part of the Beacon specifications, it should probably be sufficient to define an attribute name/format for the access handle; authentication etc. would be for demonstrators, "discovery" product ... but probably out of scope for the Beacon protocol itself (?).

See Beacon+ => CNV example => handover in response table => ...; the current concept is detailed in these slides.

@mbaudis mbaudis added this to the 0.5 milestone Oct 31, 2017

@juhtornr juhtornr added the proposal label Mar 7, 2018

@juhtornr juhtornr changed the title from Proposal: Handover method for data access to Handover method for data access Mar 7, 2018

@mbaudis

This comment has been minimized.

Contributor

mbaudis commented May 8, 2018

Following the comment/request #157 from @mfiume, we should work on specifying the Handover structure with e.g. the DOS use case.

Working assumptions for the structure of the Handover protocol extension now are that:

  • (a) randomly generated identifier(s) link(s) to the server-side stored matches of the Beacon query
  • this identifier(s) - along with a specification of the supported response format(s) is exposed in the Beacon response - draft:
handover: [
  {
    schema: "DOS",
    access_key: "30822e80-8ef8-4ac9-af5d-304aa7f8c1dd"
  }
],
  • the client can process the response to generate a query against the Handover system, containing the identifier and the selected response type (e.g. DOS)
  • the Handover mechanism will usually include an authentication procedure which will evaluate credentials && request scope - this is not part of the Beacon protocol, but may be integrated with procedures in place for secured Beacon environments

The Handover implementation proposal addresses #157 and is related to #107.

As reminder, a simplified implementation has been prototyped for the Beacon+ resource and is conceptually documented here, though the format of the Handover object is assumed to be an object instead of the callset_access_handle used in the demonstrator.

@mfiume

This comment has been minimized.

Contributor

mfiume commented May 30, 2018

@mbaudis what if the object lives on a different server from the one that is generating the response?

Here, is the access key used to ID the object or does it comprise the authentication information required to fetch it, or both?

What about having a url in the handover struct to point to the payload?

Can you provide an example of how the authentication procedure would be provided? I agree that this would be very helpful to encode as a hint, just wondering how you'd approach it.

@mbaudis

This comment has been minimized.

Contributor

mbaudis commented May 30, 2018

@mfiume I don't think that this would be part of Beacon, but the general idea would be that the "handoff" key would point to whatever action is then executed. It doesn't really matter which server the data resides on; this is resolved from data_access_handle and selected "action". The Beacon itself could expose a vocabulary of actions, so that a distributed query could e.g. be run over many nodes.

Sure, the handover object could be a url; but the url should not provide ids or such, just point to a resolver which can then extract which data object are pointed to. Basically the same as above, with

    url: "https://beacondeliver.mygenomecollection.org/handover/30822e80-8ef8-4ac9-af5d-304aa7f8c1dd"

instead of

    access_key: "30822e80-8ef8-4ac9-af5d-304aa7f8c1dd"

Authentication could be provided in OAuth etc., and the resolver would match credentials to access rights. This would allow the layered access of public beacon query + limited data retrieval.

Our testbed implementation

In our current implementation, the callset_access_handle points to a temporary DB, where the document has then the details:

  • a beacon query (against a non-aggregated collection) leads to matching callsets
  • the number ... of callsets is in the beacon response
  • the _id values of the callsets are stored in a database, where the _id value of the document is returned as callset_access_handle (well, our name here); the document looks like:
{
	"_id" : "966fc3c2-5a11-11e8-bf6d-8f10af00a547",
	"query_coll" : "callsets",
	"query_key" : "id",
	"query_values" : [
		"PGX_AM_CS_GSM511473",
		"PGX_AM_CS_GSM1102907",
		"PGX_AM_CS_GSM437026",
		"PGX_AM_CS_GSM878881"
	],
	"query_db" : "arraymap_ga4gh"
}

Now data can be retrieved by creating different style queries from this.

  1. Getting the callset ids:
db.querybuffer.findOne({_id:'966fc3c2-5a11-11e8-bf6d-8f10af00a547'})

... would deliver the document shown. This has its own:

  • database and collection to query
    • "query_db" : "arraymap_ga4gh"
    • "query_coll" : "callsets"
  • attribute name
    • "query_key" : "id
  • attribute values
    • "query_values" : [ ... ]

If you now follow the original GA4GH schema, you can retrieve e.g. all biosample ids by querying:

db.callsets.find({id:{$in:["PGX_AM_CS_GSM511473","PGX_AM_CS_GSM188255"]}},{biosample_id:1})

... etc., and the get the biosample data; similar for all variants from the matching callsets etc.

But this requires a standardised data structure in the handover delivery (here the GA4GH schema - which we use); or one starts to define other endpoints (and provides this with the Beacon response's handover info).

It is all rather trivial, if keeping to the basic principles of a schema which had been developed over years, without enforcing some of the more esoteric "recapitulate VCF column format" ideas of it.

Oh well...

@jrambla jrambla modified the milestones: 0.5, 0.6 Sep 18, 2018

@mbaudis

This comment has been minimized.

Contributor

mbaudis commented Oct 30, 2018

Updated scenario: Providing a url + label handover list for direct access to the identified resources;

We have now implemented this scenario, for "one click" actions, based on the variants/callsets/samples identified in the Beacon query.

Example (this is the excerpt from the Beacon response):

"datasetAlleleResponses": [
  {
    "callCount": 163,
    "datasetId": "arraymap",
    "error": null,
    "exists": true,
    "externalUrl": "https://beacon.progenetix.org/beacon/info/",
    "frequency": 0.157,
    "handover": [
      {
        "action": "create CNV histogram from matched callsets",
        "label": "Histogram",
        "url": "/beaconplus-server/beacondeliver.cgi?do=histogram&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "export all biosample data of matched callsets",
        "label": "Biosamples",
        "url": "https://beacon.progenetix.org/beaconplus-server/beacondeliver.cgi?do=biosamples&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "export all variants of matched callsets",
        "label": "Callsets",
        "url": "/beaconplus-server/beacondeliver.cgi?do=variants&accessid=2a0136df-dc49-11e8-a927-8d34da1c5bc0"
      },
      {
        "action": "retrieve matching variants",
        "label": "Variants",
        "url": "/beaconplus-server/beacondeliver.cgi?do=variants&accessid=2a01d0bc-dc49-11e8-a927-a8c3673772cb"
      }
    ],
    "info": {
      "callset_access_handle": "2a0136df-dc49-11e8-a927-8d34da1c5bc0",
      "description": "The query was against database \"arraymap\", variant collection \"variants\". 163 matched callsets for 152 distinct variants. Out of 51820 biosamples in the database, 1038 matched the biosample query; of those, 163 had the variant.",
      "payload": null
    },
    "sampleCount": 163,
    "variantCount": 152
  }
],

This

  • provides a direct option to select handover responses
  • can then still have the authentication being handled by the delivery mechanism
  • could be extended to include an authentication key
  • keeps "Beacon response" and "Data delivery" nicely separated

sdelatorrep added a commit that referenced this issue Nov 6, 2018

@mbaudis

This comment has been minimized.

Contributor

mbaudis commented Nov 8, 2018

@sdelatorrep I would suggest adding also a label attribute to the handover object.

Reasoning:

  • You propose using an ontology term for the handover action. In OT implementations, it is common to represent them with a human readable label:
"type" : {
  "id" : "ncit:C40078",
  "label": "Ovarian clear cell adenocarcinoma"
}
  • This label could then serve also e.g. in an interface.

Also, this would be an interesting scenarion in which we have to decide if we should implement the general OntologyClass concept., which finds its way in other parts of GA4GH schemas.
So, the schema could then look like:

Handover:
  type: object
  required:
    - type
    - url
  properties:
    type:
      type: object
      required:
        - id
      properties:
        id:
          type: string
          description: The use of an ontology term, in CURIE syntax, is strongly recommended. Use “CUSTOM” when no ontology is available.
          default: CUSTOM
        label:
          type: string
          description: A short label for the handover action. In the case of an ontology, this would be the "preferred Label".
    url:
      type: string
      description: URL endpoint to where the handover process could progress (in RFC 3986 format).
    note:
      type: string
      description: Additional human readable information or description about the handover.

(The type here is a bit confusing, both as attribute name and as keyword... Alas, this is just for discussion.)

@sdelatorrep

This comment has been minimized.

Contributor

sdelatorrep commented Nov 13, 2018

Hi @mbaudis , looks good! Though we think it's not necessary to create an object for the field type. Check our proposal in PR #230, please.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment