Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How should we specify the metadata endpoint? #3

Closed
nsheff opened this issue Nov 12, 2020 · 8 comments
Closed

How should we specify the metadata endpoint? #3

nsheff opened this issue Nov 12, 2020 · 8 comments

Comments

@nsheff
Copy link
Member

nsheff commented Nov 12, 2020

The primary functions of seqcol are to 1) define unique identifiers for sequencing collections; 2) provide a protocol to serve sequence collection data given the identifiers; and 3) provide a function for comparing compatibility among sequence collections.

An important ancillary function is to provide metadata associated with a particular sequence collection, like provider, version, or organism. How should we provide this data? The current proposal is:

  • There will be a /metadata/:seqcol_digest endpoint which returns an annotated JSON with all metadata for a given sequence collection.
  • There will be a /metadata-schema endpoint that provides a JSON-schema defining the allowed and required files for
  • Each server of sequence collections would define their schema.
  • The protocol provides a base schema with some fundamental metadata, which can be extended by particular servers

Does that seem reasonable? If so, what is the base information that should be included in the base schema? In other words, let's define the core JSON-schema.

Here's a proposal for a JSON-schema that could define a base set of metadata fields:

description: "Schema for sequence collection metadata"
type: array
items: 
  type: object
  properties:
    source:
      type: string
      description: The entity that produced this collection.
    organism:
      type: string
      description: Identifier from the NCBI Taxonomy ontology.
    version:
      type: string
      description: Optional, an identifier for the release version of this collection.
    aliases:
      type: array
      description: A list of human-readable identifiers used to refer to this collection.
      items:
        type: string
  required:
    - source
    - organism
    - aliases

This means the /metadata/:seqcol_digest would return an array of what we might call "metadata packages", where each package must contain "source", "organism", and "aliases", and may contain "version". The rationale behind making this provide an array of "packages" instead of just one package is that multiple providers may provide the same collection, and annotate it in different ways, and this approach keeps their metadata separate.

Perhaps it makes sense to use a simple ontology (or at least controlled vocabulary) for providers, and use those terms in the source field. If we did that, then the metadata endpoint could be qualified by a provider identifier, so you could retrieve only the metadata package specified by a particular provider. I'm not sure going to thls level of complexity is really warranted though.

@andrewyatz
Copy link
Collaborator

@jb-adams you have had some experience with service-info I think? Also one possible alternative to the metadata endpoint is have the JSON send back the specific schema is uses

{
  "$id": "http://yourdomain.com/schemas/myschema.json",
  "$schema": "http://json-schema.org/schema#"
}

Both bits here shamelessly stolen from JSON Schema's basics page.

@jb-adams
Copy link
Member

@andrewyatz @nsheff yes I'm quite familiar with /service-info. What are you thinking of in particular for this issue? At the outset, I think we can keep these design considerations in mind:

  1. Implement the /service-info endpoint in such a way that it won't collide with another /service-info endpoint if multiple GA4GH API specs are implemented by a single web service (e.g. a refget + seqcol service). e.g. this may be as simple as implementing the endpoint at /collections/service-info rather than /service-info

  2. Extend the /service-info endpoint with custom attributes specific to seqcol. In htsget we added an htsget object to the base service info response, which contains info about supported formats, etc. For seqcol, we could use this to inform clients whether an optional metadata endpoint is implemented, and/or what schema(s) are supported by the service if we want to allow for multiple metadata schemas

@andrewyatz
Copy link
Collaborator

It's more about how to offer the same endpoint but with extensions. In service-info we just allowed individual specifications in OpenAPI to inherit our base schema and extend. But that means you have to do it in openAPI and there is no way to access the schema bar going into OpenAPI. Maybe one to consider what our best practice is here

@jb-adams
Copy link
Member

Oh, so how to extend the base ServiceInfo schema in OpenAPI? We did this in htsget with the allOf construct, basically importing all base attributes and adding our own:

htsgetServiceInfo:
      allOf:
        - '$ref': '#/components/schemas/ServiceInfo'
        - type: object
          properties:
            htsget:
              type: object
              description: extended attributes for htsget
              properties:
                datatype:
                    type: string
                    description: >
                      Indicates the htsget datatype category ('reads' or 'variants')
                      served by the ticket endpoint related to this service-info
                      endpoint
                    enum: [reads, variants]
                    example: reads
                formats:
                  type: array
                  description: >
                    List of alignment or variant file formats supported
                    by the htsget endpoint. If absent, clients cannot make 
                    assumptions about what formats are supported ahead
                    of making a query.
                  items:
                    type: string
                    enum: [BAM, CRAM, VCF, BCF]
                fieldsParameterEffective:
                  type: boolean
                  description: >
                    Indicates whether the web service supports alignment field
                    inclusion/exclusion via the `fields` parameter. If absent,
                    clients cannot make assumptions about whether the `fields`
                    parameter is effective ahead of making a query.
                tagsParametersEffective:
                  type: boolean
                  description: >
                    Indicates whether the web service supports alignment tag
                    inclusion/exclusion via the `tags` and `notags` parameters.
                    If absent, clients cannot make assumptions about whether the
                    `tags` and `notags` parameters are effective ahead of making
                    a query.
        - type: object
          description: >
            This response extends the GA4GH Service Info specification
            with htsget-specific properties under the 'htsget' attribute.
            ServiceType 'artifact' property MUST be 'htsget' for both reads 
            and variants endpoints.
          required:
            - type
          properties:
            type:
              type: object
              required:
                - artifact
              properties:
                artifact:
                  type: string
                  enum: [htsget]
                  example: htsget

You'll see 3 objects under the allOf parameter. In order, they:

  1. import the base Service schema from service info
  2. add extended attributes under the htsget property
  3. constrain the type.artifact value so that only htsget is allowable

Is this what you're referring to?

@andrewyatz
Copy link
Collaborator

Discussions from the seqcol meeting just now said we should go the same route as refget, which specified the schema only in OpenAPI format. Also that this issue will get split into two to address the issue of having this endpoint (and if it is mandatory) and if so what is the format of that response (assuming I understood the resolution correctly)

@sveinugu
Copy link
Collaborator

Hi, and thanks for including me in the seqcol meeting! I am a senior engineer employed by ELIXIR Norway (at the University of Oslo).

So the reason I was invited, was that I am one of the main developers of the FAIRtracks draft standard (and related tool infrastructure) for metadata of genomic tracks files, which is the result on an ELIXIR implementation study: http://fairtracks.github.io. So FAIRtracks is available in the form of a set of JSON schemas: https://github.com/fairtracks/fairtracks_standard/. It is for now a suggestion and is meant to evolve. So obviously the metadata aspect of seqcol is of interest to me, and adding seqcol support would be a natural extension. A manuscript is written and will be submitted soon.

So this seems to be a bit late in the process, so I hope I am not being too assuming here. I just wanted to present some initial thoughts:

  1. As mentioned in the meeting, having a way for the metadata content to refer back to the schema would be useful for versioning purposes (and would be nice to be included in the first version, so that downstream implementations don't have to add a specific rule for the first version). In the FAIRtracks standard, we have added an '@Schema' field which contains an URL that includes a version string. Another useful feature of a '@Schema' field, as someone else mentioned, is to provide a simple way to validate the payload.

  2. I think it would be an idea to ponder a bit on the FAIR principles (https://www.go-fair.org/fair-principles/). I think most of the points are already handled by the current specification or are not relevant, but there are at least some that pose a challenge:

  • "I2. (Meta)data use vocabularies that follow FAIR principles": As mentioned by @nsheff, it would be nice if all relevant fields, such as source, would point to an ontology or vocabulary.
  • "I3. (Meta)data include qualified references to other (meta)data". So this is my main idea here, is that it would be nice to have a pointer to a record describing the source content. In FAIRtracks, we make use of CURIEs identifiers resolvable by https://identifiers.org for this (and we will probably also support n2t.net at some point). So the seqcol is a new identifier that is meant to be used in place of such source-specific identifiers, but I think the metadata should contain the relation.
  • "R1. (Meta)data are richly described with a plurality of accurate and relevant attributes". This would not be natural for a minimal standard such as seqcol, but providing a resolvable source identifier (see I3) would make it relatively easy to access a larger set of relevant metadata fields. So the main approach of FAIRtracks in this context is to include the fields that are most useful, and refer to other records for the rest.
  1. As to the question of whether metadata should be required or not, I am in the "required" camp, at least for the most important fields, which for me would be source identifier (as CURIE), organism identifier, and I think also version.

@nsheff
Copy link
Member Author

nsheff commented Jan 25, 2023

From today's discussion:

@nsheff
Copy link
Member Author

nsheff commented Feb 21, 2024

Solved with "no metadata endpoint" decision in #54.

@nsheff nsheff closed this as completed Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants