How should we specify the metadata endpoint? #3

nsheff · 2020-11-12T13:05:49Z

The primary functions of seqcol are to 1) define unique identifiers for sequencing collections; 2) provide a protocol to serve sequence collection data given the identifiers; and 3) provide a function for comparing compatibility among sequence collections.

An important ancillary function is to provide metadata associated with a particular sequence collection, like provider, version, or organism. How should we provide this data? The current proposal is:

There will be a /metadata/:seqcol_digest endpoint which returns an annotated JSON with all metadata for a given sequence collection.
There will be a /metadata-schema endpoint that provides a JSON-schema defining the allowed and required files for
Each server of sequence collections would define their schema.
The protocol provides a base schema with some fundamental metadata, which can be extended by particular servers

Does that seem reasonable? If so, what is the base information that should be included in the base schema? In other words, let's define the core JSON-schema.

Here's a proposal for a JSON-schema that could define a base set of metadata fields:

description: "Schema for sequence collection metadata"
type: array
items: 
  type: object
  properties:
    source:
      type: string
      description: The entity that produced this collection.
    organism:
      type: string
      description: Identifier from the NCBI Taxonomy ontology.
    version:
      type: string
      description: Optional, an identifier for the release version of this collection.
    aliases:
      type: array
      description: A list of human-readable identifiers used to refer to this collection.
      items:
        type: string
  required:
    - source
    - organism
    - aliases

This means the /metadata/:seqcol_digest would return an array of what we might call "metadata packages", where each package must contain "source", "organism", and "aliases", and may contain "version". The rationale behind making this provide an array of "packages" instead of just one package is that multiple providers may provide the same collection, and annotate it in different ways, and this approach keeps their metadata separate.

Perhaps it makes sense to use a simple ontology (or at least controlled vocabulary) for providers, and use those terms in the source field. If we did that, then the metadata endpoint could be qualified by a provider identifier, so you could retrieve only the metadata package specified by a particular provider. I'm not sure going to thls level of complexity is really warranted though.

The text was updated successfully, but these errors were encountered:

andrewyatz · 2020-11-13T17:42:07Z

@jb-adams you have had some experience with service-info I think? Also one possible alternative to the metadata endpoint is have the JSON send back the specific schema is uses

{
  "$id": "http://yourdomain.com/schemas/myschema.json",
  "$schema": "http://json-schema.org/schema#"
}

Both bits here shamelessly stolen from JSON Schema's basics page.

jb-adams · 2020-11-13T17:54:23Z

@andrewyatz @nsheff yes I'm quite familiar with /service-info. What are you thinking of in particular for this issue? At the outset, I think we can keep these design considerations in mind:

Implement the /service-info endpoint in such a way that it won't collide with another /service-info endpoint if multiple GA4GH API specs are implemented by a single web service (e.g. a refget + seqcol service). e.g. this may be as simple as implementing the endpoint at /collections/service-info rather than /service-info
Extend the /service-info endpoint with custom attributes specific to seqcol. In htsget we added an htsget object to the base service info response, which contains info about supported formats, etc. For seqcol, we could use this to inform clients whether an optional metadata endpoint is implemented, and/or what schema(s) are supported by the service if we want to allow for multiple metadata schemas

andrewyatz · 2020-11-13T17:56:45Z

It's more about how to offer the same endpoint but with extensions. In service-info we just allowed individual specifications in OpenAPI to inherit our base schema and extend. But that means you have to do it in openAPI and there is no way to access the schema bar going into OpenAPI. Maybe one to consider what our best practice is here

jb-adams · 2020-11-13T18:03:02Z

Oh, so how to extend the base ServiceInfo schema in OpenAPI? We did this in htsget with the allOf construct, basically importing all base attributes and adding our own:

htsgetServiceInfo:
      allOf:
        - '$ref': '#/components/schemas/ServiceInfo'
        - type: object
          properties:
            htsget:
              type: object
              description: extended attributes for htsget
              properties:
                datatype:
                    type: string
                    description: >
                      Indicates the htsget datatype category ('reads' or 'variants')
                      served by the ticket endpoint related to this service-info
                      endpoint
                    enum: [reads, variants]
                    example: reads
                formats:
                  type: array
                  description: >
                    List of alignment or variant file formats supported
                    by the htsget endpoint. If absent, clients cannot make 
                    assumptions about what formats are supported ahead
                    of making a query.
                  items:
                    type: string
                    enum: [BAM, CRAM, VCF, BCF]
                fieldsParameterEffective:
                  type: boolean
                  description: >
                    Indicates whether the web service supports alignment field
                    inclusion/exclusion via the `fields` parameter. If absent,
                    clients cannot make assumptions about whether the `fields`
                    parameter is effective ahead of making a query.
                tagsParametersEffective:
                  type: boolean
                  description: >
                    Indicates whether the web service supports alignment tag
                    inclusion/exclusion via the `tags` and `notags` parameters.
                    If absent, clients cannot make assumptions about whether the
                    `tags` and `notags` parameters are effective ahead of making
                    a query.
        - type: object
          description: >
            This response extends the GA4GH Service Info specification
            with htsget-specific properties under the 'htsget' attribute.
            ServiceType 'artifact' property MUST be 'htsget' for both reads 
            and variants endpoints.
          required:
            - type
          properties:
            type:
              type: object
              required:
                - artifact
              properties:
                artifact:
                  type: string
                  enum: [htsget]
                  example: htsget

You'll see 3 objects under the allOf parameter. In order, they:

import the base Service schema from service info
add extended attributes under the htsget property
constrain the type.artifact value so that only htsget is allowable

Is this what you're referring to?

andrewyatz · 2020-11-25T16:05:16Z

Discussions from the seqcol meeting just now said we should go the same route as refget, which specified the schema only in OpenAPI format. Also that this issue will get split into two to address the issue of having this endpoint (and if it is mandatory) and if so what is the format of that response (assuming I understood the resolution correctly)

sveinugu · 2020-11-25T16:46:07Z

Hi, and thanks for including me in the seqcol meeting! I am a senior engineer employed by ELIXIR Norway (at the University of Oslo).

So the reason I was invited, was that I am one of the main developers of the FAIRtracks draft standard (and related tool infrastructure) for metadata of genomic tracks files, which is the result on an ELIXIR implementation study: http://fairtracks.github.io. So FAIRtracks is available in the form of a set of JSON schemas: https://github.com/fairtracks/fairtracks_standard/. It is for now a suggestion and is meant to evolve. So obviously the metadata aspect of seqcol is of interest to me, and adding seqcol support would be a natural extension. A manuscript is written and will be submitted soon.

So this seems to be a bit late in the process, so I hope I am not being too assuming here. I just wanted to present some initial thoughts:

As mentioned in the meeting, having a way for the metadata content to refer back to the schema would be useful for versioning purposes (and would be nice to be included in the first version, so that downstream implementations don't have to add a specific rule for the first version). In the FAIRtracks standard, we have added an '@Schema' field which contains an URL that includes a version string. Another useful feature of a '@Schema' field, as someone else mentioned, is to provide a simple way to validate the payload.
I think it would be an idea to ponder a bit on the FAIR principles (https://www.go-fair.org/fair-principles/). I think most of the points are already handled by the current specification or are not relevant, but there are at least some that pose a challenge:

"I2. (Meta)data use vocabularies that follow FAIR principles": As mentioned by @nsheff, it would be nice if all relevant fields, such as source, would point to an ontology or vocabulary.
"I3. (Meta)data include qualified references to other (meta)data". So this is my main idea here, is that it would be nice to have a pointer to a record describing the source content. In FAIRtracks, we make use of CURIEs identifiers resolvable by https://identifiers.org for this (and we will probably also support n2t.net at some point). So the seqcol is a new identifier that is meant to be used in place of such source-specific identifiers, but I think the metadata should contain the relation.
"R1. (Meta)data are richly described with a plurality of accurate and relevant attributes". This would not be natural for a minimal standard such as seqcol, but providing a resolvable source identifier (see I3) would make it relatively easy to access a larger set of relevant metadata fields. So the main approach of FAIRtracks in this context is to include the fields that are most useful, and refer to other records for the rest.

As to the question of whether metadata should be required or not, I am in the "required" camp, at least for the most important fields, which for me would be source identifier (as CURIE), organism identifier, and I think also version.

nsheff · 2023-01-25T17:55:59Z

From today's discussion:

server-scoped metadata, like the schema we described above, should be served by /service-info (Define what the service info will contain #39)
collection-scoped or sequence-scoped metadata don't fit under /service-info. For these, they could maybe go into a /metadata endpoint? But I think we should wait for discussion on Discussion on undigested attributes and sorted-name-length-pairs #40, since the "undigested attributes" could correspond to metadata, so how to serve those will come out in that discussion.

nsheff · 2024-02-21T19:31:08Z

Solved with "no metadata endpoint" decision in #54.

tcezard mentioned this issue Nov 23, 2020

Will the API offer an alias to digest conversion endpoint? #4

Open

nsheff mentioned this issue Jan 11, 2023

Define what the service info will contain #39

Open

nsheff mentioned this issue Jul 26, 2023

Add ADR for no metadata endpoint #54

Merged

nsheff added the likely-solved label Jul 26, 2023

nsheff closed this as completed Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How should we specify the metadata endpoint? #3

How should we specify the metadata endpoint? #3

nsheff commented Nov 12, 2020

andrewyatz commented Nov 13, 2020

jb-adams commented Nov 13, 2020

andrewyatz commented Nov 13, 2020

jb-adams commented Nov 13, 2020

andrewyatz commented Nov 25, 2020

sveinugu commented Nov 25, 2020

nsheff commented Jan 25, 2023 •

edited

Loading

nsheff commented Feb 21, 2024

How should we specify the metadata endpoint? #3

How should we specify the metadata endpoint? #3

Comments

nsheff commented Nov 12, 2020

andrewyatz commented Nov 13, 2020

jb-adams commented Nov 13, 2020

andrewyatz commented Nov 13, 2020

jb-adams commented Nov 13, 2020

andrewyatz commented Nov 25, 2020

sveinugu commented Nov 25, 2020

nsheff commented Jan 25, 2023 • edited Loading

nsheff commented Feb 21, 2024

nsheff commented Jan 25, 2023 •

edited

Loading