Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Provide a JSON-LD context file or files #311

Open
cmungall opened this issue May 23, 2015 · 4 comments
Open

Provide a JSON-LD context file or files #311

cmungall opened this issue May 23, 2015 · 4 comments

Comments

@cmungall
Copy link
Member

A JSON-LD context file provides an unambiguous machine-interpretable way to translate a JSON document to an RDF serialization. Even in cases where translation to RDF is not a desired goal, a JSON-LD document file can be a useful way of clarifying certain aspects of the GA4GH schema, such as how identifiers should be specified.

Broadly speaking, the JSON-LD context file would provides mappings to RDF URIs for the following

  • The keys in a JSON/Avro object; for example, a key that specifies the position of a variant could be mapped to a FALDO property
  • The values in a JSON/Avro object; for example, an enum-type value could be mapped to an ontology class URI. Or, an ontology class ID can be mapped to its formal URI

Note I am proposing this is adoped a non-invasive way. The JSON-LD context can be safely ignored by developers seeking to consume JSON, and the adoption of a JSON-LD schema should not affect modeling decisions in the Avro schemas. To make a JSON document JSON-LD all that is necessary is to add the "@context" object in the header, but even this should not be required; rather there should be a simple translation from the Avro-specified JSON to JSON-LD that adds this.

One area where immediate a GA4GH context file would immediately provide some clarification is in issue #165, where there is currently no consensus on the form for an ID to denote an ontology class. If the GA4GH uses this OBO JSON-LD context or some subset of it, this would provide an unambiguous way of writing identifiers for any OBO library class.

The use of JSON-LD contexts was brought up by @tetron in the ever expanding discussion of #264, and may help clarify certain aspects of that discussion.

A JSON-LD context file that provides complete coverage over all keys used in the union of all avro modules would be a larger task. This need not all be done at once, and need not be done as an 'official' GA4GH project (although it would be better to avoid the situation where we have competing JSON-LD contexts).

@tetron
Copy link

tetron commented May 23, 2015

An important thing to keep in mind, the json-ld context file needs to stay in sync with the avro schema that it maps from, which may be challenging if it has to be updated manually against an evolving spec.

For the common workflow language effort, which is associated with the ga4gh containers and workflows task team, we have been working on extending the avro schema language with annotations that enable automatic generation of the json-ld context and rdfs schema from the avro schema. Example schema with annotations:

https://github.com/common-workflow-language/common-workflow-language/blob/master/schemas/draft-2/cwl-avro.yml

The processing code is here:

https://github.com/common-workflow-language/common-workflow-language/tree/master/reference/cwltool/avro_ld

A few notes:

This is formatted using yaml instead of json for ease of writing inline documentation since yaml supports multiline string literals and plain json doesn't.

It supports data type definitions, I have not tried to use it with protocol definitions, but I don't expect much additional work would be required for the json notation.

It does not support the avro IDL syntax currently used to write most ga4gh schemas. I'm not sure how best to go about implementing that.

The processor also implements record subclassing, abstract types, templates types with specialization, and documentation generation. Example:

http://common-workflow-language.github.io

I am considering splitting this out from CWL to its own project, however to make it useful to the rest of ga4gh it would need outside contributors because my time is pretty limited.

@cmungall
Copy link
Member Author

@tetron - good point about synchronization. But we already have this when the schema changes but the implementation doesn't. It should be possible to automatically check the ld and the avro are in sync using some simple tooling.

The idea of annotating Avro with additional information is an interesting one. This would require broad agreement and substantial changes across the GA4GH. I would prefer to discuss this in a separate ticket.

The proposal I have outlined would have virtually zero impact on existing schema development and implementation, and would be an optional add-on (the only area that might be impacted would be forcing the adoption of a set of standardized prefixes for identifiers, which I think would be a good thing).

This may be sub-optimal in the long run, and it may be better to eventually tightly couple the json-ld and the avro. But perhaps best to do things incrementally?

@tetron
Copy link

tetron commented May 24, 2015

Actually, I disagree about having zero impact on existing schema development. To apply json-ld successfully, several details that affect the ability to successfully capture all the semantics from idiomatic json that have to be accounted for in schema design:

  • json lists can be interpreted as either an unordered set of properties, or an ordered list associated with a single property. It becomes important to document whether order is significant for lists in the base (avro) data model.
  • json-ld mostly assumes all keys are known ahead of time and ignores anything it does not understand, with some allowance for hacks based on @vocab. In particular, map types with arbitrary keys are hard to map to rdf successfully.
  • there is a tendency to re-use field names for similar or related but not identical purposes, where the exact meaning is derived from the type of object the property is attached to. This causes semantic ambiguity and problems for rdfs entailment. Fields need to be named precisely.

@cmungall
Copy link
Member Author

cmungall commented Jul 4, 2015

@tetron good points

  • lists: do we know how often order-dependent lists are used? Maybe we could force explicit semantics, e.g. lists of objects with rank fields. Obvious objections are bulkier objects.
  • maps: forgot about these. Seemed to be used for extensible tag-value property lists. Options here are to force more explicit modeling (e.g. a TagValue record, which would have other advantages, albeit at the cost of some bulkiness), or to have an extra-schema level restriction on whta can be the keys
  • identical key name in different records meaning different things: is it not possible to override this in JSON-LD on a per-object type basis?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants