sharing schema across resources #71

Closed
sballesteros opened this Issue Oct 16, 2013 · 21 comments

Projects

None yet

6 participants

@sballesteros

If I have a list of resources (several csv files for instance) each having the very same schema, is there a way to factor the schema or do I have to repeat it for each resource ?

One way to solve that would be to allow path to be an array of files instead of a single file but I am afraid that this would break a lot of things (including the discussed foreignkey support)

@rufuspollock
Contributor

Really glad you opened this - we've been thinking about this for a while but had opened an issue (and I recently ran into this with https://github.com/okfn/publicbodies. (Could i suggest you raise this on-list at http://lists.okfn.org/mailman/listinfo/data-protocols?)

Questions:

  • Can you reference a common schema within a data package (and where is that schema located)
  • Can you reference externally located schemas? (Or schemas located within another data package)

It is worth looking here at JSON schema which has $ref which can be a URL to another JSON schema or relative to current schema. See e.g. http://json-schema.org/example1.html

My straw man suggestion which would support both options is:

{
  "resources": [
    {
    ...
      "schema": {
         # if url - url should be of form: http://url-to/datapackage.json#schema-name
         "$ref": schema-name (or url)
      }
    }
  ],
  "schemas": {
    "schema-name": {
      schema goes here ...
    }
  }
}

This introduces 2 new things:

  • schemas entry in datapackage.json
  • $ref attribute in a schema

What do you think?

@rufuspollock
Contributor

@sballesteros @paulfitz any thoughts on whether you think this is a sensible approach?

@maxogden don't know if this is ever relevant to your usecase? (Do you ever want to share schemas for CSVs across multiple projects?)

@sballesteros

looks great to me!

a new (optional) schemas hash, and $ref in the schema property of a resource containing a JSON pointer.
For the url, I would suggest to be more general and use:
http://url-to-anything#JSON-pointer-to-schema as some schemas can be stored elsewhere than in datapackages.
So in your example:
I would replace:

http://url-to/datapackage.json#schema-name

by:

http://url-to/datapackage.json#schemas/schema-name
@paulfitz
Contributor

Low-level point: Perhaps "schema": "schema-name" would be a slightly simpler syntax? So instead of schema being strictly a hash, then having a $ref exception, schema could be instead a string (name or url reference) or a hash (in-place schema).

Higher-level point: the suggestion seems reasonable - but the one thing that makes me nervous is that, as the schema moves further away from the data, are they more likely to get out of sync? One particular case seems useful to think about: a column being appended to a (remote) schema, without the local table getting updated (or its maintainer even knowing there was a change). Could that just be defined as OK? That it is valid to have a table with fewer columns than the specified schema? It would be reasonable to outlaw other inconsistencies, but allowing this one could save some unnecessary hassles down the line. Not exactly sure if this is nailed down one way or the other in the spec.

@sballesteros

Low-level point: Perhaps "schema": "schema-name" would be a slightly simpler syntax? So instead of schema being strictly a hash, then having a $ref exception, schema could be instead a string (name or url reference) or a hash (in-place schema).

+1

@rufuspollock
Contributor

@paulfitz would you like to have a stab an actual pull-request for this?

@rufuspollock
Contributor

Based on @paulfitz suggestion, here is a updated (and possibly final!) proposal:

schema entry in resource MAY be a string or a Hash. If a Hash it should follow an appropriate specification appropriate to this data package type. (e.g. JSON Table Schema for Simple Data Format data packages).

If it is a string it must be:

  • EITHER: a URL. The url must be consist of a URL to a data package datapackage.json file combined with anchor section following JSON Pointer notation pointing to the schema with that data package
  • OR: a simple string name which MUST correspond to an entry in the schemas hash in the same datapackage.json file

schemas hash. A Data Package MAY have a schemas Hash. Each schema name in the Hash must consist only of lower-case alphanumeric letters, together with - and _. Each value for an entry must be a Hash specifying an appropriate schema.

Examples:

{
  "resources": [
    {
    ...
      "schema": http://url-to/datapackage.json#schemas/schema-name
      }
    }
  ],
...
}
{
  "resources": [
    {
    ...
      "schema": "xyz-schema"
    }
  ],
  "schemas": {
    "xyz-schema": {
      schema goes here ...
    }
  }
}
@rufuspollock
Contributor

@paulfitz @sballesteros @ldodds what do you think?

@ldodds
Contributor
ldodds commented Mar 10, 2014

I like the basic proposal, but personally I think it would be better to have the URL refer to a JSON document which is the schema itself, i.e. it conforms to JSON Table schema, not a datapackage containing a schema. I don't think that adds any value, especially if you're just exposing a schema for others to follow: there would be nothing in the package.

You could support both, though:

  • Either URL resolves to a JSON document that conforms to JSON Table schema, or
  • URL includes a fragment identifier which conforms to JSON Pointer. A validator would fetch the document and extract the schema from the designated element. That achieves the same goal but adds flexibility around what the overall JSON structure might be.
@rufuspollock
Contributor

@ldodds very good points and I guess we can just use json pointer notation even where we do point to another datapackage. Personally, I think pointing within a given datapackage file will (initially) prove most useful and likely the most reliable (external dependencies is a bit dangerous as could easily break).

My one concern here is the burden place on tools to recognize this. I guess a first pass would be to support this in the normalize code in https://github.com/okfn/datapackage-read and see how tough this is.

@Floppy
Floppy commented Nov 7, 2014

+1 from me as well on this, would be great to have.

@Floppy
Floppy commented Nov 7, 2014

I can give this a go on my repository at http://github.com/SomethingNewUK/finances. I'll take a look at the current proposal, and see what I can do.

@Floppy
Floppy commented Nov 26, 2014

So I'm now doing this in two different ways:

All thoughts welcome!

@Floppy
Floppy commented Nov 26, 2014

Reading back, looks like what I've done basically conforms to @rgrp's (final?) proposal on Jan 12. Only difference is that I'm linking straight to the Popolo JSON schemas, and they aren't in JSON Table Schema format. Both are slightly awkward in that of course I don't control them. I guess the thing to do would be to wrap them with a datapackage that then describes what they are? Not sure.

@rufuspollock
Contributor

This is very useful demonstration. The point re linking to popolo is interesting. What does the schema look like there exactly? I certainly think we should probably be able to link to any json file on the web but the schema would need to conform to json table schema structure. I didn't think popolo yet had a version that was json-table-schema'd but that could be useful (also relevant for for https://github.com/okfn/publicbodies)

@rufuspollock
Contributor

OK, I'm going to implement this into the Data Package spec but I am wondering whether I should split this out in a separate mini-spec e.g. data-package-schema-references or something ...

@paulfitz @pwalsh @jpmckinney ... wdyt?

@rufuspollock rufuspollock added a commit that closed this issue May 26, 2015
@rufuspollock rufuspollock [dp,!][m]: add support for sharing schemas across resources via schem…
…a references - fixes #71.

* Breaking: not strictly a breaking change as an enhancement but places new requirements on processors so flagging as potentially breaking for them.
c4d75a3
@rufuspollock
Contributor

FIXED but discussion still open - this could be split out and interested if any sharp edges in new material (one example: how does a processor know whether it has a schema string value is a url or a local reference - i guess one can check whether starts with http but that seems sort of hacky ...)

@jpmckinney
Member

If a local reference, the value should be a fragment identifier as in JSON Schema, i.e. starting with #.

http://json-schema.org/latest/json-schema-core.html#anchor31

@rufuspollock
Contributor

@jpmckinney is this a convention or part of json pointer? I like the idea though as a simple way of disambiguating ...

@jpmckinney
Member

JSON Schema can use JSON Pointer. The local references can start with #/ as in JSON Pointer's URI Fragment Identifier Representation.

@Floppy
Floppy commented May 27, 2015

Fantastic, thanks @rgrp. I am now standard-compliant again :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment