Closer alignment with JSON Schema #46

Closed
jpmckinney opened this Issue May 22, 2013 · 16 comments

Projects

None yet

4 participants

@jpmckinney
Member

It seems to me like you can use JSON Schema (or something very close) to describe CSVs just as well as JSON files, where instead of:

{
"fields": [{
    "id": "foo",
    "label": "bar",
    "description": "...",
    "type": "date",
    "format": "YYYY.MM.DD"
  }, {
    ...
  }]
}

You'd have:

{
"$schema": "http://link-to-dataprotocols-dialect-of-json-schema.tld/path",
"properties": {
    "foo": {
      "title": "bar",
      "description": "...",
      "type": "date",
      "format": "YYYY.MM.DD"
    },
    "baz": {
      ...
    },
    ...
  }
}

format has a different meaning in JSON Schema, so maybe mint a scheme property.

JSON Schema is already well supported, so I figure it would be a benefit to try to be as close to it as possible.

@rufuspollock
Contributor

This is a really good idea - and there is already re-use of some of the json schema with the types etc. In terms of mappings:

  • fields => properties
  • description + type are already identical
  • label would have to remap to title
  • format - my reading of http://json-schema.org/latest/json-schema-validation.html#anchor104 would indicate that format as used their is compatible with the one here so would be ok to converge
    • only difference is JTS allows date/datetime as actual types whereas json schema obviously requires as types

Challenges:

  • order matters for properties/fields in JTS but does not for json schema. How do we handle this? (We could introduce an order argument)
@jpmckinney
Member

Yes, I think a position or order field would be fine. On re-reading the format docs, it seems you can indeed add your own custom formats.

@rufuspollock
Contributor

I have to say I don't like the order item that much but I guess it would be essential.

What's the neatest example of json schema handling for arrays?

It also seems that id is a special keyword in json schema (obviously not so much an issue if we switch to keys approach but that has challenges too - like required uniqueness ...)

@jpmckinney
Member

I hadn't thought of non-unique table headers - maybe use a composite key of position and header?

I haven't seen examples where JSON schema handles array order. For an unordered array, see e.g. "Set of products schema" on this page.

@rufuspollock
Contributor

@jpmckinney hmmm, issue is that array itself is not a natural part of json schema - we aren't talking here about using describing arrays using json schemas but how we use json schema itself ...

Two options:

  • Keep arrays which mean this won't fit json schema perfectly
  • introduce order / position field - this feels kind of hacky but is possible
@trickvi
Contributor
trickvi commented Jun 20, 2013

I like this idea!

I'm wondering why we would try to use properties when json schema defines items. I say we use items instead of fields. Pretty simple!

I'm also confused by the use of the $schema attribute. I understand it if we have a standard csv format, e.g. for spending data, but for varying data it might be difficult to keep track of all different schemas (even though it is recommended) since these schemas are a part of the datapackage.json file and not individual schema files. Another option would be to use the extends attribute and provide some basic schemas somewhere central.

@jpmckinney
Member

@tryggvib $schema can just point to the version of JSON Schema that you are currently using. I'm not sure items saves us. items is for describing arrays, but CSV doesn't have arrays. The properties defined under items aren't ordered either. The question here is how to communicate array order. @rgrp gave the two options.

@trickvi
Contributor
trickvi commented Jun 20, 2013

Hmmm... Well using JSON schema to describe a CSV is probably not what it was originally intended for ;) A CSV doesn't have an array but each row in the CSV can be seen as an array so I think that would be a good fit.

An array is json is ordered (at least according to the spec on json.org where it says "An array is an ordered collection of values."

So here's what I'm thinking:

{
    "$schema": "http://link-to-dataprotocols-sdf-schema/",
    "items": [
        {
            "id": "foo",
            "title": "bar",
            "description": "...",
            "type": "date",
            "format": "YYYY.MM.DD"
        },
        {
            "id": "baz"
            ...
        },
        ...
    ]
}

So items is a list of subschemas.

@jpmckinney
Member

@tryggvib Ah, but items in JSON Schema is not an array! It's an object (hash).

@trickvi
Contributor
trickvi commented Jun 20, 2013

@jpmckinney I'm not sure I understand you there. JSON schema says this about items: "This attribute defines the allowed items in an instance array, and MUST be a schema or an array of schemas.". In more detail:

When this attribute value is a schema and the instance value is an
array, then all the items in the array MUST be valid according to the
schema.

When this attribute value is an array of schemas and the instance
value is an array, each position in the instance array MUST conform
to the schema in the corresponding position for this array.

So it's only an object if all items are supposed to follow the same schema. With a CSV row different values in the columns we could use an array of scheams to define it (which must abide to the position condition).

The problem with that is the only row that doesn't necessarily conform with the schema, the header row. Either we define that as part of the json schema (and make it harder to read) or we skip it (like the current schema does).

@jpmckinney
Member

Indeed, you are correct. I'd never seen examples where the value of items was an array, but it does seem to validate against linters: http://jsonschemalint.com/

@rufuspollock
Contributor

I generally like this though I must say items seems a lot less descriptive than fields (I note that e.g. Google BigQuery schema also uses fields ...)

@rufuspollock rufuspollock referenced this issue in okfn/ideas Oct 5, 2013
Open

The Good Data Toolchain #71

@rufuspollock
Contributor

@paulfitz another request for your (excellent) thoughts here. I'm still in 2 minds about what to do here. Real compatibility would involve quite significant changes to JTS and would interact with many of the existing proposals. But that does not mean this is a "bad thing" (tm).

@pwalsh
Member
pwalsh commented Mar 7, 2016

@rgrp @danfowler @jpmckinney

We are well beyond this discussion by now. I think we've got as much from JSON Schema as we'll get into the JTS spec by now. Can this issue be closed?

@rufuspollock
Contributor

FIXED / WONTFIX. @pwalsh i think you are right - at this point we are unlikely to get that much more alignment and we have a fair amount (e.g. reuse of types etc).

@jpmckinney if you have any thoughts here or think we should reopen please say!

@jpmckinney
Member

I agree that we've probably gone as far as we can on this. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment