Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jts] Refine field/column data types in JSON Table Schema #159

Closed
rufuspollock opened this issue Dec 24, 2014 · 31 comments
Closed

[jts] Refine field/column data types in JSON Table Schema #159

rufuspollock opened this issue Dec 24, 2014 · 31 comments

Comments

@rufuspollock
Copy link
Contributor

This is a discussion issue for potentially refining the list of column / field data types in JSON Table Schema:

  • string: a string (of arbitrary length)
  • number: a number including floating point numbers.
  • integer: an integer.
  • date: a date. This MUST be in ISO6801 format YYYY-MM-DD or, if
    not, a format field must be provided describing the structure.
  • time: a time without a date
  • datetime: a date-time. This MUST be in ISO 8601 format of
    YYYY-MM-DDThh:mm:ssZ in UTC time or, if not, a format field must be
    provided.
  • boolean: a boolean value (1/0, true/false).
  • binary: base64 representation of binary data.
  • object: (alias json) an JSON-encoded object
  • geopoint: has one of the following structures:
{ lon: ..., lat: ... }

[lon,lat]

"lon, lat"

Also worth comparing against the (in-progress W3C spec - which has been informed by this spec ...): http://w3c.github.io/csvw/metadata/index.html#datatypes

@rufuspollock rufuspollock changed the title Refine list of field/column data types in JSON Table Schema [jts] Refine field/column data types in JSON Table Schema Dec 24, 2014
@pwalsh
Copy link
Member

pwalsh commented Jan 20, 2015

cc @tryggvib

Suggestion:

Several data types are just format patterns for strings.

JSON Schema has a set of "built-in" formats, and "allows" for custom formats.

Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful (3rd party tools for validation, etc.).

For built-in formats of the string type in JSON Schema, see:

http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format

I propose the following:

  • Support all the builtin JSON Schema formats explicitly: date-time, email, hostname, ipv4, ipv6 and uri
    • I grant that hostname, ipv4 and ipv6 may not seem terribly necessary
  • Remove the JSON Table Schema datetime type (it will be replaced by the date-time format on the string type)
  • Remove the JSON Table Schema date and time types, and replace them with date and time as accepted formats for the string type
  • While I'm at it :), I don't really see any need to maintain a json alias for the object type. It is (potentially) confusing when array is also JSON.

I believe that this simplifies JSON Table Schema, by correctly differentiating between a type and a format that the type should conform with.

Following from this, there is another opportunity to simplify the spec with regards to geopoint and geojson. Again, these are really formats for (array or object or string), and I suggest would be better represented as such:

eg:

  • {"type": "string", "format": "geopoint"} > "lon, lat"
  • {"type": "array", "format": "geopoint"} > [lon, lat]
  • {"type": "object", "format": "geopoint"} > {lon: ..., lat: ...}
  • {"type": "object", "format": "geojson"} > { ... }

@rufuspollock
Copy link
Contributor Author

@pwalsh need to think about this one. I'm not sure how much we get moving the "complexity" down into "format" (though maybe, as you suggest, we can reuse json schema stuff more ...). I also acknowledge that format is currently underspecified generally.

Could you also summarize in a list what the new set of types (+ formats) would be.

/cc @paulfitz @jpmckinney

@pwalsh
Copy link
Member

pwalsh commented Jan 21, 2015

I'll give one example of where tightening the type/format relationship could be useful for code that implements the spec:

As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec.

With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..).

However, if the user was able to declare the object as type object and format geojson, we might expect that the validation library could at least validate it as a validly formed object, even if it doesn't support the geojson format.

By reducing the primitive type set, I think it makes it easier in the following areas:

  • For people writing libraries around JTS, to provide full type support, and then iterate on format support.
  • For people working on the JTS spec, to introduce new formats which are subsets of types
  • It also gives the user a useful level of assurance as to the validity of the data, even if the format check is lacking.

So in general I'm suggesting:

  • type array
    • format geopoint
  • type boolean
  • type integer
  • type number
  • type null
  • type object
    • format geopoint
    • format geojson
  • type string
    • format date-time (ISO6801)
    • format date (ISO6801)
    • format time (ISO6801)
    • format geopoint
    • format email
    • format uri
  • type binary

@jpmckinney
Copy link

+1 for @pwalsh's three bulleted reasons, plus it avoids an unnecessary difference with JSON Schema.

Separately: why are there three serializations of a geopoint?

@pwalsh
Copy link
Member

pwalsh commented Jan 21, 2015

@jpmckinney because the current JSON Table Schema spec prescribes these three forms for geopoint, each being valid.

Plus, it serves an an interesting example of formats > types - these three forms are quite common in the wild.

@jpmckinney
Copy link

@pwalsh I've seen the array and string forms in the wild - but the object form? I've seen all sorts of keys: lat/lng, lat/long, latitude/longitude, etc. The object form seems the least standardized.

@pwalsh
Copy link
Member

pwalsh commented Jan 21, 2015

@jpmckinney fair enough. In any event, they are here because they are all described in the current spec (obj form using lat and lon).

@rufuspollock
Copy link
Contributor Author

Note: re geopoint I think we should drop array version but re the object and string versions that was there to support very common usage (certainly string version ...). I would be up for narrowing to even one but challenge will be that either one you drop you exclude a lot of existing use (and convenience if you drop string).

@pwalsh
Copy link
Member

pwalsh commented Jan 23, 2015

So I guess we should first get general agreement on the type/format implementation I've suggested above.

But, about geopoint specifically, I'd keep all three:

@jpmckinney
Copy link

I'm fine with keeping all three. Going back to this comment #159 (comment) I agree with @pwalsh 's suggestion.

@pwalsh
Copy link
Member

pwalsh commented Jan 27, 2015

A slight amendment to the type/format list above:

binary (which is a base64 encoded string in JTS) should be a format of the string type, IMHO.

@pwalsh
Copy link
Member

pwalsh commented Jan 28, 2015

@rgrp how do we move forward with this (or not)? I'm just not sure of the process with the data protocols specifications.

@rufuspollock
Copy link
Contributor Author

@pwalsh can we summarize again what the exact proposal is. Also I'm really not sure how this interacts with format field (which is somewhat underspecified).

Also clarity on pros / cons would be super-useful (obvious con is this is a substantial breaking change but that is not the end of the world). Once we have that we can resolve on this, trial it and then release into the spec.

@pwalsh
Copy link
Member

pwalsh commented Jan 29, 2015

Ok, summary below. BTW, if this is too radical, that's fine. I made the proposal based on (a) working with JTS in code, and (b) under the assumption that the spec is still rather fluid, and there is not necessarily a deep commitment to the stability of the spec as is. Could be the (b) is an incorrect assumption.

Proposal - Standardise field type/format for consistency, ease of use, and wide(r) compatibility with JSON Schema

I find the implementation of the type and format properties of field hashes to be inconsistent and unclear. I'm proposing to reduce the valid types in JSON Table Schema, and make the use of format consistent and explicit for each type.

Making format "consistent and explicit" means that formats, if declared, should always be a particular pattern or structure that type conforms to. In addition, this also means that something should not be a type if it is really a format of another type (e.g.: geojson is really just a format for object).

Relation to JSON Schema

The available types should be consistent with JSON Schema. formats can align with JSON Schema where possible, but also extend and deviate from those in JSON Schema. JSON Schema has a set of "built-in" formats, and "allows" for custom formats.

Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful for 3rd party tools for validation, etc.

For built-in formats of the string type in JSON Schema, see:

http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format

I propose the following:

  • Support all the builtin JSON Schema formats explicitly: date-time, email, hostname, ipv4, ipv6 and uri
    • I grant that hostname, ipv4 and ipv6 may not seem terribly necessary
  • Remove the JSON Table Schema datetime type (it will be replaced by the date-time format on the string type)
  • Remove the JSON Table Schema date and time types, and replace them with date and time as accepted formats for the string type
  • While I'm at it :), I don't really see any need to maintain a json alias for the object type. It is (potentially) confusing when array is also JSON.

The change

The type/format relation I'm suggesting would still cover the entire spec as it stands today, while making the use of both type and format clearer.

  • type array
    • format geopoint
  • type boolean
  • type integer
  • type number
  • type null
  • type object
    • format geopoint
    • format geojson
  • type string
    • format date-time (ISO8601)
    • format date (ISO8601)
    • format time (ISO8601)
    • format geopoint
    • format email
    • format uri
    • format binary (base64 encoded string)

User story

I'll give one example of where tightening the type/format relationship could be useful for code that implements the spec:

As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec.

With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..).

However, if the user was able to declare the object as type object and format geojson, we might expect that the validation library could at least validate it as a validly formed object, even if it doesn't support the geojson format.

Pros/Cons

Cons

  • A potentially significant breaking change
  • If this actually would require JTS validators to support the spec in its current form, as well as in the proposed form, I don't think it is worth it

Pros

Reducing the primitive type set is an improvement in the following areas:

  • Writing libraries around JTS may be easier:
    • full type support can be provided by leveraging current JSON Schema libraries
    • new libraries can first implement type support and iterate on format support
    • clearer seperation in code between checking for type and format
  • Iterating on the JTS spec itself may be clearer:
    • the type interface is clear and defined, additions can be made by introducing new formats for existing types
  • Users of JTS validation code may be happier :):
    • See the above user story as an example

@rufuspollock
Copy link
Contributor Author

This is wonderfully clear and a really great summary @pwalsh.

In general, I think there is a lot to be said for this change.

My only immediate thought is about "promoting" date-time (and I guess date and time) to a first-class type.

Why: it is a "first-class" type - very common. Also could allow us to introduce flexibility on the date-time format using the format string on date-time (default would still be full ISO8601).

Why not: dates are strings in JSON and CSV and are notoriously tricky. This system is consistent with treating them basically as strings with structure to them.

Overall I'm +1 on this change.

@jpmckinney @paulfitz @ldodds - thoughts, votes +1, -1, +0, -0 0.

@pwalsh
Copy link
Member

pwalsh commented Jan 29, 2015

The date/time thing with formatting is a bit tricky, and in order to provide custom formats, I see why the case can be made for date/time as types, and not formats on string.

I though of suggesting "format modifiers" for this very problem, but held back because it may confuse the bigger issue of type/format clarity.

Basically, the idea of a format modifier would be that certain formats could take a custom pattern. Example for date: "type": "string", "format": "date:d% %b %Y"

@jpmckinney
Copy link

+1 to proposed changes. We can continue thinking about dates/times, but I think the proposal is in the right direction in that regard.

@paulfitz
Copy link
Contributor

I generally like this, especially for describing tables that are themselves encoded in JSON. For describing types in an sql database, I'm less sure, mostly because of date-time/date/time.

I like that, ignoring the type/format hierarchy, all the geopoint representations are grouped (via format). I don't like that there's no grouping of date-time/date/time representations.

Geopoints showing up as strings, objects, or arrays reminds me of sqlite's notion of "affinities" for dealing with date-times (https://www.sqlite.org/datatype3.html#datetime), which can be stored as strings, reals, or integers.

What if format were retained for variations on a theme, while adding a logicalType field that captures the same basic idea being expressed different ways?

So a date might be "type": "string", "logicalType": "date-time", "format": "date", and a geopoint expressed as an array might be "type": "array", "logicalType": "geopoint" (no format needed until there are geopoints with lat/long in differing orders).

@ldodds
Copy link
Contributor

ldodds commented Jan 30, 2015

I think I'm -1 on this. I'd prefer to see the direction of travel be alignment with the W3C metadata vocabulary. I've already advocated for use of XML Schema type system in the other type changes that I've proposed (and which have been partially adopted).

I'm not clear why alignment with JSON Schema is that desirable, as opposed to alignment with the W3C guidance.

@rufuspollock
Copy link
Contributor Author

@pwalsh I've also been reflecting and I think that my major considerations would be:

  • being relatively simple and easy to support (for tool implementors and consumers) - obviously it needs to be good for producers too but i would actually put them secondary to some extent (since the risk with fitting all producers is you blow out on simplicity)
  • easy alignment with sql (it would be nice to have relatively easy fidelity here)
  • easy alignment with json (schema)
  • easy support for common types in CSV or similar in the wild

This makes me think:

  • expandable is important - we may want core types but allow people to have some way to indicate a specific type for their use case (e.g. export / import sql data to csv)
    • A lot of this can be done through the "format" attribute (e.g. all the different sql string types)
  • date/time is so fundamental and common that I would like to see date/time stuff promoted to type (esp as we can then use format naturally if we want)
  • Re geopoint I'm sort of easy but think again we may promote as it is so useful (especially for geocsv [New] GeoCSV / Geo CSV #81) - @pwalsh i would be interested to understand how making this a format really helps you as an implementor have an easier job

@ldodds there is a bit of circularity re W3C metadata vocab since that is in part coming from here to. Also, it sounds like you are +1 on a significant upgrade, but just in a different direction. Interested in your thoughts on this comment.

Aside: if we do a significant change we probably could fold this in with #126 (switch to datatype from type)

@pwalsh
Copy link
Member

pwalsh commented Jan 30, 2015

@ldodds: about the W3 XML Schema, I don't really see why XML Schema is preferable to JSON Schema either, but, the JTS spec as written certainly engages more with JSON Schema than W3 XML Schema. As a relative newcomer to JTS (me), that alone makes a significant difference. Also see #46.

@rgrp: ok now that SQL comes into the picture that changes things (for me) somewhat. I approached JTS as a way to declare schemes for text (CSV) and JSON sources, hence the emphasis on string formats. Perhaps the spec needs to be more explicit in the expected applications of JTS - the only SQL reference I picked out of the spec was to show contrast to a text-based table.

So, I would def. support date/time types if we are not just talking text/json-based data sources here.

About geocsv I didn't know it, but I'll check it out.

@rufuspollock
Copy link
Contributor Author

@pwalsh point re SQL is that in terms of people using tabular data (and esp using CSV), SQL is a very obvious load target (plus things like bigquery, redshift etc) tend to be close to SQL. To be clear, i'm not suggesting we support all of SQL but an easy ability to map (pretty well) is def desirable.

@ldodds
Copy link
Contributor

ldodds commented Jan 30, 2015

@pwalsh my reasoning here is that a specification should be as clear as possible. rather than defining a new type system we can use one that already exists. I don't think the JSON Schema type system is particularly well defined. It defining types that are arguably unnecessary for a tabular data structure (JSON objects) and doesn't define other types that are useful (date, uri, etc). Rather than define those things in JTS, we can just refer to another specification.

As an aside I was wondering whether datapackage should just let a user say what type of schema it includes, which might be either a JSON Schema or the new W3C format. Then ditch JTS all together. Then rather than try to align various specifications, we just let people choose what they want to use.

@rufuspollock
Copy link
Contributor Author

@ldodds i have to say i'm -1 or -0 on just using full xsd types. Real aim for simplicity here. Also W3C format is a long way from done and is significantly reusing JTS :-) I also think JTS will be useful on its own.

As such I don't think making this "choose your own" is the way to go for Tabular Data Package per se - though I strongly +1 idea of allowing schema field to mean other things for other extensions of Data Package :-)

@ldodds
Copy link
Contributor

ldodds commented Jan 30, 2015

I wasn't suggesting supporting all of XSD, just that we draw from its type system which is well defined. See also discussions in #124 & #96.

fwiw, csvlint already supports using some of the XSD schema types and people are creating & applying schemas using it. haven't found any particular complexity from an implementation point of view.

@pwalsh
Copy link
Member

pwalsh commented Feb 17, 2015

So, some great points have been brought up in the thread; here is a revised version of my type/format proposal. The goal is still to have explicit definitions of types and the supported formats per type.

  • type array
    • format does not accept format
  • type object
    • format does not accept format
  • type boolean
    • format does not accept format, but should have an explicit set of true and false strings (e.g.: 'yes', 'y', 'true', '0', 'no', 'n','false', '1')
  • type integer
    • format does not accept format
  • type number
    • format does not accept format
  • type null
    • format does not accept format, but should have an explicit set of null strings (e.g.: 'null', 'none', 'nil', 'nan', '-', '')
  • type datetime
  • type date
  • type time
    • (formats below apply to datetime, date, and time)
    • format 'ISO6801' (default)
    • format 'non-strict' (allow the library to try to attempt to parse as any date/time format possible e.g.: in python, via python-dateutils' dateutil.parser.parse)
    • format (?) any other date time string format that is 100% language agnostic (must be made explicit in spec)
  • type geopoint
    • format 'string-geopoint' (default) (comma separated values of lat then long)
    • format 'array-geopoint' (2 items, being lat then long)
    • format 'object-geopoint' (2 keys, being lat, lng, and any other aliases we explicitly define)
  • type geojson
    • format does not accept format (I'm not sure if there are competing representations of geojson?)
  • type any
  • type string
    • format email
    • format uri
    • format binary (base64 encoded string)

@rufuspollock
Copy link
Contributor Author

@pwalsh I think this looks great. Only thing in spec would be to make clear that format is always optional (e.g. string does not require format). Re geopoint could we lose the '-geopoint' in the format descriptions. also for string, default should lng then lat (i understand that to be an underlying standard) and i think we then add 'string-latlng' or similar as a format.

Would you be happy to submit a pull request with these changes to the spec and we can get it in (remember to note the changes at the top of the spec).

@pwalsh
Copy link
Member

pwalsh commented Feb 18, 2015

Sure. I'll do a pull request on the spec for this over the next few days.

@jpmckinney
Copy link

👍 Re geojson formats, we might consider topojson?

@pwalsh
Copy link
Member

pwalsh commented Feb 18, 2015

Good idea. Seems practical. I'll add it to the pull request too.

@rufuspollock
Copy link
Contributor Author

FIXED. This was fixed in PR #168.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

No branches or pull requests

5 participants