-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[jts] Refine field/column data types in JSON Table Schema #159
Comments
cc @tryggvib Suggestion: Several data types are just format patterns for strings. JSON Schema has a set of "built-in" formats, and "allows" for custom formats. Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful (3rd party tools for validation, etc.). For built-in formats of the string type in JSON Schema, see: http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format I propose the following:
I believe that this simplifies JSON Table Schema, by correctly differentiating between a type and a format that the type should conform with. Following from this, there is another opportunity to simplify the spec with regards to eg:
|
@pwalsh need to think about this one. I'm not sure how much we get moving the "complexity" down into "format" (though maybe, as you suggest, we can reuse json schema stuff more ...). I also acknowledge that Could you also summarize in a list what the new set of types (+ formats) would be. /cc @paulfitz @jpmckinney |
I'll give one example of where tightening the type/format relationship could be useful for code that implements the spec: As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec. With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..). However, if the user was able to declare the object as type By reducing the primitive type set, I think it makes it easier in the following areas:
So in general I'm suggesting:
|
+1 for @pwalsh's three bulleted reasons, plus it avoids an unnecessary difference with JSON Schema. Separately: why are there three serializations of a geopoint? |
@jpmckinney because the current JSON Table Schema spec prescribes these three forms for geopoint, each being valid. Plus, it serves an an interesting example of formats > types - these three forms are quite common in the wild. |
@pwalsh I've seen the array and string forms in the wild - but the object form? I've seen all sorts of keys: lat/lng, lat/long, latitude/longitude, etc. The object form seems the least standardized. |
@jpmckinney fair enough. In any event, they are here because they are all described in the current spec (obj form using lat and lon). |
Note: re geopoint I think we should drop array version but re the object and string versions that was there to support very common usage (certainly string version ...). I would be up for narrowing to even one but challenge will be that either one you drop you exclude a lot of existing use (and convenience if you drop string). |
So I guess we should first get general agreement on the type/format implementation I've suggested above. But, about geopoint specifically, I'd keep all three:
|
I'm fine with keeping all three. Going back to this comment #159 (comment) I agree with @pwalsh 's suggestion. |
A slight amendment to the type/format list above: binary (which is a base64 encoded string in JTS) should be a format of the string type, IMHO. |
@rgrp how do we move forward with this (or not)? I'm just not sure of the process with the data protocols specifications. |
@pwalsh can we summarize again what the exact proposal is. Also I'm really not sure how this interacts with format field (which is somewhat underspecified). Also clarity on pros / cons would be super-useful (obvious con is this is a substantial breaking change but that is not the end of the world). Once we have that we can resolve on this, trial it and then release into the spec. |
Ok, summary below. BTW, if this is too radical, that's fine. I made the proposal based on (a) working with JTS in code, and (b) under the assumption that the spec is still rather fluid, and there is not necessarily a deep commitment to the stability of the spec as is. Could be the (b) is an incorrect assumption. Proposal - Standardise field type/format for consistency, ease of use, and wide(r) compatibility with JSON SchemaI find the implementation of the Making Relation to JSON SchemaThe available Getting JSON Table Schema closer to JSON Schema in this regard would potentially be useful for 3rd party tools for validation, etc. For built-in formats of the string type in JSON Schema, see: http://spacetelescope.github.io/understanding-json-schema/reference/string.html#format I propose the following:
The changeThe
User storyI'll give one example of where tightening the type/format relationship could be useful for code that implements the spec: As a user, if I have an object that is geojson, I'd define it as type geojson in the current spec. With the current spec, the validation library would have to explicitly support geojson as a type - I don't think any do at present (may be wrong though..). However, if the user was able to declare the object as type Pros/ConsCons
ProsReducing the primitive type set is an improvement in the following areas:
|
This is wonderfully clear and a really great summary @pwalsh. In general, I think there is a lot to be said for this change. My only immediate thought is about "promoting" date-time (and I guess date and time) to a first-class type. Why: it is a "first-class" type - very common. Also could allow us to introduce flexibility on the date-time format using the format string on date-time (default would still be full ISO8601). Why not: dates are strings in JSON and CSV and are notoriously tricky. This system is consistent with treating them basically as strings with structure to them. Overall I'm +1 on this change. @jpmckinney @paulfitz @ldodds - thoughts, votes +1, -1, +0, -0 0. |
The date/time thing with formatting is a bit tricky, and in order to provide custom formats, I see why the case can be made for date/time as types, and not formats on string. I though of suggesting "format modifiers" for this very problem, but held back because it may confuse the bigger issue of type/format clarity. Basically, the idea of a format modifier would be that certain formats could take a custom pattern. Example for date: |
+1 to proposed changes. We can continue thinking about dates/times, but I think the proposal is in the right direction in that regard. |
I generally like this, especially for describing tables that are themselves encoded in JSON. For describing types in an sql database, I'm less sure, mostly because of date-time/date/time. I like that, ignoring the type/format hierarchy, all the geopoint representations are grouped (via Geopoints showing up as strings, objects, or arrays reminds me of sqlite's notion of "affinities" for dealing with date-times (https://www.sqlite.org/datatype3.html#datetime), which can be stored as strings, reals, or integers. What if So a date might be |
I think I'm -1 on this. I'd prefer to see the direction of travel be alignment with the W3C metadata vocabulary. I've already advocated for use of XML Schema type system in the other type changes that I've proposed (and which have been partially adopted). I'm not clear why alignment with JSON Schema is that desirable, as opposed to alignment with the W3C guidance. |
@pwalsh I've also been reflecting and I think that my major considerations would be:
This makes me think:
@ldodds there is a bit of circularity re W3C metadata vocab since that is in part coming from here to. Also, it sounds like you are +1 on a significant upgrade, but just in a different direction. Interested in your thoughts on this comment. Aside: if we do a significant change we probably could fold this in with #126 (switch to datatype from type) |
@ldodds: about the W3 XML Schema, I don't really see why XML Schema is preferable to JSON Schema either, but, the JTS spec as written certainly engages more with JSON Schema than W3 XML Schema. As a relative newcomer to JTS (me), that alone makes a significant difference. Also see #46. @rgrp: ok now that SQL comes into the picture that changes things (for me) somewhat. I approached JTS as a way to declare schemes for text (CSV) and JSON sources, hence the emphasis on string formats. Perhaps the spec needs to be more explicit in the expected applications of JTS - the only SQL reference I picked out of the spec was to show contrast to a text-based table. So, I would def. support date/time types if we are not just talking text/json-based data sources here. About geocsv I didn't know it, but I'll check it out. |
@pwalsh point re SQL is that in terms of people using tabular data (and esp using CSV), SQL is a very obvious load target (plus things like bigquery, redshift etc) tend to be close to SQL. To be clear, i'm not suggesting we support all of SQL but an easy ability to map (pretty well) is def desirable. |
@pwalsh my reasoning here is that a specification should be as clear as possible. rather than defining a new type system we can use one that already exists. I don't think the JSON Schema type system is particularly well defined. It defining types that are arguably unnecessary for a tabular data structure (JSON objects) and doesn't define other types that are useful (date, uri, etc). Rather than define those things in JTS, we can just refer to another specification. As an aside I was wondering whether datapackage should just let a user say what type of schema it includes, which might be either a JSON Schema or the new W3C format. Then ditch JTS all together. Then rather than try to align various specifications, we just let people choose what they want to use. |
@ldodds i have to say i'm -1 or -0 on just using full xsd types. Real aim for simplicity here. Also W3C format is a long way from done and is significantly reusing JTS :-) I also think JTS will be useful on its own. As such I don't think making this "choose your own" is the way to go for Tabular Data Package per se - though I strongly +1 idea of allowing schema field to mean other things for other extensions of Data Package :-) |
I wasn't suggesting supporting all of XSD, just that we draw from its type system which is well defined. See also discussions in #124 & #96. fwiw, csvlint already supports using some of the XSD schema types and people are creating & applying schemas using it. haven't found any particular complexity from an implementation point of view. |
So, some great points have been brought up in the thread; here is a revised version of my type/format proposal. The goal is still to have explicit definitions of types and the supported formats per type.
|
@pwalsh I think this looks great. Only thing in spec would be to make clear that format is always optional (e.g. string does not require format). Re geopoint could we lose the '-geopoint' in the format descriptions. also for string, default should lng then lat (i understand that to be an underlying standard) and i think we then add 'string-latlng' or similar as a format. Would you be happy to submit a pull request with these changes to the spec and we can get it in (remember to note the changes at the top of the spec). |
Sure. I'll do a pull request on the spec for this over the next few days. |
👍 Re geojson formats, we might consider |
Good idea. Seems practical. I'll add it to the pull request too. |
FIXED. This was fixed in PR #168. |
This is a discussion issue for potentially refining the list of column / field data types in JSON Table Schema:
not, a format field must be provided describing the structure.
YYYY-MM-DDThh:mm:ssZ in UTC time or, if not, a format field must be
provided.
Also worth comparing against the (in-progress W3C spec - which has been informed by this spec ...): http://w3c.github.io/csvw/metadata/index.html#datatypes
The text was updated successfully, but these errors were encountered: