[JTS] Extend JSON Table Schema to support simple field validation #95

Closed
ldodds opened this Issue Jan 27, 2014 · 34 comments

Projects

None yet

4 participants

@ldodds
Contributor
ldodds commented Jan 27, 2014

Background to this issue can be found here:

https://github.com/theodi/csv-validation-research/wiki/Extending-Data-Packages-to-Support-CSV-File-Validation

Revise JTS to support the following:

  • required -- boolean. Field must have a value (not be an empty string) in every row
  • minLength -- integer. Minimum number of characters in value
  • maxLength -- integer. Maximum number of characters in value
  • unique -- boolean. Every value must be unique within the file in which it is found. Defines a unique key for a row although a row could potentially have several such keys.
  • pattern -- string. Regular expression to be matched against the field, which will be treated as a string of characters

The existing format key is under specified as there's no guidance on expected patterns.

@rufuspollock
Contributor

What does HTML5 validation have for this stuff - would be nice to match as much as possible. I also wonder if there is any existing mini-JSON spec for validation ...

@ldodds
Contributor
ldodds commented Jan 27, 2014

I think this is close to HTML 5 already. Form validation supports basic types (as described in #95) with types including date, datetime, url, etc. In addition form elements can have min/max values (again, covered by #95) and also specify a pattern for simple string matching.

HTML5 defines some additional algorithms for turning strings into various types, but these are orthogonal to how the types are defined and constraints on form values.

The proposals here are also a common sub-set of validation approaches supported by a number of different CSV validator tools, so build on existing efforts elsewhere, see https://github.com/theodi/csv-validation-research

@rufuspollock
Contributor

Questions:

  • Should these be top-level attributes or nested in, say, constraints attribute?
  • format: should we use this to specify some kind of subtypes e.g. email, year - or should we get that stuff by expanding types as discussed in #96?

Also /cc @maxogden @paulfitz @sballesteros @besquared @jpmckinney for thoughts ...

@jpmckinney
Member
  1. I prefer to nest under constraints, which seems to be the term we agreed on in #23
  2. JSON Schema has format and pattern fields. May be worth copying.
@ldodds
Contributor
ldodds commented Jan 28, 2014

Personally I prefer the combination of declaring a type plus pattern as per
#95 and #96. I don't think JSON Schema format adds much here.

Happy to go with nested constraints.

On Tue, Jan 28, 2014 at 4:48 PM, James McKinney notifications@github.comwrote:

  1. I prefer to nest under constraints, which seems to be the term we
    agreed on in #23#23
  2. JSON Schema has a format field. May be worth copying.

    http://json-schema.org/latest/json-schema-validation.html#anchor104

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-33497338
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@jpmckinney
Member

@ldodds Can you write an example JSON document so I can see what the option you propose looks like? e.g. an example like what I did at #23 (comment)

@ldodds
Contributor
ldodds commented Jan 28, 2014

There are examples that cover this proposal and my others here:
https://github.com/theodi/csv-validation-research/wiki/Extending-Data-Packages-to-Support-CSV-File-Validation

On Tue, Jan 28, 2014 at 5:08 PM, James McKinney notifications@github.comwrote:

@ldodds https://github.com/ldodds Can you write an example JSON
document so I can see what the option you propose looks like? e.g. an
example like what I did at #23 (comment)#23 (comment)

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-33499562
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@jpmckinney
Member

@ldodds I don't see any important difference with the JSON Schema properties. pattern has the same semantics. If you rename type to format and change its list of possible values to match those used by JSON Schema (e.g. use date-time instead of dateTime), you get the benefit of not forcing developers and implementers to learn two slightly different ways of doing pretty much exactly the same thing. A toolchain exists for JSON Schema. Make it easy for people to tailor those tools for JSV by not choosing different terms for the same use case.

@ldodds
Contributor
ldodds commented Jan 28, 2014

OK, so we agree on pattern.

I think we're only disagreeing over the name and values for the key that
assigns a type to a column. Based on naming alone, I think "type" is more
descriptive than "format". HTML5 uses "type" as well. JSON-LD does too. So
its debatable about which might be easier for people to understand.

In terms of values, then as I noted in the type discussion, I think
grounding it on something that's widely supported like XML Schema is better
than defining new names for types and then having to define their value
spaces. I don't think that JSON Schema really covers the variety of data
types JTS needs to support. The format keyword can have values of
date-time, email, hostname, uri and ip addresses. That's very limited.
HTML5 has a broader and more useful set. In particular I think the range of
date types offered by XSD is useful. I think its better to have more
built-in types rather than forcing people to use regular expressions.

On Tue, Jan 28, 2014 at 5:18 PM, James McKinney notifications@github.comwrote:

@ldodds https://github.com/ldodds I don't see any important difference
with the JSON Schema properties. pattern has the same semantics. If you
rename type to format and change its list of possible values to match
those used by JSON Schema (use date-time instead of dateTime), you get
the benefit of not forcing developers and implementers to learn two
slightly different ways of doing pretty much exactly the same thing.

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-33500593
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@jpmckinney
Member

In JSON Schema, you would have:

{
  "created": {
    "type": "string",
    "format": "date-time"
  }
}

It has separate type and format fields, because JSON only has strings, numbers, objects, arrays, true, false and null as possible types. When a JSON Table Schema describes the data in a CSV file, the type values can be all sorts of other values, like dates, so I see your point now.

JSON-LD has @type which is a URI to a RDF class:

{
  "@context": {
    "created": {
      "@id": "http://purl.org/dc/terms/created",
      "@type": "http://www.w3.org/2001/XMLSchema#dateTime"
    },
  },
  "created": "1940-10-09"
}

Or a string which is aliased to a URI:

{
  "@context": {
    "dateTime": "http://www.w3.org/2001/XMLSchema#dateTime",
    "created": {
      "@id": "http://purl.org/dc/terms/created",
      "@type": "dateTime"
    },
  },
  "created": "1940-10-09"
}

I think using XSD values makes sense. Is there any reason we can't use something closer to JSON-LD's @context block for the fields block in JTS? Or we can keep JTS more or less as proposed, but offer a .jsonld context file, that maps the values (dateTime, etc.) to URIs (http://www.w3.org/2001/XMLSchema#dateTime, etc.). That way, people can use JTS without knowing about its ties to other vocabularies via JSON-LD.

@jpmckinney
Member

So, in short, I agree with most of the proposal, which I understand as using required, minLength, maxLength, pattern as in JSON Schema, and using type like @type in JSON-LD.

It doesn't seem like JSON Schema has a way to validate uniqueness, because JSON Schema validation operates at the field level, not the collection level, as far as I can tell. I think we discussed unique as being a different type of constraint in issue #23. The proposal cannot do composite unique keys. Since uniqueness operates at a different level than the other constraints, it may be easier to deal with it separately and remove it from this proposal.

@jpmckinney
Member

Separate issue: the linked document uses as an example a pattern value of "%Y-%m-%d" which is a strftime string and not a regular expression. I would like to restrict pattern values to regular expressions, instead of supporting a wide variety of possible values which will be more demanding to implement in third-party libraries consuming JTS.

@ldodds
Contributor
ldodds commented Feb 3, 2014

I was proposing that pattern was used in two ways:

  • for string types it defines a regex for validating the basic structure of
    the field
  • for dates it defines a format mask for parsing the date, we want to
    create a valid date so that we can also apply minimum/maximum range values
    which would parsed with the same pattern

Regular expression are awkward for doing the latter. Most libraries I've
used support strftime values for date parsing so I don't see much
implementation overhead. I thought the overloading of the key was also
useful as we can indicate that the additional behaviour is only for certain
data types.

On Tue, Jan 28, 2014 at 7:05 PM, James McKinney notifications@github.comwrote:

Separate issue: the linked document uses as an example a pattern value of
"%Y-%m-%d" which is a strftime string and not a regular expression. I
would like to simply use regular expressions for everything, instead of
supporting a wide variety of possible values which will be more demanding
to implement.

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-33512119
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@jpmckinney
Member

I think it would be safer to mint a new property datePattern than to overload one property.

@besquared
Contributor

@ldodds I believe that interpretation and validation are separate concerns so I don't think that the strftime style formatting and the pattern style regular expression matching should be conflated into a single property.

The pattern property validates the syntax of the given field but says nothing about it's semantic meaning. That's the point of the format property for dates and for other types. There are several data types other than dates that have identical syntax definitions but with different semantics and both need to be represented unambiguously.

+1 for pattern being a regular expression syntactical validation
+1 for keeping format around for semantic interpretation (and semantic validation)

@ldodds
Contributor
ldodds commented Feb 10, 2014

Picking this up again and to summarise discussion so far. I believe we've decided to:

  • Add a constraints key for specifying validation rules
  • The initial constraints are required, minLength, maxLength, and pattern -- as described in #96 we can extend this to include minimum and maximum
  • type is used to specify an XML Schema data type for a column
  • dataPattern is used to specify a strftime compatible pattern for parsing date-times from fields

Does that seem reasonable? If so I'll work on a PR.

Personally I think adding unique as a single column constraint is useful. We can add additional constraints at the resource level, e.g. to allow for multiple column indexes or cross-referencing between resources.

@besquared
Contributor

I think it was datePattern (maybe a typo) and I still think it's too narrow a use case to call it that. There are other kinds of data that are represented as formatted strings where the format definition would be useful for the consumer. It's also not really a pattern as much as it is a format. Even the name "strftime" in the date case is about formatting not patterns.

@ldodds
Contributor
ldodds commented Feb 10, 2014

Yep a typo, I meant datePattern.

Proper parsing and validation of dates is a common use case, so I personally don't see an issue in having a specific key for this. I dislike the current format key as its not properly specified, as others have commented having a single key for multiple uses is awkward. dateFormat is fine with me as an alternative name.

For some types of data validation, e.g. emails or simple identifier strings, the regex support in pattern will be sufficient to check the data. The goal for datePattern is to help implementations parse the value into a date/date-time object: partly so that its semantics can be checked (is it a valid date?) but also to support additional constraints (date must be within a specific range).

@rufuspollock
Contributor

This looks to have been a productive thread. @ldodds would you like to make a stab at a comment providing an updated summary proposal (perhaps with refs to other proposals people have brought in e.g. using @type which is #89 ...). We can try to get it approved :-)

@ldodds
Contributor
ldodds commented Feb 14, 2014

OK, will try and do that over the next few days.

On Fri, Feb 14, 2014 at 5:50 PM, Rufus Pollock notifications@github.comwrote:

This looks to have been a productive thread. @ldoddshttps://github.com/ldoddswould you like to make a stab at a comment providing an updated summary
proposal (perhaps with refs to other proposals people have brought in e.g.
using @type https://github.com/type which is #89#89...). We can try to get it approved :-)

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-35107247
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@rufuspollock
Contributor

@ldodds ping :-)

@ldodds
Contributor
ldodds commented Mar 3, 2014

OK, here is an attempt to pull together the proposals in this and related threads. If there's general agreement we can put together PRs for the changes.

The following covers discussions here and in #95.

Addition of Constraints

The definition of a JTS field will be revised to be just the following:

  • name -- required
  • title -- optional
  • description -- optional
  • constraints -- optional

The constraints key is a hash whose contents describes a set of constraints that are used to validate the contents of that field. All constraints are applied together.

A data package resource is valid if all of its fields are valid according to their associated schemas.

The existing type and format fields will be replaced by equivalents defined below.

Basic Field Constraints

The following basic constraints can be used to validate a field:

  • required -- boolean. Field must have a value (not be an empty string) in every row
  • minLength -- integer. Minimum number of characters in value
  • maxLength -- integer. Maximum number of characters in value
  • unique -- boolean. Every value must be unique within the file in which it is found. Defines a unique key for a row although a row could potentially have several such keys.
  • pattern -- string. Regular expression to be matched against the field, which will be treated as a string of characters. We should use a standard regular expression syntax, e.g. the simple regex format defined in XML Schema. (See also this reference)

Here is an example:

{ "name": "ID",
  "description": "Unique transaction code",
  "constraints": {
    "required": true,
    "minLength": 38,
    "maxLength": 38,
    "unique": true
  }
}

Type Constraints

In addition to the basic field constraints, fields can be validated against a data type. This information can also be used for more than validation, e.g. to map values to types in a programming language or database schema.

A data type is associated with a field by adding a type key to its constraints hash. Types are defined using URIs. It is recommended that the XML Schema data type URIs are used to provide identifiers for a well-specified set of data types.

Example:

{
  "name": "Price",
  "description": "Price Paid",
  "constraints": {
       "required": true,
       "type": "http://www.w3.org/2001/XMLSchema#nonNegativeInteger"
 }
}

Additional well-known URIs could be defined for other types. This would provide a way to support some of the existing types supported in the current specification.

Values of the type field could be expanded according to a JSON-LD @context referenced from the schema, or its enclosing data package.

The value of a field with a type constraint can be further constrained by using two additional constraints:

  • minimum -- the minimum value for the field, after converting it to the specified type. E.g. minimum integer
  • maximum -- the maximum value for the field

These differ from minLength and maxLength as those constraints operate on the length of the field value, not its type.

Date Validation

XML Schema defines a number of date times. These will be useful for checking a variety of real-world data. These definitions includes standard formats for dates and date times. Data publishers should be strongly encouraged, but not required (see #104) to use these standard formats.

However often data is published with different date formats. It should be possible to write a schema for existing datasets. With this in mind JTS should support description of the date format used in the field.

The constraints hash for a field can have an optional datePattern key. This key provides a strftime compatible date format mask that can be used to parse a date or date time from the field value. This allows custom date patterns to be validated. If no datePattern is provided, the field is assumed to match the relevant XML Schema pattern.

If a minimum or maximum value constraint is also provided then its value must also conform to the provided datePattern.

Here is an example:

{
  "name": "Date of Transfer",
  "description": "Data of transfer",
  "constraints": {
    "required": true,
    "type": "http://www.w3.org/2001/XMLSchema#dateTime",
    "datePattern": "%Y-%m-%d %H:%M"
  }
}

I think this covers evverything?

@ldodds
Contributor
ldodds commented Mar 3, 2014

I've also created a couple of draft schemas. Here is one for the Land Registry Monthly Price Paid CSV files:

https://gist.github.com/ldodds/9070478

Here is one for UK Govt Senior Staff pay data CSV format:

https://gist.github.com/ldodds/9070198

There is initial support for the proposal in theodi/csvlint.rb which is used in the alpha csvlint.to service. So its possible to play with this now.

If this proosal is approved then I can also update the schemas in dataprotocols/schemas and the validation code in theodi/datapackage.rb

@rufuspollock
Contributor

@ldodds one immediate question, you write:

"The definition of a JTS field will be revised to be just the following: [emphasis added]"

But we already allow for e.g. primaryKey and foreignKey etc and in theory could allow other additions. I think what we're really doing here is introducing a constraints keyword.

I'm also concerned about the major, breaking change to move type down into constraints - I'd be happy to lose format altogether (cf #104 etc) but moving type down seems a biggie.

Finally, I wonder if we keep type ultra-simple (as per definition at present) and allow @type for the more complex stuff including xsd date.

@ldodds
Contributor
ldodds commented Mar 3, 2014

@rgrp: Sorry, I should have referenced primaryKey and foreignKey, that
was an oversight. (I've not used them)

re: the breaking change to type. Personally I'd be happy to keep it at the
top level, although we might also want to move up datePattern too as
they're closely related. There were other suggestions to have all
constraints separate and I may have mis-read that as wanting all of those
fields grouped together.

On Mon, Mar 3, 2014 at 12:42 PM, Rufus Pollock notifications@github.comwrote:

@ldodds https://github.com/ldodds one immediate question, you write:

"The definition of a JTS field will be revised to be just the
following: [emphasis added]"

But we already allow for e.g. primaryKey and foreignKey etc and in theory
could allow other additions. I think what we're really doing here is
introducing a constraints keyword.

I'm also concerned about the major, breaking change to move type down into
constraints - I'd be happy to lose format altogether (cf #104https://github.com/dataprotocols/dataprotocols/issues/104etc) but moving type down seems a biggie.

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-36506429
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@rufuspollock
Contributor

@ldodds let's get this moving. I suggest for now:

  • we add constraints with values as per your spec BUT
    • we leave type (and @type) at top level
    • I suggest we go for pattern and we leave datePattern for the moment - we punt date stuff into a separate issue (perhaps #104 or similar)

wdyt - we can get this in and you can update to use :-)

@ldodds
Contributor
ldodds commented Mar 28, 2014

I'm planning to work on PR for this early next week, now that I've got time
to focus on it again, so will flag up when its ready for review.

On Wed, Mar 12, 2014 at 6:23 PM, Rufus Pollock notifications@github.comwrote:

@ldodds https://github.com/ldodds let's get this moving. I suggest for
now:

wdyt - we can get this in and you can update to use :-)

Reply to this email directly or view it on GitHubhttps://github.com/dataprotocols/dataprotocols/issues/95#issuecomment-37444979
.

Leigh Dodds
Freelance Technologist
Open Data, Linked Data Geek
t: @ldodds
w: ldodds.com
e: leigh@ldodds.com

@rufuspollock
Contributor

@ldodds ping :-)

@ldodds ldodds referenced this issue Apr 6, 2014
Merged

Added description of constraints #112

0 of 2 tasks complete
@rufuspollock
Contributor

@ldodds great we've got most of this in.

What do we think of supporting @type - #89 (at the moment I think you use xml schema uris for the validation stuff and it wld be good to get back to compatibility). Also see #124 which is about pulling in the xmlschema list of types.

I also think we need to spec out datePattern properly somewhere (do we use printf, regex etc). See also new #125 which is about improving format definition.

@rufuspollock rufuspollock referenced this issue in okfn/okfn.github.com May 27, 2014
Closed

Newsletter May 2014 #211

@rufuspollock
Contributor

@ldodds ping ^^^

@ldodds
Contributor
ldodds commented Jul 22, 2014

I've addded some comments to the other issues. But for datePattern we've been using strftime. Its already pretty well defined, provides lots of options and, importantly, is widely supported in a variety of languages.

@rufuspollock
Contributor

@ldodds could you add that comment re datePattern to #125 (and I guess we could then resolve that in favour of strftime)

@ldodds
Contributor
ldodds commented Aug 8, 2014

@rgrp done!

@Floppy Floppy referenced this issue in theodi/csvlint.rb Jan 12, 2015
Open

Spec difference for type field? #116

@rufuspollock
Contributor

FIXED. I'm closing this as fixed as I think most of this went in and that which did not now has its own issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment