Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data Resource JSON schema definition for `schema` seems to contradict specification #645

Open
iSnow opened this issue Oct 31, 2019 · 5 comments
Projects

Comments

@iSnow
Copy link

@iSnow iSnow commented Oct 31, 2019

Overview

Problem description

Caution: there's a lot of different uses of "schema" coming, so bear with me...

In the JSON schema definition file for a resource, data-resource.json, the formal definition for a schema property of a Resource is:

"schema": {
  "propertyOrder": 40,
  "title": "Schema",
  "description": "A schema for this resource.",
  "type": "object"
},

This allows only JSON-objects as schema definitions, essentially JSON-Strings. It disallows URLs or file paths.

This seemingly contradicts the human-readable specification of a schema property:

The value for the schema property on a resource MUST be an object representing the schema OR a string that identifies the location of the schema.

(Emphasis mine).

Proposed fix

I believe the section should read:

"oneOf": [
    {
      "title": "Schema path",
      "description": "A fully qualified URL, or a POSIX file path..",
      "type": "string"
    },
    {
      "title": "Schema as JSON",
      "description": "A schema encoded as a JSON object",
      "type": "object"
    }
]

I am no specialist for JSON schema, but if I change it, the Java implementation works with schema properties that are URLs and file paths.

How to reproduce

Reproduce the buggy behavior

  • Head to https://www.jsonschemavalidator.net/

  • copy the JSON content from https://frictionlessdata.io/schemas/data-resource.json into the "Schema" field on the left

  • Copy into "Input JSON" on the right the following JSON:

     {
      "name": "test",
      "schema": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/schema/population_schema.json",
      "path": "https://raw.githubusercontent.com/frictionlessdata/datapackage-java/master/src/test/resources/fixtures/data/cities.csv"
     }
    

Result: validation will fail with the following error:

Found 1 error(s)
Message: Invalid type. Expected Object but got String.
Schema path: #/properties/schema/type

Reproduce the proposed fix

  • Insert the proposed fix into the schema JSON
  • Run validation against the same JSON snippet

Result: no validation error will be shown.

Fix to YAML source

Created a pull request, please check and if working, merge

@lwinfree

This comment has been minimized.

Copy link
Member

@lwinfree lwinfree commented Oct 31, 2019

Thanks @iSnow! @rufuspollock + @pwalsh what do you think?

@roll

This comment has been minimized.

Copy link
Member

@roll roll commented Nov 1, 2019

@iSnow @lwinfree
It is one of the confusing parts in the connection between the specs and the implementations.

The idea behind this decision is that this JSON schema has to be used after:

  • descriptor retrieval
  • descriptor dereferencing (so-called) - loading and parsing JSON objects

For example in Python:

# Process descriptor
descriptor = helpers.retrieve_descriptor(descriptor)
descriptor = helpers.dereference_resource_descriptor(descriptor, base_path)
# validate the descriptor against JSON schema

Because of these preparational steps, there is a guarantee that resource.schema and resource.dialect are objects.

It works fine for dynamic languages but probably creates problems for static languages like Java (I'm not sure). If it does I would recommend adding string as an option to JSON schema types.

PS.
The word schema was the most confusing word when I started working on the project. It still can be confusing for me so I created a rule for myself that I say profile when I talk about JSON Schema and schema when I talk about FrictionlessData Table Schema. Some of our libraries use this naming convention e.g. having profiles folders with JSON schemas.

@rufuspollock

This comment has been minimized.

Copy link
Contributor

@rufuspollock rufuspollock commented Nov 2, 2019

I think @roll has summarized this well. It is a tricky one since obviously there are times a user wants to validate a data package "as is" without dereferencing but that then makes the json schema "profile" very flexible.

@iSnow

This comment has been minimized.

Copy link
Author

@iSnow iSnow commented Nov 2, 2019

The idea behind this decision is that this JSON schema has to be used after:

descriptor retrieval

descriptor dereferencing (so-called) - loading and parsing JSON objects

In that scenario, obviously we'll only have objects there. I don't quite understand the stipulation that the schema is to be applied after those steps, though. IMO, this kind of undermines the rigidity of having a JSON schema in the first place, as general-purpose schema validators like https://www.jsonschemavalidator.net/ will mark perfectly valid DataPackages as invalid.

I'd always validate first and after that, de-reference the dependencies as part of the parsing step. Is it because a validator might not load the resources, either because they are local to the file system or will not want to load any old URL, and therefore isn't able to fully validate the schema?

If that's the rationale, there should be an explanation (or maybe I just missed it). The Java code I took over does not do resource de-referencing before validation, but will first validate the DataPackge descriptor JSON and then lazy validate each resource if/when it is queried - which is how I fell over the issue mentioned above. It's of course perfectly possible in Java to first resolve the URL's/file descriptors and only then go for validation, but the behavior of the Java implementation seems more correct to me.

@roll

This comment has been minimized.

Copy link
Member

@roll roll commented Nov 4, 2019

I don't quite understand the stipulation that the schema is to be applied after those steps

I think the point was to validate a schema also because it can turn a data package to be invalid after dereferencing. I agree this area is not handled very well in the specs and probably some kind of composite approach is better when e.g. validation happens on different levels (package/resource/schema) although the current approach seems the easiest to do the trick.

PS.
Actually, I'm not sure what downside is of having object OR url-or-path as a type for schema/dialect. It seems more correct regarding generic validation tools.

@roll roll added this to Specifications in General Nov 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
General
  
Specifications
4 participants
You can’t perform that action at this time.