Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add schema.fieldsMatch property; clarified extra/non-specified fields in Table Schema #39

Merged
merged 23 commits into from Mar 28, 2024

Conversation

roll
Copy link
Member

@roll roll commented Feb 21, 2024


Overview

It's the first attempt at implementing a new partial/syncSchema/schemaComplete feature that also clarifies a default field-column mapping behavior in Table Schema. Also, it naturally makes field.name unique amongst other fields in Table Schema with backward-compatibility note for implementors.

I propose to add this property to the Table Schema level as it regulates field-column mapping rules where both field and column concepts are defined on the Table Schema level (not on the Data Resource level). Also, because Table Schema can be shared or constitute catalogs, I think it will be right if compliance rules configuration belongs to abstract Table Schema rather than concrete Data Resources.

For example, if I publish a partial Table Schema I declare that any data source having fields A, B, C in any order will be valid to this schema. If I publish non-partial (default) Table Schema I declare that only a data source having only 3 fields A, B, C in exact order is valid to this schema. I think it's important for e.g. government regulation Table Schemas etc

@roll
Copy link
Member Author

roll commented Feb 23, 2024

I think a found one more good argument why partiality configuration is better to be on the Table Schema level.

If we think about it conceptually:

  • JSON Schema: validates structured data
  • Table Schema: validates tabular data

JSON Schema has settings like additionalProperties that you can set to false limiting the amount of properties in the DATA. That's basically what we do with this currently called schema.partial -- we constraint or not amount and order of columns in the DATA. I think it's really symmetrical and in both cases it happens within the abstract reusable validation container (rather than in the concrete data).

@nichtich
Copy link

This seems to mix two indepenent features:

  • require unique names and any order
  • allow additional fields

Both should get their own flag. If order is relevant, additional fields can stil be allowed after the last defined field, so both flags are independent from each other.

@roll
Copy link
Member Author

roll commented Feb 23, 2024

@nichtich
Good point!

Probably two flags extraFields and orderedFields will be even better. At least, it's more self-explanatory and better aligns with JSON Schema counterpart spirit

Copy link

cloudflare-pages bot commented Feb 23, 2024

Deploying datapackage with  Cloudflare Pages  Cloudflare Pages

Latest commit: f71a530
Status: ✅  Deploy successful!
Preview URL: https://9e56f6e0.datapackage.pages.dev
Branch Preview URL: https://367-schema-partial.datapackage.pages.dev

View logs

@roll roll changed the title Add schema.partial property; clarified extra/non-specified fields in Table Schema Add schema.exact/orderedFields properties; clarified extra/non-specified fields in Table Schema Feb 23, 2024
@roll
Copy link
Member Author

roll commented Feb 23, 2024

I've updated the PR to have two properties that make data source requirements stricter if set to true:

  • exactFields
  • orderedFields

Selected defaults make it non-breaking in relation to the current v1 definition.

@roll
Copy link
Member Author

roll commented Mar 11, 2024

Hi @peterdesmet,

can you please take a look? WDYT?

Copy link
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Selected defaults make it non-breaking in relation to the current v1 definition.

@roll, can you clarify this comment? The defaults for both fields are false, allowing partial matching, while frictionless v1 mandated complete matching.

Irrespective, do you think the default should be the most lenient approach?


A Table Schema descriptor `MAY` contain a property `exactFields` that `MUST` be boolean with default value `false`:

- **false** (default): The number of fields in the data source `MUST` be equal or more than the number of elements in the `fields` array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the (common) use case where you have less fields in the data source compared to the fields array?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterdesmet
It's a good question. Won't it make the data source invalid against the given Table Schema? I guess software still can read it but my feeling that the missing fields need to be marked as a structural error

I'm just trying to think here about a Table Schema as a data contract and missing fields seems to be a violation but I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In biodiversity informatics, it is common to work with schemas (like this one https://rs.gbif.org/extension/gbif/1.0/distribution_2022-02-02.xml), where a publisher will only use a number of fields, since not all apply to the data they hold. I'd like to be able to use Table Schemas for that. 😄 Fields that are expected to be there can be indicated with the required property (and raise a validation error if they are not).

If you agree, would it be possible to update the wording to allow this use case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, missing fields can be filled by null and then e.g. required: true be used but it seems a little bit different as it's on an individual rows level

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@peterdesmet
Make sense!

I've updated the PR -- if a Table Schema author needs strictness here, they can use exactFields: true. If there is a use case of requiring equal or more fields we can add a separate constraint later

content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
roll and others added 2 commits March 14, 2024 15:03
@roll
Copy link
Member Author

roll commented Mar 14, 2024

@peterdesmet
It was a little bit of a surprise for me, but the current wording is:

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array SHOULD be the order of fields in the CSV file. The number of elements in the fields array SHOULD be the same as the number of fields in the CSV file.

It doesn't use MUST so basically, it's the same as:

exactFields: false
orderedFields: false

Irrespective, do you think the default should be the most lenient approach?

Yeah, I was thinking about it, and I did like the concept of applying constraints similar to JSONSchema e.g. additionalProperties: false. On the other hand, I don't have a strong opinion here but it seems to be the only non-breaking way based on the v1 wording.

@peterdesmet
Copy link
Member

@roll the current SHOULD wording is a "recent" change though. It was suggested by me in 2020. From 2012 to 2020 it was MUST, so I believe that is what most implementors/users consider v1.

@roll
Copy link
Member Author

roll commented Mar 14, 2024

@peterdesmet
Probably it was committed back then by mistake without updating a version according to semver etc but it's been there for 3 years so I think we just need to consider v1 to be as it is now.

BTW even before the change it had been using SHOULD for length:

The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

@peterdesmet
Copy link
Member

Regarding v1: ok, seems it was SHOULD. But I think a number of data consumers assumed MUST (including frictionless-r) and did offer sync_schema functionality. Even if it is a breaking change, I think the strict approach (orderedFields: true, exactFields: true) makes more sense as a default to me. But I'd like to hear from others too.

@roll
Copy link
Member Author

roll commented Mar 14, 2024

I would vote for any, both false or both true, by default. I agree that both true by default is what, currently, is a de-facto standard on the implementation level.

I'm curious if there are better names to mirror it and use false by default .e.g unorderedFields and arbitraryFields. Personally, I always feel more natural not to use flags that truth by default but, of course, it's a matter of taste

@peterdesmet
Copy link
Member

@roll, we have some precedents of true defaults (and false):

  • CSV dialect doubleQuote: true
  • CSV dialect header: true
  • Table schema bareNumber: true
  • Table schema required: false
  • Table schema unique: false

I think the positive properties (orderedFields, exactFields) are easier to understand than the negative ones (unorderedFields, ...)

@roll
Copy link
Member Author

roll commented Mar 14, 2024

@peterdesmet
I agree, it's not a problem. I'll update defaults for voting now as it seems we're more leaning towards true by default

Upd.
Done

@roll roll changed the title Add schema.exact/orderedFields properties; clarified extra/non-specified fields in Table Schema Add schema.fieldsMatch property; clarified extra/non-specified fields in Table Schema Mar 18, 2024
@roll roll added the candidate label Mar 19, 2024
content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
content/docs/specifications/table-schema.md Outdated Show resolved Hide resolved
@roll
Copy link
Member Author

roll commented Mar 28, 2024

With Paul's support expressed in the original issue we got the quorum. It was one of the most awaited features in the Data Package, and it's really great that we finally made it! Esp. in a form that beautifully aligned with the set theory that became real only after a great discussion (the initial proposal was way different).

🎉

@roll
Copy link
Member Author

roll commented Mar 28, 2024

ACCEPTED by WG (6/9)

@roll roll merged commit bd163e8 into main Mar 28, 2024
2 checks passed
@roll roll deleted the 367/schema-partial branch March 28, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarity around extra/non-specified fields
3 participants