Community schemas: allowing less fields than are defined in the schema #635

peterdesmet · 2020-10-01T18:36:26Z

The Data Resource spec allows to point to an externally hosted Resource Schema (schema: url-or-path). This is great, because it allows communities to for example develop and host a agreed upon set of Table Schemas (example) and data producers within that community to indicate that they are following the shared schema:

"resources": [
    {
      "name": "deployments",
      "path": "deployments.csv"
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/deployments-table-schema.json"
    },
   {
      "name": "media",
      "path": "media.csv",
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/media-table-schema.json"
    }
]

A very common use however, is that data producers will have less fields in their csv files than are defined in the schema (e.g. skipping some non-required fields that are not applicable). According to the Table Schema specs, that should be possible (key word is SHOULD) as long as the fields are in the correct order:

A Table Schema is represented by a descriptor. The descriptor MUST be a JSON object (JSON is defined in RFC 4627).

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array MUST be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

Is this a correct assumption?

As in:

id,name,population
1,Britain,67
2,France,

Should be valid against (an externally hosted):

{
  "fields": [
    { "name": "id" },
    { "name": "neighbour_id" }, <-- Extra field
    { "name": "name" },
    { "name": "population" }
  ]
}

Validation in e.g. frictionless-py seems to require the exact same number of fields, which is a serious drawback for externally hosted schemas, so I wanted to understand first what the intention of the spec is.

The text was updated successfully, but these errors were encountered:

roll · 2020-10-02T05:23:00Z

Hi @peterdesmet,

It's a very good issue. It needs to be discussed on the specs level as it's not only the dimension requirement but also an addressing one: the mapping between file fields and schema fields are tied to their order.

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

https://colab.research.google.com/drive/1is_PcpzFl42aWI2B2tHaBGj3jxsKZ_eZ#scrollTo=myOn0wYQcIHY

peterdesmet · 2020-10-02T09:54:53Z

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

The less fields (1) is addressed in the specs (SHOULD) but as you point out, the order (2) is currently required (MUST) in the specs:

The order of elements in fields array MUST be the order of fields in the CSV file.

roll · 2020-10-02T11:36:26Z

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

Yes it's available in the CLI

Not sure why the spec uses SHOULD instead of MUST

peterdesmet · 2020-10-06T07:50:22Z

Not sure why the spec uses SHOULD instead of MUST

But given the need (and the already supported implementation) the issue at hand is to change MUST into SHOULD (recommended but not required) right?

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array ~~MUST~~ SHOULD be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

roll · 2020-10-06T07:59:36Z

I agree. Would you like to PR?

peterdesmet · 2020-10-06T08:14:06Z

Yes, please see frictionlessdata/specs#707

roll · 2020-10-06T09:34:21Z

Thanks

peterdesmet · 2020-10-07T11:10:59Z

Commented by @roll in frictionlessdata/specs#707 (comment):

@peterdesmet
TBH I though you're going to make both MUST not SHOULD. I think they have to be MUST.

But actually, I think there are two aspects:

for concrete data + schema like Resource has they should be MUST

for abstract schema applicable to many files or community schema used as a reference there is no CSV file at all

Maybe we need to rephrase this paragraph.

peterdesmet · 2020-10-07T11:16:03Z

there is no CSV file at all

I don't understand this. E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

roll · 2020-10-07T11:17:26Z

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below).

The order of fields in a compliant tabular file MUST BE the order of elements in fields array. The number of fields in a compliant tabular file MUST BE a number of elements in fields array.

I would use something like this just because Schema makes/can exist sense even you don't have a file.

roll · 2020-10-07T11:19:07Z

there is no CSV file at all.
I don't understand this.

I mean you can write an abstract Table Schema. It's not tied to any file. That why it can be a community/shared/etc 😃 One schema can describe N data files.

For example, Data Package and Data Resource are tied to exact files.

Table Schema: metadata
Data Resource: metadata + data
Data Package: metadata + data

roll · 2020-10-07T11:25:57Z

E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

There is a problem that this will not work in any existing software unless you provide special flags on software level like validate(...sync_schema=True). But it's an extension over the spec while it's not actually described in the spec even though we change MUST->SHOULD.

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

PS.
Sorry for a lot of messages I've just completely understood the problem

peterdesmet · 2020-10-09T14:37:02Z

@roll

Regarding `MUST` vs `SHOULD` in Table Schema

you write an abstract Table Schema. It's not tied to any file.

I agree, which is exactly why relaxing (SHOULD not MUST frictionlessdata/specs#707) the order/number of fields is an improvement imo, because it reflects reality and supports more use cases (e.g. community standards). Since it's not tied to data, it is bit odd that CSV file (3 times) is mentioned in the Table Schema spec though. It makes it easy to read, but maybe that should be adapted?

Regarding making `sync_schema` implementation explicit

I agree completely regarding:

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

I guess this should then be mentioned in the specs for Tabular Data Resource (not Table Schema), somewhere at:

The schema property MUST follow the Table Schema specification, either as a JSON object directly under the property, or a string referencing another JSON document containing the Table Schema

kgeographer · 2020-11-18T01:32:31Z

Regarding the Oct 1 message from @roll:

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

I find that adding sync_schema makes it impossible to trap header errors. For example non-matching-header. In the validation report, the header is changed to include possibly incorrect incoming header strings, and constraints initially in the referenced schema (e.g. 'required') are dropped; type becomes 'any.'

I encountered this trying to move from goodtables to frictionless to take advantage of its xlsx support. My situation is that incoming data can have fewer fields than the schema describes, but a few of the fields are required. Data creators shouldn't have to care about the order of fields, and in goodtables (2.4.1) this is not an issue.

roll · 2020-11-23T08:01:17Z

Hi @kgeographer,

Can you please post a simple example of the header and the field names in your case? I'm trying to figure out if labels in data don't match field names so how we can link them for validation.

kgeographer · 2020-11-24T18:58:53Z

Yes, my case arises in performing validation of data contributions for World Historical Gazetteer using the LP-TSV variation of our contribution format.

Attached a sample data file, sample schema and little python script for demo

One problem is that with sync_schema = True, the file is validated even if a header for a required field is missing or if the column is missing altogether - I guess because the schema considered is modified on the fly to simply match what arrives.

The second problem is that if sync_schema is not used, and an incoming file header is not in the same order as the schema, and/or it is missing some non-required fields, then it generates many errors because columns are evaluated against the schema by index/position, not header string.

thanks

issue_704.zip

roll · 2020-11-25T08:07:06Z

Thanks @kgeographer,

I've created a feature request - frictionlessdata/frictionless-py#546

rufuspollock transferred this issue from frictionlessdata/specs Mar 31, 2021

rufuspollock closed this as completed Mar 31, 2021

frictionlessdata locked and limited conversation to collaborators Mar 31, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Community schemas: allowing less fields than are defined in the schema #635

Community schemas: allowing less fields than are defined in the schema #635

peterdesmet commented Oct 1, 2020

roll commented Oct 2, 2020 •

edited

peterdesmet commented Oct 2, 2020

roll commented Oct 2, 2020 •

edited

peterdesmet commented Oct 6, 2020

roll commented Oct 6, 2020

peterdesmet commented Oct 6, 2020

roll commented Oct 6, 2020

peterdesmet commented Oct 7, 2020 •

edited

peterdesmet commented Oct 7, 2020

roll commented Oct 7, 2020

roll commented Oct 7, 2020 •

edited

roll commented Oct 7, 2020 •

edited

peterdesmet commented Oct 9, 2020

kgeographer commented Nov 18, 2020

roll commented Nov 23, 2020

kgeographer commented Nov 24, 2020

roll commented Nov 25, 2020 •

edited

This issue was moved to a discussion.

This issue was moved to a discussion.

Community schemas: allowing less fields than are defined in the schema #635

Community schemas: allowing less fields than are defined in the schema #635

Comments

peterdesmet commented Oct 1, 2020

roll commented Oct 2, 2020 • edited

peterdesmet commented Oct 2, 2020

roll commented Oct 2, 2020 • edited

peterdesmet commented Oct 6, 2020

roll commented Oct 6, 2020

peterdesmet commented Oct 6, 2020

roll commented Oct 6, 2020

peterdesmet commented Oct 7, 2020 • edited

peterdesmet commented Oct 7, 2020

roll commented Oct 7, 2020

roll commented Oct 7, 2020 • edited

roll commented Oct 7, 2020 • edited

peterdesmet commented Oct 9, 2020

Regarding MUST vs SHOULD in Table Schema

Regarding making sync_schema implementation explicit

kgeographer commented Nov 18, 2020

roll commented Nov 23, 2020

kgeographer commented Nov 24, 2020

roll commented Nov 25, 2020 • edited

This issue was moved to a discussion.

roll commented Oct 2, 2020 •

edited

roll commented Oct 2, 2020 •

edited

peterdesmet commented Oct 7, 2020 •

edited

roll commented Oct 7, 2020 •

edited

roll commented Oct 7, 2020 •

edited

Regarding `MUST` vs `SHOULD` in Table Schema

Regarding making `sync_schema` implementation explicit

roll commented Nov 25, 2020 •

edited