Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Community schemas: allowing less fields than are defined in the schema #635

Closed
peterdesmet opened this issue Oct 1, 2020 · 17 comments
Closed

Comments

@peterdesmet
Copy link
Member

The Data Resource spec allows to point to an externally hosted Resource Schema (schema: url-or-path). This is great, because it allows communities to for example develop and host a agreed upon set of Table Schemas (example) and data producers within that community to indicate that they are following the shared schema:

"resources": [
    {
      "name": "deployments",
      "path": "deployments.csv"
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/deployments-table-schema.json"
    },
   {
      "name": "media",
      "path": "media.csv",
      "profile": "tabular-data-resource",
      "schema": "https://gitlab.com/oscf/camtrap-package-schemas/-/raw/development/media-table-schema.json"
    }
]

A very common use however, is that data producers will have less fields in their csv files than are defined in the schema (e.g. skipping some non-required fields that are not applicable). According to the Table Schema specs, that should be possible (key word is SHOULD) as long as the fields are in the correct order:

A Table Schema is represented by a descriptor. The descriptor MUST be a JSON object (JSON is defined in RFC 4627).

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array MUST be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

Is this a correct assumption?

As in:

id,name,population
1,Britain,67
2,France,

Should be valid against (an externally hosted):

{
  "fields": [
    { "name": "id" },
    { "name": "neighbour_id" }, <-- Extra field
    { "name": "name" },
    { "name": "population" }
  ]
}

Validation in e.g. frictionless-py seems to require the exact same number of fields, which is a serious drawback for externally hosted schemas, so I wanted to understand first what the intention of the spec is.

@roll
Copy link
Member

roll commented Oct 2, 2020

Hi @peterdesmet,

It's a very good issue. It needs to be discussed on the specs level as it's not only the dimension requirement but also an addressing one: the mapping between file fields and schema fields are tied to their order.

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

https://colab.research.google.com/drive/1is_PcpzFl42aWI2B2tHaBGj3jxsKZ_eZ#scrollTo=myOn0wYQcIHY

@peterdesmet
Copy link
Member Author

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

The less fields (1) is addressed in the specs (SHOULD) but as you point out, the order (2) is currently required (MUST) in the specs:

The order of elements in fields array MUST be the order of fields in the CSV file.

@roll
Copy link
Member

roll commented Oct 2, 2020

Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this sync_schema also available as an option in the CLI?

Yes it's available in the CLI

Not sure why the spec uses SHOULD instead of MUST

@peterdesmet
Copy link
Member Author

Not sure why the spec uses SHOULD instead of MUST

But given the need (and the already supported implementation) the issue at hand is to change MUST into SHOULD (recommended but not required) right?

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below). The order of elements in fields array MUST SHOULD be the order of fields in the CSV file. The number of elements in fields array SHOULD be exactly the same as the number of fields in the CSV file.

@roll
Copy link
Member

roll commented Oct 6, 2020

I agree. Would you like to PR?

@peterdesmet
Copy link
Member Author

Yes, please see frictionlessdata/specs#707

@roll
Copy link
Member

roll commented Oct 6, 2020

Thanks

@peterdesmet
Copy link
Member Author

peterdesmet commented Oct 7, 2020

Commented by @roll in frictionlessdata/specs#707 (comment):

@peterdesmet
TBH I though you're going to make both MUST not SHOULD. I think they have to be MUST.

But actually, I think there are two aspects:

  • for concrete data + schema like Resource has they should be MUST
  • for abstract schema applicable to many files or community schema used as a reference there is no CSV file at all

Maybe we need to rephrase this paragraph.

@peterdesmet
Copy link
Member Author

there is no CSV file at all

I don't understand this. E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

@roll
Copy link
Member

roll commented Oct 7, 2020

It MUST contain a property fields. fields MUST be an array where each entry in the array is a field descriptor (as defined below).

The order of fields in a compliant tabular file MUST BE the order of elements in fields array. The number of fields in a compliant tabular file MUST BE a number of elements in fields array.

I would use something like this just because Schema makes/can exist sense even you don't have a file.

@roll
Copy link
Member

roll commented Oct 7, 2020

there is no CSV file at all.
I don't understand this.

I mean you can write an abstract Table Schema. It's not tied to any file. That why it can be a community/shared/etc 😃 One schema can describe N data files.

For example, Data Package and Data Resource are tied to exact files.


Table Schema: metadata
Data Resource: metadata + data
Data Package: metadata + data

@roll
Copy link
Member

roll commented Oct 7, 2020

E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in schema for each resource. Or (better) schema can reference the url for the community table schema (an external json file that describes more or less columns in a different order).

There is a problem that this will not work in any existing software unless you provide special flags on software level like validate(...sync_schema=True). But it's an extension over the spec while it's not actually described in the spec even though we change MUST->SHOULD.

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

PS.
Sorry for a lot of messages I've just completely understood the problem

@peterdesmet
Copy link
Member Author

@roll

Regarding MUST vs SHOULD in Table Schema

you write an abstract Table Schema. It's not tied to any file.

I agree, which is exactly why relaxing (SHOULD not MUST frictionlessdata/specs#707) the order/number of fields is an improvement imo, because it reflects reality and supports more use cases (e.g. community standards). Since it's not tied to data, it is bit odd that CSV file (3 times) is mentioned in the Table Schema spec though. It makes it easy to read, but maybe that should be adapted?

Regarding making sync_schema implementation explicit

I agree completely regarding:

To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of sync_schema or something otherwise it will be ambiguous.

I guess this should then be mentioned in the specs for Tabular Data Resource (not Table Schema), somewhere at:

The schema property MUST follow the Table Schema specification, either as a JSON object directly under the property, or a string referencing another JSON document containing the Table Schema

@kgeographer
Copy link

Regarding the Oct 1 message from @roll:

On the software level it's much easier you just need to provide the sync_schema argument. In this case, a provided schema.fields will be used as a reference table indexed by a field name that can have more or fewer fields than the data has.

extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True)

I find that adding sync_schema makes it impossible to trap header errors. For example non-matching-header. In the validation report, the header is changed to include possibly incorrect incoming header strings, and constraints initially in the referenced schema (e.g. 'required') are dropped; type becomes 'any.'

I encountered this trying to move from goodtables to frictionless to take advantage of its xlsx support. My situation is that incoming data can have fewer fields than the schema describes, but a few of the fields are required. Data creators shouldn't have to care about the order of fields, and in goodtables (2.4.1) this is not an issue.

@roll
Copy link
Member

roll commented Nov 23, 2020

Hi @kgeographer,

Can you please post a simple example of the header and the field names in your case? I'm trying to figure out if labels in data don't match field names so how we can link them for validation.

@kgeographer
Copy link

Yes, my case arises in performing validation of data contributions for World Historical Gazetteer using the LP-TSV variation of our contribution format.

Attached a sample data file, sample schema and little python script for demo

One problem is that with sync_schema = True, the file is validated even if a header for a required field is missing or if the column is missing altogether - I guess because the schema considered is modified on the fly to simply match what arrives.

The second problem is that if sync_schema is not used, and an incoming file header is not in the same order as the schema, and/or it is missing some non-required fields, then it generates many errors because columns are evaluated against the schema by index/position, not header string.

thanks

issue_704.zip

@roll
Copy link
Member

roll commented Nov 25, 2020

Thanks @kgeographer,

I've created a feature request - frictionlessdata/frictionless-py#546

@rufuspollock rufuspollock transferred this issue from frictionlessdata/specs Mar 31, 2021
@frictionlessdata frictionlessdata locked and limited conversation to collaborators Mar 31, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

4 participants