This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Community schemas: allowing less fields than are defined in the schema #635
Comments
Hi @peterdesmet, It's a very good issue. It needs to be discussed on the specs level as it's not only the dimension requirement but also an addressing one: the mapping between file fields and schema fields are tied to their order. On the software level it's much easier you just need to provide the extract('path/to/table.csv', schema=schema, sync_schema=True)
validate('path/to/table.csv', schema=schema, sync_schema=True) https://colab.research.google.com/drive/1is_PcpzFl42aWI2B2tHaBGj3jxsKZ_eZ#scrollTo=myOn0wYQcIHY |
Fantastic! So the implementation allows to have less fields (1) and fields in a different order (2) then is provided in the schema. Question: is this The less fields (1) is addressed in the specs (
|
Yes it's available in the CLI Not sure why the spec uses |
But given the need (and the already supported implementation) the issue at hand is to change
|
I agree. Would you like to PR? |
Yes, please see frictionlessdata/specs#707 |
Thanks |
Commented by @roll in frictionlessdata/specs#707 (comment):
|
I don't understand this. E.g. a camera trap data package for a specific project has csv files. These can be described literally (no reference) in |
I would use something like this just because Schema makes/can exist sense even you don't have a file. |
I mean you can write an abstract Table Schema. It's not tied to any file. That why it can be a community/shared/etc 😃 One schema can describe N data files. For example, Data Package and Data Resource are tied to exact files. Table Schema: metadata |
There is a problem that this will not work in any existing software unless you provide special flags on software level like To make it possible we will need a whole new feature of the specs that allows software to lookup the fields by names. Some explicit analogue of PS. |
Regarding
|
Regarding the Oct 1 message from @roll:
I find that adding sync_schema makes it impossible to trap header errors. For example non-matching-header. In the validation report, the header is changed to include possibly incorrect incoming header strings, and constraints initially in the referenced schema (e.g. 'required') are dropped; type becomes 'any.' I encountered this trying to move from goodtables to frictionless to take advantage of its xlsx support. My situation is that incoming data can have fewer fields than the schema describes, but a few of the fields are required. Data creators shouldn't have to care about the order of fields, and in goodtables (2.4.1) this is not an issue. |
Hi @kgeographer, Can you please post a simple example of the header and the field names in your case? I'm trying to figure out if labels in data don't match field names so how we can link them for validation. |
Yes, my case arises in performing validation of data contributions for World Historical Gazetteer using the LP-TSV variation of our contribution format. Attached a sample data file, sample schema and little python script for demo One problem is that with sync_schema = True, the file is validated even if a header for a required field is missing or if the column is missing altogether - I guess because the schema considered is modified on the fly to simply match what arrives. The second problem is that if sync_schema is not used, and an incoming file header is not in the same order as the schema, and/or it is missing some non-required fields, then it generates many errors because columns are evaluated against the schema by index/position, not header string. thanks |
Thanks @kgeographer, I've created a feature request - frictionlessdata/frictionless-py#546 |
This issue was moved to a discussion.
You can continue the conversation there. Go to discussion →
The Data Resource spec allows to point to an externally hosted Resource Schema (
schema: url-or-path
). This is great, because it allows communities to for example develop and host a agreed upon set of Table Schemas (example) and data producers within that community to indicate that they are following the shared schema:A very common use however, is that data producers will have less fields in their csv files than are defined in the schema (e.g. skipping some non-required fields that are not applicable). According to the Table Schema specs, that should be possible (key word is
SHOULD
) as long as the fields are in the correct order:Is this a correct assumption?
As in:
Should be valid against (an externally hosted):
Validation in e.g.
frictionless-py
seems to require the exact same number of fields, which is a serious drawback for externally hosted schemas, so I wanted to understand first what the intention of the spec is.The text was updated successfully, but these errors were encountered: