Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax resource.name rules but keep it required and unique #27

Merged
merged 7 commits into from
Feb 20, 2024

Conversation

roll
Copy link
Member

@roll roll commented Feb 5, 2024


Rationale

It's a fixed version of #21. In this edition, resource.name and field.name will have the same level of requirements (making it quite nice symmetry across the specs). Please see the reasons for relaxing the rules in the linked issue and PR.

Copy link
Member

@peterdesmet peterdesmet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested some changes:

  • Align the definitions
  • Include human-readable aspect for both
  • Reinstate uniqueness of resource names as a MUST
  • Added section on required properties

@@ -191,23 +191,17 @@ Thus, a consumer of resource object `MAY` assume if no format or mediatype prope

### Metadata Properties

#### Required Properties
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Required properties section should be kept, since name is a required property.

content/docs/specifications/data-resource.md Show resolved Hide resolved
@@ -146,7 +146,7 @@ In addition to the required properties, the following properties `SHOULD` be inc

##### `name`

A short url-usable (and preferably human-readable) name of the package. This `MUST` be lower-case and contain only alphanumeric characters along with ".", "\_" or "-" characters. It will function as a unique identifier and therefore `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique).
A short url-usable (and preferably human-readable) name of the package. This `SHOULD` be lower-case and contain only alphanumeric characters along with ".", "\_" or "-" characters. It will function as a unique identifier and therefore `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
A short url-usable (and preferably human-readable) name of the package. This `SHOULD` be lower-case and contain only alphanumeric characters along with ".", "\_" or "-" characters. It will function as a unique identifier and therefore `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique).
A simple name or identifier for the package.
The name `SHOULD` be human-readable and consist only of lowercase alphanumeric characters plus ".", "-" and "\_".
The name `SHOULD` be unique in relation to any registry in which this package will be deposited (and preferably globally unique).
  • Rewrote to align with resource name
  • Removed the url-usable (since this aligns with the character recommendation)

package.
- It `MUST` consist only of lowercase alphanumeric characters plus ".", "-" and "\_".
- If present, the name `SHOULD` be unique amongst all resources in this data package.
- It `SHOULD` consist only of lowercase alphanumeric characters plus ".", "-" and "\_".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- It `SHOULD` consist only of lowercase alphanumeric characters plus ".", "-" and "\_".
- It `SHOULD` be human-readable and consist only of lowercase alphanumeric characters plus ".", "-" and "\_".

Add suggestion to be human-readable, cf. package name.

@roll
Copy link
Member Author

roll commented Feb 19, 2024

@peterdesmet
I still think that we have a missing opportunity here by being too strict on resource.name as keeping uniqueness will basically mean keeping the thing that creates a lot of implementation problems (e.g., in Open Data Editor).

Personally, and based on Rufus's arguments, I consider resource.name as a feature that allows foreign keys referencing and resource access by names, but as any feature, it's not needed for everyone. If a data producer just wants to add a simple data API to their datasets or if we talk about metadata mapping (e.g. we will be doing for CKAN and Zenodo exports later on the projects), this feature is just not needed (e.g. as in mentioned OWID datasets). So for me it feels more like this you know 80 / 20 concept -- complicating things for 80% of users to benefit the 20% rest.

BTW, maybe it can be just enforced on an extension level? E.g. data package doesn't require resource.name uniqueness but camtrap requires it?

@peterdesmet
Copy link
Member

Thanks @roll, I think the main issue for me is the "resource access by names" feature, which is currently key to reading data in frictionless-r.

Rufus argument "All you need to identify a resource is a url or relative path!" is incorrect in my opinion, since path is optional (over data) and it can contain an array of paths. However, one can always uniquely identify a resource by its index in resources (e.g. resource 2 of 3). frictionless-r could potentially be adapted to use that as an alternative to name (ping @khusmann to know what he thinks of this). But if a name is present, I think that "access by name" and "foreign keys" would really benefit if it is kept unique.

So I suggest:

  • Don't require resource name: this should solve the issues that OWID datasets and Open Data Editor have with the current (strict) spec. This is with some reluctance, as I really like having resource name 😄 , but if it results having to invent meaningless names or having invalid datapackage.json being published, then that's not great.
  • Don't require resource name to stick to a certain formatting: that was never really necessary. Best to align with field.
  • If name is present, have it be unique, so any feature relying on it doesn't have to check that (but can just check if name is present). This won't affect implementations that don't use name.

@roll
Copy link
Member Author

roll commented Feb 19, 2024

@peterdesmet
Great wrap-up! Let me update the PR tomorrow based on your comment.

Regarding implementations, I think the key here is that an implementation can generate names if it needs it (e.g. frictionless-py generates field names if not provided etc -- as we do our best to read the data anyway even if the metadata is not fully valid). Not requiring resource.name we just unload it from data publishers.

BTW I'm curious if we can finally make field.name unique as well? If we introduce schema.partial (or similar) and clarify field ordering I think it will make a lot of sense

@peterdesmet
Copy link
Member

Let me update the PR tomorrow based on your comment.

👍 Might be good to start from the original text on main, so unintended changes don't creep in.

make field.name unique as well

Perhaps. And indeed as part of schema partial. It is non-backward compatible change though.

@khusmann
Copy link
Contributor

khusmann commented Feb 19, 2024

Hmmm, like @peterdesmet I am loath to drop resource.name. My mental model of a datapackage is similar to an SQL db: a collection of named tables identified by unique names. But I understand this creates challenges for the OWID and the Open Data Editor.

However, one can always uniquely identify a resource by its index in resources (e.g. resource 2 of 3)

I think this is a key observation by @peterdesmet, and equally applies to fields. This gives us a guaranteed way of identifying fields and resources, even without names.

In an API like frictionless-r using names to target specific resources & fields is preferred. But we could also have numeric indexing work when names are not available. There is precedent for this behavior in R:

foo <- list(field1 = "a", field2 = "b", field3 = "c")
foo[["field1"] # "a"
foo[[1]] # "a"

We could do a similar thing with foreignKeys and other resource / field references within schemas: When a value is numeric, locate the resource / field via index. When a value is string, locate the resource / field via name field.

For example, a foreign keys def using names looking like this:

"foreignKeys": [
  {
    "fields": "product",
    "reference": {
      "resource": ["product-info"],
      "fields": ["product"]
     }
   }
]

could be equivalently specified with indices like this (assuming "product-info" was the first resource in the datapackage, and "product" was its first field):

"foreignKeys": [
  {
    "fields": "product",
    "reference": {
      "resource": [1],
      "fields": [1]
     }
   }
]

Whether we're using the name field or an index for the reference can safely inferred from the usage of a numeric or string type.

If name is present, have it be unique, so any feature relying on it doesn't have to check that (but can just check if name is present). This won't affect implementations that don't use name
Don't require resource name to stick to a certain formatting: that was never really necessary. Best to align with field.

Agreed. As an identifier, resource.name should be unique (scoped to within a datapackage). I also don't care about its formatting.

make field.name unique as well

Oof, I didn't realize field.name wasn't already unique... I think unique field names (scoped to within a resource) is important, because I would expect to use them as identifiers. I support this change even though it is breaking, because I'd go so far to consider this a "bug" in the original spec. In my experience, allowing duplicate column names is a very bad time™.

@peterdesmet
Copy link
Member

@khusmann excellent points, just a bit reluctant to roll out [resource name or index] everywhere where a resource is mentioned. It makes sense, but it is quite the dramatic change, just like dropping resource.name is.

@roll can I suggest reducing this PR to removing the string constraints for resource.name? There seems to be consensus for that and will make it easy to reference.

And to create a separate PR for no longer requiring resource.name, which needs a bit more thought, such as does foreign keys still expect it or can indexes be used everywhere?

@ezwelty
Copy link

ezwelty commented Feb 20, 2024

Focusing on the string constraints on resource.name – I'm very happy about the change but why even retain a SHOULD on package.name and resource.name containing only a-z0-9.-_ (as opposed to field.name which was always character-agnostic)? It feels overly restrictive and old-fashioned in an age of browsers that know how to navigate to URLs like https://zh.wikipedia.org/wiki/近畿地方 (by escaping special characters, but that's quickly becoming an implementation detail). And rather latin-script-centric, given there are billions of people for which a-z is not particularly "human-readable" (or at least not their first choice) as the specs recommend.

Copy link

cloudflare-pages bot commented Feb 20, 2024

Deploying with  Cloudflare Pages  Cloudflare Pages

Latest commit: 2004fdf
Status: ✅  Deploy successful!
Preview URL: https://2a4d30c0.datapackage.pages.dev
Branch Preview URL: https://685-resource-name-optional.datapackage.pages.dev

View logs

@roll
Copy link
Member Author

roll commented Feb 20, 2024

Great! I think we have a broad agreement on removing resource.name sluggishness but any other changes need to be better justified and analysed. I've updated the PR just to switch from MUST to SHOULD regarding resource.name lexical pattern

@roll roll changed the title Relax resource.name rules but keep it required Relax resource.name rules but keep it required and unique Feb 20, 2024
@roll
Copy link
Member Author

roll commented Feb 20, 2024

ACCEPTED by WG (6/9)

@roll roll merged commit 1d95b9e into main Feb 20, 2024
1 check passed
@roll roll deleted the 685/resource-name-optional branch February 20, 2024 10:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resource name should no longer be required (and need not be slug-friendly)
4 participants