New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

name and id as identifiers for Data Packages #237

Closed
rufuspollock opened this Issue Dec 22, 2015 · 25 comments

Comments

Projects
None yet
@rufuspollock
Contributor

rufuspollock commented Dec 22, 2015

Currently Data Packagese must have a name attribute but do not have an id attribute.

There has been debate about both the semantics (e.g. uniqueness) of the name field and its usability for certain cases (e.g. importing datasets into a new catalog) - see #220 for extensive discussions.

Proposal

Two identifier fields:

  • name: SHOULD be present (and certainly required for installation etc). Name is human meaningful and is designed to support both resolution (protocol to be determined) and easy use by humans e.g. in data dependencies
    • (?) Have this as a MUST?
  • id: MAY be present. If present MUST be globally unique. Propose it is a 36 bit uuid or similar.

What is the structure of name?

name may only contain lower case alphanumeric plus _-. and / as a separator (?? should we allow other url compatible values e.g. :?)

Option 1 - 3 part

Name has the following structure:

[registry/[owner-or-namespace/]]local-name

The primary Data Package registry (assuming there is one) will have the special registry name dp

local-name MUST NOT contain a '/'

# single-part - for resolution one would anticipate these implicitly become
# `{primary-registry}/core/{name}`
finance-vix

#2 part: `registry/local-name`
# Propose that namespace MUST
# either come from a designated central data package registry if / when we have one e.g. `core/gdp`
# OR be a valid domain name e.g. `data.gov.uk/my-name` (so we can piggy back on domain name issuance)
datahub.io/xyz
data.gov.uk/xyz

#3 part:
doi/{doi}   # {doi} usually has /
github.com/rgrp/court-decisions-gb

Asides

  1. I did think about having an initial "scheme" value e.g. dp/core/abc or www/data.gov.uk/xyz but felt we were starting to reinvent the url wheel a bit too much ...*
  2. one option I thought about was about keeping name single-valued and having id support the multipart option.*
  3. What about just using DOI? Ans: DOI requires a relatively complex registration process in order to able to issue DOIs. We want anyone to be able to create data packages
  4. Why not just use URIs / URLs? That is an option and we should think move about it. Main disadvantages are:
    • They are somewhat cumbersome
    • Are liable to breakage e.g. if a registry simply moves url ... (but that may creates problems with the above too?)
    • Do not translate well to local installation
    • Implicitly creates relation between name and URL resolution -- what happens if you don't control any url space?

Use Cases

Why does having an identifier matter? What is used for? At the moment the use cases are not very clear.

Note also @amercader comment: "As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one." -- though my question is why do you want to decide if it is the same?

Context

  • Check out Zooko's Triangle. For names hard to have more than 2 of:
    • meaningful (for humans)
    • decentralized
    • secure / non-colliding

Aims for name:

  • be human-usable and usable in dependencies
  • make possible and likely but not guarantee non-collision
  • be partially distributed

Content-based naming / addressing

One attractive approach to naming that is both secure and decentralized is content-based naming based on hashes. The basic idea is you name content via the (e.g. sha1) hash of the content.

This is attractive and clever but does have 2 drawbacks:

  • The name changes if the content changes (this could be a feature rather than a bug)
  • The name is an opaque long string
@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock
Contributor

rufuspollock commented Dec 31, 2015

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jan 4, 2016

Member

I think making name SHOULD and id MAY means that we've removed any type of identifier from data packages, by default. TBH I'm not sure what is best... but I think if id is not becoming a MUST, then we should leave name a MUST.

Member

pwalsh commented Jan 4, 2016

I think making name SHOULD and id MAY means that we've removed any type of identifier from data packages, by default. TBH I'm not sure what is best... but I think if id is not becoming a MUST, then we should leave name a MUST.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Jan 6, 2016

@rgrp thanks for summarizing the issues again, and for proposing a path forward. My view is slightly different, summarized as follows:

  • I strongly recommend to require a globally unique identifier for every data package
  • the easiest way to do this, in particular without any registration, is to use a UUID
  • human-readable name identifiers are common for software packages. I think the use case for data is different, and a id is a better fit. One example would be hundreds of datasets only differing in the time window or geolocation they describe (e.g. gold prices). Hard to imagine how to find a common naming scheme for cases such as this one
  • urls are very good unique identifiers, as they are actionable and have a name space. One important disadvantage is that a lot more infrastructure is required to generate URLs

In short my view is to

  • MUST have a globally unique id
  • MUST have a name which does not need to be globally unique (i.e. the current spec)
  • MAY have one or more 'url` which can point to a registry

mfenner commented Jan 6, 2016

@rgrp thanks for summarizing the issues again, and for proposing a path forward. My view is slightly different, summarized as follows:

  • I strongly recommend to require a globally unique identifier for every data package
  • the easiest way to do this, in particular without any registration, is to use a UUID
  • human-readable name identifiers are common for software packages. I think the use case for data is different, and a id is a better fit. One example would be hundreds of datasets only differing in the time window or geolocation they describe (e.g. gold prices). Hard to imagine how to find a common naming scheme for cases such as this one
  • urls are very good unique identifiers, as they are actionable and have a name space. One important disadvantage is that a lot more infrastructure is required to generate URLs

In short my view is to

  • MUST have a globally unique id
  • MUST have a name which does not need to be globally unique (i.e. the current spec)
  • MAY have one or more 'url` which can point to a registry
@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jan 29, 2016

Contributor

OK. Here's the updated proposal:

  • id: introduce this field and define semantics as globally unique.
    • Qu: what is syntax? Do we allow anything or restrict to simple "slug" stuff -- this would exclude e.g. doi or use of URL as id
    • This would be SHOULD not MUST as we want to allow people to create ultra-simple data packages without knowing how to create a UUID. However emphasize a MUST if you want to publish to e.g. registries.
  • name: probably keep this very simple as it is (at most allow a '/" in it - probably not as breaks sluggability ...)
    • Qu: make this MUST or SHOULD? as per @karissa question in #220

All the fancy multi-part stuff is really a bit of a confusion here as it is really about "package identifiers" for use in e.g. installation of dataDependencies. This could / should go into: http://dataprotocols.org/data-package-identifier/

@mfenner @karissa @danfowler @pwalsh any thoughts here before we move to implement.

Contributor

rufuspollock commented Jan 29, 2016

OK. Here's the updated proposal:

  • id: introduce this field and define semantics as globally unique.
    • Qu: what is syntax? Do we allow anything or restrict to simple "slug" stuff -- this would exclude e.g. doi or use of URL as id
    • This would be SHOULD not MUST as we want to allow people to create ultra-simple data packages without knowing how to create a UUID. However emphasize a MUST if you want to publish to e.g. registries.
  • name: probably keep this very simple as it is (at most allow a '/" in it - probably not as breaks sluggability ...)
    • Qu: make this MUST or SHOULD? as per @karissa question in #220

All the fancy multi-part stuff is really a bit of a confusion here as it is really about "package identifiers" for use in e.g. installation of dataDependencies. This could / should go into: http://dataprotocols.org/data-package-identifier/

@mfenner @karissa @danfowler @pwalsh any thoughts here before we move to implement.

@amercader

This comment has been minimized.

Show comment
Hide comment
@amercader

amercader Jan 29, 2016

Member

Note also @amercader comment: "As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one." -- though my question is why do you want to decide if it is the same?

Basically so you can decide whether to create a new entry on the registry or update an existing one.

If id needs to be globally unique, then allowing it to be a slug without a centralized registry is not really going to work, so I would strongly suggest that the spec recommends using UUIDs, an URI or another unique identifier (like the Figshare id mentioned). We can add a small note along the lines of "If unsure, use an UUID generated on this site or this one")

Member

amercader commented Jan 29, 2016

Note also @amercader comment: "As a Catalogue / Registry / Command Line Utility I Want Data Packages to have a global unique id So That I can sanely decide if a Data Package is the same as another one." -- though my question is why do you want to decide if it is the same?

Basically so you can decide whether to create a new entry on the registry or update an existing one.

If id needs to be globally unique, then allowing it to be a slug without a centralized registry is not really going to work, so I would strongly suggest that the spec recommends using UUIDs, an URI or another unique identifier (like the Figshare id mentioned). We can add a small note along the lines of "If unsure, use an UUID generated on this site or this one")

@karissa

This comment has been minimized.

Show comment
Hide comment
@karissa

karissa Jan 29, 2016

Encouraging the use of a DOI would be ideal, but any unique id scheme
(e.g., uuid) work work

On Friday, January 29, 2016, Adrià Mercader notifications@github.com
wrote:

Note also @amercader https://github.com/amercader comment: "As a
Catalogue / Registry / Command Line Utility I Want Data Packages to have a
global unique id So That I can sanely decide if a Data Package is the same
as another one." -- though my question is why do you want to decide if it
is the same?

Basically so you can decide whether to create a new entry on the registry
or update an existing one.

If id needs to be globally unique, then allowing it to be a slug without
a centralized registry is not really going to work, so I would strongly
suggest that the spec recommends using UUIDs, an URI or another unique
identifier (like the Figshare id mentioned). We can add a small note along
the lines of "If unsure, use an UUID generated on this site
http://mozilla.pettay.fi/cgi-bin/mozuuid.pl or this one
http://www.guidgen.com/")


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

karissa commented Jan 29, 2016

Encouraging the use of a DOI would be ideal, but any unique id scheme
(e.g., uuid) work work

On Friday, January 29, 2016, Adrià Mercader notifications@github.com
wrote:

Note also @amercader https://github.com/amercader comment: "As a
Catalogue / Registry / Command Line Utility I Want Data Packages to have a
global unique id So That I can sanely decide if a Data Package is the same
as another one." -- though my question is why do you want to decide if it
is the same?

Basically so you can decide whether to create a new entry on the registry
or update an existing one.

If id needs to be globally unique, then allowing it to be a slug without
a centralized registry is not really going to work, so I would strongly
suggest that the spec recommends using UUIDs, an URI or another unique
identifier (like the Figshare id mentioned). We can add a small note along
the lines of "If unsure, use an UUID generated on this site
http://mozilla.pettay.fi/cgi-bin/mozuuid.pl or this one
http://www.guidgen.com/")


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Jan 31, 2016

Contributor

OK. So current new language looks like:

* `id` - an identifier string for this package. If present, this MUST be
  globally unique and persistently so. For example, it could be a [UUID][], a
  [DOI][]. or even a URI/URL (under a domain you control).

  The id SHOULD be invariant, meaning that it SHOULD NOT change when a data
  package is updated, unless the new package version should be considered a
  distinct package, e.g.  due to significant changes in structure or
  interpretation.

  *Relationship to `name`*: `id` and `name` both serve as identifiers for a
  package. ...

Having written this much I realise there is a bit of a challenge explaining to the average reader of the spec why there are two "identifier" fields and their relationship. Questions so far:

  • If id is a SHOULD I think name probably should be too.
  • MUST one of id or name be present?
  • I am struggling now to explain why we have name and/or its relationship to id. Traditionally e.g. in software packaging name would be your unique, meaningful identifier. We could preserve some of this use here if we had a central registry (which we sort of but do not really have yet). I also feel it is pretty useful for some purposes (e.g. as a file name when installing on disk etc) and as a starting point for many people compared to a UUID.
    • One stab at an explanation: "name is human-readable and unique within a given registry (e.g. the registry this data package comes from). However, we anticipate data packages being published to and from multiple registries. As such, name may not be globally unique. id is globally unique but is likely human-opaque."
Contributor

rufuspollock commented Jan 31, 2016

OK. So current new language looks like:

* `id` - an identifier string for this package. If present, this MUST be
  globally unique and persistently so. For example, it could be a [UUID][], a
  [DOI][]. or even a URI/URL (under a domain you control).

  The id SHOULD be invariant, meaning that it SHOULD NOT change when a data
  package is updated, unless the new package version should be considered a
  distinct package, e.g.  due to significant changes in structure or
  interpretation.

  *Relationship to `name`*: `id` and `name` both serve as identifiers for a
  package. ...

Having written this much I realise there is a bit of a challenge explaining to the average reader of the spec why there are two "identifier" fields and their relationship. Questions so far:

  • If id is a SHOULD I think name probably should be too.
  • MUST one of id or name be present?
  • I am struggling now to explain why we have name and/or its relationship to id. Traditionally e.g. in software packaging name would be your unique, meaningful identifier. We could preserve some of this use here if we had a central registry (which we sort of but do not really have yet). I also feel it is pretty useful for some purposes (e.g. as a file name when installing on disk etc) and as a starting point for many people compared to a UUID.
    • One stab at an explanation: "name is human-readable and unique within a given registry (e.g. the registry this data package comes from). However, we anticipate data packages being published to and from multiple registries. As such, name may not be globally unique. id is globally unique but is likely human-opaque."
@karissa

This comment has been minimized.

Show comment
Hide comment
@karissa

karissa Jan 31, 2016

Yes, I like that explanation.

It would be nice if they were both 'SHOULD'

On Sun, Jan 31, 2016 at 3:58 PM, Rufus Pollock notifications@github.com
wrote:

OK. So current new language looks like:

  • id - an identifier string for this package. If present, this MUST be
    globally unique and persistently so. For example, it could be a [UUID][], a
    [DOI][]. or even a URI/URL (under a domain you control).

    The id SHOULD be invariant, meaning that it SHOULD NOT change when a data
    package is updated, unless the new package version should be considered a
    distinct package, e.g. due to significant changes in structure or
    interpretation.

    Relationship to name: id and name both serve as identifiers for a
    package. ...

Having written this much I realise there is a bit of a challenge
explaining to the average reader of the spec why there are two "identifier"
fields and their relationship. Questions so far:

  • If id is a SHOULD I think name probably should be too.
  • MUST one of id or name be present?
  • I am struggling now to explain why we have name and/or its
    relationship to id. Traditionally e.g. in software packaging name
    would be your unique, meaningful identifier. We could preserve some of this
    use here if we had a central registry (which we sort of but do not really
    have yet). I also feel it is pretty useful for some purposes (e.g. as a
    file name when installing on disk etc) and as a starting point for many
    people compared to a UUID.
    • One stab at an explanation: "name is human-readable and unique
      within a given registry (e.g. the registry this data package comes from).
      However, we anticipate data packages being published to and from multiple
      registries. As such, name may not be globally unique. id is
      globally unique but is likely human-opaque."


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

karissa commented Jan 31, 2016

Yes, I like that explanation.

It would be nice if they were both 'SHOULD'

On Sun, Jan 31, 2016 at 3:58 PM, Rufus Pollock notifications@github.com
wrote:

OK. So current new language looks like:

  • id - an identifier string for this package. If present, this MUST be
    globally unique and persistently so. For example, it could be a [UUID][], a
    [DOI][]. or even a URI/URL (under a domain you control).

    The id SHOULD be invariant, meaning that it SHOULD NOT change when a data
    package is updated, unless the new package version should be considered a
    distinct package, e.g. due to significant changes in structure or
    interpretation.

    Relationship to name: id and name both serve as identifiers for a
    package. ...

Having written this much I realise there is a bit of a challenge
explaining to the average reader of the spec why there are two "identifier"
fields and their relationship. Questions so far:

  • If id is a SHOULD I think name probably should be too.
  • MUST one of id or name be present?
  • I am struggling now to explain why we have name and/or its
    relationship to id. Traditionally e.g. in software packaging name
    would be your unique, meaningful identifier. We could preserve some of this
    use here if we had a central registry (which we sort of but do not really
    have yet). I also feel it is pretty useful for some purposes (e.g. as a
    file name when installing on disk etc) and as a starting point for many
    people compared to a UUID.
    • One stab at an explanation: "name is human-readable and unique
      within a given registry (e.g. the registry this data package comes from).
      However, we anticipate data packages being published to and from multiple
      registries. As such, name may not be globally unique. id is
      globally unique but is likely human-opaque."


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Feb 1, 2016

Member

This is complicated.

If name MUST be unique within a given registry (based on last explanation text), this may only be knowable after the creation of a Data Package, and even after possible publication to other registries under the same name.

I understand the logic of all the various arguments here, but I really think we will fall into a hole by trying to please all and having both name and id, where each are a type of identifier.

Member

pwalsh commented Feb 1, 2016

This is complicated.

If name MUST be unique within a given registry (based on last explanation text), this may only be knowable after the creation of a Data Package, and even after possible publication to other registries under the same name.

I understand the logic of all the various arguments here, but I really think we will fall into a hole by trying to please all and having both name and id, where each are a type of identifier.

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Feb 1, 2016

I agree with @pwalsh. This is a bit too complicated. As much as I like the human readability of name, and understand that this is a pattern for software packages, I don't see this pattern as a good fit for data packages. While we rarely have more than a few software packages that do the same thing, we can easily have dozens or hundreds of data packages all describing UK gold prices or weather in Paris. If we drop the idea that we have an identifier that is both human readable and globally unique, things become easier, and we can drop name.

The problem with identifiers that first need to be registered, e.g. DOI, or name that is unique in a registry is that this creates overhead when the package is created. For the use cases of data packages that I see, I want to be able to create them without this overhead, e.g. when I have no network connection.

My suggestion is to

  • require id, and make this a uuid
  • allow additional identifiers, including DOIs in another field, named for example other_ids. We can also put human readable identifiers such as name in here
  • drop name, as it is confusing if we also have id
  • use title for a human readable description, but this doesn't have to be unique

I agree with @rgrp that

The id SHOULD be invariant, meaning that it SHOULD NOT change when a data package 
is updated, unless the new package version should be considered a distinct package, 
e.g. due to significant changes in structure or interpretation.

mfenner commented Feb 1, 2016

I agree with @pwalsh. This is a bit too complicated. As much as I like the human readability of name, and understand that this is a pattern for software packages, I don't see this pattern as a good fit for data packages. While we rarely have more than a few software packages that do the same thing, we can easily have dozens or hundreds of data packages all describing UK gold prices or weather in Paris. If we drop the idea that we have an identifier that is both human readable and globally unique, things become easier, and we can drop name.

The problem with identifiers that first need to be registered, e.g. DOI, or name that is unique in a registry is that this creates overhead when the package is created. For the use cases of data packages that I see, I want to be able to create them without this overhead, e.g. when I have no network connection.

My suggestion is to

  • require id, and make this a uuid
  • allow additional identifiers, including DOIs in another field, named for example other_ids. We can also put human readable identifiers such as name in here
  • drop name, as it is confusing if we also have id
  • use title for a human readable description, but this doesn't have to be unique

I agree with @rgrp that

The id SHOULD be invariant, meaning that it SHOULD NOT change when a data package 
is updated, unless the new package version should be considered a distinct package, 
e.g. due to significant changes in structure or interpretation.
@karissa

This comment has been minimized.

Show comment
Hide comment
@karissa

karissa Feb 1, 2016

I agree with you until you advocate to use title instead of name. Itd be
nice to keep the functionality you propose with title, but keep name as the
keyword. This way, it doesn't break backwards compatibility.

On Sunday, January 31, 2016, Martin Fenner notifications@github.com wrote:

I agree with @pwalsh https://github.com/pwalsh. This is a bit too
complicated. As much as I like the human readability of name, and
understand that this is a pattern for software packages, I don't see this
pattern as a good fit for data packages. While we rarely have more than a
few software packages that do the same thing, we can easily have dozens or
hundreds of data packages all describing UK gold prices or weather in
Paris. If we drop the idea that we have an identifier that is both human
readable and globally unique, things become easier, and we can drop name.

The problem with identifiers that first need to be registered, e.g. DOI,
or name that is unique in a registry is that this creates overhead when
the package is created. For the use cases of data packages that I see, I
want to be able to create them without this overhead, e.g. when I have no
network connection.

My suggestion is to

  • require id, and make this a uuid
  • allow additional identifiers, including DOIs in another field, named
    for example other_ids. We can also put human readable identifiers such
    as name in here
  • drop name, as it is confusing if we also have id
  • use title for a human readable description, but this doesn't have to
    be unique

I agree with @rgrp https://github.com/rgrp that

The id SHOULD be invariant, meaning that it SHOULD NOT change when a data package is updated, unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or interpretation.


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

karissa commented Feb 1, 2016

I agree with you until you advocate to use title instead of name. Itd be
nice to keep the functionality you propose with title, but keep name as the
keyword. This way, it doesn't break backwards compatibility.

On Sunday, January 31, 2016, Martin Fenner notifications@github.com wrote:

I agree with @pwalsh https://github.com/pwalsh. This is a bit too
complicated. As much as I like the human readability of name, and
understand that this is a pattern for software packages, I don't see this
pattern as a good fit for data packages. While we rarely have more than a
few software packages that do the same thing, we can easily have dozens or
hundreds of data packages all describing UK gold prices or weather in
Paris. If we drop the idea that we have an identifier that is both human
readable and globally unique, things become easier, and we can drop name.

The problem with identifiers that first need to be registered, e.g. DOI,
or name that is unique in a registry is that this creates overhead when
the package is created. For the use cases of data packages that I see, I
want to be able to create them without this overhead, e.g. when I have no
network connection.

My suggestion is to

  • require id, and make this a uuid
  • allow additional identifiers, including DOIs in another field, named
    for example other_ids. We can also put human readable identifiers such
    as name in here
  • drop name, as it is confusing if we also have id
  • use title for a human readable description, but this doesn't have to
    be unique

I agree with @rgrp https://github.com/rgrp that

The id SHOULD be invariant, meaning that it SHOULD NOT change when a data package is updated, unless the new package version should be considered a distinct package, e.g. due to significant changes in structure or interpretation.


Reply to this email directly or view it on GitHub
#237 (comment)
.

Karissa McKelvey
http://karissa.github.io/

@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Feb 1, 2016

Agree about keeping name for backwards compatibility.

mfenner commented Feb 1, 2016

Agree about keeping name for backwards compatibility.

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Feb 1, 2016

Member

@mfenner +1 with the addition of @karissa re name.

I just don't see any other way forward that will not be confusing.

Member

pwalsh commented Feb 1, 2016

@mfenner +1 with the addition of @karissa re name.

I just don't see any other way forward that will not be confusing.

@jbenet

This comment has been minimized.

Show comment
Hide comment
@jbenet

jbenet Feb 1, 2016

  • +1 on good description of id @rgrp
  • +1 to @mfenner's comments and +1 on using "name"
  • please always make doi optional, as another attribute.
  • for id, please use uuid4 or just a large cryptographically random string.
  • if you use uuid, consider no dashes for copy-paste friendliness -- dashes tend to be confusing too.

jbenet commented Feb 1, 2016

  • +1 on good description of id @rgrp
  • +1 to @mfenner's comments and +1 on using "name"
  • please always make doi optional, as another attribute.
  • for id, please use uuid4 or just a large cryptographically random string.
  • if you use uuid, consider no dashes for copy-paste friendliness -- dashes tend to be confusing too.
@danfowler

This comment has been minimized.

Show comment
Hide comment
@danfowler

danfowler Feb 3, 2016

Contributor

+1 @mfenner @karissa on requiring id (MUST) while also keeping name for backwards compatibility. I suppose this means that name should, going forward, be a SHOULD property?

One comment, if id is required to be a UUID, why not just explicitly make the field name uuid? I also like the other_ids approach, and suggest an object like so:

"uuid": "ff28b27c-6556-45fd-9d17-13368ae6fd44",
"other_ids": {
  "doi": "...",
  "uri": "..."
}
Contributor

danfowler commented Feb 3, 2016

+1 @mfenner @karissa on requiring id (MUST) while also keeping name for backwards compatibility. I suppose this means that name should, going forward, be a SHOULD property?

One comment, if id is required to be a UUID, why not just explicitly make the field name uuid? I also like the other_ids approach, and suggest an object like so:

"uuid": "ff28b27c-6556-45fd-9d17-13368ae6fd44",
"other_ids": {
  "doi": "...",
  "uri": "..."
}
@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Feb 3, 2016

I like what @danfowler said. Not sure I prefer id or uuid, can be convinced both ways.

mfenner commented Feb 3, 2016

I like what @danfowler said. Not sure I prefer id or uuid, can be convinced both ways.

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 3, 2016

Contributor

@danfowler @mfenner @pwalsh (and everyone else)

I emphasize the constant principle of simplicity. Yes, we can have a bunch of other ids. But: for what purpose right now?

Everything we add to the spec is extra cognitive load for users and implementors. Remember we can always have extensions of the spec as separate mini-specs -- and you can add your own fields at the moment as long as they don't conflict.

Thus, on the other_ids / identifiers` field I'm a bit cautious in adding it as I'm not clear what exact use case it serves for people who want to use the packages.

In fact, I again want to check what people think the use cases are for these field? It is not super clear to me atm -- see one attempt to describe at top of the issue. I ask as this would help clarify, for example:

  • What values there is in supporting other identifiers explicitly right now in the spec
  • Whether id should be uuid always or allows doi or uri.
  • Whether id is MUST or SHOULD? Atm I am assuming SHOULD based on discussion above.

Example of how user stories help us think about this:

  • Making a data package creator tool. I need to know whether I MUST or SHOULD have an id present and how it can be generated.
  • A tool "consuming" a Data Package: what relevance is id to me? At the moment I can only really see one use: importing to a catalog / registry - there it does seem that allowing a DOI or URI as id could be problematic e.g. consider CKAN.
Contributor

rufuspollock commented Feb 3, 2016

@danfowler @mfenner @pwalsh (and everyone else)

I emphasize the constant principle of simplicity. Yes, we can have a bunch of other ids. But: for what purpose right now?

Everything we add to the spec is extra cognitive load for users and implementors. Remember we can always have extensions of the spec as separate mini-specs -- and you can add your own fields at the moment as long as they don't conflict.

Thus, on the other_ids / identifiers` field I'm a bit cautious in adding it as I'm not clear what exact use case it serves for people who want to use the packages.

In fact, I again want to check what people think the use cases are for these field? It is not super clear to me atm -- see one attempt to describe at top of the issue. I ask as this would help clarify, for example:

  • What values there is in supporting other identifiers explicitly right now in the spec
  • Whether id should be uuid always or allows doi or uri.
  • Whether id is MUST or SHOULD? Atm I am assuming SHOULD based on discussion above.

Example of how user stories help us think about this:

  • Making a data package creator tool. I need to know whether I MUST or SHOULD have an id present and how it can be generated.
  • A tool "consuming" a Data Package: what relevance is id to me? At the moment I can only really see one use: importing to a catalog / registry - there it does seem that allowing a DOI or URI as id could be problematic e.g. consider CKAN.
@mfenner

This comment has been minimized.

Show comment
Hide comment
@mfenner

mfenner Feb 3, 2016

I advocate that ìdMUST be present, and to always use auuid`.

In my view id is essential if you want to handle more than a handful of datasets. Imagine for a example the use case of an R script analyzing 5 datasets with weather data from 5 different locations. And then try to rerun that script 6 months later, when files have moved around, might have changed, etc. I would argue that a required id is essential for any data package.

other_ids is important for example for my use case, using DOIs to uniquely identify data packages (I work for DataCite and we assign DOIs to data). Keeping DOIs and other identifiers out of the id field is important, as parsing id otherwise becomes a nightmare.

mfenner commented Feb 3, 2016

I advocate that ìdMUST be present, and to always use auuid`.

In my view id is essential if you want to handle more than a handful of datasets. Imagine for a example the use case of an R script analyzing 5 datasets with weather data from 5 different locations. And then try to rerun that script 6 months later, when files have moved around, might have changed, etc. I would argue that a required id is essential for any data package.

other_ids is important for example for my use case, using DOIs to uniquely identify data packages (I work for DataCite and we assign DOIs to data). Keeping DOIs and other identifiers out of the id field is important, as parsing id otherwise becomes a nightmare.

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Jul 12, 2016

Member

@rgrp and all

I'd like to move to solving this asap - it comes up all the time, and the discussion here shows that a one size fits all approach will be hard to get consensus on, even though, in principle, we all want the same thing, being: a reliable way to uniquely identify a package.

My take on all of the above is:

explicit part of spec

  • name a must, not unique, human readable title (this is backwards compatible, but going forward what was title becomes name)
  • id not a must, but if present, must be globally unique (not enforcing uuid though, as there are valid use cases for a "just a large cryptographically random string" ( @jbenet ), such as in DAT)

documented patterns (and, the core libs that OKI maintain in Python and Javascript will support these patterns)

  • uuid not a must, but if present, must be a globally unique uuid
  • doi not a must, but if present, must be a globally unique DOI

if one of uuid or doi exist, then id, and the other, should not exist.

No, this is not elegant. It tries to strike a balance between:

  • not requiring ids for general, local use of data package
  • having a generic id field which is claimed to be globally unique, but cannot be verified by anyone (except a system that generates and processes the packages)
  • having explicit ID fields for the most common globally unique identifiers that are applicable to data packages (UUID and DOI)

If this approach is not workable, then I would support a solution where id is still not must, but when present, must be a UUID.

Member

pwalsh commented Jul 12, 2016

@rgrp and all

I'd like to move to solving this asap - it comes up all the time, and the discussion here shows that a one size fits all approach will be hard to get consensus on, even though, in principle, we all want the same thing, being: a reliable way to uniquely identify a package.

My take on all of the above is:

explicit part of spec

  • name a must, not unique, human readable title (this is backwards compatible, but going forward what was title becomes name)
  • id not a must, but if present, must be globally unique (not enforcing uuid though, as there are valid use cases for a "just a large cryptographically random string" ( @jbenet ), such as in DAT)

documented patterns (and, the core libs that OKI maintain in Python and Javascript will support these patterns)

  • uuid not a must, but if present, must be a globally unique uuid
  • doi not a must, but if present, must be a globally unique DOI

if one of uuid or doi exist, then id, and the other, should not exist.

No, this is not elegant. It tries to strike a balance between:

  • not requiring ids for general, local use of data package
  • having a generic id field which is claimed to be globally unique, but cannot be verified by anyone (except a system that generates and processes the packages)
  • having explicit ID fields for the most common globally unique identifiers that are applicable to data packages (UUID and DOI)

If this approach is not workable, then I would support a solution where id is still not must, but when present, must be a UUID.

@icklecows

This comment has been minimized.

Show comment
Hide comment
@icklecows

icklecows Jul 28, 2016

+1 Using DOIs as the identifier - we already create this globally unique identifier, and using UUID would require additional development as it's not already in use in our system.

Whatever is decided, name is very awkward for us - any identifier is better as a necessity than name.

icklecows commented Jul 28, 2016

+1 Using DOIs as the identifier - we already create this globally unique identifier, and using UUID would require additional development as it's not already in use in our system.

Whatever is decided, name is very awkward for us - any identifier is better as a necessity than name.

@roll roll changed the title from name and id as identifiers for Data Packages to name and id as identifiers for Data Packages Aug 8, 2016

@roll roll added the backlog label Aug 8, 2016

@roll roll removed the backlog label Aug 29, 2016

@rufuspollock rufuspollock added this to the Current milestone Sep 27, 2016

@jpmckinney jpmckinney added the blocker label Sep 28, 2016

@pwalsh

This comment has been minimized.

Show comment
Hide comment
@pwalsh

pwalsh Nov 10, 2016

Member

@rgrp @frictionlessdata/specs-working-group

On reflection, I think my last suggestion is ridiculous, as by trying to please every opinion, it serves no one particularly well.

I'm starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

What we could do is reserve id as a protected property for use as a unique identifier by implementations, and then see how it is used, or, we could require id to be a UUID, which can be validated, and is the most implementation-neutral of all suggestions I've seen so far.

@frictionlessdata/specs-working-group this has been a persistent issue, well beyond this thread. I'd love to hear additional thoughts.

Member

pwalsh commented Nov 10, 2016

@rgrp @frictionlessdata/specs-working-group

On reflection, I think my last suggestion is ridiculous, as by trying to please every opinion, it serves no one particularly well.

I'm starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

What we could do is reserve id as a protected property for use as a unique identifier by implementations, and then see how it is used, or, we could require id to be a UUID, which can be validated, and is the most implementation-neutral of all suggestions I've seen so far.

@frictionlessdata/specs-working-group this has been a persistent issue, well beyond this thread. I'd love to hear additional thoughts.

@jpmckinney

This comment has been minimized.

Show comment
Hide comment
@jpmckinney

jpmckinney Nov 11, 2016

Member

I'm not sure where the proposal for name landed. At the top @rgrp describes a fairly complex process of assigning names that is much more labor-intensive than generating a UUID, which you can do online if you happen to be generating IDs manually.

If, as @pwalsh writes, we're tending towards not having any unique identifiers, then I'm in favor of eliminating name, which is very confusing.

If we do add id, then I'm in favor that it SHOULD be a UUID, but if some of the ecosystem already uses DOI or other identifiers (dat was mentioned above), then by all means let them continue to use those IDs in a 'grandfathered' way.

Member

jpmckinney commented Nov 11, 2016

I'm not sure where the proposal for name landed. At the top @rgrp describes a fairly complex process of assigning names that is much more labor-intensive than generating a UUID, which you can do online if you happen to be generating IDs manually.

If, as @pwalsh writes, we're tending towards not having any unique identifiers, then I'm in favor of eliminating name, which is very confusing.

If we do add id, then I'm in favor that it SHOULD be a UUID, but if some of the ecosystem already uses DOI or other identifiers (dat was mentioned above), then by all means let them continue to use those IDs in a 'grandfathered' way.

@joehand

This comment has been minimized.

Show comment
Hide comment
@joehand

joehand Nov 11, 2016

I'm starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

This seems like a good compromise. Global uniqueness will be hard to guarantee and only meaningful within the datapackage space. Allowing the id to be a platform specific unique id will make it easier to use datapackages in those platforms. This would allow id's such as:

"id" : "https://doi.org/10.5281/zenodo.166271"
"id" : "dat://f677bd23661a1d5871e40092268d197c73f213f6b8aefebe01709647cfde9528/"

These IDs will be resolvable within the specific platform but also meaningful when viewing the datapackage outside that platform. It is clear what subspace these IDs come from and where they are guaranteed to be unique.

joehand commented Nov 11, 2016

I'm starting to swing towards what I think is the view of @rgrp being that the spec does not need a unique identifier as part of it: it is a platform-specific concern (federated or otherwise).

This seems like a good compromise. Global uniqueness will be hard to guarantee and only meaningful within the datapackage space. Allowing the id to be a platform specific unique id will make it easier to use datapackages in those platforms. This would allow id's such as:

"id" : "https://doi.org/10.5281/zenodo.166271"
"id" : "dat://f677bd23661a1d5871e40092268d197c73f213f6b8aefebe01709647cfde9528/"

These IDs will be resolvable within the specific platform but also meaningful when viewing the datapackage outside that platform. It is clear what subspace these IDs come from and where they are guaranteed to be unique.

@roll roll removed the blocker label Nov 16, 2016

@rufuspollock rufuspollock self-assigned this Nov 17, 2016

@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Nov 17, 2016

Contributor

@joehand @pwalsh good suggestions. Next steps are now clear:

  • Separate issue about creating id field - which is up to users to make unique
  • Separate issue about making name non-required and more flexible (it is basically there for registries)
Contributor

rufuspollock commented Nov 17, 2016

@joehand @pwalsh good suggestions. Next steps are now clear:

  • Separate issue about creating id field - which is up to users to make unique
  • Separate issue about making name non-required and more flexible (it is basically there for registries)
@rufuspollock

This comment has been minimized.

Show comment
Hide comment
@rufuspollock

rufuspollock Feb 5, 2017

Contributor

AGREED: will do as separate PR:

  • id field which MUST be unique (e.g. uuid, doi etc)
  • name field is MAY and can be anything you like within reason ...
Contributor

rufuspollock commented Feb 5, 2017

AGREED: will do as separate PR:

  • id field which MUST be unique (e.g. uuid, doi etc)
  • name field is MAY and can be anything you like within reason ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment