Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifier construction: To prefix or not to prefix #37

Open
nsheff opened this issue Oct 5, 2022 · 11 comments
Open

Identifier construction: To prefix or not to prefix #37

nsheff opened this issue Oct 5, 2022 · 11 comments
Milestone

Comments

@nsheff
Copy link
Member

nsheff commented Oct 5, 2022

On 2022-09-21 we debated how to actually form the identifiers. Like, is there a <prefix>, and/or a <type_prefix>, and are these modifiers used just for returning identifiers, or are they actually digested, since our protocol involves digesting digests.

Here are some thoughts:

  • I think it will be useful to disambiguate the terms and , then: identifier = <prefix>:<type_prefix>.<digest>
  • we should probably include prefixes for the "sequence_digests" array
  • the "sequence_digests" array should therefore refer to identifiers, rather than digests, and then probably renamed to "sequence_identifiers"
  • we should probably also include prefixes/type_prefixes for the level 1 digest algorithm, but then we have to define these type prefixes for each array.
  • it seems like the definition of the type prefixes should happen by an authority at a level higher than our working group.
  • are the prefixes actually added before the digest, or just returned to the user? There is not really any identifiability value added in actually digesting them. Could they just be affixed at the return/display stage? This makes the specification more universal.
  • Would prefixes be required or optional for the input from a user, when requesting a lookup given an identifier/digest?
@nsheff
Copy link
Member Author

nsheff commented Oct 19, 2022

On 2022-10-19, there was some support for including the type prefix, but not the namespace prefix (ga4gh). To define the type prefixes, we'd just re-use the array names. If we did it this way, it would look like this:

Level 0

seqcol.a6748aa0f6a1e165f871dbed5e54ba62

Level 1

{
  "lengths": "lengths.4925cdbd780a71e332d13145141863c1",
  "names": "names.ce04be1226e56f48da55b6c130d45b94",
  "sequences": "sequences.3b379221b4d6ea26da26cec571e5911c"
}

Level 2

{
  "lengths": [
    "1216",
    "970",
    "1788"
  ],
  "names": [
    "A",
    "B",
    "C"
  ],
  "sequences": [
    "SQ.76f9f3315fa4b831e93c36cd88196480",
    "SQ.d5171e863a3d8f832f0559235987b1e5",
    "SQ.b9b1baaa7abf206f6b70cf31654172db"
  ]
}

At level 2, would we want to add in the ga4gh namespace because that would be necessary for the lookup for refget 2.0? If so, you'd end up with this, which would lose consistency:

  "sequences": [
    "ga4gh:SQ.76f9f3315fa4b831e93c36cd88196480",
    "ga4gh:SQ.d5171e863a3d8f832f0559235987b1e5",
    "ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db"
  ]

@andrewyatz
Copy link
Collaborator

Level 2 is the bit that worries me since ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db is the identifier and it is our domain knowledge that allows us to know we need to add ga4gh: before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?

@sveinugu
Copy link
Collaborator

sveinugu commented Oct 27, 2022

Level 2 is the bit that worries me since ga4gh:SQ.b9b1baaa7abf206f6b70cf31654172db is the identifier and it is our domain knowledge that allows us to know we need to add ga4gh: before it is valid. I wonder if there is a missing component in the schema level where a namespace can be specified and that really means you have to add the following namespace onto the identifier before it is a valid identifier?

According to the CURIE Syntax document:

A host language MAY declare a default prefix value, or MAY provide a mechanism for defining a default prefix value. In such a host language, when the prefix is omitted from a CURIE, the default prefix value MUST be used. Conversely, if such a language does not define a default prefix value mechanism and does not define a set of reserved values, CURIEs MUST NOT be used without a leading prefix and colon.

Not 100% sure whether a service-related schema such as ours would qualify as a "host language", but if so we seem to be free to define our own mechanism for defining a default prefix value.

I googled my way to the specification of the UHF Hypermedia Format (UHF), which makes use of default CURIE prefixes and is also similar to our use case as it is basically a JSON schema or "format".

I am really only arguing that we can omit the prefix and still state that the values are CURIEs. Any automated usage must still extract our default prefix in a custom way, as the CURIE syntax document does not seem to define a canonical method for providing the default prefix in an automated fashion.

In the end, I suggest we contact identifiers.org or other relevant entities to get their view of the issue.

@andrewyatz For clarity, does the refget standard specify that the endpoints require the prefix to be available or is it optional?

@andrewyatz
Copy link
Collaborator

GA4GH compliance refget instances in v2 will accept GA4GH identifiers of the format ga4gh:SQ.XXXXXXXXX..., md5 checksums or namespace:identifier constructs such as insdc:CM000663.2. The prefix is seen as non-optional

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 2, 2022

A nice blog post about CURIEs and why we need them, as background: https://cthoyt.com/2021/09/14/curies.html

@nsheff
Copy link
Member Author

nsheff commented Nov 2, 2022

Some summary from today's discussion:

2 questions posted by Tim:

  1. Do we want what is going into the serialization to be the same thing that we expose to the public? Or do we not care about this level of consistency?

    • What we make available publicly is a lot easier to change in the future. We could always change prefixes later. In contrast if we change what's in the digest, that messes stuff up.
  2. If you don't necessarily require the same thing that is digested, is there much value in adding a lot of unnecessary characters to what you digest?

It seems we were approaching consensus that we could offer API endpoints that behave both ways: either they give exactly the string that was digested, if requested, or they give a more information-rich version. In fact, if we include non-digested arrays, then by definition the server will be serving up data that is different from exactly what is digested. Maybe it would be nice to have a flag or endpoint or option to get the exact digested string, though.

So, a thought experiment is:

  • for internal stuff (seqcol entities), we digest only digests, not identifiers (no prefixes or type prefixes)
  • for external identifiers, like refget identifiers, we accept them as strings at face value
  • for sequence digest arrays specfiically, we're following the ga4gh specification, so we'd expect these to be complete identifiers, with both namespace and type prefixes. But really, this is not specified by seqcol, which specifies no additional constraints

So this leads to a few next questions:

  1. what do we want to accept in the API? with or without prefixes?
  2. what does the server serve? the output provided to the user. Do we have to say that these strings have to be prefixed with something? When we return things, do we include these prefixes? Or do we make it user-controlled through query parameters or something?

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 3, 2022

Great writeup, @nsheff!

I only want to add some comments regarding the Refget v2 digest. I think we also agreed that the Refget v2 digest isn’t actually a CURIE, even though it looks very much like one. This was surprising to me and I think it has also been a cause of misunderstandings lately.

From the CURIE syntax document:

CURIEs are an abbreviation for strings which are intended to represent IRIs (as defined by the IRI production in [IRI]), but checking that intent is not part of CURIE conformance. The intended IRI is constructed by concatenating the prefix binding with the reference part, if any. There MUST be a prefix binding for the prefix (or the default prefix, if the prefix is absent) in scope. The concatenation of the prefix value associated with a CURIE and its reference MUST be an IRI (as defined by the IRI production in [IRI]).

So for the reget v2 digest to be a CURIE, say
ga4gh:SQ.a63c69dcd…, it should be possible to replace the "ga4gh" part with an IRI prefix and produce a valid IRI that would resolve into the concept that the CURIE represents, here the sequence itself. But since the ga4gh namespace is mandatory input for the refget endpoint, this is not possible.

Example:

Say you host a refget v2 server with the main endpoint available at (sorry, i did not bother looking up the actual endpoint name requirements in refget v2):

https://my.refgetserver.net/refget/

Then if ga4gh:SQ.a63c69dcd… was a CURIE, one should be able to replace the namespace with the endpoint IRI, and get a working IRI:

https://my.refgetserver.net/refget/SQ.a63c69dcd…`

However, this leaves out the namespace from the input to the endpoint, contrary to what Refget v2 requires, according to @andrewyatz (#37 (comment)).

I think it is unfortunate that the Refget v2 digest quacks like a duck without being a duck (but perhaps a swan?… 😁). Even if the standard does not state that the digest is a CURIE, it looks very much like one. I understand the ship has sailed in Refget v2 on this, sadly.

I think another thing we were nearing consensus on was that we would probably want to raise an issue to a higher power in GA4GH on what to use for the namespace of a seqcol CURIE identifier?

I would argue for using just ga4gh would make the refget v2 digest look even more like a CURIE and thus generate even more confusion. One possibility could be to instead include a type prefix in the namespace prefix, e.g.:

ga4gh.seqcol:6bc72cdf

Which is not uncommmon for CURIES, ex ega.study: and ega.dataset:.

Including some variant of a seqcolprefix on both sides of the colon is, I suppose, also a possibility:

ga4gh.seqcol:sc.6bc72cdf

@sveinugu
Copy link
Collaborator

sveinugu commented Nov 30, 2022

Just wanted to concretize some of my thoughts after todays meeting and the decision to not include any prefixes in the serializations (except the Refget one):

Digests vs identifiers

For me, the decision was made based on a clear separation of concern between the:

  • digest, which represents a particular content
  • identifier, which represents a particular concept

Two different concepts should have different identifiers, even if the contents are the same.

A way to clearly separate these concerns is to not include any prefixes at all in the digests. This is in essence what I believe we decided on today.

About identifiers

Regarding the identifiers, I think we should discern between locally and globally unique identifiers (Reference: "Unique, persistent identifiers" FAIR Cookbook). Identifiers should also be persistent and machine-resolvable. Identifiers could be full URI, for instance using persistent URLs, or they could be represented as CURIEs (see the FAIR Cookbook recipe or the above-mentioned blog post.

Suggestion for top-level seqcol identifiers

Syntax

So I have the following simple suggestion for relating globally unique identifiers in the form of CURIEs with the top level digests:

ga4gh.seqcol:<digest>

e.g.

ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi

Globally vs locally unique

If we remove the prefix, we get a locally unique identifier, which is in this case is equal to the digest. Following the conceptual framework from the CURIE syntax, this can be viewed as defining, in the context of a seqcol server, that the "default prefix value" is ga4gh.seqcol. In the context of a seqcol server, a top-level digest then also functions as a locally unique identifier and is furthermore also a valid CURIE!

Similarly, when others are making use of the seqcol identifiers in other contexts, they could in the same way define ga4gh.seqcol as the default prefix for the particular field holding the seqcol identifier. In such cases, the top-level-digest would still be a valid CURIE.

In conclusion: In the specification, we can basically say that a seqcol identifier is a CURIE, constructed according to the above syntax, and that the default prefix for a seqcol server is ga4gh.seqcol. One would not need to say anything about how the identifier should be used elsewhere, typing it as a CURIE would make sure of proper usage.

Note: A consequence of defining ga4gh.seqcol as the default prefix is that we might want the endpoints to also allow the user to specify the identifier WITH the prefix. Since the default prefix for a CURIE is only considered in the cases where the prefix is not present, it might be natural to make it optional to specify the prefix. Restricting the endpoints to only allow CURIEs without the default prefix will remove the possibility for later extending support to other prefixes, should we want to do that. We have anyway discussed having the prefix as optional just to be nice to the user.

Resolving the CURIE identifiers to URIs

In a CURIE resolution service, such as identifiers.org or N2T one could e.g. provide the following mappings:

ga4gh.seqcol -> https://www.ncbi.nlm.nih.gov/seqcol/collection/
ga4gh.seqcol -> https://www.ebi.ac.uk/ga4gh/collection/

Resolving the ga4gh.seqcol:ya7YJT-8kndreP6UamO9v20BZIPacuCi CURIE to the list

https://www.ncbi.nlm.nih.gov/seqcol/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi
https://www.ebi.ac.uk/ga4gh/collection/ya7YJT-8kndreP6UamO9v20BZIPacuCi

Suggestion for second-level seqcol identifiers

So what about possible identifiers for concepts represented by arrays (second level)?

I suggest the following syntax:

ga4gh.seqcol:<array name>.<digest>

e.g.

ga4gh.seqcol:lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o

CURIE resolution services would then resolve this identifier into e.g.:

https://www.ncbi.nlm.nih.gov/seqcol/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o
https://www.ebi.ac.uk/ga4gh/collection/lengths.kiVAmcKvvUQ8LRWIkIeQf2n9psRqKx8o

Whether the endpoints would accept that identifier or not is up to the implementation.

Note on persistent URLs

One could also later provide mapping to a persistent URLs scheme if there is the need for that, e.g.:

http://purl.org/ga4gh/seqcol/ya7YJT-8kndreP6UamO9v20BZIPacuCi

(BTW: I found this ga4gh domain under the Internet Archive-governed PURL system. It seems to have been registered by the GA4GH-Pedigree-Standard, helpfully using the top-level domain directly...)

@nsheff
Copy link
Member Author

nsheff commented Jan 11, 2023

In discussions in November and December 2022, we divided this issue into 2 related issues:

  1. Should we prefix things internally?
  2. Should we prefix the final level 0 digest in what we refer to as the "seqcol identifier?"

For the first, we have an agreement: we do not include the ga4gh prefix, or type prefixes. This is codified in PR #42.

The second is kind of a spinoff question, which I believe is still under debate.

@andrewyatz
Copy link
Collaborator

Following other discussions with Nathan I had in a 1:1 discussion, apologies for not being in the meeting yesterday from the start, we think there is a good course of action. We also believe that due to the misnaming of name-spaced identifiers as CURIEs we have conflated retrieval of an entity by its ID and the data required to resolve such an identifier.

  1. Change refget to accept non-prefixed identifiers i.e. SQ.nnnn which I think was discussed in previous refget meetings as a sensible extension (since SQ. is unique)
  2. Suggest that things should not be prefixed internally (the change in point 1 allows seqcol sequence ids to resolve to a sequence)
  3. Talk to the vrs group about their use of CURIE. Allowing refget to sit in a halfway house would allow VRS to continue to work as enforcing a change of not resolving namespaced identifiers in refget would be a major issue for them

@ahwagner
Copy link
Member

ahwagner commented Feb 3, 2023

We discussed this in the GKS leads call this week. A few takeaways from the discussion:

  1. There is no requirement that CURIEs are locatable. URIs cover URLs, URNs, and other URI types. AFAIK only URLs need be locatable, but CURIEs are not limited to URLs only. I've always thought of VRS object identifiers as URNs.
  2. @larrybabb thinks we should have gone the <namespace>.<type_prefix>:<digest> route. I agree with him.
  3. Following from 2, I don't think there's any reason the ga4gh namespace or SQ. prefix need be stored in refget. I actually think it is somewhat awkward to do this inside VRS objects, since we also strip those components when computing nested VRS digests that contain nested identifiable objects.
  4. Unrelated, it would be great if refget could move to just one digest scheme, but @andrewyatz rightly pointed out that this would be breaking for the CRAM spec. Though I would still push refget to consider a major version release at some point that is TRUNC512 only, leaving older versions available for use with CRAM, etc.
  5. I would like the VR team to work under a shared identifier / digest paradigm to refget, and assume that @larrybabb and @andreasprlic feel similarly, but would encourage them to chime in here too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants