Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make SAID more generic #21

Closed
KDean-GS1 opened this issue May 22, 2022 · 11 comments
Closed

Make SAID more generic #21

KDean-GS1 opened this issue May 22, 2022 · 11 comments

Comments

@KDean-GS1
Copy link

The SAID concept is great, but the binding to CESR alone is inherently limiting. The ID ends up being a string that, while referenceable, is not easily discoverable without the addition of another field. For example, if I wanted to access the “Sue Smith” object without embedding it, I would have to write something like this:

"partner": {
    "id": "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk",
    "source": "https://www.example.com/repo/EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"
}

The standard should speak to multiple representations. I see two possible categories.

The first is protocol specific. Assuming that the did:example method uses SAIDs, then any ID that is did:example is, by definition, a SAID. The pre-hash encoding would look like this:

{
    "id": "did:example:############################################",
    "first": "Sue",
    "last": "Smith",
    "role": "Founder"
}

The field is no longer called said, because the did:example method implies it. Whatever processes are used to discover the document that has the post-hash DID would retrieve a document that could be verified as per the specification.

The second is application specific. This is a little more complicated, because it maps to existing protocols that will have a mix of SAID and non-SAID identifiers.

Suppose, for example, that you want your JSON schemas to be discoverable and ingestible by standard validation tools. The $id fields would have to be HTTPS URLs and those URLs would have to be resolvable. Your pre-hash JSON schema example would therefore look something like this:

{
    "$id": "https://www.example.com/schema/said/############################################",
    "$schema": "http://json-schema.org/draft-07/schema#",
    "type": "object",
    "properties": {
    "full_name": {
        "type": "string"
    }
}

The key here is the structure of the URL. If the URL matches a regex pattern, then it is probably a SAID. We did the same with GS1 Digital Link: we published a regex pattern that tells you that the URL you’re looking at is conformant to the GS1 Digital Link specification and the data in it is likely a GS1 AI (Application Identifier) string (see Section 6.1 of this document). It’s not 100% guaranteed, but the probability of collision is very low.

That brings me to the underlying CESR representation itself. I personally hate single-character mnemonics; they are incredibly hard to remember, read, and debug. They also require regular maintenance of a lookup table and tricks to extend beyond the range of single characters (e.g., “1AAD” to represent an Ed448 public signing verification key, with the ‘1’ denoting that it’s a four-character code). I understand the desire for a compact representation for IoT devices, but there should be support for something more expansive for applications that don’t have storage and bandwidth constraints, as well as automatic support for other algorithms. The pre-hash did:example and HTTPS URL could instead look like this:

did:example:Blake3-256:###########################################
https://www.example.com/schema/said/Blake3-256/###########################################

Support for CESR could still be included:

did:example:CESR:############################################
https://www.example.com/schema/said/CESR/############################################

Or, CESR could be implied by the absence of a hash algorithm:

did:example:############################################
https://www.example.com/schema/said/############################################

GS1 Application Identifiers are guilty of this as well, but they were developed in the 1980s. :-)

@SmithSamuelM
Copy link
Contributor

A reference for this discussion is section 3 of the ACDC Specification.
ACDC
In that specification, the value of the $id field is allowed to be something other than a bare CESR SAID. This includes URIs that include the SAID in the URI is a known location in the URI as well as did: that have the SAID in a known location in the URN. And OOBIs URIs (another spec) that have the SAID in a known location in the OOBI. Essentially it is allowing values that embed a SAID in a known location within another namespace.

The problem is that namespaces in general may not have a natural location. URIs specifically are dynamic indirections to resources with path, query, and fragment components each of which could hold a SAID. Thus there is no "preferred" or natural location within the URI to place the SAID so that it is universally parseable. So It can't be generalized but must be application specific. This is especially the case for URLs. URNs like DIDs are static and more constrained. For each URN type one might be able to specify a known location for a SAID. For did URNs each did method type has a method specific ID which would be the natural location to put a SAID. So that one is easy.

I believe that this SAID specification could be amended with namespace and application specific appendices that allow other value types besides bare SAIDS for specific applications where the appendice defines the parseable location within the other namespace/application.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented May 25, 2022

To your point about CESR extension. More verbose expression, IMHO goes down the slippery slope that at the bottom is JWT, JOSE, COSE etc and everything else in between.

My personal experience and the experience of the various developers on KERI that uses CESR is positive. Since one only ever uses a handful of codes both a combination of the context and the code is meaningful enough. The fact that its text and compact is preferable to the verbosity of JWT which are unmanageably verbose which makes them hard to debug. The simplicity of a document with 10 cryptographic primitives that fits in 10 lines of text is amazing. But I get that not everyone will see it that way. CESR is really addressing the problem of textualizing binary cryptographic primitives and has to make tradeoffs. Compared to binary primitives, CESR text is super easy to debug and work with. It's all in what you are comparing to. The primary feature of CESR is composability and none of your proposals above satisfy the composability property so unfortunately, they are non-viable for CESR. But as your examples above propose one can always embed a CESR text string in wither something that looks like a JWT by adding all the extra stuff you want or by embedding it in some other namespace. So one can create a non-compostable namespace specification that embeds a CESR primitive (see did:keri). But that specification is not SAID or CESR is a application-specific profile for using either.

CESR is meant for over-the-wire representation of cryptographic primitives that benefit greatly from its composability property. What you call a trick is essential to satisfying composability which requires alignment on 24 bit boundaries.

One can always in post-processing (after over-the-wire verification has happened) augment the CESR representation. One can always add a data shaping layer on top that annotates the CESR with verbosity. Indeed, this is one of the touted advantages of CESR's BASE-64 representation. A CESR string of primitives can be trivially annotated and trivially de-annotated. Annotation is a universal approach that enables any degree of verbosity that explains the CESR primitive to any degree in any context but requires only the simplest logic for annotating and de-annotating.
It preserves the name space agnosticism that is exibited by the family of KERI specifications to which CESR and SAID belong. Then when debugging just turn on annotation and turn it off when not debugging. Or leave it on for any post processing and strip annotation for over the wire and re add it back. Once can define a table of Normative annotations for a given application without introducing new namespaces embedding CESR in a namespace that is not composable.

For example.

CESR string

"EABCDEF123456789sabecdes"

Annotated CESR string

"EABCDEF123456789sabecdes  # this is an annotated CESR string and this is the annotation"

A regex parser de-annotator, extracts only the BASE64 characters upto but not including the first non-Base64 Character and ignores everything else.

A annotator appends at least one non-Base64 character to delimit the annotation and then any number of characters of any type in the annotation.

Multi Line Annotation

A multi-line de-annotator acts similarly but respects line feeds as separating multiple CESR primitives. So any line that does not start with Base64 character is ignored and any line that starts with a base64 character is a CESR primitive(s) up to but not including the first non-BASE64 chararacter which are ignored upto and including a linefeed.

For example here is a line annotated multi-primitive CESR stream:

# Trans Indexed Sig Groups counter code 1 following group
-FAB 
    
# trans prefix of signer for sigs
E_T2_p83_gRSuAYvGhqV3S0JzYEF2dIa-OCPLbIhBO7Y   

# sequence number of est event of signer's public keys for sigs
-EAB0AAAAAAAAAAAAAAAAAAAAAAB    

# digest of est event of signer's public keys for sigs
EwmQtlcszNoEIDfqD-Zih3N6o5B3humRKvBBln2juTEM  
  
# Controller Indexed Sigs counter code 3 following sigs
-AAD 

# sig 0    
AA5267UlFg1jHee4Dauht77SzGl8WUC_0oimYG5If3SdIOSzWM8Qs9SFajAilQcozXJVnbkY5stG_K4NbKdNB4AQ 

# sig 1       
ABBgeqntZW3Gu4HL0h3odYz6LaZ_SMfmITL-Btoq_7OZFe3L16jmOe49Ur108wH7mnBaq2E_0U0N0c5vgrJtDpAQ  

# sig 2
ACTD7NDX93ZGTkZBBuSeSGsAQ7u0hngpNTZTK_Um7rUZGnLRNJvo5oOnnC1J2iBQHuxoq8PyjdT3BHS2LiPrs2Cg 

We find that annotation is the easy way to debug and understand CESR streams without sacrificing its compactness and composability.

The temptation is to look at the text representation and want to augment it because its not verbose enough. But (except for annotation) that exactly defeats the purpose of having the minimally sufficient text verbosity that can be converted to binary. It's a trade-space that you may never inhabit. But CESR is designed to support performant streaming and that is a hard constraint to satisfy. So the over-the-wire text representation will never be as verbose and descriptive or memorable or usable as one could want. But compared to a binary over-the-wire expression, CESR is exponentially more usable. Frankly, when I use JWTs I don't manually debug the BASE64 raw values. It's not possible. I assume that my code that converts to BASE64 works. Likewise, when I looked at a CESR expression for a primitive I don't manually debug the conversion. The only time I care is when I am debugging the code that implements the conversion and in that case remembering which one letter code I am expecting is more than adequate. Any more is just wasted verbosity.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented May 25, 2022

Compared to URLs on average CESR primitives for Public Keys and Hashes are more compact.
So usability is better with CESR.

Compared to other textual encodings for crypto primitives such as BASE-58 check. CESR is a little more compact but is composable and Base58 check isnt. CESR is also cryptographically agile and Base58-check is not. So usability is better with CESR.

Compared to MultiCodec which is a binary encoding, CESR is more usable.

A serendipitous use of part of a MultiCodec byte code as a text code is not generalizable. It's not interoperable. So CESR is more usable.

Compared to URIs that embed a textual cryptographic primitive (non-CESR) such as a hashlink, The size and verbosity is similar but the CESR version has cryptographic agility. So Usability better for CESR

For streaming, CESR is the only usable text representation because it is composable.

For self-describing data structures that provide a cryptographic primitive in multiple parts such as JWT. When read in isolation the JWT is more descriptive than a bare CESR primitive. But in any real-world use case with multiple primitives in a single document such as, public keys, IDs, Hashes, signatures. The weight of multiple JWTs makes that document too big to fit on one screen, so readability is less (even though describability is still better). However, comparing apples to apples the same document with
annotated CESR strings for each primitive (versus a JWT structure) is equally as descriptive and way more compact. So annotated CESR is more usable than JWT.

Any machine automated validation tools do not benefit by comparison from the verbosity or self-describing data structures. So annotation does not suffer with-respect-to validation accuracy, safety, or security versus a JWT. A normative table of codes is both more performant and easier to lockdown. It's a little harder to code the first time. But single copy costs of validation libraries pale in comparison to transmission costs of verbosity over the wire.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented May 26, 2022

@KDean-GS1

The SAID concept is great, but the binding to CESR alone is inherently limiting. The ID ends up being a string that, while referenceable, is not easily discoverable without the addition of another field. For example, if I wanted to access the “Sue Smith” object without embedding it, I would have to write something like this:

"partner": {
    "id": "EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk",
    "source": "https://www.example.com/repo/EnKa0ALimLL8eQdZGzglJG_SxvncxkmvwFDhIyLFchUk"
}

The issue of discovery is an important one. The example you gave for putting both the discovery mechanism or vector as a URL and the discovered data in the same structure, i.e. SAD, is confusing to me.

The ietf-oobi protocol see IETF-OOBI is about discovery. It puts AIDs and SAIDs inside URLs that allow verifiable discovery of a SAD (which by definition contains a SAID). As it should be.
All I need to verify is the bare SAID not the discovery vector. indeed embedding the SAID inside the discovery vector is good for discovery but then going one step farther and embedding that discovery vector inside the discovered data (the SAD) not only adds unneccessary parsing complexity when verifying, it also adds semantic confusion. if I discover the SAD using a different discovery vector than the one embedded in the SAD does it mean that its an invalid discovery? The verification of the SAD given any discovery vector should be agnostic about the vector used for discovery. The whole idea of end-verifiability is base on the property that the path, the vector, by which a SAD or end-verifiable data is obtained is immaterial to the validity of that data. Making the embedded SAID by which a SAD is verfied (integrity) the discovery vector goes against that principle.

To clarify, it makes sense in some cases for one SAD to reference another SAD by using a discovery vector that embeds the SAID of the other SAD. But the other SAD itself should not include the vector as its SAID but merely its bare SAID. Verifying the result of discovery via the vector means comparing the SAID embedded in the vector with the provided SAID in the discovered SAD. This way any vector reference type or namespace is verifiable against the same SAD. Any discovery path is verfiable using the exact same SAID. The discovery vector is a different specification. (This assumes that the parsed position of the SAID inside the vector is unambiguous to the verifier)

Your suggestion IMHO is confusing verifiability via a SAID in a provided SAD with discovery of the SAD via a namespaced vector the includes an embedded SAID.

How to embed a SAID in some vector namespace is namespace specific and should not pollute the SAID itself. This is why there is an IETF-OOBI specification. It provides normative rules for embedding a AID or a SAID in a URL for the purpose of discovery.

Looking at SAIDs from a different perspective, one extremely useful feature of the class of content-addressable identifiers (of which SAIDs are a member) is that they support database de-duplication. As is well known a content-addressable identifier is universally unique to the data content that it addresses. Therefore using a content-addressable identifier as a database key, trivially makes that database de-duped. Infrastructure that uses SAIDs gets this benefit for free. But not if the SAID is embedded in a discovery (location) oriented namespace like a URL. De-duplications is also very helpful from a security perspective in identifying tampering and duplicity in data provided from any source.

This is why I strongly believe that SAIDs MUST be just the cryptographic (CESR) primitive that is embedded in the SAD in a universally standard way that is independent of discovery mechanisms. Nonetheless, a verifiable external reference to a SAD is its SAID. When that external reference also is being used to locate or discover the SAD then that external reference is a namespace then it MAY include the SAID embedded it its namespace. But verifiability breaks unless there is one and only one way to parse the namespace reference to extract the SAID. If not then it is no longer a verifiable external reference. Many URI approaches are very loose about where one might put a SAID and if multiple strings that look like SAIDs may appear in a given URI then it is no longer verifiable. This would be bad.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented May 26, 2022

@KDean-GS1

The key here is the structure of the URL. If the URL matches a regex pattern, then it is probably a SAID.

Probably a match is ok for discovery but NOT ok for cryptographic verification. The match MUST be unambiguous.
But I stand by my comment that your examples are mixing discovery and verification processes. The SAD itself only needs the bare SAID for itself. An external reference not in the SAD may be used for discovery and then a protocol specific namespace embedding specification is required in order to recognize and extract the embedded SAID from the namespaced identifier used as an external reference.

Confusing the usages of discovery namespaced identifiers that embed cryptographically derived non-namespaced identifiers with the bare cryptographically derived identifiers themselves, albeit common, is fraught with danger from a security perspective.

The internet is broken from a security perspective because DNS can not be trusted to provide secure authenticity (source of resources) and therefore any URL must be considered insecure and only depended on as an out-of-band mechanism relative to an in-band secure authentication mechanism such as KERI. Consequently, a URL can only be used for the discovery of a data resource not for verifiably secure identification for the purpose of authenticity or integrity. The netloc part of a URL was supposed to provide the security of everything in its namespace. Because it doesn't we have to treat it appropriately as broken for authenticity but may still be used for discovery. This applies to authentic data items as identified entities. Given the ubiquity of web tooling, infrastructure, and familiarity, respecting this rule is very difficult to do. But in order for Web 3.0 to realize its promises of a secure internet, it must respect this rule.

The big picture is that web 2.0 infrastructure musts needs morph into web 3.0 infrastructure in order to fix is broken security model. What is unclear is what is the final web 3.0 infrastructure look like. The vision of KERI is that AIDs and SAIDs provide the foundational identifier classes that together with other cryptographic primitives namely keys, salts, and signatures may be used in verifiable data structures that fixe the security model in a portable universal way. This necessarily means jettisoning some of the typical use cases of web 2.0 infrastructure, namely, the misuse of URLs as supposidly secure terminal identifiers instead of merely as insecure indirectable discovery namespaces. Bare AIDs and SAIDs are namespace agnostic verifiably secure terminal identifiers. Their namespace agnosticism is essential to portability.

@SmithSamuelM
Copy link
Contributor

SmithSamuelM commented May 26, 2022

@KDean-GS1 Finally, with respect to your example above of a JSON Schema $id field.

Suppose, for example, that you want your JSON schemas to be discoverable and ingestible by standard validation tools. The $id fields would have to be HTTPS URLs and those URLs would have to be resolvable. Your pre-hash JSON schema example would therefore look something like this:

{
"$id": "https://www.example.com/schema/said/############################################",
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"properties": {
"full_name": {
"type": "string"
}
}

Most people I have encountered on this topic with respect to ACDCs use of JSON schema have not carefully read the JSON schema specification and assume incorrectly that the normative value of the $id field in a JSON Schema must be a network addressable URI, not simply an identifier in URI form. This assumption is FALSE. Otherwise, we could not use JSON Schema in a secure way with ACDCs and would have had to pick a different schema mechanism. BTW, this is the main reason ACDC could not use schema.org schema.

It seems you are also making this incorrect assumption.

JSON schemas to be discoverable and ingestible by standard validation tools

The reference is provided below. But note the following important clarifying description

" Even though schemas are identified by URIs, those identifiers are not necessarily network-addressable. They are just identifiers. Generally, implementations don’t make HTTP requests (https://) or read from the file system (file://) to fetch schemas. Instead, they provide a way to load schemas into an internal schema database. When a schema is referenced by it’s URI identifier, the schema is retrieved from the internal schema database."

This means that the tooling by default MUST NOT enforce the network addressability of the value of the $id field (i.e be discoverable over a network) and therefore any identifier value in the $id field is usable (even when not in URI form) with compliant JSON Schema Tooling. Network addressability is a non-default deployment option, not the usual case. Indeed, one has to go out of one's way to deploy JSON Schema tooling that ingests schema by first looking it up from its $id over a network. To reiterate, the $id field is just a database key in default JSON Schema tooling and hence a bare SAID as the $id field value is perfectly compatible with standard JSON Schema tooling. This means the schema type is also its de-duped content address as SAID. This allows the schema via its SAID to also be reasoned with using the BowTie principle from Ricardian contracts.

To elaborate, the reason the note is provided in the spec is that making the $id field dual purpose by allowing it to be a URI is confusing because in the standard use case it's used as a database key, not a network addressable resource. Its use as a network addressable resource is a special case that is a recent addition to the specification. Historically, a JSON Schema was only obtainable from a local database and currently, the vast majority of use cases for JSON Schema still follow this pattern.

This ($id field as database key) enables ACDC to leverage off-the-shelf JSON schema tooling in a secure way. The ACDC spec defines the secure profile for using JSON Schema.

Moreover, one reason we use the $id field to provide the SAID of the schema instead of another field label (which is allowed) is to make it more difficult for an inadvertent misuse of the schema in an insecure way. It also avoids any confusion as to how the schema is meant to be used. When the $id field is not network addressable then any attempt to load the schema or subschema from the network will fail. Whereas if the SAID was in a different field and a network addressable URI was also provided for the $id field then that schema confuse a validator and could be subject to attack.

https://json-schema.org/understanding-json-schema/structuring.html

Schema documents are not required to have an identifier, but you will need one if you want to reference one schema from another. In this documentation, we will refer to schemas with no identifier as “anonymous schemas”.

In the following sections we will see how the “identifier” for a schema is determined.

Note
URI terminology can sometimes be unintuitive. In this document, the following definitions are used.
URI [1] or non-relative URI: A full URI containing a scheme (https). It may contain a URI fragment (#foo). Sometimes this document will use “non-relative URI” to make it extra clear that relative URIs are not allowed.
relative reference [2]: A partial URI that does not contain a scheme (https). It may contain a fragment (#foo).
URI-reference [3]: A relative reference or non-relative URI. It may contain a URI fragment (#foo).
absolute URI [4] A full URI containing a scheme (https) but not a URI fragment (#foo).
Note
Even though schemas are identified by URIs, those identifiers are not necessarily network-addressable. They are just identifiers. Generally, implementations don’t make HTTP requests (https://) or read from the file system (file://) to fetch schemas. Instead, they provide a way to load schemas into an internal schema database. When a schema is referenced by it’s URI identifier, the schema is retrieved from the internal schema database.

@KDean-GS1
Copy link
Author

KDean-GS1 commented May 30, 2022

Discussed and agreed re: use of CESR. We can close off the "more readable" format support as "not planned".

Regarding the use of a SAID-compliant URI as a resolvable identifier, I think there's still value in it, from the perspective of "good enough" (two words guaranteed to strike fear into the hearts of those working in any kind of security) or "don't let the perfect be the enemy of the good".

Let's assume that we're working with a genealogy database. The document describing every individual is identified and secured with a SAID, and, where known, each document references the parents of the individual:

"partner": {
    "id": "EABCD0123",
    "fatherID": "EEFGH4567",
    "motherID": "EIJKL8901",
    ...
}

As long as we're dealing with a single database, this works just fine: the SAIDs are database keys and we can go up the tree all the way to the first Ardipithecus that thought coming down from the trees in the first place was a Really Good Idea.

In a network of genealogy services, however, we need a mechanism to discover and retrieve documents that have been created by other (possibly competing) services. Let's take it as a given that we can create that capability in a way that provides all the guarantees of integrity, security, reliability, etc. of a full-blown ACDC implementation. If that's what we want, we can do it.

That may be overkill, though. For the developers of a genealogy interoperability standard, building on Web 2.0 may be "good enough". They want the integrity that SAID provides, but they don't want the complexity of ACDC, and now we end up with something like this:

"partner": {
    "id": "https://www.example.com/genealogydb/EMNOP2345",
    "fatherID": "https://www.example.com/genealogydb/EQRST6789",
    "motherID": "https://www.example.co.uk/docs/EUVWX0123",
    ...
}

This comes with all of the problems of Web 2.0, but it provides a simple, easy-to-understand rule for walking up the tree: if you don't have the document in your local database, then you retrieve the document using HTTP libraries you already have and understand, validate it, and store it under its ID.

Validation is straightforward. Step 1 is to compare the ID of the retrieved document with the ID used for retrieval, and step 2 is to extract the SAID from the ID and verify it against the document.

The ID therefore becomes the canonical location for this document, but it doesn't have to be the only location. If I have a copy in my local database, that copy is just as good as the one retrieved from the canonical location, as long as the two-step validation succeeds. If the website has been reorganized and I have to follow a redirect while retrieving the document, and the original document is available at the new location, that copy is just as good as it was when it was at the original location, again assuming that the two-step validation succeeds.

If I can't retrieve the document at all, or if the document I retrieve is not the one I expect, whether due to malice, incompetence, or the company simply having gone out of business, then I have a decision to make: Have I got enough information? If yes (perhaps because I've gone up four generations and that's "good enough"), then I'll live with what I've got, with the knowledge that either the data is irretrievable or that the referencing document is itself not valid (because although it has the right SAID in its own ID, it could be poisoned data, a possibility that applies all the way back down the tree). Unless my need for the data is in some way critical, such as laying claim to the throne of some far away kingdom, the lack of complete data doesn't cost more than unsatisfied curiosity about my ancestry.

I do acknowledge that this is in no way a generic solution. This form of retrieval and validation is not guaranteed to work in other problem domains. However, the developers of the genealogy interoperability standard can make this an application-specific use of SAID and they would have to document it accordingly. Other application domains may be happy with "good enough" as well, and people working in each application domain will have to understand their application-specific rules.

Another reason for including guidance on how to use a SAID as part of a URI is, as mentioned in the ACDC spec, there are some places where an ID must be a URI, and there are multiple application domains where IDs will need to be dereferenced to retrieve a document. A combination of the sad:SAID URI scheme and OOBI will do the trick, but the SAID specification would be more valuable if it provides guidance on how to create SAIDs that are part of a larger URI scheme with built-in discovery and retrieval mechanisms (e.g., did:example:EYZAB4567, https://www.example.com/ECDEF8901). High integrity environments may still require raw SAID plus OOBI, but many won't.

@SmithSamuelM
Copy link
Contributor

@KDean-GS1

That may be overkill, though. For the developers of a genealogy interoperability standard, building on Web 2.0 may be "good enough". They want the integrity that SAID provides, but they don't want the complexity of ACDC, and now we end up with something like this:

"partner": {
    "id": "https://www.example.com/genealogydb/EMNOP2345",
    "fatherID": "https://www.example.com/genealogydb/EQRST6789",
    "motherID": "https://www.example.co.uk/docs/EUVWX0123",
    ...
}

This comes with all of the problems of Web 2.0, but it provides a simple, easy-to-understand rule for walking up the tree: if you don't have the document in your local database, then you retrieve the document using HTTP libraries you already have and understand, validate it, and store it under its ID.

Validation is straightforward. Step 1 is to compare the ID of the retrieved document with the ID used for retrieval, and step 2 is to extract the SAID from the ID and verify it against the document.

The ID therefore becomes the canonical location for this document, but it doesn't have to be the only location. If I have a copy in my local database, that copy is just as good as the one retrieved from the canonical location, as long as the two-step validation succeeds. If the website has been reorganized and I have to follow a redirect while retrieving the document, and the original document is available at the new location, that copy is just as good as it was when it was at the original location, again assuming that the two-step validation succeeds.

I agree that the tendency for web 2.0 infrastructure is to treat URL as somehow canonical. This is different though from a canonical URL. A canonical URL would be more like a URN in terms of semantics as it is fixed and may not be network addressable. The example justification you give is for all URLs to be canonical because you assume that they are network addressable and only do a fall back lookup if they are not.

But where we differ is in the statement that "if they don't want the complexity of ACDC". The core security principle is not dependent on ACDC and does not require any ACDC infrastructure to implement. It is indeed independent of ACDC. We use it in KERI which is lower level than ACDC. But for SAIDs the principle is event lighter-weight because we are merely talking about integrity of the data (tamper-evident) not authenticy. The core distinction is separation of concerns between integrity protection and discovery. As a design principle a content address is not a discovery mechanism. There is no reason from the standpoint of verifying integrity to store within the data item itself a discovery mechanism. These are independent. Mixing the two while apparently convenient in some cases comes with limitations that are imposed on any database mechanisms for integrity, monotonic update logic, and consistency consensus. Whereas separating discovery cleanly from integrity enables more expressive power in both integrity/consistency and discovery.

As we discussed, for legacy adoption where there is a true canonical URL and no appetite to separate discovery from integrity, having a URL format (discovery mechanism) that embeds integrity protection (a SAID) is compromise that helps with adoption. But as we discussed it is not a preferred mechanism and must be caveated.

Each an every use case of embedding, requires a use case (special case) specific appendix that details how the SAID is embedded and parsed by verifiers, and optionally whether or not the implementer trusts or wants to implement the particular discovery mechanism. Critically, the interoperability across trust domains is dependent on implementations adding the complexity of each and every use case. This is but the first step down the slippery slope of a many different embedding forms. Today its GS1s preferred mechanism (preferred because it requires the least effort to adapt to GS1's already existing way of doing things. But tomorrow it will be the next entity and so on.

The problem is not that one can define a given format for embedding and verifying. That part is easy. That is not the source of my concern. My concern is that everyone will want to define a slight variation and have it standardized. And then implementations become burdened with the complexity of all those variations.

The goal should be to keep discovery and integrity on separate layers of the protocol stack. Bare SAIDs should always be preferred and discovery mechanisms should be separate. Only by clean separation can we guarantee any semblance of future proofing for both integrity and discovery. When they are comingled in a standard way then we make it extremely difficult to separate later, (should bebe on separate layers) and that leads to problems as discovery works best (allows innovation over time) if its not burdened by security and integrity is easier to secure (allows innovation over time) if its not burdened by discovery. It seems easy now to combine the two in a URLs but that is only because its familiar and convenient for the time being but it violates the principle of layered protocol design.

But as a nod to bridging with legacy web 2.0 infrastructure a well caveated (use at your own risk), mechanismm for embedding a SAID in a URL is OK in an appendix. So as long as we are moving in that direction then lets do it. But the mechanism needs to give consideration to being general enough that its the only embedded veriant that combines discovery and integrity

In contrast, for example, the OOBI protocol is a discovery protocol (mechanism) that explicitly leverages DNS/Web URLs with embedded SAIDs for example for discovering verifiable integrity data items. It looks very similar to what you have proposed, The important distinction is that OOBI URLs do not replace the SAID field value in the discovered data item. The discovered data item is always the bare SAID. This means that the OOBI discovery protocol and the data base integrity/consistency mechanisms are mutually isolated from each other and therefore both are future proofed with respect to changes in the other. New discovery protocols can be created that do not impinge at all on database integrity, consistency, and consensus mechanisms. Different database integrity, consistency and consensus mechanisms can be created that work with any and all discovery mechanisms.

If I am using a graph database for example vs. and SQL database, versus a non-sql document database and I am using eventual consistency or causal consistency or BFT consistency across my database replicants, then embedding a URL in each and every copy causes semantic confusion. Where do I look for the data. Is it out of band discovery or consistency (across replicants) That I care about? first, last, foremost, eventually. Discovery has no place at this level. its should be outside the scope of database consistency and integrity and authenticity. The source of truth is the database not the discovery mechanism.

In the OOBI protocol spec there is a discussion about how to use a simple BADA-RUN mechanism for updating data in a database. This is complementary to discovery. Its simpler than ACDC. It works with any database. Its is agnostic about discovery mechanisms, it merely sets the criteria from a security perspective on how to update the database. It allows authenticity with monotonic logic to prevent replay attacks and redundant consistent copies to prevent deletion attacks. Both replay and deletion attacks are vulnerabilities of authentic data. Deletion attacks are also a vulnerability with integral data (non-authenticatable). The mitigation for replay attacks is monotonicity. The mitigation for deletion attacks is redundancy which requires consistency across the redundant copies.

Whereas discovery, is independent of the database update rules. Discovery is how an external entity finds the database in order to either to read from or to write to the database. Not how the database manages its own data. It doesn't need to discover its own data because it already has it. If it needs to find replicants then it has a database specific replicant addressing mechanism (which means that each replicant has a different address not the same across all replicants. This clearly makes the distinction between discovery by external hosts and internal replicant addressing.

Because SAIDs are content-addresses they can be used for deduplication across replicants and update sources. Because they are UUIDs they can be used as efficient database keys in all types of databases indexes be it hash table, b-tree, cross table reference, or graph edge. Moreover one use them for in-stride integrity (tamper-evident) optimized database cleaning whenever the chain-of-custody (process reboot, os change, device change etc) of the data is broken. This is essential to the most basic zero-trust architecture.

But NOT if that SAID is encumbered with a URL namespace. Then it must be parsed and stripped with every access. Its just really cumbersome except for the corner case of legacy, point designs that have already baked in a URL centric philosophy.

So once again, as a legacy adoption vector it makes sense to define an embedded SAID URL as the "ID" field in the data item itself instead of a bare SAID. But the preferred design should be to cleanly separate discovery from self-referential self-addressing SAID fields.

@KDean-GS1
Copy link
Author

I agree completely with all of the above. What I'm driving at here is a separation of concerns in standards documentation (separate from the separation of concerns of identification and discovery).

There are, or are likely to be, use cases for SAIDs in URIs. For the most part, these may be satisfied through a sad:SAID URI scheme, but there may be arguments to be made in application-specific domains, whether it be for web 2.0 legacy compatibility (the URL example above) or as the unique portion of a DID URI (which would be method-specific).

Documenting "here's how a SAID may be incorporated into a URI" doesn't preclude any of the above, because other specifications that depend on SAID (e.g., ACDC) will simply say that only raw SAIDs are supported. This would be analogous to GS1 standards: we support multiple barcode formats, for example, but only a handful are supported in retail environments and only with a limited subset of all the data elements we support. Any barcode scanning application would have to be application-aware in order to process the content.

@SmithSamuelM
Copy link
Contributor

Suggest pull request with appendix that outlines how to genericize SAID for legacy adoption in application specific domains.

@m00sey
Copy link
Member

m00sey commented Aug 1, 2023

@KDean-GS1 This seems to have been addressed by @SmithSamuelM

If you feel different please open a new issue:
https://github.com/trustoverip/tswg-said-specification/issues

@m00sey m00sey closed this as completed Aug 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants