Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Schema Registry Proposal #625

Merged
merged 3 commits into from
Jun 21, 2020
Merged

Conversation

clemensv
Copy link
Contributor

@clemensv clemensv commented May 20, 2020

Signed-off-by: clemensv clemensv@microsoft.com

For initial review. I'm still updating both documents including changing some names, but the combination of OpenAPI doc and the spec doc should already tell a fairly complete story.

Microsoft proposal for #610

Signed-off-by: clemensv <clemensv@microsoft.com>
type: integer
tags:
- 'schemas'
/schemagroups/{group-name}/schemas/{schema-name}/versions/{version-number}:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a case for metadata to live at the version level of a schema. For example, when the schema was created, who created it, etc. I wonder if we need separate endpoints for the metadata vs fetching the raw schema itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those extra metadata items can be exposed through metadata annotations or through an OPTIONS call if they are not part of the schema itself.

- Type: `Integer`
- Description: The version of the schema. This is a simple counter and tracks
the version in the scope of this schema within the schema group. The schema
document MAY indicate a schema that follows a different versioning scheme.
Copy link

@ryanhorn ryanhorn May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do we envision this working exactly? Does this just mean implementations might follow a scheme that isn't simply a monotonically increasing Integer? This wording feels like we're opening it up for implementations to diverge from the spec without being specific about how they should do that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the schema document itself can use semver or some other "embedded" versioning notion while the API goes strictly by order or when changes have been added.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, but wouldn't semver (or other versioning schemes) imply a different data type?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same.


For simple scenarios, the API allows for version management to be automatic and
transparent. Whenever a schema is updated, a new version number is assigned and
prior schema versions are retained. The latest available schema is always the
Copy link
Collaborator

@duglin duglin May 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems there should be a MUST in here about "no version == latest"

document MAY indicate a schema that follows a different versioning scheme.
- Constraints:
- REQUIRED
- Assigned by server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm on the same track as Ryan, I'm wondering if we need to allow for the author to decide the version string so they can choose a simple int or a semvar pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the server-assigned integer is for versioning not to complicate the API model. With automatic numbers, you can make all updates a plain POST on the schema URI and you can enforce the compatibility rules you want using service-side logic as the update happens.

Introducing breaking changes should really require the pain of a wholly new schema with its own backcompat versioning sequence.

Semver 1.x, 2.x, 3.x is really better captured by

/schemagroups/myapp/schemas/foo.1/versions/{n}
/schemagroups/myapp/schemas/foo.2/versions/{n}
/schemagroups/myapp/schemas/foo.3/versions/{n}

than by

/schemagroups/myapp/schemas/foo/versions/1.1
/schemagroups/myapp/schemas/foo/versions/2.2
/schemagroups/myapp/schemas/foo/versions/3.0

Those things under foo are not the same if they don't describe structurally compatible data.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By doing this you're basically asking the impl to be a document management system. Would it be so bad if we left that up to the author of the schema files and they could pretty much pick any URL pattern they wanted (within their permissions/scope)? Meaning, if they PUT over an existing one then it updates it. If they PUT to a new URL them they're creating a new one. Then the impl doesn't need any versioning at all, no saving of history, etc.. that's left up to systems that are built for that kind of thing and the results are pushed here for sharing/viewing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the ability and suggestion to semver the dataschema or schema id, complex versioning schemes within a schema seem less necessary. A server-side check to prevent footguns should be possible even without versions, and this API doesn't actually prevent replacing a specified version with an incompatible document at the same URL (delete the existing version and/or schema, then re-create a schema with the same name and re-insert schemas until the version lines up, but the content doesn't).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just discovered this project. Evaluating whether we could use this instead of writing our own schema registry. We want to be able to use semver to represent versions for a schema, where foo v1.y can be used as-is instead of foo v1.x (for y >= x), but foo v1.x and foo v2 are incompatible representations of the same data (and then maybe there'll be a converter registered in our system, etc.).

It's not clear to me which direction the conversation is leaning, but it'd be really lovely if it was possible to control the version id assigned to a new version of a schema. Otherwise, I mean, we can always add a proxy in front that remaps foo/versions/x.y to foo.x/versions/y, I guess. But it would be simpler if we could just control the version id to be created as we want. This is a very localized configuration point, too.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One interesting thing I was thinking about with respect to semver support for versioning, is that compatibility checking could perhaps be customized with a different configuration based on a real understanding of semver. So if a client tries to update the content of a schema that is currently at version 2.7.3, uploading content to new version 2.7.4 then that is a patch version release and the compatibility checker can be very strict. But if the client does the same thing to new version 3.0.0 then the compatibility checker can e.g. allow breaking changes.

Just a thought. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chrish42 @EricWittmann the model here allows for a "trivial" case where the version is just a server assigned counter similar to how a GitHub commit identifier is a server assigned identifier for an entry in the sequential commit log. That does also not have deeper meaning. (The commit id is not a plain counter for different reasons). If you want to do something more sophisticated, you can always manage schema versions explicitly as "myschema:v1.1" and "myschema:v1.2" as separate schemas within the group and if you want to have the "latest" functionality, you also maintain a "myschema" that always returns the latest version. An implementation behind the protocol could easily provide that.


`/schemagroups/{group-id}/schemas/{schema-id}/versions/{version}`

The name of the first segment of the path is a suggestion and MAY differ between
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you see the query for all group getting interop w/o agreement on the first one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"schemagroups" is really part of the path to the registry itself; everything defined here sits under that path. the segment could even be empty or could be multiple segments. I don't see there being interop issues, because you would always have it being at the root of the URI.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I didn't realize you meant for "schemagroups" to be an impl choice - we might want to make clearer throughout then entire doc

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the OpenAPI to remove /schemagroups from paths. Would make it easier to generate servers/clients from the OpenAPI at some future point.

Copy link
Contributor

@gunnarmorling gunnarmorling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really welcoming this initiative! I like where it's going; put in a few comments.

schemaregistry/schemaregistry.md Outdated Show resolved Hide resolved
schemaregistry/schemaregistry.md Outdated Show resolved Hide resolved
schemaregistry/schemaregistry.md Show resolved Hide resolved
- Type: `Integer`
- Description: The version of the schema. This is a simple counter and tracks
the version in the scope of this schema within the schema group. The schema
document MAY indicate a schema that follows a different versioning scheme.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was wondering the same.

description: Schema group already exists
tags:
- 'groups'
delete:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should DELETE actually be allowed? Did you consider a "decomission" option or "soft delete" alternatively? That'd e.g. prohibit to produce new events referencing such schema taken out of business, but existing events could still be decoded.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, in general deletes are problematic and should IMO be either discouraged or prevented. Immutability has been a useful property in our registry for schema versions - we can deprecate and disable versions. And I wish we had done the same for the entire schema (deprecate/disable) rather than allow deletes.

schemaregistry/schemaregistry.md Outdated Show resolved Hide resolved
schemaregistry/schemaregistry.md Outdated Show resolved Hide resolved
> 2) Since the above strategy is truly RESTful, but quite esoteric if you've not
> grown up as a RESTafarian, the alternative strategy for concurrently
> handling multiple schema formats is much simpler: Constrain each schema
> group to a single format.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems simple and pragmatic. Does it make sense to start with this approach?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about constraining on a per-schema basis (giving each schema a "format" or "type"). The reason here is that I can imagine a logical group of schemas that aren't all the same technology. Especially if this were ever to expand beyond schemas and into e.g. API Designs as well - I would want to have a group that included perhaps multiple OpenAPI documents as well as some JSON Schemas... and my OpenAPI would likely have $refs to the JSON Schemas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricWittmann I will add that as an XOR option, i.e. you can either define the formats at the group or schema level. If you define it at the group level, that is binding, meaning you can't override.

- Constraints:
- REQUIRED
- MUST be a non-empty string
- MUST conform with RFC3986/3.3 `segment-nz-nc` syntax
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be more familiar to specify this as a hostname / reg-name in section 3.2.2 ("Host"), rather than a path segment (which is slightly more permissive)?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think segment-nz-nc better supports some common ways you might want to name your groups to indicate a hierarchy where none exists. Or an Organization + Project format - that sort of thing. I'm thinking things like how a lot of NPM packages are now being named...

schemaregistry/schemaregistry.md Show resolved Hide resolved
schemaregistry/schemaregistry.md Show resolved Hide resolved
A schema version is a document. The "body" of a schema version MAY be a text
document or binary stream. An implementation SHOULD validate whether a
schema version is valid according to the rules of its format, for instance
whether it is a valid Avro schema document when the format is Apache Avro.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do recipients know what type of content to expect in the body? Is this based on the datacontenttype in the received message (plus some sort of lookup table to map datacontenttype to format)?

If so, making format the same value as datacontenttype would simplify the API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricWittmann "SHOULD" leaves that up to what you think is right.

- 'groups'
put:
summary: Create schema group
description: Create schema group with specified format format in registry namespace.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
description: Create schema group with specified format format in registry namespace.
description: Create schema group with specified format in registry namespace.

operationId: getLatestSchema
responses:
'200':
$ref: '#/components/responses/SchemaBytePayloadResponse'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to include the ID that was served in this response, so that clients can retrieve the value again on subsequent calls?

document MAY indicate a schema that follows a different versioning scheme.
- Constraints:
- REQUIRED
- Assigned by server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the ability and suggestion to semver the dataschema or schema id, complex versioning schemes within a schema seem less necessary. A server-side check to prevent footguns should be possible even without versions, and this API doesn't actually prevent replacing a specified version with an incompatible document at the same URL (delete the existing version and/or schema, then re-create a schema with the same name and re-insert schemas until the version lines up, but the content doesn't).

@@ -0,0 +1,257 @@
# CNCF Schema Registry API Version 0.1-rc01s
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe "CNCF Schema Registry API - wip" since it's not an 'rc' yet.

This section further describes the elements enumerated in the introduction.

### 2.1. Schema Group

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to add that groups is a great concept that we don't (yet) have in Apicurio Registry, but may have been a mistake to not include. It's important I think to organize these things into groupings. I guess I'm just saying +1 to this concept from me. :)

Comment on lines +89 to +90
This specification does not define management constructs for such access control
rules.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Good

Comment on lines +144 to +148
A newer schema version might introduce breaking changes or it might only
introduce careful changes that preserve compatibility. These strategies are not
subject of this specification, but the API provides a conflict handling
mechanism that allows an implementation to reject updates that do not comply
with a compatibility policy, if one has been implemented.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Schema evolution is a pretty significant concept in a schema registry. I haven't seen how the spec facilitates configuring the compatibility (or validity) rules. This paragraph mentions a compatibility policy, but if that's mentioned anywhere else I missed it. :(

This might be an important enough feature to include in the spec. For Apicurio Registry we have the concept of "rules" that can be configured globally or per-artifact (in this spec I imagine rules could also be configured at the schemagroup level). Right now we have only two rules: Validity and Compatibility. But perhaps "rule" is a concept to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EricWittmann I believe this is an important feature of the registry, but not an important feature of the protocol. At the protocol level, a policy violation bubbles up as a plain conflict. If we were designing an implementation and the management API for that implementation, I would agree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. :)

described data structure. All documents coexisting within the same version
SHOULD describe the exact same data structure.

### 2.2.2. Schema attributes

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to keep some other meta-data for each artifact, particularly if anyone eve wants to create a Registry UI. Name, description, labels, creationTime, modifiedTime, etc.

Comment on lines +183 to +186
A schema version is a document. The "body" of a schema version MAY be a text
document or binary stream. An implementation SHOULD validate whether a
schema version is valid according to the rules of its format, for instance
whether it is a valid Avro schema document when the format is Apache Avro.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can validity be disabled or configured (e.g. syntax vs. semantic validity)? Treating Validity and Compatibility in similar ways has proven useful in Apicurio Registry.

- Type: `String`
- Description:
- Constraints:
- OPTIONAL. Can be used if and only if not format has been set for the schema

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be: if and only if format has not been set for the schema

Signed-off-by: clemensv <clemensv@microsoft.com>
retention policy, but implementations MAY retire and remove outdated schema
versions.

The latest available schema is always the default version that is retrieved when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to add a MUST in here someplace....

When the URL to a schema is used without a version string, the implementation MUST return the latest version of that schema.

perhaps?

- Description: Instant when the schema was added to the registry.
- Constraints:
- OPTIONAL
- Assigned by the server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My OCD is kicking in... sometimes you have periods, sometime you don't on bulleted lists :-) can we choose one? I'd prefer no periods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am referring to the CE spec for the data types now. Do you think we need to copy them?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just referring to the lack of consistency on the bulleted lists - some ending in a period and some not. I don't have an opinion on copying vs referencing the data types

- Description: Instant when the schema was added to the registry.
- Constraints:
- OPTIONAL
- Assigned by the server.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as above, I think we need a MUST here specifying the format/syntax - or define "Timestamp" in some kind of "data types" section

schema version is valid according to the rules of its format, for instance
whether it is a valid Avro schema document when the format is Apache Avro.

Within the scope of the schema set, the version is identified by the combination
Copy link
Collaborator

@duglin duglin Jun 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use of the word "set" here might confuse people since it's new. Did you mean "group" or "set of versions for one particular schema" ?


Within the scope of the schema set, the version is identified by the combination
of a version number and an optional format identifier. The schema version MAY
also have an additional, optional unique identifier within the scope of the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this last sentence? Do you mean they can add extensions that are unique identifier values? If so, why did you call this out? Just curious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any concrete schema document may also have a unique identifier for itself. There is either a path /group/xyz/schemas/abc/versions/1 or you could just address that exact doc with its ID. We want that to enable a URL shortener function.

- 1
- 2

#### id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you see this being used by a consumer of the registry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aka.ms/s/{id} URL shortener option for greedy protocols.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't that just part of the URL shortener function's logic? E.g. tinyurl.com doesn't ask for an id - I just give it a full URL


These dependencies are reflected in the path structure:

`[/schemagroups]/{group-id}/schemas/{schema-id}/versions/{version}`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it matters, but the use of the word id got me wondering if it should be name instead. While the semantics would be the same either way, all examples of these IDs appear to be more like human friendly names rather than IDs (e.g GUIDs). So while it's not totally human friendly (eg no spaces, etc...), if we expect people to use meaningful "words" and not "random chars", then perhaps name would help guide them in that direction.

Just a thought

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add something here to indicate that the /versions/{version} part is optional?
[/schemagroups]/{group-id}/schemas/{schema-id}[/versions/{version}]
or text?