-
Notifications
You must be signed in to change notification settings - Fork 584
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Schema Registry Proposal #625
Conversation
Signed-off-by: clemensv <clemensv@microsoft.com>
schemaregistry/schemaregistry.yaml
Outdated
type: integer | ||
tags: | ||
- 'schemas' | ||
/schemagroups/{group-name}/schemas/{schema-name}/versions/{version-number}: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be a case for metadata to live at the version level of a schema. For example, when the schema was created, who created it, etc. I wonder if we need separate endpoints for the metadata vs fetching the raw schema itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those extra metadata items can be exposed through metadata annotations or through an OPTIONS call if they are not part of the schema itself.
schemaregistry/schemaregistry.md
Outdated
- Type: `Integer` | ||
- Description: The version of the schema. This is a simple counter and tracks | ||
the version in the scope of this schema within the schema group. The schema | ||
document MAY indicate a schema that follows a different versioning scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we envision this working exactly? Does this just mean implementations might follow a scheme that isn't simply a monotonically increasing Integer? This wording feels like we're opening it up for implementations to diverge from the spec without being specific about how they should do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the schema document itself can use semver or some other "embedded" versioning notion while the API goes strictly by order or when changes have been added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood, but wouldn't semver (or other versioning schemes) imply a different data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same.
schemaregistry/schemaregistry.md
Outdated
|
||
For simple scenarios, the API allows for version management to be automatic and | ||
transparent. Whenever a schema is updated, a new version number is assigned and | ||
prior schema versions are retained. The latest available schema is always the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems there should be a MUST in here about "no version == latest"
document MAY indicate a schema that follows a different versioning scheme. | ||
- Constraints: | ||
- REQUIRED | ||
- Assigned by server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'm on the same track as Ryan, I'm wondering if we need to allow for the author to decide the version string so they can choose a simple int or a semvar pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the server-assigned integer is for versioning not to complicate the API model. With automatic numbers, you can make all updates a plain POST on the schema URI and you can enforce the compatibility rules you want using service-side logic as the update happens.
Introducing breaking changes should really require the pain of a wholly new schema with its own backcompat versioning sequence.
Semver 1.x, 2.x, 3.x is really better captured by
/schemagroups/myapp/schemas/foo.1/versions/{n}
/schemagroups/myapp/schemas/foo.2/versions/{n}
/schemagroups/myapp/schemas/foo.3/versions/{n}
than by
/schemagroups/myapp/schemas/foo/versions/1.1
/schemagroups/myapp/schemas/foo/versions/2.2
/schemagroups/myapp/schemas/foo/versions/3.0
Those things under foo are not the same if they don't describe structurally compatible data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By doing this you're basically asking the impl to be a document management system. Would it be so bad if we left that up to the author of the schema files and they could pretty much pick any URL pattern they wanted (within their permissions/scope)? Meaning, if they PUT over an existing one then it updates it. If they PUT to a new URL them they're creating a new one. Then the impl doesn't need any versioning at all, no saving of history, etc.. that's left up to systems that are built for that kind of thing and the results are pushed here for sharing/viewing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the ability and suggestion to semver the dataschema
or schema id
, complex versioning schemes within a schema seem less necessary. A server-side check to prevent footguns should be possible even without versions, and this API doesn't actually prevent replacing a specified version with an incompatible document at the same URL (delete the existing version and/or schema, then re-create a schema with the same name and re-insert schemas until the version lines up, but the content doesn't).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just discovered this project. Evaluating whether we could use this instead of writing our own schema registry. We want to be able to use semver to represent versions for a schema, where foo v1.y can be used as-is instead of foo v1.x (for y >= x), but foo v1.x and foo v2 are incompatible representations of the same data (and then maybe there'll be a converter registered in our system, etc.).
It's not clear to me which direction the conversation is leaning, but it'd be really lovely if it was possible to control the version id assigned to a new version of a schema. Otherwise, I mean, we can always add a proxy in front that remaps foo/versions/x.y to foo.x/versions/y, I guess. But it would be simpler if we could just control the version id to be created as we want. This is a very localized configuration point, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One interesting thing I was thinking about with respect to semver support for versioning, is that compatibility checking could perhaps be customized with a different configuration based on a real understanding of semver. So if a client tries to update the content of a schema that is currently at version 2.7.3
, uploading content to new version 2.7.4
then that is a patch version release and the compatibility checker can be very strict. But if the client does the same thing to new version 3.0.0
then the compatibility checker can e.g. allow breaking changes.
Just a thought. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chrish42 @EricWittmann the model here allows for a "trivial" case where the version is just a server assigned counter similar to how a GitHub commit identifier is a server assigned identifier for an entry in the sequential commit log. That does also not have deeper meaning. (The commit id is not a plain counter for different reasons). If you want to do something more sophisticated, you can always manage schema versions explicitly as "myschema:v1.1" and "myschema:v1.2" as separate schemas within the group and if you want to have the "latest" functionality, you also maintain a "myschema" that always returns the latest version. An implementation behind the protocol could easily provide that.
schemaregistry/schemaregistry.md
Outdated
|
||
`/schemagroups/{group-id}/schemas/{schema-id}/versions/{version}` | ||
|
||
The name of the first segment of the path is a suggestion and MAY differ between |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you see the query for all group getting interop w/o agreement on the first one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"schemagroups" is really part of the path to the registry itself; everything defined here sits under that path. the segment could even be empty or could be multiple segments. I don't see there being interop issues, because you would always have it being at the root of the URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I didn't realize you meant for "schemagroups" to be an impl choice - we might want to make clearer throughout then entire doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also update the OpenAPI to remove /schemagroups
from paths. Would make it easier to generate servers/clients from the OpenAPI at some future point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really welcoming this initiative! I like where it's going; put in a few comments.
schemaregistry/schemaregistry.md
Outdated
- Type: `Integer` | ||
- Description: The version of the schema. This is a simple counter and tracks | ||
the version in the scope of this schema within the schema group. The schema | ||
document MAY indicate a schema that follows a different versioning scheme. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering the same.
description: Schema group already exists | ||
tags: | ||
- 'groups' | ||
delete: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should DELETE actually be allowed? Did you consider a "decomission" option or "soft delete" alternatively? That'd e.g. prohibit to produce new events referencing such schema taken out of business, but existing events could still be decoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, in general deletes are problematic and should IMO be either discouraged or prevented. Immutability has been a useful property in our registry for schema versions - we can deprecate and disable versions. And I wish we had done the same for the entire schema (deprecate/disable) rather than allow deletes.
schemaregistry/schemaregistry.md
Outdated
> 2) Since the above strategy is truly RESTful, but quite esoteric if you've not | ||
> grown up as a RESTafarian, the alternative strategy for concurrently | ||
> handling multiple schema formats is much simpler: Constrain each schema | ||
> group to a single format. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems simple and pragmatic. Does it make sense to start with this approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about constraining on a per-schema basis (giving each schema a "format" or "type"). The reason here is that I can imagine a logical group of schemas that aren't all the same technology. Especially if this were ever to expand beyond schemas and into e.g. API Designs as well - I would want to have a group that included perhaps multiple OpenAPI documents as well as some JSON Schemas... and my OpenAPI would likely have $ref
s to the JSON Schemas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EricWittmann I will add that as an XOR option, i.e. you can either define the formats at the group or schema level. If you define it at the group level, that is binding, meaning you can't override.
schemaregistry/schemaregistry.md
Outdated
- Constraints: | ||
- REQUIRED | ||
- MUST be a non-empty string | ||
- MUST conform with RFC3986/3.3 `segment-nz-nc` syntax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be more familiar to specify this as a hostname / reg-name
in section 3.2.2 ("Host"), rather than a path segment (which is slightly more permissive)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think segment-nz-nc
better supports some common ways you might want to name your groups to indicate a hierarchy where none exists. Or an Organization + Project format - that sort of thing. I'm thinking things like how a lot of NPM packages are now being named...
A schema version is a document. The "body" of a schema version MAY be a text | ||
document or binary stream. An implementation SHOULD validate whether a | ||
schema version is valid according to the rules of its format, for instance | ||
whether it is a valid Avro schema document when the format is Apache Avro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do recipients know what type of content to expect in the body? Is this based on the datacontenttype
in the received message (plus some sort of lookup table to map datacontenttype
to format
)?
If so, making format
the same value as datacontenttype
would simplify the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EricWittmann "SHOULD" leaves that up to what you think is right.
schemaregistry/schemaregistry.yaml
Outdated
- 'groups' | ||
put: | ||
summary: Create schema group | ||
description: Create schema group with specified format format in registry namespace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
description: Create schema group with specified format format in registry namespace. | |
description: Create schema group with specified format in registry namespace. |
operationId: getLatestSchema | ||
responses: | ||
'200': | ||
$ref: '#/components/responses/SchemaBytePayloadResponse' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to include the ID that was served in this response, so that clients can retrieve the value again on subsequent calls?
document MAY indicate a schema that follows a different versioning scheme. | ||
- Constraints: | ||
- REQUIRED | ||
- Assigned by server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the ability and suggestion to semver the dataschema
or schema id
, complex versioning schemes within a schema seem less necessary. A server-side check to prevent footguns should be possible even without versions, and this API doesn't actually prevent replacing a specified version with an incompatible document at the same URL (delete the existing version and/or schema, then re-create a schema with the same name and re-insert schemas until the version lines up, but the content doesn't).
schemaregistry/schemaregistry.md
Outdated
@@ -0,0 +1,257 @@ | |||
# CNCF Schema Registry API Version 0.1-rc01s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe "CNCF Schema Registry API - wip" since it's not an 'rc' yet.
This section further describes the elements enumerated in the introduction. | ||
|
||
### 2.1. Schema Group | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to add that groups is a great concept that we don't (yet) have in Apicurio Registry, but may have been a mistake to not include. It's important I think to organize these things into groupings. I guess I'm just saying +1 to this concept from me. :)
This specification does not define management constructs for such access control | ||
rules. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Good
A newer schema version might introduce breaking changes or it might only | ||
introduce careful changes that preserve compatibility. These strategies are not | ||
subject of this specification, but the API provides a conflict handling | ||
mechanism that allows an implementation to reject updates that do not comply | ||
with a compatibility policy, if one has been implemented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Schema evolution is a pretty significant concept in a schema registry. I haven't seen how the spec facilitates configuring the compatibility (or validity) rules. This paragraph mentions a compatibility policy, but if that's mentioned anywhere else I missed it. :(
This might be an important enough feature to include in the spec. For Apicurio Registry we have the concept of "rules" that can be configured globally or per-artifact (in this spec I imagine rules could also be configured at the schemagroup level). Right now we have only two rules: Validity and Compatibility. But perhaps "rule" is a concept to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@EricWittmann I believe this is an important feature of the registry, but not an important feature of the protocol. At the protocol level, a policy violation bubbles up as a plain conflict. If we were designing an implementation and the management API for that implementation, I would agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. :)
described data structure. All documents coexisting within the same version | ||
SHOULD describe the exact same data structure. | ||
|
||
### 2.2.2. Schema attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be useful to keep some other meta-data for each artifact, particularly if anyone eve wants to create a Registry UI. Name, description, labels, creationTime, modifiedTime, etc.
A schema version is a document. The "body" of a schema version MAY be a text | ||
document or binary stream. An implementation SHOULD validate whether a | ||
schema version is valid according to the rules of its format, for instance | ||
whether it is a valid Avro schema document when the format is Apache Avro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can validity be disabled or configured (e.g. syntax vs. semantic validity)? Treating Validity and Compatibility in similar ways has proven useful in Apicurio Registry.
schemaregistry/schemaregistry.md
Outdated
- Type: `String` | ||
- Description: | ||
- Constraints: | ||
- OPTIONAL. Can be used if and only if not format has been set for the schema |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be: if and only if format has not been set for the schema
Signed-off-by: clemensv <clemensv@microsoft.com>
2254b64
to
c33abad
Compare
retention policy, but implementations MAY retire and remove outdated schema | ||
versions. | ||
|
||
The latest available schema is always the default version that is retrieved when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a MUST in here someplace....
When the URL to a schema is used without a version string, the implementation MUST return the latest version of that schema.
perhaps?
- Description: Instant when the schema was added to the registry. | ||
- Constraints: | ||
- OPTIONAL | ||
- Assigned by the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My OCD is kicking in... sometimes you have periods, sometime you don't on bulleted lists :-) can we choose one? I'd prefer no periods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am referring to the CE spec for the data types now. Do you think we need to copy them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just referring to the lack of consistency on the bulleted lists - some ending in a period and some not. I don't have an opinion on copying vs referencing the data types
- Description: Instant when the schema was added to the registry. | ||
- Constraints: | ||
- OPTIONAL | ||
- Assigned by the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I think we need a MUST here specifying the format/syntax - or define "Timestamp" in some kind of "data types" section
schemaregistry/schemaregistry.md
Outdated
schema version is valid according to the rules of its format, for instance | ||
whether it is a valid Avro schema document when the format is Apache Avro. | ||
|
||
Within the scope of the schema set, the version is identified by the combination |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use of the word "set" here might confuse people since it's new. Did you mean "group" or "set of versions for one particular schema" ?
schemaregistry/schemaregistry.md
Outdated
|
||
Within the scope of the schema set, the version is identified by the combination | ||
of a version number and an optional format identifier. The schema version MAY | ||
also have an additional, optional unique identifier within the scope of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this last sentence? Do you mean they can add extensions that are unique identifier values? If so, why did you call this out? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any concrete schema document may also have a unique identifier for itself. There is either a path /group/xyz/schemas/abc/versions/1 or you could just address that exact doc with its ID. We want that to enable a URL shortener function.
- 1 | ||
- 2 | ||
|
||
#### id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you see this being used by a consumer of the registry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aka.ms/s/{id} URL shortener option for greedy protocols.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't that just part of the URL shortener function's logic? E.g. tinyurl.com doesn't ask for an id - I just give it a full URL
schemaregistry/schemaregistry.md
Outdated
|
||
These dependencies are reflected in the path structure: | ||
|
||
`[/schemagroups]/{group-id}/schemas/{schema-id}/versions/{version}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it matters, but the use of the word id
got me wondering if it should be name
instead. While the semantics would be the same either way, all examples of these IDs appear to be more like human friendly names rather than IDs (e.g GUIDs). So while it's not totally human friendly (eg no spaces, etc...), if we expect people to use meaningful "words" and not "random chars", then perhaps name
would help guide them in that direction.
Just a thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add something here to indicate that the /versions/{version}
part is optional?
[/schemagroups]/{group-id}/schemas/{schema-id}[/versions/{version}]
or text?
Signed-off-by: clemensv clemensv@microsoft.com
For initial review. I'm still updating both documents including changing some names, but the combination of OpenAPI doc and the spec doc should already tell a fairly complete story.
Microsoft proposal for #610