Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[schema] Introduce multi version generic record schema #3670

Merged
merged 1 commit into from
Feb 26, 2019

Conversation

sijie
Copy link
Member

@sijie sijie commented Feb 22, 2019

Motivation

Currently AUTO_CONSUME only supports decoding records from latest schema.
All the schema versions are lost. It makes AUTO_CONSUME less useful in some use cases,
such as CDC. Because there is no way for the applications to know which version of schema
that a message is using.

In order to support multi-version schema, we need to propagate schema version from
message header through schema#decode method to the decoded record.

Modifications

  • Introduce a new decode method decode(byte[] data, byte[] schemaVersion). This allows the implementation
    to leverage the schema version.
  • Introduce a method supportSchemaVersioning to tell which decode methods to use. Because most of the schema
    implementations such as primitive schemas and POJO based schema doesn't make any sense to use schema version.
  • Introduce a SchemaProvider which returns a specific schema instance for a given schema version
  • Implement a MultiVersionGenericRecordSchema which decode the messages based on schema version. All the records
    decoded by this schema will have schema version and its corresponding schema definitions.

NOTES

This implementation only introduce the mechanism. But it doesn't wire the multi-version schema
with auto_consume schema. There will be a subsequent pull request on implementing a schema provider
that fetches and caches schemas from brokers.

*Motivation*

Currently AUTO_CONSUME only supports decoding records from latest schema.
All the schema versions are lost. It makes AUTO_CONSUME less useful in some use cases,
such as CDC. Because there is no way for the applications to know which version of schema
that a message is using.

In order to support multi-version schema, we need to propagate schema version from
message header through schema#decode method to the decoded record.

*Modifications*

- Introduce a new decode method `decode(byte[] data, byte[] schemaVersion)`. This allows the implementation
  to leverage the schema version.
- Introduce a method `supportSchemaVersioning` to tell which decode methods to use. Because most of the schema
  implementations such as primitive schemas and POJO based schema doesn't make any sense to use schema version.
- Introduce a SchemaProvider which returns a specific schema instance for a given schema version
- Implement a MultiVersionGenericRecordSchema which decode the messages based on schema version. All the records
  decoded by this schema will have schema version and its corresponding schema definitions.

*NOTES

This implementation only introduce the mechanism. But it doesn't wire the multi-version schema
with auto_consume schema. There will be a subsequent pull request on implementing a schema provider
that fetches and caches schemas from brokers.
@sijie sijie added area/client type/feature The PR added a new feature or issue requested a new feature component/schemaregistry labels Feb 22, 2019
@sijie sijie added this to the 2.4.0 milestone Feb 22, 2019
@sijie sijie self-assigned this Feb 22, 2019
@ivankelly
Copy link
Contributor

To clarify, this is only ever used for AUTO_CONSUME, no? When a client passes a schema like Schema.AVRO(MyPojo.class), it won't actually know the schema version. So the schema version is only used here for the client to request a specific version of the schema from the broker?

@sijie
Copy link
Member Author

sijie commented Feb 23, 2019

@ivankelly that's correct. for pojo and primitive schemas, they will not know the schema version. for any other schema implementations that returns GenericRecord, the schema version will be used for the client to request a specific version of the schema from the broker. Currently the schema implementation returning GenericRecord is AUTO_CONSUME.

codelipenghui
codelipenghui approved these changes Feb 23, 2019

@Override
public GenericRecord decode(byte[] bytes, byte[] schemaVersion) {
return provider.getSchema(schemaVersion).decode(bytes, schemaVersion);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already get the specific schema, is it necessary to provide schemaVersion when call decode()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codelipenghui good point. I think it is a good practice to pass the schema version down to the specific implementation and let the implementation decide whether to use this schema version or not.

@sijie sijie merged commit 3c36705 into apache:master Feb 26, 2019
@sijie sijie deleted the schema_management branch February 26, 2019 12:02
merlimat pushed a commit that referenced this pull request Mar 29, 2019
*Motivation*

Currently AUTO_CONSUME only supports decoding records from latest schema.
All the schema versions are lost. It makes AUTO_CONSUME less useful in some use cases,
such as CDC. Because there is no way for the applications to know which version of schema
that a message is using.

In order to support multi-version schema, we need to propagate schema version from
message header through schema#decode method to the decoded record.

*Modifications*

- Introduce a new decode method `decode(byte[] data, byte[] schemaVersion)`. This allows the implementation
  to leverage the schema version.
- Introduce a method `supportSchemaVersioning` to tell which decode methods to use. Because most of the schema
  implementations such as primitive schemas and POJO based schema doesn't make any sense to use schema version.
- Introduce a SchemaProvider which returns a specific schema instance for a given schema version
- Implement a MultiVersionGenericRecordSchema which decode the messages based on schema version. All the records
  decoded by this schema will have schema version and its corresponding schema definitions.

*NOTES

This implementation only introduce the mechanism. But it doesn't wire the multi-version schema
with auto_consume schema. There will be a subsequent pull request on implementing a schema provider
that fetches and caches schemas from brokers.
jiazhai pushed a commit that referenced this pull request Jul 1, 2020
Motivation
Pulsar 2.4.0 Added schema versioning to support multi version messages produce and consume #3876 #3670 #4211 #4325 #4548. but the doc is not updated accordingly.

Modifications
Update the schema version in the pulsar registry doc for releases 2.4.0/2.4.1/2.4.2.
huangdx0726 pushed a commit to huangdx0726/pulsar that referenced this pull request Aug 24, 2020
Motivation
Pulsar 2.4.0 Added schema versioning to support multi version messages produce and consume apache#3876 apache#3670 apache#4211 apache#4325 apache#4548. but the doc is not updated accordingly.

Modifications
Update the schema version in the pulsar registry doc for releases 2.4.0/2.4.1/2.4.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/client type/feature The PR added a new feature or issue requested a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants