Use Message.getReaderSchema() in Pulsar IO Sinks when possible #10557

eolivelli · 2021-05-12T12:26:45Z

Motivation

When you run a Sink and you call record.getSchema(), it does not return an accurate representation of the schema in case of a schema update.

For instance:

the topic starts with AVRO Schema schema1
the AutoConsumeSchema starts by populating the internal SchemaInfo with the definition of schema1
the topic advances to AVRO Schema schema2
the AutoConsumeSchema still reports SchemaInfo for schema1
record.getSchema().getNativeSchema() reports the wrong schema definition

Modifications

With this change we leverage PIP-85 Message.getReaderSchema() API that returns the exact schema used for the Message, that is the schema that these pieces for information updated to the same "schemaVersion" of the Message represented by the Record:

reports the correct getSchemaInfo()
reports the correct getNativeSchema()

I also fixing a problem in KeyValueSchema for atSchemaVersion(), the constructor of KeyValueSchema did not fill in the SchemaInfo data structure, resulting in NPEs.

Verifying this change

This change adds a new integration test and a unit test

Does this pull request potentially affect one of the following parts:

It affects Pulsar Sinks that implement Sink, in fact now Record.getSchema() will return accurate schema information.

Documentation

No need for docs, previous behaviour was unexpected, the new behaviour is what you expect.

eolivelli · 2021-05-12T12:27:28Z

I am working on the integration test, I have marked this PR as "draft" for early review

codelipenghui

LGTM

congbobo184 · 2021-05-13T09:54:49Z

LGTM!

…e#10557) (cherry picked from commit 90117b2)

Use Message.getReaderSchema in Pulsar IO Sinks when possible

9fb5603

eolivelli self-assigned this May 12, 2021

eolivelli added area/connector component/schemaregistry labels May 12, 2021

eolivelli added this to the 2.8.0 milestone May 12, 2021

codelipenghui reviewed May 12, 2021

View reviewed changes

add integration test

9e31671

eolivelli marked this pull request as ready for review May 12, 2021 14:56

eolivelli added 3 commits May 12, 2021 18:28

Debug CI error

7229625

Ensure that KeyValueSchema always has a SchemaInfo

cc20cc0

fix test

84e9f7e

congbobo184 approved these changes May 13, 2021

View reviewed changes

eolivelli merged commit 90117b2 into apache:master May 14, 2021

eolivelli deleted the impl/pulsario-getreaderschema branch May 14, 2021 05:29

eolivelli added a commit to datastax/pulsar that referenced this pull request May 14, 2021

Use Message.getReaderSchema() in Pulsar IO Sinks when possible (apach…

5dee3eb

…e#10557) (cherry picked from commit 90117b2)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Message.getReaderSchema() in Pulsar IO Sinks when possible #10557

Use Message.getReaderSchema() in Pulsar IO Sinks when possible #10557

eolivelli commented May 12, 2021 •

edited

Loading

eolivelli commented May 12, 2021

codelipenghui left a comment

congbobo184 commented May 13, 2021

Use Message.getReaderSchema() in Pulsar IO Sinks when possible #10557

Use Message.getReaderSchema() in Pulsar IO Sinks when possible #10557

Conversation

eolivelli commented May 12, 2021 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

eolivelli commented May 12, 2021

codelipenghui left a comment

Choose a reason for hiding this comment

congbobo184 commented May 13, 2021

eolivelli commented May 12, 2021 •

edited

Loading