Allow Pulsar IO Sources to push GenericRecord instances encoded with AVRO #9481

eolivelli · 2021-02-04T13:44:16Z

Motivation

Currently Pulsar Source Connectors cannot produce messages using the GenericRecord API.

Producing messages with the GenericRecord API will allow to dynamically generate data structures that can be consumed downstream using the supported encoding

Modifications

Allow a Pulsar Source to be declared to produce "GenericRecord" instances, this in turn means to change PulsarSink that is the entity that receives message from the Source and writes to the Pulsar topic.

PulsarSink: do not precreate a Schema in case of GenericRecord data type (the schema will be created at the first write)
AvroWriter: support encoding any GenericRecord and not only GenericAvroRecords
Add integration test

Verifying this change

The change adds integration tests and new unit tests

Documentation

There is no need to add documentation, because it comes naturally for a developer to declare the source as Source<GenericRecord> and it is expected that it works

pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/sink/PulsarSink.java

…record

codelipenghui · 2021-02-05T14:15:58Z

@congbobo184 Could you please also help review this PR?

codelipenghui

LGTM

codelipenghui · 2021-02-05T14:22:18Z

pulsar-functions/instance/src/test/java/org/apache/pulsar/functions/source/TopicSchemaTest.java

+import lombok.Getter;
+import lombok.Setter;
+import lombok.extern.slf4j.Slf4j;
+import org.apache.pulsar.client.api.*;


It's better to avoid star import.

oh sure.
I will update the patch after @congbobo184 review, in order to save CI cycles

thanks

eolivelli · 2021-02-08T07:20:53Z

@congbobo184 do you have cycles ?

eolivelli · 2021-02-08T07:29:20Z

@codelipenghui I have addressed your comment and merged with current master.

congbobo184

@eolivelli Sorry for late comment

congbobo184 · 2021-02-08T15:36:54Z

pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/generic/GenericAvroWriter.java

+
+    /**
+     * This is an adapter from Pulsar GenericRecord to Avro classes.
+     */
+    private class GenericRecordAdapter extends SpecificRecordBase {
+        private GenericRecord message;
+
+        void setCurrentMessage(GenericRecord message) {
+            this.message = message;
+        }
+        @Override
+        public Schema getSchema() {
+            return schema;
+        }
+
+        @Override
+        public Object get(int field) {
+            return message.getField(schema.getFields().get(field).name());
+        }
+
+        @Override
+        public void put(int field, Object value) {
+            throw new UnsupportedOperationException();
+        }
+    }


Now we generate writer is GenericDatumWriter, it can't write SpecificRecordBase , we should support SpecificDatumWriter. So, I don't think it a good way to do this.

Here we are in a writer and a writer is supposed to only read from the object and not to alter the contents.
So it is good here that we prevent any change.
I cannot understand your point.

The integration tests makes an example of a Source that is pushing GenericRecord instances and then those messages can be consumed by a standard Pulsar consumer with AUTO_CONSUME

It also can change the message in GenericRecordAdapter, GenericRecordAdapter seems to have no effect. It can't prevent change.

here we are adding an adapter from a Pulsar GenericRecord to SpecificRecordBase but we are interested in implementing only read-only methods.

if you prefer I can leave the implementation of this method empty, the method won't be called by the writer.

If you run the integration test you will see that we are passing thru this code.

Current implementation in master assumes that the GenericRecord is always a GenericAvroRecord, but in case of this new feature it can be any implementation of GenericRecord (and in fact in my case it will be a generic data structure generated dynamically, with a dynamic schema)

why we make GenericRecordAdapter extend SpecificRecordBase not GenericRecord , seem private final GenericDatumWriter<org.apache.avro.generic.GenericRecord> writer; only can write GenericRecord not SpecificRecordBase.

@congbobo184 I have double checked, and we indeed only need to implement GenericRecord.
The usage of SpecificRecordBase was a left-over of my initial implementation.
I have simplified the code by implementing all of the GenericRecord fields.

good catch !

congbobo184 · 2021-02-08T15:42:06Z

pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/source/TopicSchema.java

+            if (!input && clazz.equals(GenericRecord.class)) {
+                return new AutoProduceBytesSchema();
+            }


It seems integration test don't cover here, it still use record.getSchema(). What is the integration test testing? I did not understand here.

The integration test without this change does not work.

you start a source, but source don't getSchema by this method. I don't understand where use this change?

@congbobo184
This method is called by PulsarSink in initializeSchema() method via TopicSchema#getSchema
The PulsarSink is the entity that actually write to Pulsar.
When you create a Source Pulsar IO creates a PulsarSink to write to the destination topic.

When you start the Source you are going to use this special AutoProduceBytesSchema (it was pre-existing, I did not add it).

Initially the Source does not enforce a Schema on the topic (we achieve this with AutoProduceBytesSchema).

When the Source passes a Record to the Pulsar runtime, the PulsarSink picks up the Schema (using Record#getSchema) and sets the schema properly.
Therefore when the schema changes (in a compatible way) the runtime automatically updates the schema.

In fact when the Source starts you cannot know the schema, because the schema is generated dynamically by the Source itself.

Thank you for your detailed reply, I understand here. :)

...-functions/src/main/java/org/apache/pulsar/tests/integration/io/TestGenericRecordSource.java

eolivelli · 2021-02-09T11:44:40Z

@congbobo184 I have addressed your comment PTAL, thanks for your review

lhotari

LGTM

eolivelli · 2021-02-09T13:37:46Z

/pulsarbot run-failures-tests

eolivelli · 2021-02-09T14:07:45Z

/pulsarbot run-failure-tests

gaoran10 · 2021-02-09T16:39:40Z

/pulsarbot run-failure-checks

eolivelli · 2021-02-10T07:24:43Z

@congbobo184 CI passed, I hope that this patch is good to go now

thank you again for your review @congbobo184 @codelipenghui and @sijie

congbobo184 · 2021-02-10T12:54:00Z

LGTM! great work!

eolivelli · 2021-02-10T13:07:41Z

thank you @congbobo184
can you please commit this patch ?
this way I can rebase another patch of mine and send it

eolivelli · 2021-02-11T08:02:04Z

@sijie @codelipenghui
@congbobo184 approved this patch.

can we merge it please ?

rdhabalia

LGTM

rdhabalia · 2021-02-15T08:09:15Z

@sijie can you please help to review this PR.

sijie

@eolivelli I don't think this is the right implementation.

The fundamental problem this pull request tries to solve is to allow Pulsar Sink to write a record based on its schema information it carries.

This should be done in the PulsarSink implementation. As the current PulsarSink implementation already does this job to construct a message based on the schema of a Sink record. I don't think we need to change the existing schema implementation. As those schema implementations are designed to write one version and read multi-versions, and multi-version schema writes are implemented in the Pulsar client level.

I am happy to craft a skeleton on this change.

sijie · 2021-02-15T23:47:30Z

@eolivelli I think the change can be as simple as #9590. I haven't verified it works. But it should be the right direction to go. Because PulsarSink already handles writing messages using multiple schemas (

pulsar/pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/sink/PulsarSink.java

Line 230 in c27f7a3

if (schemaToWrite != null) {

). For GenericRecord source connectors, each source record already carries the schema information. The PulsarSink can use that schema information to write a the record.

sijie · 2021-02-16T00:46:21Z

For the reviewers on this pull request, please take a look at #9590. In order to support writing a generic record, we don't need to change the existing AVRO schema implementation. It should be done in PulsarSink, as it needs to support writing multiple versions of generic records. Pulsar Sink already has the framework to support that by leveraging PIP-43.

#9590 is ready for review with unit tests. integration tests to be added.

codelipenghui · 2021-02-16T14:48:36Z

@sijie Thanks for the comment. If I understand correctly, the approach is using the AUTO_CONSUME schema and the PIP-43 to achieve this purpose. Previously, we always use the AUTO_CONSUME schema for a consumer. Essentially, the AUTO_CONSUME schema is a Generic Schema that works with GenericRecord, encode a GenericRecord to byte[] and decode byte[] to GenericRecord. So we don't need to change the existing implementation, we can use the AUTO_CONSUME schema for a producer too to allow the producer able to publish GenericRecord.

And with PIP-45, the message carries the real schema. So that we are able to publish GenericRecord with different version schemas.

Please point out if I am wrong, thanks.

sijie · 2021-02-16T18:00:25Z

@codelipenghui: Most of the understandings are correct. There is one mistake. We don't use AUTO_CONSUME for serializing the messages. We just use AUTO_CONSUME to tell PulsarSink that it doesn't need to pre-create a producer. The messages produced to the sink are all GenericRecords. They all carry the schema information themselves.

Then we leverage PIP-45 to write the messages based on the schema information they carry.

eolivelli · 2021-02-17T07:35:53Z

I am closing this PR in favour of #9590.
I would like to see the integration tests in this patch added to #9590 as they show an usecase I have to support

eolivelli added 2 commits February 4, 2021 14:36

Allow Pulsar IO Sources to push GenericRecord instances

2555679

add test definition

e942365

eolivelli mentioned this pull request Feb 4, 2021

Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

Merged

sijie requested changes Feb 5, 2021

View reviewed changes

pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/sink/PulsarSink.java Outdated Show resolved Hide resolved

eolivelli added 3 commits February 5, 2021 10:20

Move logic to TopicSchema

34d93f7

Add unit tests

691eb56

Merge remote-tracking branch 'origin/master' into impl/source-generic…

912d415

…record

eolivelli requested a review from sijie February 5, 2021 13:46

codelipenghui assigned eolivelli Feb 5, 2021

codelipenghui added this to the 2.8.0 milestone Feb 5, 2021

codelipenghui added the component/schemaregistry label Feb 5, 2021

codelipenghui requested review from jerrypeng, srkukarni, jiazhai and merlimat February 5, 2021 14:15

codelipenghui approved these changes Feb 5, 2021

View reviewed changes

eolivelli added 2 commits February 8, 2021 08:23

Merge branch 'master' into impl/source-genericrecord

28e944e

Address Peng Hui's comments

b6c6f24

congbobo184 reviewed Feb 8, 2021

View reviewed changes

Merge branch 'master' into impl/source-genericrecord

2b4a0ea

wolfstudy reviewed Feb 9, 2021

View reviewed changes

...-functions/src/main/java/org/apache/pulsar/tests/integration/io/TestGenericRecordSource.java Outdated Show resolved Hide resolved

eolivelli added 2 commits February 9, 2021 09:22

fix imports

29c728f

Clean up GenericAvroWriter

553fad2

lhotari approved these changes Feb 9, 2021

View reviewed changes

eolivelli requested a review from wolfstudy February 9, 2021 12:44

rdhabalia approved these changes Feb 15, 2021

View reviewed changes

sijie requested changes Feb 15, 2021

View reviewed changes

eolivelli mentioned this pull request Feb 16, 2021

Support writing general records to Pulsar sink #9590

Merged

eolivelli closed this Feb 17, 2021

eolivelli mentioned this pull request Mar 12, 2021

Pulsar function Should support Generic output Message #9892

Closed

Allow Pulsar IO Sources to push GenericRecord instances encoded with AVRO #9481

Allow Pulsar IO Sources to push GenericRecord instances encoded with AVRO #9481

Conversation

eolivelli commented Feb 4, 2021

Motivation

Modifications

Verifying this change

Documentation

codelipenghui commented Feb 5, 2021

codelipenghui left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eolivelli commented Feb 8, 2021

eolivelli commented Feb 8, 2021

congbobo184 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eolivelli Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

congbobo184 Feb 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

congbobo184 Feb 8, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eolivelli commented Feb 9, 2021

lhotari left a comment

Choose a reason for hiding this comment

eolivelli commented Feb 9, 2021 • edited Loading

eolivelli commented Feb 9, 2021

gaoran10 commented Feb 9, 2021

eolivelli commented Feb 10, 2021

congbobo184 commented Feb 10, 2021

eolivelli commented Feb 10, 2021

eolivelli commented Feb 11, 2021

rdhabalia left a comment

Choose a reason for hiding this comment

rdhabalia commented Feb 15, 2021

sijie left a comment

Choose a reason for hiding this comment

sijie commented Feb 15, 2021

sijie commented Feb 16, 2021

codelipenghui commented Feb 16, 2021

sijie commented Feb 16, 2021

eolivelli commented Feb 17, 2021

eolivelli Feb 8, 2021 •

edited

Loading

congbobo184 Feb 9, 2021 •

edited

Loading

congbobo184 Feb 8, 2021 •

edited

Loading

eolivelli commented Feb 9, 2021 •

edited

Loading