Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

eolivelli · 2021-02-03T15:50:42Z

Motivation

Currently KafkaSource allows only to deal with strings and byte arrays, it does not support records with Schema.
In Kafka we have the ability to encode messages using Avro and there is a Schema Registry (by Confluent®)

Modifications

Summary of changes:

allow current KafkaSource (KafkaBytesSource) to deal with io.confluent.kafka.serializers.KafkaAvroDeserializer and copy the raw bytes to the Pulsar topic, setting appropriately the Schema
this source support Schema Evolution end-to-end (i.e. add fields to the original schema in the Kafka world, and see the new fields in the Pulsar topic, without any reconfiguration or restart)
add Confluent® Schema Registry Client to the Kafka Connector NAR, the license is compatible with Apache 2 license and we can redistribute it
the configuration of the Schema Registry Client is done done in the consumerProperties property of the source (usually you add schema.registry.url)
add integration tests with Kafka and Schema Registry

Verifying this change

The patch introduces new integration tests.
The integration tests launch a Kafka Container and also a Confluent Schema Registry Container

Documentation

I will be happy to provide documentation once this patch is committed.

…ableBatching is enabled if you do not set an initial schema to the Producer the schema must be prepared at the first message with a Schema. There is a bug and compression is not applied in this case, and the consumer receives an uncompressed message, failing

…hema-aware

…ware

…ware-with-tests

sijie

@eolivelli I think this approach makes unnecessary changes to the existing Pulsar schema library for achieving a very narrow scope. Pulsar and Kafka have very similar data models. Most of the time you should just transfer the bytes. The only thing you need to manage is to convert "Kafka SerDe" to "Pulsar Schema". You don't really need to write a special record to carry the schema information.

I wrote a schema-aware Kafka source at https://github.com/streamnative/pulsar-io-kafka, which we haven't contributed back yet.

It only has one source connector that deals with byte[]. It works for all schemas. You don't need different source connectors.
All what you need to do is to covert KafkaSerDe to Pulsar schema. See: https://github.com/streamnative/pulsar-io-kafka/blob/master/src/main/java/io/streamnative/connectors/kafka/KafkaSource.java#L338
because Kafka encoded schema id as the first 4 bytes in Avro messages, hence you need to write a Schema wrapper. https://github.com/streamnative/pulsar-io-kafka/blob/master/src/main/java/io/streamnative/connectors/kafka/schema/KafkaAvroSchema.java
Kafka also handles JSON slight differently. Hence we need to process that differently as well. https://github.com/streamnative/pulsar-io-kafka/blob/master/src/main/java/io/streamnative/connectors/kafka/schema/KafkaJsonSchema.java
In a lot of cases, you just use Kafka's existing tools to covert Kafka schema into a standard AVRO schema to be stored as a Pulsar schema. That is the benefit of using an open-standard serialization framework rather than introducing your own type system. Reuse existing tools instead of reinventing a wheel.

eolivelli

@sijie
I think you are right in saying that we have to use open standards and existing tools, and to not reinvent the wheel.

Indeed this implementation of the KafkaSource is using exacly KafkaAvroDeserializer that is the tool that is available in the Kafka ecosystem to deal with Avro.

I prefer this solution because it is using only standard tools in the Kafka ecosystem.
I believe that in the long term this very simple code will be easily maintained here in the Pulsar codebase

I would like not to enter the details of how Kafka and Confluent serialize data (extract the schema id, extract the raw payload, connect to the schema registry....)

It is better to use the official library and use standard APIs.
in the future it will be easy to upgrade to new versions of the Kafka/Confluent client and support future changes/evolutions and enterprise features.

I believe this is a good enhancement to the Pulsar Sink framework:
we allowing Sources to push GenericRecords, and Pulsar will package the structure using the Schema provided by the Source using the Pulsar Schema API (Avro or whatever we will support in the future...).

If you prefer I can split the patch into two parts:

allow PulsarSource to deal with GenericRecord
enhance the standard KafkaSource

pulsar-broker/src/test/java/org/apache/pulsar/client/api/SimpleProducerConsumerTest.java

pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java

pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/generic/GenericAvroWriter.java

pulsar-functions/instance/src/main/java/org/apache/pulsar/functions/sink/PulsarSink.java

sijie · 2021-02-04T08:37:21Z

@eolivelli

Unfortunately, I disagree with you on this. This approach creates a very bad example for other people to follow.

First of all, it basically creates a "connector" class per schema type. This is a very bad practice. I would discourage a connector implementation going down this route. It is impossible to maintain.

Secondly, it solves a very narrow scoped problem by introducing a specific type of source connector in which the key is a string and the value is AVRO. All the key schema information is dropped. Key schema information is important to a lot of streaming use cases. You can't solve the problem with this approach. Following your approach, you will end up creating N*N connector classes (where N is the number of schema types).

I would encourage people not introducing a code change that is specialized for their own needs. The connector implementation should be beneficial to broader users.

I would like not to enter the details of how Kafka and Confluent serialize data

Unfortunately, the Kafka AVRO is a confluent open-source thing not a Kafka community thing.

We are always in the game of touching serialization details when converting a message from X format to Y format. You either do the conversation at a high level using abstraction or at a low level by realizing the serialization details.

The approach you proposed also has a bad performance because it will churn a lot of object allocations. Realizing the serialization details can save a lot of memory copy and serialization/deserialization.

we allowing Sources to push GenericRecords,

This is a good initiative. But it should be isolated from this KafkaSource change.

eolivelli · 2021-02-04T08:57:03Z

@sijie there no problem in disagreeing,
let's find together the right way to provide features to the users in the best way for the project.

I am going to split the patch into two parts, this way we can make one step at a time.

First of all, it basically creates a "connector" class per schema type. This is a very bad practice. I would discourage a connector implementation going down this route. It is impossible to maintain.

We already have KafkaBytesSource and KafkaStringSource, so I am just adding a new flavour of the KafkaSource, in fact the implementation is just about adding a new subclass of KafkaAbstractSource.
I am following the current style.

In my plans I would like to work more on this KafkaSource and on the KafkaSink and try to make the structure better.

There is an open work that will allow to put more sinks on the same nar and provide a better user experience.
#3678

In the meanwhile users of the Kafka source can go with "--classname" (or they can select it from a Web UI for interactive Pulsar Management Consoles)

in which the key is a string and the value is AVRO

we can work on this issue as well (and that's on my backlog), I didn't want to introduce too many features.

I have users that are used to advanced data mapping mechanisms both for the key and for the value, so mapping the key is very important to me.
That said, currently the KafkaAbstractSource is working on a string key, that is a preexisting code

The approach you proposed also has a bad performance because it will churn a lot

of object allocations. Realizing the serialization details can save a lot of memory copy and serialization/deserialization.

I know about this fact, and I know how the StreamNative connector works.

Using the Java Model with GenericRecord adds that additional cost, but the benefit are:

to have simpler code, using code provided by the same vendors that are mantaining that serialization protocol
we can follow the evolutions just by upgrading the Confluent library
we are using pure Kafka/Pulsar APIs, totally integrated with the framework, this will allow us to leverage all future improvements

lhotari

Good work @eolivelli . I put a few comments about some minor things.

pom.xml

pulsar-client/src/main/java/org/apache/pulsar/client/impl/schema/generic/GenericAvroWriter.java

pulsar-functions/instance/src/test/java/org/apache/pulsar/functions/sink/PulsarSinkTest.java

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/KafkaAbstractSource.java

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/PulsarSchemaCache.java

tests/integration/src/test/java/org/apache/pulsar/tests/integration/io/AvroKafkaSourceTest.java

eolivelli

@sijie I have added Guava cache, removed AUTO_PRODUCE_BYTES, removed extractKey method and switched to ByteBuffer and addressed all of your comments.

Please take a look again

sijie

@eolivelli overall looks good.

I see you removed the key handling from this PR. I think in general, we should structure the class hierarchy like the following. As we are introducing a new bytes connector, we need to make the key as pure bytes because that's the default behavior from Kafka.

KafkaAbstractSource -> KafkaAbstractStringKeySource -> KafkaStringSource
KafkaAbstractSource -> KafkaAbstractBytesKeySource -> KafkaBytesSource

sijie · 2021-03-08T22:01:12Z

pulsar-io/kafka/pom.xml

+      <version>${kafka.confluent.avroserializer.version}</version>
+    </dependency>
+
+    <dependency>


Why do you need this dependency?

good catch. I needed it initially.
now it is useless.

dropped

sijie · 2021-03-08T22:10:54Z

pulsar-io/kafka/src/main/java/org/apache/pulsar/io/kafka/AvroSchemaCache.java

+                    .properties(Collections.emptyMap())
+                    .schema(definition.getBytes(StandardCharsets.UTF_8)
+                    ).build();
+            return new Schema<ByteBuffer>() {


I think you need to override the decode method and throw an UnsupportedException. Otherwise, if the decode method is used, it will cause StackOverflowException.

eolivelli · 2021-03-09T07:57:46Z

@sijie I have addressed your last comments.
I created a new issue regarding the management of the key.
#9848

this way we can move forward one step at a time

eolivelli · 2021-03-09T10:17:57Z

@codelipenghui @sijie CI passed. probably we are good to go now :)

pulsar-io/kafka/pom.xml

eolivelli · 2021-03-10T09:44:44Z

/pulsarbot rerun-failure-checks

sijie · 2021-03-13T03:15:13Z

@freeznet @codelipenghui Can you review this PR?

eolivelli · 2021-03-15T05:54:24Z

@codelipenghui can you please help merging this patch?

eolivelli · 2021-03-15T07:47:07Z

@thank you very much @codelipenghui and @sijie

…he#9448) ### Motivation Currently KafkaSource allows only to deal with strings and byte arrays, it does not support records with Schema. In Kafka we have the ability to encode messages using Avro and there is a Schema Registry (by Confluent®) ### Modifications Summary of changes: - allow current KafkaSource (`KafkaBytesSource`) to deal with `io.confluent.kafka.serializers.KafkaAvroDeserializer ` and copy the raw bytes to the Pulsar topic, setting appropriately the Schema - this source support Schema Evolution end-to-end (i.e. add fields to the original schema in the Kafka world, and see the new fields in the Pulsar topic, without any reconfiguration or restart) - add Confluent® Schema Registry Client to the Kafka Connector NAR, the license is compatible with Apache 2 license and we can redistribute it - the configuration of the Schema Registry Client is done done in the consumerProperties property of the source (usually you add schema.registry.url) - add integration tests with Kafka and Schema Registry ### Verifying this change The patch introduces new integration tests. The integration tests launch a Kafka Container and also a Confluent Schema Registry Container

eolivelli added 24 commits January 28, 2021 12:18

Implement Avro support for Kafka Source

28cb99c

Merge branch 'master' into impl/kafka-schema-aware

1595182

make it work

8051cd5

clean up

ae87b9f

more cleanup

68d1346

cache schema

9b76f47

fix compression bug

490b631

Merge brach fix/prod-deferred-schema-compressed

82cb6e7

Merge branch 'master' into impl/kafka-schema-aware

377ca2c

clean up

f77c00b

Merge branch 'fix/prod-deferred-schema-compressed' into impl/kafka-sc…

fbefdfa

…hema-aware

fix config

7f4fe01

Merge remote-tracking branch 'origin/master' into impl/kafka-schema-a…

3a79356

…ware

add tests

b5f7256

add test

dbc6de7

Merge remote-tracking branch 'origin/master' into impl/kafka-schema-a…

b6a6525

…ware-with-tests

make schemaregistry start

b42475a

Merge remote-tracking branch 'origin/master' into impl/kafka-schema-a…

48efa13

…ware-with-tests

use kafka cli tools

d36f73f

test works

311b16b

Merge remote-tracking branch 'origin/master' into impl/kafka-schema-a…

368bd44

…ware-with-tests

clean up

2968afb

more cleanup

0a6c456

sijie requested changes Feb 4, 2021

View reviewed changes

eolivelli commented Feb 4, 2021

View reviewed changes

lhotari requested changes Feb 4, 2021

View reviewed changes

Allow Pulsar IO Sources to push GenericRecord instances

2555679

eolivelli force-pushed the impl/kafka-schema-aware-with-tests branch from f06c417 to 8e456e3 Compare March 8, 2021 10:21

eolivelli added 3 commits March 8, 2021 11:26

Do not use AUTOPRODUCE

72f5c1b

Use ByteBuffer

0c2f87a

Fix imports and revert findbugs file

7f5b2a9

eolivelli commented Mar 8, 2021

View reviewed changes

eolivelli requested a review from sijie March 8, 2021 10:42

sijie reviewed Mar 8, 2021

View reviewed changes

eolivelli added 2 commits March 9, 2021 08:52

clean up

3558ecf

Merge branch 'master' into impl/kafka-schema-aware-with-tests

6d4f107

eolivelli mentioned this pull request Mar 9, 2021

Kafka Sink Connector: support non-String keys #9848

Open

sijie reviewed Mar 10, 2021

View reviewed changes

pulsar-io/kafka/pom.xml Outdated Show resolved Hide resolved

remove useless dependency

8757864

eolivelli requested a review from sijie March 12, 2021 11:08

sijie approved these changes Mar 13, 2021

View reviewed changes

sijie assigned eolivelli Mar 13, 2021

sijie added this to the 2.8.0 milestone Mar 13, 2021

sijie added the area/connector label Mar 13, 2021

codelipenghui approved these changes Mar 14, 2021

View reviewed changes

michaeljmarshall approved these changes Mar 14, 2021

View reviewed changes

codelipenghui merged commit d52a1b0 into apache:master Mar 15, 2021

eolivelli deleted the impl/kafka-schema-aware-with-tests branch March 15, 2021 07:46

eolivelli mentioned this pull request Mar 30, 2021

Document Kafka Source Schema management #10084

Merged

lhotari mentioned this pull request Dec 14, 2022

[Bug] org.apache.pulsar:pulsar-io-kafka depends on io.confluent:kafka-schema-registry which has incompatible license #18920

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

eolivelli commented Feb 3, 2021 •

edited

Loading

sijie left a comment

eolivelli left a comment

sijie commented Feb 4, 2021

eolivelli commented Feb 4, 2021

lhotari left a comment

eolivelli left a comment

sijie left a comment

sijie Mar 8, 2021

eolivelli Mar 9, 2021

sijie Mar 8, 2021

eolivelli Mar 9, 2021

eolivelli commented Mar 9, 2021

eolivelli commented Mar 9, 2021

eolivelli commented Mar 10, 2021

sijie commented Mar 13, 2021

eolivelli commented Mar 15, 2021

eolivelli commented Mar 15, 2021

Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

Pulsar IO - KafkaSource - allow to manage Avro Encoded messages #9448

Conversation

eolivelli commented Feb 3, 2021 • edited Loading

Motivation

Modifications

Verifying this change

Documentation

sijie left a comment

Choose a reason for hiding this comment

eolivelli left a comment

Choose a reason for hiding this comment

sijie commented Feb 4, 2021

eolivelli commented Feb 4, 2021

lhotari left a comment

Choose a reason for hiding this comment

eolivelli left a comment

Choose a reason for hiding this comment

sijie left a comment

Choose a reason for hiding this comment

sijie Mar 8, 2021

Choose a reason for hiding this comment

eolivelli Mar 9, 2021

Choose a reason for hiding this comment

sijie Mar 8, 2021

Choose a reason for hiding this comment

eolivelli Mar 9, 2021

Choose a reason for hiding this comment

eolivelli commented Mar 9, 2021

eolivelli commented Mar 9, 2021

eolivelli commented Mar 10, 2021

sijie commented Mar 13, 2021

eolivelli commented Mar 15, 2021

eolivelli commented Mar 15, 2021

eolivelli commented Feb 3, 2021 •

edited

Loading