Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support writing general records to Pulsar sink #9590

Merged
merged 7 commits into from
Mar 14, 2021

Conversation

sijie
Copy link
Member

@sijie sijie commented Feb 15, 2021

No description provided.

@sijie sijie changed the title [WIP] Support writing general records to Pulsar sink Support writing general records to Pulsar sink Feb 16, 2021
@sijie
Copy link
Member Author

sijie commented Feb 16, 2021

The unit tests are added. Integration tests to be added.

if (GenericRecord.class.isAssignableFrom(typeArg)) {
consumerConfig.setSchemaType(SchemaType.AUTO_CONSUME.toString());
SchemaType configuredSchemaType = SchemaType.valueOf(pulsarSinkConfig.getSchemaType());
if (SchemaType.AUTO_CONSUME != configuredSchemaType) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add consumerConfig.setSchemaType(pulsarSinkConfig.getSchemaType()); in this branch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. The schema type is already overwritten in line 419. This is just to log an info message to indicate that the schema type has been overwritten to AUTO_CONSUME.

@@ -400,7 +415,16 @@ public void close() throws Exception {
ConsumerConfig consumerConfig = new ConsumerConfig();
consumerConfig.setSchemaProperties(pulsarSinkConfig.getSchemaProperties());
if (!StringUtils.isEmpty(pulsarSinkConfig.getSchemaType())) {
consumerConfig.setSchemaType(pulsarSinkConfig.getSchemaType());
if (GenericRecord.class.isAssignableFrom(typeArg)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie initially you pointed out that working here in PulsarSink is not the right way, but we should only work on TopicSchema
#9481 (comment)

In fact I believe that in my PR #9481 I took the right way, driven by your suggestions.

I believe that this change is not enough in order to support by needs.

BTW if the integration test I added to #9481 works with this patch then we can converge to a good solution.
My goal is to get that usecase work, in the best way for the project for the mid/long term

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eolivelli Yes and no on my original comment.

My original comment is to make sure we returned the write schema information via TopicSchema. Because we are using AUTO_CONSUME in the PulsarSink to indicate GenericRecord are published to the Pulsar topic. AUTO_CONSUME can be used by both source and sink. In order to not impact sources, I didn't add the logic in TopicSchema. Instead, I add it in PulsarSink to make it more explicit, which results in one line of similar change as your initial change. But it doesn't your original and current implementation is in the right direction.

The main problem of your previous and current implementation on #9481 is you are trying to hijack the existing AVRO implementation to introduce the support of lazy schema initialization. The lazy schema initialization is already implemented as part of multi-schema write support. So you don't need to add such a hack.

@eolivelli
Copy link
Contributor

@sijie

I totally agree that the main point here is to prevent the PulsarSink from creating the Producer and forcing a Schema on the topic in case of GenericRecord type.
So I am fine with this approach as well.

if you are okay I can merge this patch in my branch at #9481 (and revert the changes to TopicSchema) as we already have integration tests and I can continue the work.
But if you prefer I can close my PR and let you complete your patch, but please add an integration test like my one (that basically covers my usecase).

I just want to see this feature land to master branch and make it available to our users.

@sijie
Copy link
Member Author

sijie commented Feb 17, 2021

I will complete the integration tests here.

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Great work

Thanks @sijie

@zymap
Copy link
Member

zymap commented Feb 23, 2021

Failed tests:

~~~~~~~ SKIPPED -- [TestClass name=class org.apache.pulsar.tests.integration.io.GenericRecordSourceTest].testGenericRecordSource([])-------------- Starting test [TestClass name=class org.apache.pulsar.tests.integration.io.GenericRecordSourceTest].testGenericRecordSource([])-------
07:19:45.283 [TestNG-method=testGenericRecordSource-1:org.apache.pulsar.tests.integration.io.GenericRecordSourceTest@112] INFO  org.apache.pulsar.tests.integration.io.GenericRecordSourceTest - Run command : /pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource
07:19:45.288 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@216] INFO  org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): Executing...
07:19:47.833 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@221] INFO  org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): STDERR: Source class org.apache.pulsar.tests.integration.io.GenericRecordSource must be in class path
07:19:47.834 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@221] INFO  org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): STDERR: Reason: Source class org.apache.pulsar.tests.integration.io.GenericRecordSource must be in class path
07:19:48.163 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@236] INFO  org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): Done
07:19:48.167 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@254] INFO  org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): completed with 1
07:19:48.167 [docker-java-stream-57079303:org.apache.pulsar.tests.integration.utils.DockerUtils$2@257] ERROR org.apache.pulsar.tests.integration.utils.DockerUtils - DOCKER.exec(ckmqvcaq-standalone:/pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource): completed with non zero return code: 1
stdout: 
stderr: Source class org.apache.pulsar.tests.integration.io.GenericRecordSource must be in class path

Reason: Source class org.apache.pulsar.tests.integration.io.GenericRecordSource must be in class path

!!!!!!!!! FAILURE-- [TestClass name=class org.apache.pulsar.tests.integration.io.GenericRecordSourceTest].testGenericRecordSource([])-------
Error:  Tests run: 5, Failures: 1, Errors: 0, Skipped: 1, Time elapsed: 206.535 s <<< FAILURE! - in TestSuite
Error:  pulsar-standalone-suite(org.apache.pulsar.tests.integration.io.GenericRecordSourceTest)  Time elapsed: 2.904 s  <<< FAILURE!
org.apache.pulsar.tests.integration.docker.ContainerExecException: /pulsar/bin/pulsar-admin sources create --name test-state-source-isolqgjy --destinationTopicName test-state-source-output-zrbylspk --archive /pulsar/examples/java-test-functions.jar --classname org.apache.pulsar.tests.integration.io.GenericRecordSource failed on 267d5fc4c5f0ad1512ae6ba8588f32236a6d72ddd840ba940a24d9d2a94de872 with error code 1
	at org.apache.pulsar.tests.integration.utils.DockerUtils$2.onComplete(DockerUtils.java:259)
	at org.testcontainers.shaded.com.github.dockerjava.core.exec.AbstrAsyncDockerCmdExec$1.onComplete(AbstrAsyncDockerCmdExec.java:51)
	at org.testcontainers.shaded.com.github.dockerjava.core.DefaultInvocationBuilder.lambda$executeAndStream$1(DefaultInvocationBuilder.java:276)
	at java.lang.Thread.run(Thread.java:748)

[INFO] 
[INFO] Results:
[INFO] 
Error:  Failures: 
Error:  org.apache.pulsar.tests.integration.io.GenericRecordSourceTest.pulsar-standalone-suite(org.apache.pulsar.tests.integration.io.GenericRecordSourceTest)
[INFO]   Run 1: PASS
Error:    Run 2: GenericRecordSourceTest.testGenericRecordSource » ContainerExec /pulsar/bin/pu...
[INFO] 
[INFO] 
Error:  Tests run: 4, Failures: 1, Errors: 0, Skipped: 0

@eolivelli
Copy link
Contributor

@sijie do you have time to complete this patch.

It is very useful for a couple of usecases I saw and I really would like this work to land to master
the patch has already been approved by @codelipenghui and @freeznet

@freeznet
Copy link
Contributor

@sijie do you have time to complete this patch.

It is very useful for a couple of usecases I saw and I really would like this work to land to master
the patch has already been approved by @codelipenghui and @freeznet

@eolivelli i am work with @sijie to fix the ci failed issue, will update and finish this pr soon.

@codelipenghui codelipenghui merged commit b826e03 into apache:master Mar 14, 2021
Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you guys for this work.
It is really a good step forward.

But I need a little more, please check my latest comment

Thread.sleep(20);

int value = count.incrementAndGet();
GenericRecord record = schema.newRecordBuilder()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie @codelipenghui unfortunately this is not exactly like my original test case, that reproduced my use case.
That is to be able to push an object that implements GenericRecord.

Here you are using the builder provided by Pulsar but this is mo enough for me, because my user would like to use an object from his own domain, just by implementing Pulsar GenericRecord java interface, because we will save resources (allocations and cycles)

https://github.com/apache/pulsar/pull/9481/files#diff-bbdf586ddad181a0a9dae17974b19ca5cbbce398716ec7fa5b4c45b69be58f41R66

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GenericRecord was designed to be from RecordSchemaBuilder. It doesn't expect people to implement the GenericRecord directly. I don't understand the allocations and cycles issue. If it is an allocations or cycles issue, it should be fixed in RecordSchemaBuilder. It shouldn't be done by just implementing GenericRecord.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sijie okay
I will check downstream if using RecordSchemaBuilder is a valid option.
thanks

fmiguelez pushed a commit to fmiguelez/pulsar that referenced this pull request Mar 16, 2021
eolivelli pushed a commit to datastax/pulsar that referenced this pull request May 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants