add protobuf inputformat #11018

bananaaggle · 2021-03-20T07:41:23Z

Because of deprecated of parseSpec, I develop ProtobufInputFormat for new interface, which supports stream ingestion for data encoded by Protobuf.

This PR has:

been self-reviewed.
- using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
added documentation for new or modified features or behaviors.
added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
added or updated version, license, or notice information in licenses.yaml
added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
added integration tests.
been tested in a test Druid cluster.

bananaaggle · 2021-03-20T07:50:22Z

@clintropolis Hi, I create ProtobufInputFormat followed your suggestion. I don't use this interface before, so I'm not very familiar with it. Can you review my code and tell me if this implementation meet requirement or not? If this implementation is correct, I will add more unit tests. By the way, where should I change in document about this feature?

bananaaggle · 2021-03-20T07:55:17Z

@clintropolis I reviewed code about Avro for inputFormat and learnt it only supports batch ingestion jobs. Why do we not support stream ingestion jobs? I think it's not very hard to implement it and I'm glad to do that.

clintropolis · 2021-03-20T08:29:10Z

@clintropolis Hi, I create ProtobufInputFormat followed your suggestion. I don't use this interface before, so I'm not very familiar with it. Can you review my code and tell me if this implementation meet requirement or not? If this implementation is correct, I will add more unit tests. By the way, where should I change in document about this feature?

Thanks! I will have a look this weekend. I think https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md is the appropriate place to document the new InputFormat (I guess we also forgot to update the protobuf section of this in the last PR, https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#protobuf-parser)

@clintropolis I reviewed code about Avro for inputFormat and learnt it only supports batch ingestion jobs. Why do we not support stream ingestion jobs? I think it's not very hard to implement it and I'm glad to do that.

👍 The only reason streaming Avro isn't supported yet is basically the same reason it wasn't done for Protobuf, simply that no one has done the conversion. I think it would be great if you would like to take that on, especially since I think Avro and Protobuf (until this PR) are the only "core" extensions that do not yet support InputFormat. It would make ingestion be consistent for native batch and streaming, and be much appreciated!

bananaaggle · 2021-03-24T01:26:43Z

@clintropolis Hi, I create ProtobufInputFormat followed your suggestion. I don't use this interface before, so I'm not very familiar with it. Can you review my code and tell me if this implementation meet requirement or not? If this implementation is correct, I will add more unit tests. By the way, where should I change in document about this feature?

Thanks! I will have a look this weekend. I think https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md is the appropriate place to document the new InputFormat (I guess we also forgot to update the protobuf section of this in the last PR, https://github.com/apache/druid/blob/master/docs/ingestion/data-formats.md#protobuf-parser)

@clintropolis I reviewed code about Avro for inputFormat and learnt it only supports batch ingestion jobs. Why do we not support stream ingestion jobs? I think it's not very hard to implement it and I'm glad to do that.

👍 The only reason streaming Avro isn't supported yet is basically the same reason it wasn't done for Protobuf, simply that no one has done the conversion. I think it would be great if you would like to take that on, especially since I think Avro and Protobuf (until this PR) are the only "core" extensions that do not yet support InputFormat. It would make ingestion be consistent for native batch and streaming, and be much appreciated!

clintropolis

Apologies for the delay in having a look, overall lgtm 👍

clintropolis · 2021-03-24T03:53:37Z

...-extensions/src/main/java/org/apache/druid/data/input/protobuf/ProtobufExtensionsModule.java

@@ -37,7 +37,8 @@
    return Collections.singletonList(
        new SimpleModule("ProtobufInputRowParserModule")
            .registerSubtypes(
-                new NamedType(ProtobufInputRowParser.class, "protobuf")
+                new NamedType(ProtobufInputRowParser.class, "protobuf"),
+                new NamedType(ProtobufInputFormat.class, "protobuf_format")


I think this could just be protobuf the same as the parser name, since they are separate interfaces

clintropolis · 2021-03-24T04:02:20Z

...e/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/ProtobufReader.java

+    return CloseableIterators.withEmptyBaggage(
+        Iterators.singletonIterator(JsonFormat.printer().print(protobufBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())))
+        )));


The InputRowParser implementation for protobuf has an optimization that skips the conversion to JSON if a flattenSpec is not defined (see #9999), since the overhead to convert to be able to flatten can slow input processing a fair bit (from the numbers in that PR).

To retain this, it might make sense to make the intermediary format be ByteBuffer or byte[], and handle the case of having a flattenSpec or not separately. I think these could probably be done within this same class, just make parseInputRows behave differently for each situation, and it maybe makes sense to use JSON conversion for the toMap method (it is by InputSourceSampler for the sampler API).

bananaaggle · 2021-03-24T10:56:50Z

@clintropolis Thanks for your review. I didn't know this feature before, it's a good idea to change code for this optimization. And do you have other suggestions about my code? If no more concerns, I'll add unit test and document for it.

clintropolis · 2021-03-26T11:01:42Z

And do you have other suggestions about my code? If no more concerns, I'll add unit test and document for it.

Everything else looked good to me 👍

bananaaggle · 2021-03-28T04:58:51Z

@clintropolis I've add document for this feature and supple document for last commit. Can you review those documents and help me refine it? And what else do you think I should add for unit tests?

clintropolis

docs are almost there I think, 👍

As far as additional testing goes, I guess this covers most of the InputFormat stuff, and a lot of the other stuff, such as the decoding logic itself is already covered by other tests.

I guess serialization/deserialization tests on ProtobufInputFormat would be good to make sure that JSON requests work as expected when submitting tasks/supervisor specs. Beyond that, integration tests would probably be the most useful, but that can be done in a separate PR since there is going to be a bit of setup to do to get that working.

clintropolis · 2021-03-30T01:11:42Z

...e/protobuf-extensions/src/main/java/org/apache/druid/data/input/protobuf/ProtobufReader.java

+    return CloseableIterators.withEmptyBaggage(
+        Iterators.singletonIterator(protobufBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())
+        ))));


nit: formatting is sort of funny here

Suggested change

return CloseableIterators.withEmptyBaggage(

Iterators.singletonIterator(protobufBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open())

))));

return CloseableIterators.withEmptyBaggage(

Iterators.singletonIterator(protobufBytesDecoder.parse(ByteBuffer.wrap(IOUtils.toByteArray(source.open()))))

);

clintropolis · 2021-03-30T01:14:04Z

docs/ingestion/data-formats.md

@@ -1104,6 +1142,83 @@ Sample spec:
 See the [extension description](../development/extensions-core/protobuf.md) for
 more details and examples.

+#### Protobuf Bytes Decoder
+
+If `type` is not included, the avroBytesDecoder defaults to `schema_registry`.


Suggested change

If `type` is not included, the avroBytesDecoder defaults to `schema_registry`.

If `type` is not included, the `protoBytesDecoder` defaults to `schema_registry`.

clintropolis · 2021-03-30T01:16:54Z

docs/ingestion/data-formats.md

+
+| Field | Type | Description | Required |
+|-------|------|-------------|----------|
+|type| String| This should be set to `protobuf` to read Protobuf file| yes |


clintropolis · 2021-03-30T01:18:45Z

docs/ingestion/data-formats.md

+        }
+      ]
+    },
+    "binaryAsString": false


I think this is an Avro/Parquet/ORC parameter

bananaaggle · 2021-03-30T13:21:54Z

I've changed document and add a serialization test. What is the integration tests you mentioned above? Can you give me an example for implementing it? @clintropolis

lgtm-com · 2021-03-30T14:27:33Z

This pull request introduces 2 alerts when merging 1751a3f into 6789ed0 - view on LGTM.com

new alerts:

2 for Inconsistent equals and hashCode

bananaaggle · 2021-04-11T08:04:00Z

@clintropolis Hi, I add one unit test for serialization/deserialization when use schema registry. As for integration test, should I add it in this PR or other PR after this PR is merged?

clintropolis · 2021-04-12T08:29:42Z

As for integration test, should I add it in this PR or other PR after this PR is merged?

I think it would be fine to do the integration test as a follow-up, since it is a bit more involved than the Avro input format due to there not being an existing integration test for protobuf.

Adding it would be pretty similar to what is already there for Avro and the other formats though, with the main bit being implementing EventSerializer for the protobuf input format to test so that data can be written to the stream, and adding all the json template resources so that it gets added to the "data format" test group.

clintropolis

lgtm 👍

add protobuf inputformat

862ba16

repair pom

e939a65

bananaaggle closed this Mar 24, 2021

bananaaggle reopened this Mar 24, 2021

clintropolis added the Area - Ingestion label Mar 24, 2021

clintropolis reviewed Mar 24, 2021

View reviewed changes

alter intermediateRow to type of Dynamicmessage

697c592

add document

ecdd234

yuanyi added 3 commits March 28, 2021 16:41

refine test

f9060ea

fix document

4173eed

add protoBytesDecoder

7f475f7

clintropolis reviewed Mar 30, 2021

View reviewed changes

refine document and add ser test

1751a3f

yuanyi added 2 commits March 31, 2021 10:03

add hash

2872069

add schema registry ser test

6251f00

clintropolis approved these changes Apr 12, 2021

View reviewed changes

clintropolis merged commit 0e0c1a1 into apache:master Apr 13, 2021

clintropolis added this to the 0.22.0 milestone Aug 12, 2021

clintropolis mentioned this pull request Sep 3, 2021

[Draft] 0.22.0 Release Notes #11657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add protobuf inputformat #11018

add protobuf inputformat #11018

bananaaggle commented Mar 20, 2021 •

edited

Loading

bananaaggle commented Mar 20, 2021

bananaaggle commented Mar 20, 2021

clintropolis commented Mar 20, 2021

bananaaggle commented Mar 24, 2021

clintropolis left a comment

clintropolis Mar 24, 2021

clintropolis Mar 24, 2021

bananaaggle commented Mar 24, 2021

clintropolis commented Mar 26, 2021

bananaaggle commented Mar 28, 2021 •

edited

Loading

clintropolis left a comment

clintropolis Mar 30, 2021 •

edited

Loading

clintropolis Mar 30, 2021

clintropolis Mar 30, 2021

clintropolis Mar 30, 2021

bananaaggle commented Mar 30, 2021

lgtm-com bot commented Mar 30, 2021

bananaaggle commented Apr 11, 2021

clintropolis commented Apr 12, 2021

clintropolis left a comment

	If `type` is not included, the avroBytesDecoder defaults to `schema_registry`.
	If `type` is not included, the `protoBytesDecoder` defaults to `schema_registry`.

	\|type\| String\| This should be set to `protobuf` to read Protobuf file\| yes \|
	\|type\| String\| This should be set to `protobuf` to read Protobuf serialized data\| yes \|

add protobuf inputformat #11018

add protobuf inputformat #11018

Conversation

bananaaggle commented Mar 20, 2021 • edited Loading

bananaaggle commented Mar 20, 2021

bananaaggle commented Mar 20, 2021

clintropolis commented Mar 20, 2021

bananaaggle commented Mar 24, 2021

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis Mar 24, 2021

Choose a reason for hiding this comment

clintropolis Mar 24, 2021

Choose a reason for hiding this comment

bananaaggle commented Mar 24, 2021

clintropolis commented Mar 26, 2021

bananaaggle commented Mar 28, 2021 • edited Loading

clintropolis left a comment

Choose a reason for hiding this comment

clintropolis Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

clintropolis Mar 30, 2021

Choose a reason for hiding this comment

clintropolis Mar 30, 2021

Choose a reason for hiding this comment

clintropolis Mar 30, 2021

Choose a reason for hiding this comment

bananaaggle commented Mar 30, 2021

lgtm-com bot commented Mar 30, 2021

bananaaggle commented Apr 11, 2021

clintropolis commented Apr 12, 2021

clintropolis left a comment

Choose a reason for hiding this comment

bananaaggle commented Mar 20, 2021 •

edited

Loading

bananaaggle commented Mar 28, 2021 •

edited

Loading

clintropolis Mar 30, 2021 •

edited

Loading