New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Protocol Buffer Stream Decoder #8972
Conversation
62484e1
to
96bb773
Compare
Codecov Report
@@ Coverage Diff @@
## master #8972 +/- ##
============================================
- Coverage 68.56% 61.35% -7.22%
+ Complexity 4640 4523 -117
============================================
Files 1741 1828 +87
Lines 91475 97090 +5615
Branches 13674 14641 +967
============================================
- Hits 62724 59571 -3153
- Misses 24363 33078 +8715
- Partials 4388 4441 +53
Flags with carried forward coverage won't be shown. Click here to find out more.
Help us with your feedback. Take ten seconds to tell us how you rate us. |
...tobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufMessageDecoder.java
Outdated
Show resolved
Hide resolved
@Override | ||
public void init(Map<String, String> props, Set<String> fieldsToRead, String topicName) | ||
throws Exception { | ||
Preconditions.checkState(props.containsKey(DESCRIPTOR_FILE_PATH), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with what a descriptor file represents after reading https://github.com/os72/protobuf-dynamic and viewing sample.desc
. Is that a binary file that is generated by proto compiler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. So basically .proto
file contains the schema. From this schema, you can either generate java classes or you can generate .desc
binary file. The latter approach allows us to parse any proto message easily without relying on java impl. It is a bit slower though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the slowness should only be a one-time overhead right? we do not decode desc
files on every message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The slowness is not due to the decoding of .desc file.
It is due to the fact we are using DynamicMessage
class instead of compiled Proto java class to deserialize the messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did some benchmark on this few years back - https://codeburst.io/using-dynamic-messages-in-protocol-buffers-in-scala-9fda4f0efcb3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are on my machine with proto version we are using. Significant improvments from ones in blog post.
Benchmark (_numRecords) Mode Cnt Score Error Units
BenchmarkDynamicMessage.compiledClassDeserialization 100000 avgt 5 12.787 ± 0.297 ms/op
BenchmarkDynamicMessage.dynamicClassDeserialization 100000 avgt 5 28.150 ± 0.691 ms/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall lgtm.
...tobuf/src/main/java/org/apache/pinot/plugin/inputformat/protobuf/ProtoBufMessageDecoder.java
Outdated
Show resolved
Hide resolved
adaa65e
to
f88c7cf
Compare
(cherry picked from commit 8806dc3)
Allows users to ingest protocol buffers in realtime streams.
Supported configs
descriptorFile
- Path of the descriptor file. You can generate this file useprotoc -o file.desc --include_imports file.proto
command. The path can be a local path in which case it needs to be available on all servers OR it can be a DFS path such as s3, gcs etc. If providing a DFS path, you need to ensure pinot is configured to use that filesystem.protoClassName
- If the descriptor file contains multiple proto object, you can mention the name of the class to use for parsing.Example
For future PRs