New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make StreamMessage generic and a bug fix #9544
Conversation
@sajjad-moradi and @mcvsubbu please review. |
Tagging @navina to take a look as well. |
_key = key; | ||
_value = value; | ||
_metadata = metadata; | ||
_length = length; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please update the javadoc for the new field?
@vvivekiyer I had intentionally tried to stay away from using
I am curious why Linkedin does this? isn't is expensive to deserialize every record until when the deserialized payload is actually needed? The consumer's contract should not involve deserializing the payload. Can you please explain why this is useful? Using generics forces the segment manager implementation to deal with raw usage of parameterized classes (due to type erasures) and make the code hard to read and maintain. Soliciting feedback from @npawar / @Jackie-Jiang / @kishoreg here. Thanks!
Thank you for fixing this! I recently noticed this in my testing. |
Another question: If Linkedin is using |
@navina I've tried to answer the questions below. Please take a look and let me know what you think.
Yes. LinkedIn has a custom kafka consumer implementation.
LinkedIn's kafka consumer directly fetches the deserialized payload . AFAIK, Linkedin Kafka has a schema registry where the payload's schema is registered. So they provide (optimized) deser and do not allow clients to have their own deserialization. @sajjad-moradi to add more details, if any. Just to give more clarity about linkedin's custom implementation of interfaces:
I meant additional serialization and deserialization. Edited the description.
As per my understanding of the code, SegmentManager only deals with GenericRow once the deserialization is done. Depending on various implementations,
I agree with this part. We can discuss and arrive at the best way to do this. But IMO, forcing This is my understanding of our OSS code prior to #9224:
Note that MessageBatch is a generic interface because users of Pinot are free to use their custom kafka (or other) client implementations that could return messages in any format. After #9224, the code looks as follows:
Looking at the above, it looks like we've introduced a new step (2), where we are forcing messages of generic type (MessageBatch) to be serialized to |
I'm merging this PR as the issue it fixes has broken LinkedIn's build. As Vivek mentioned, we can discuss short term & long term fixes if needed. |
@vvivekiyer Not sure how much value a discussion can offer if the PR has already been merged. But here is my take on this:
I understand the flexibility that this generic MessageBatch provides. But we want to get a more stronger interface contract so that the development of a plugin becomes trivial and streamlined. Features we can add:
Yes. that was the whole point of changing to the new code. I was trying to de-couple the "decoding" of a message from "fetching" of a message from the stream.
Iirc, Linkedin has multiple client libraries and I am fairly certain that almost all of them, except Can you help me understand why this approach cannot be used by Linkedin pinot ? |
@navina @vvivekiyer @sajjad-moradi after going through the history, I see that there is more discussion warranted here to ensure we have agreement on the right design forward. Given that the PR was already merged, and there seems to be open questions, could we get to consensus quickly? Would like to avoid a situation where the right design forward is not aligned with the merged PR, but the PR stays in the system long enough to make it harder to solve cleanly. Please let me know how I can help. |
Taking this discussion to slack to get faster resolution. cc: @mayankshriv @navina @sajjad-moradi |
With PR #9224,
StreamMessage
can only acceptbytes[]
datatype for values. However,MessageBatch
interface is generic and lets users implement a class with a custom type. For example, LinkedIn usesMessageBatch<IndexedRecord>
for it's kafka client. KeepingStreamMessage
generic will avoid unnecessary serializing and deserializing for such users.Also fixed a NPE in
StreamDataDecoderImpl.java
where metadata header is null.Added tests to
StreamMessageTest.java
. Also tested the changes on a cluster.