New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
THRIFT-5443: add support for partial Thrift deserialization #2439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I admit that I did not run any tests, just from what I saw from looking at the code and the README.
That would be awesome:
"TBD: add additional information on defining map keys and values."
I have a strong feeling this will be among the first questions people are going to ask ;-)
Also, did you run tests against "fuzzy" input data? That's becoming kind of standard these days, given an extension like this that should probably be done before we merge.
Updated README.md file to add clarification on map field path syntax. I have not done fuzzy testing. However, the code includes in depth testing. In addition, the code is in active use in production jobs at Pinterest. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Java is also not my bread and butter, so my feedback here are more focused on the API than the implementation itself.
I think overall the design makes sense. To map it to go (my bread and butter), some heavy reflect
usage will be needed (we might even need some compiler changes to support it), but go's reflect
implementation is in general efficient so I'm not too worried about that. An alternative to the string version of fully qualified field definition is to use the numeric ids, but that doesn't really make the implementation easier (at least not for go), and string version is more readable.
// Having to call readFieldBegin() to compute TFieldData really results in lower | ||
// performance. However, readFieldBegin() accesses some private vars that this method | ||
// does not have access to. We could make it more performant when contributing to | ||
// origianl source code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are contributing to the original source code here, so maybe this comment should be updated accordingly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a TODO comment for now. That way we can better separate original contribution from subsequent performance improvement changes.
* @return deserialized instance. | ||
* @throws TException if an error is encountered during deserialization. | ||
*/ | ||
public Object deserialize(byte[] bytes) throws TException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this return TBase
, or take TBase
as an arg and return void
instead? that's how the current TDeserializer
interface is defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It returns Object on purpose. See the class header comment as well as the README.md file for more info.
This arrangement allows non-TBase values to be directly instantiated from serialized Thrift objects. One example of its usefulness is Spark engine reading SequenceFile of Thrift objects. In that case, this arrangement allows SparkThriftRowProcessor (not a part of this change) to directly deserialize a serialized Thrift blob into Spark's InternalRow representation. That enables significant performance boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced that this API need to be so dramatically different from the existing TDeserializer API (https://www.javadoc.io/doc/org.apache.thrift/libthrift/latest/org/apache/thrift/TDeserializer.html), can you try to match that as much as possible? The current TDeserializer API in Java already provided partiallyDeserialize*
functions.
I'm really worried about how this API will map to other languages, even if it makes sense for the Java case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fishy,
Happy to make changes if I get clarity on the concerns you have and on the type of changes you are looking for.
Based on your comments I believe you prefer adding functionality to existing interfaces instead of adding new ones (please confirm since this is my guess). If so, how does the following work for you?
- move common skip*() methods from PartialThriftProtocol.java to TProtocol.java.
- move encoding specific skip*() methods from PartialThriftProtocol.java to TProtocol (for example, from PartialThriftBinaryProtocol.java to TBinaryProtocol)
- get rid of PartialThriftDeserializer.java and add its methods to TDeserializer
Would something like the above address your concerns?
lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java
Outdated
Show resolved
Hide resolved
|
||
### Partial deserialization | ||
|
||
Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this is where you explained the API difference from TDeserialize
.
Is there a real use case of deserializing a non-TBase
, though? My main concern is that to map this API to other languages, Object
would be interface{}
in go, and we really want to avoid interface{}
in go as much as possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase.
What are your thoughts on providing a simpler wrapper over ThriftStructProcessor and PartialThriftDeserializer that directly supports deserializing into TBase? That way, not every language would need to expose the ability to deserialize into non-TBase.
- PartialThriftBinaryProtocol.java | ||
- PartialThriftCompactProtocol.java | ||
|
||
This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is true that the existing skip
implementations all need to read the field wholly then discard it, they don't necessarily have to be this way. I would prefer to replace the existing TProtocolUtil.skip
with the more efficient implementation, than creating a new, separated API for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would have been great. However, TProtocolUtil is protocol agnostic and the specifics of certain types of skip differ between compact and binary protocol. That is why it needs to happen at protocol level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current TProtocolUtil.skip
requires a TProtocol
arg (https://www.javadoc.io/static/org.apache.thrift/libthrift/0.11.0/org/apache/thrift/protocol/TProtocolUtil.html), so you can totally then delegate to the concrete TProtocol to decide how to skip that field in the most efficient way. (for example, in Go we already have Skip
function defined in the TProtocol
interface, so it can be different from read*
. we can probably do something similar in java.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. basically you are recommending moving skip*() methods directly into T*Protocol. can you please confirm?
addressed review comments |
…ft into kpandit/thrift-partial
@fishy Thanks for reviewing! Can you please provide clarification on the questions? |
@bhalchandrap for skip: yes please make For partial deserialization, I don't know which makes more sense in Java API. If you can totally use the existing API to implement it that would be awesome (I imagine you at least need to add new constructors to |
@fishy thanks for the earlier suggestions. I applied all of them. Can you please re-review? |
public Object deserialize(byte[] bytes, int offset, int length) throws TException { | ||
ensurePartialDeserializationMode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have a deserialize
function with very similar signature:
public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException
adding this function will make it really confusing to users on what to call. my suggestions are:
- Maybe rename this one to
deserializeObject
to emphasize that it returnsObject
instead of writes toTBase
. - Why does this only support partial mode (it throws a runtime exception when the
TDeserializer
was not consutrcted with partial mode), while there's nothing in its name or even javadoc mentioning that? that sounds like a huge footgun. Is there even a reason this should callensurePartialDeserializationMode
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry that was an oversight. The method should have been called partialDeserializeObject() like its other overload. fixed. That should address both #1 and #2 points above. The name clarifies two intents:
- it performs partial deserialization
- it returns an Object
* @param processor the Processor that handles deserialized field values. | ||
* @param protocolFactory the Factory to create a protocol. | ||
*/ | ||
public TDeserializer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(maybe it just works, I'm not familiar with the java codebase and this is not obvious to me)
I don't see you changing any of the existing public functions of TDeserializer
, for example:
public void deserialize(TBase base, byte[] bytes) throws TException
public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException
public void deserialize(TBase base, java.lang.String data, java.lang.String charset) throws TException
Since you are adding new constructors to TDeserializer
, you need to ensure that those existing public functions still work as intended. When someone uses this constructor to create a TDeserializer
, then calls deserialize(TBase base, byte[] bytes)
, they would expect that deserialize
function does partial deserialization, not the full one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind, if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization.
If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind, if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization.
If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree. Imagine this PR is merged and some developer wrote this code:
TDeserializer deserializer = new TDeserializer(thriftClass, fieldNames, processor, protocolFactory);
deserializer.deserialize(base, buf);
Do you expect it to do full or partial deserialization? I would say that as deserializer
here is constructed via a constructor that does partial deserialization, I would expect it to only do partial deserializations.
protected void skipBool() throws TException { | ||
this.skipBytes(1); | ||
} | ||
|
||
protected void skipByte() throws TException { | ||
this.skipBytes(1); | ||
} | ||
|
||
protected void skipI16() throws TException { | ||
this.skipBytes(2); | ||
} | ||
|
||
protected void skipI32() throws TException { | ||
this.skipBytes(4); | ||
} | ||
|
||
protected void skipI64() throws TException { | ||
this.skipBytes(8); | ||
} | ||
|
||
protected void skipDouble() throws TException { | ||
this.skipBytes(8); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the default implementations should use hardcoded number of bytes. the default implementations should just call read*
and discard the result. this guarantees correctness, and leaves individual protocols the room to override the default implementation to improve performance. Using hardcoded number of bytes cannot guarantee the correctness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had other aspect in mind. That is, have the derived protocols override when they do not do what the base does. If I call read() for each skip() then each of the derived protocols would want to override all of the methods.
Happy to do what you suggest. Can you please confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is fine, but then you need to fix all the TProtocol
implementations that use different bytes here, which includes TJSONProtocol
and TSimpleJSONProtocol
, and you did not "fix" them.
TProtocol
is not some thing "either binary or compact". there are a lot more of them, and the fixed numbers are only correct for TBInaryProtocol
. So it really should be the other way around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I did not know about additional protocols. I updated default implementation of all skip methods to call read and moved specific byte skipping to binary protocol. It'd be great if someone who understands the other protocols better provide more efficient skip implementation for them.
@fishy thanks for the careful code review and the suggestions. I applied your suggestions. Can you please re-review? I will add tests corresponding to the new code once the overall structure looks ok to you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for the delayed review, I think the API design in the current state is good enough.
but still I'd wait for someone more familiar with the java code to review it.
@@ -82,12 +147,16 @@ public void deserialize(TBase base, byte[] bytes) throws TException { | |||
* @throws TException if an error is encountered during deserialization. | |||
*/ | |||
public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we probably need to update the javadoc of this function accordingly.
@Jens-G can you please let me know if I can do anything from my side to make progress on this pull request? |
If there are no objections and the CI is green again I would merge this in the next days. |
Next time please either squash the PR or at least rebase it onto master every now and then. |
Certainly. This was the first time I contributed to OSS. It was a good
learning experience for me.
…On Fri, Nov 19, 2021 at 4:01 PM Jens Geyer ***@***.***> wrote:
Next time please either squash the PR or at least rebase it onto master
every now and then.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2439 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ATA7ZBX7PVQ4PZXUGM4RCEDUM3QNVANCNFSM5CILR3NQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Source files: | ||
- PartialThriftProtocol.java | ||
- PartialThriftBinaryProtocol.java | ||
- PartialThriftCompactProtocol.java |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these source files have been removed,the document should be updated。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry that was an oversight. I will update the document asap and send another PR.
Client: java
Adds support for partial Thrift deserialization in Java.
More information is available at these locations:
blog: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
in source: lib/java/src/org/apache/thrift/partial/README.md
[skip ci]
anywhere in the commit message to free up build resources.