Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THRIFT-5443: add support for partial Thrift deserialization #2439

Closed
wants to merge 22 commits into from

Conversation

bhalchandrap
Copy link
Contributor

Client: java

Adds support for partial Thrift deserialization in Java.
More information is available at these locations:
blog: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
in source: lib/java/src/org/apache/thrift/partial/README.md

  • Did you create an Apache Jira ticket? (not required for trivial changes)
  • If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
  • Did you squash your changes to a single commit? (not required, but preferred)
  • Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
  • If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

@Jens-G Jens-G self-requested a review August 22, 2021 16:39
Copy link
Member

@Jens-G Jens-G left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I admit that I did not run any tests, just from what I saw from looking at the code and the README.

That would be awesome:

"TBD: add additional information on defining map keys and values."
I have a strong feeling this will be among the first questions people are going to ask ;-)

Also, did you run tests against "fuzzy" input data? That's becoming kind of standard these days, given an extension like this that should probably be done before we merge.

@bhalchandrap
Copy link
Contributor Author

Updated README.md file to add clarification on map field path syntax.

I have not done fuzzy testing. However, the code includes in depth testing. In addition, the code is in active use in production jobs at Pinterest.

Copy link
Member

@fishy fishy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Java is also not my bread and butter, so my feedback here are more focused on the API than the implementation itself.

I think overall the design makes sense. To map it to go (my bread and butter), some heavy reflect usage will be needed (we might even need some compiler changes to support it), but go's reflect implementation is in general efficient so I'm not too worried about that. An alternative to the string version of fully qualified field definition is to use the numeric ids, but that doesn't really make the implementation easier (at least not for go), and string version is more readable.

// Having to call readFieldBegin() to compute TFieldData really results in lower
// performance. However, readFieldBegin() accesses some private vars that this method
// does not have access to. We could make it more performant when contributing to
// origianl source code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are contributing to the original source code here, so maybe this comment should be updated accordingly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a TODO comment for now. That way we can better separate original contribution from subsequent performance improvement changes.

* @return deserialized instance.
* @throws TException if an error is encountered during deserialization.
*/
public Object deserialize(byte[] bytes) throws TException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this return TBase, or take TBase as an arg and return void instead? that's how the current TDeserializer interface is defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It returns Object on purpose. See the class header comment as well as the README.md file for more info.
This arrangement allows non-TBase values to be directly instantiated from serialized Thrift objects. One example of its usefulness is Spark engine reading SequenceFile of Thrift objects. In that case, this arrangement allows SparkThriftRowProcessor (not a part of this change) to directly deserialize a serialized Thrift blob into Spark's InternalRow representation. That enables significant performance boost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced that this API need to be so dramatically different from the existing TDeserializer API (https://www.javadoc.io/doc/org.apache.thrift/libthrift/latest/org/apache/thrift/TDeserializer.html), can you try to match that as much as possible? The current TDeserializer API in Java already provided partiallyDeserialize* functions.

I'm really worried about how this API will map to other languages, even if it makes sense for the Java case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fishy,
Happy to make changes if I get clarity on the concerns you have and on the type of changes you are looking for.

Based on your comments I believe you prefer adding functionality to existing interfaces instead of adding new ones (please confirm since this is my guess). If so, how does the following work for you?

  1. move common skip*() methods from PartialThriftProtocol.java to TProtocol.java.
  2. move encoding specific skip*() methods from PartialThriftProtocol.java to TProtocol (for example, from PartialThriftBinaryProtocol.java to TBinaryProtocol)
  3. get rid of PartialThriftDeserializer.java and add its methods to TDeserializer

Would something like the above address your concerns?


### Partial deserialization

Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this is where you explained the API difference from TDeserialize.

Is there a real use case of deserializing a non-TBase, though? My main concern is that to map this API to other languages, Object would be interface{} in go, and we really want to avoid interface{} in go as much as possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase.

What are your thoughts on providing a simpler wrapper over ThriftStructProcessor and PartialThriftDeserializer that directly supports deserializing into TBase? That way, not every language would need to expose the ability to deserialize into non-TBase.

- PartialThriftBinaryProtocol.java
- PartialThriftCompactProtocol.java

This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true that the existing skip implementations all need to read the field wholly then discard it, they don't necessarily have to be this way. I would prefer to replace the existing TProtocolUtil.skip with the more efficient implementation, than creating a new, separated API for them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would have been great. However, TProtocolUtil is protocol agnostic and the specifics of certain types of skip differ between compact and binary protocol. That is why it needs to happen at protocol level.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current TProtocolUtil.skip requires a TProtocol arg (https://www.javadoc.io/static/org.apache.thrift/libthrift/0.11.0/org/apache/thrift/protocol/TProtocolUtil.html), so you can totally then delegate to the concrete TProtocol to decide how to skip that field in the most efficient way. (for example, in Go we already have Skip function defined in the TProtocol interface, so it can be different from read*. we can probably do something similar in java.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it. basically you are recommending moving skip*() methods directly into T*Protocol. can you please confirm?

@bhalchandrap
Copy link
Contributor Author

addressed review comments

@bhalchandrap
Copy link
Contributor Author

@fishy @Jens-G do you have additional comments? do you know a Java dev who can review this PR?

@bhalchandrap
Copy link
Contributor Author

@fishy Thanks for reviewing! Can you please provide clarification on the questions?

@fishy
Copy link
Member

fishy commented Sep 14, 2021

@bhalchandrap for skip: yes please make TProtocolUtil.skip choose the most efficient way to skip a field in a TProtocol, instead of adding a new API for "efficient skip" and keep TProtocolUtil.skip as the "slow skip".

For partial deserialization, I don't know which makes more sense in Java API. If you can totally use the existing API to implement it that would be awesome (I imagine you at least need to add new constructors to TDeserializer to take the field maps you want to partially deserialize). If you can't, I think it's fine to add new APIs for it as long as the new API design makes sense, but the new API should be more similar to the existing TDeserializer API, unless we have a good reason to make it quite different (e.g. the current TDeserializer API is bad/inefficient/etc. because of X Y Z).

@bhalchandrap
Copy link
Contributor Author

@fishy thanks for the earlier suggestions. I applied all of them. Can you please re-review?

Comment on lines 445 to 446
public Object deserialize(byte[] bytes, int offset, int length) throws TException {
ensurePartialDeserializationMode();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a deserialize function with very similar signature:

public void deserialize​(TBase base, byte[] bytes, int offset, int length) throws TException

adding this function will make it really confusing to users on what to call. my suggestions are:

  1. Maybe rename this one to deserializeObject to emphasize that it returns Object instead of writes to TBase.
  2. Why does this only support partial mode (it throws a runtime exception when the TDeserializer was not consutrcted with partial mode), while there's nothing in its name or even javadoc mentioning that? that sounds like a huge footgun. Is there even a reason this should call ensurePartialDeserializationMode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry that was an oversight. The method should have been called partialDeserializeObject() like its other overload. fixed. That should address both #1 and #2 points above. The name clarifies two intents:

  1. it performs partial deserialization
  2. it returns an Object

* @param processor the Processor that handles deserialized field values.
* @param protocolFactory the Factory to create a protocol.
*/
public TDeserializer(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(maybe it just works, I'm not familiar with the java codebase and this is not obvious to me)

I don't see you changing any of the existing public functions of TDeserializer, for example:

public void deserialize​(TBase base, byte[] bytes) throws TException
public void deserialize​(TBase base, byte[] bytes, int offset, int length) throws TException
public void deserialize​(TBase base, java.lang.String data, java.lang.String charset) throws TException

Since you are adding new constructors to TDeserializer, you need to ensure that those existing public functions still work as intended. When someone uses this constructor to create a TDeserializer, then calls deserialize​(TBase base, byte[] bytes), they would expect that deserialize function does partial deserialization, not the full one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind, if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization.

If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind, if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization.

If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. Imagine this PR is merged and some developer wrote this code:

TDeserializer deserializer = new TDeserializer(thriftClass, fieldNames, processor, protocolFactory);
deserializer.deserialize(base, buf);

Do you expect it to do full or partial deserialization? I would say that as deserializer here is constructed via a constructor that does partial deserialization, I would expect it to only do partial deserializations.

Comment on lines 278 to 300
protected void skipBool() throws TException {
this.skipBytes(1);
}

protected void skipByte() throws TException {
this.skipBytes(1);
}

protected void skipI16() throws TException {
this.skipBytes(2);
}

protected void skipI32() throws TException {
this.skipBytes(4);
}

protected void skipI64() throws TException {
this.skipBytes(8);
}

protected void skipDouble() throws TException {
this.skipBytes(8);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the default implementations should use hardcoded number of bytes. the default implementations should just call read* and discard the result. this guarantees correctness, and leaves individual protocols the room to override the default implementation to improve performance. Using hardcoded number of bytes cannot guarantee the correctness.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had other aspect in mind. That is, have the derived protocols override when they do not do what the base does. If I call read() for each skip() then each of the derived protocols would want to override all of the methods.

Happy to do what you suggest. Can you please confirm?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is fine, but then you need to fix all the TProtocol implementations that use different bytes here, which includes TJSONProtocol and TSimpleJSONProtocol, and you did not "fix" them.

TProtocol is not some thing "either binary or compact". there are a lot more of them, and the fixed numbers are only correct for TBInaryProtocol. So it really should be the other way around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I did not know about additional protocols. I updated default implementation of all skip methods to call read and moved specific byte skipping to binary protocol. It'd be great if someone who understands the other protocols better provide more efficient skip implementation for them.

@bhalchandrap
Copy link
Contributor Author

@fishy thanks for the careful code review and the suggestions. I applied your suggestions. Can you please re-review? I will add tests corresponding to the new code once the overall structure looks ok to you.

Copy link
Member

@fishy fishy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry for the delayed review, I think the API design in the current state is good enough.

but still I'd wait for someone more familiar with the java code to review it.

@@ -82,12 +147,16 @@ public void deserialize(TBase base, byte[] bytes) throws TException {
* @throws TException if an error is encountered during deserialization.
*/
public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably need to update the javadoc of this function accordingly.

@bhalchandrap
Copy link
Contributor Author

Hi @Jens-G, @fishy approved the pull request. I had addressed your earlier comments. Do you have more suggestions? I would like to know what I can do from my side to make progress on this pull request.

@bhalchandrap
Copy link
Contributor Author

@Jens-G can you please let me know if I can do anything from my side to make progress on this pull request?

@Jens-G
Copy link
Member

Jens-G commented Nov 13, 2021

This branch cannot be rebased due to conflicts

If there are no objections and the CI is green again I would merge this in the next days.

@Jens-G Jens-G closed this in 5a9d139 Nov 19, 2021
@Jens-G
Copy link
Member

Jens-G commented Nov 20, 2021

Next time please either squash the PR or at least rebase it onto master every now and then.

@bhalchandrap
Copy link
Contributor Author

bhalchandrap commented Nov 20, 2021 via email

Source files:
- PartialThriftProtocol.java
- PartialThriftBinaryProtocol.java
- PartialThriftCompactProtocol.java
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these source files have been removed,the document should be updated。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that was an oversight. I will update the document asap and send another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants