THRIFT-5443: add support for partial Thrift deserialization #2439

bhalchandrap · 2021-08-16T19:32:38Z

Client: java

Adds support for partial Thrift deserialization in Java.
More information is available at these locations:
blog: https://medium.com/pinterest-engineering/improving-data-processing-efficiency-using-partial-deserialization-of-thrift-16bc3a4a38b4
in source: lib/java/src/org/apache/thrift/partial/README.md

Did you create an Apache Jira ticket? (not required for trivial changes)
If a ticket exists: Does your pull request title follow the pattern "THRIFT-NNNN: describe my issue"?
Did you squash your changes to a single commit? (not required, but preferred)
Did you do your best to avoid breaking changes? If one was needed, did you label the Jira ticket with "Breaking-Change"?
If your change does not involve any code, include [skip ci] anywhere in the commit message to free up build resources.

Client: java

Jens-G

Nice! I admit that I did not run any tests, just from what I saw from looking at the code and the README.

That would be awesome:

"TBD: add additional information on defining map keys and values."
I have a strong feeling this will be among the first questions people are going to ask ;-)

Also, did you run tests against "fuzzy" input data? That's becoming kind of standard these days, given an extension like this that should probably be done before we merge.

…ft into kpandit/thrift-partial

bhalchandrap · 2021-08-31T01:36:40Z

Updated README.md file to add clarification on map field path syntax.

I have not done fuzzy testing. However, the code includes in depth testing. In addition, the code is in active use in production jobs at Pinterest.

fishy

Java is also not my bread and butter, so my feedback here are more focused on the API than the implementation itself.

I think overall the design makes sense. To map it to go (my bread and butter), some heavy reflect usage will be needed (we might even need some compiler changes to support it), but go's reflect implementation is in general efficient so I'm not too worried about that. An alternative to the string version of fully qualified field definition is to use the numeric ids, but that doesn't really make the implementation easier (at least not for go), and string version is more readable.

fishy · 2021-09-05T01:08:17Z

lib/java/src/org/apache/thrift/partial/PartialThriftCompactProtocol.java

+    // Having to call readFieldBegin() to compute TFieldData really results in lower
+    // performance. However, readFieldBegin() accesses some private vars that this method
+    // does not have access to. We could make it more performant when contributing to
+    // origianl source code.


you are contributing to the original source code here, so maybe this comment should be updated accordingly?

added a TODO comment for now. That way we can better separate original contribution from subsequent performance improvement changes.

fishy · 2021-09-05T01:17:07Z

lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java

+   * @return deserialized instance.
+   * @throws TException if an error is encountered during deserialization.
+   */
+  public Object deserialize(byte[] bytes) throws TException {


should this return TBase, or take TBase as an arg and return void instead? that's how the current TDeserializer interface is defined.

It returns Object on purpose. See the class header comment as well as the README.md file for more info.
This arrangement allows non-TBase values to be directly instantiated from serialized Thrift objects. One example of its usefulness is Spark engine reading SequenceFile of Thrift objects. In that case, this arrangement allows SparkThriftRowProcessor (not a part of this change) to directly deserialize a serialized Thrift blob into Spark's InternalRow representation. That enables significant performance boost.

I'm still not convinced that this API need to be so dramatically different from the existing TDeserializer API (https://www.javadoc.io/doc/org.apache.thrift/libthrift/latest/org/apache/thrift/TDeserializer.html), can you try to match that as much as possible? The current TDeserializer API in Java already provided partiallyDeserialize* functions.

I'm really worried about how this API will map to other languages, even if it makes sense for the Java case.

Hi @fishy,
Happy to make changes if I get clarity on the concerns you have and on the type of changes you are looking for.

Based on your comments I believe you prefer adding functionality to existing interfaces instead of adding new ones (please confirm since this is my guess). If so, how does the following work for you?

move common skip*() methods from PartialThriftProtocol.java to TProtocol.java.

move encoding specific skip*() methods from PartialThriftProtocol.java to TProtocol (for example, from PartialThriftBinaryProtocol.java to TBinaryProtocol)

get rid of PartialThriftDeserializer.java and add its methods to TDeserializer

Would something like the above address your concerns?

lib/java/src/org/apache/thrift/partial/PartialThriftDeserializer.java

fishy · 2021-09-05T02:08:04Z

lib/java/src/org/apache/thrift/partial/README.md

+
+### Partial deserialization
+
+Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.


I see this is where you explained the API difference from TDeserialize.

Is there a real use case of deserializing a non-TBase, though? My main concern is that to map this API to other languages, Object would be interface{} in go, and we really want to avoid interface{} in go as much as possible.

Yes, it is very useful in certain cases (see earlier comment). ThriftStructProcessor enables direct deserialization into TBase.

What are your thoughts on providing a simpler wrapper over ThriftStructProcessor and PartialThriftDeserializer that directly supports deserializing into TBase? That way, not every language would need to expose the ability to deserialize into non-TBase.

fishy · 2021-09-05T02:11:24Z

lib/java/src/org/apache/thrift/partial/README.md

+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java
+
+This component implements efficient skipping over fields that need not be deserialized. Note that this skipping is more efficient compared to that achieved by using `TProtocolUtil.skip()`. The latter calls the corresponding `read()`, allocates and initializes certain values (for example, strings) and then discards the returned value. In comparison, `PartialThriftProtocol` skips a field by incrementing internal offset into the transport buffer.


While this is true that the existing skip implementations all need to read the field wholly then discard it, they don't necessarily have to be this way. I would prefer to replace the existing TProtocolUtil.skip with the more efficient implementation, than creating a new, separated API for them.

That would have been great. However, TProtocolUtil is protocol agnostic and the specifics of certain types of skip differ between compact and binary protocol. That is why it needs to happen at protocol level.

The current TProtocolUtil.skip requires a TProtocol arg (https://www.javadoc.io/static/org.apache.thrift/libthrift/0.11.0/org/apache/thrift/protocol/TProtocolUtil.html), so you can totally then delegate to the concrete TProtocol to decide how to skip that field in the most efficient way. (for example, in Go we already have Skip function defined in the TProtocol interface, so it can be different from read*. we can probably do something similar in java.)

got it. basically you are recommending moving skip*() methods directly into T*Protocol. can you please confirm?

bhalchandrap · 2021-09-06T20:16:10Z

addressed review comments

…ft into kpandit/thrift-partial

bhalchandrap · 2021-09-10T17:40:31Z

@fishy @Jens-G do you have additional comments? do you know a Java dev who can review this PR?

bhalchandrap · 2021-09-13T02:09:20Z

@fishy Thanks for reviewing! Can you please provide clarification on the questions?

fishy · 2021-09-14T02:55:25Z

@bhalchandrap for skip: yes please make TProtocolUtil.skip choose the most efficient way to skip a field in a TProtocol, instead of adding a new API for "efficient skip" and keep TProtocolUtil.skip as the "slow skip".

For partial deserialization, I don't know which makes more sense in Java API. If you can totally use the existing API to implement it that would be awesome (I imagine you at least need to add new constructors to TDeserializer to take the field maps you want to partially deserialize). If you can't, I think it's fine to add new APIs for it as long as the new API design makes sense, but the new API should be more similar to the existing TDeserializer API, unless we have a good reason to make it quite different (e.g. the current TDeserializer API is bad/inefficient/etc. because of X Y Z).

…TDeserializer

…ft into kpandit/thrift-partial

bhalchandrap · 2021-10-01T00:23:50Z

@fishy thanks for the earlier suggestions. I applied all of them. Can you please re-review?

fishy · 2021-10-02T16:00:00Z

lib/java/src/org/apache/thrift/TDeserializer.java

+  public Object deserialize(byte[] bytes, int offset, int length) throws TException {
+    ensurePartialDeserializationMode();


We already have a deserialize function with very similar signature:

public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException

adding this function will make it really confusing to users on what to call. my suggestions are:

Maybe rename this one to deserializeObject to emphasize that it returns Object instead of writes to TBase.

Why does this only support partial mode (it throws a runtime exception when the TDeserializer was not consutrcted with partial mode), while there's nothing in its name or even javadoc mentioning that? that sounds like a huge footgun. Is there even a reason this should call ensurePartialDeserializationMode?

sorry that was an oversight. The method should have been called partialDeserializeObject() like its other overload. fixed. That should address both #1 and #2 points above. The name clarifies two intents:

it performs partial deserialization

it returns an Object

fishy · 2021-10-02T16:05:13Z

lib/java/src/org/apache/thrift/TDeserializer.java

+   * @param processor the Processor that handles deserialized field values.
+   * @param protocolFactory the Factory to create a protocol.
+   */
+  public TDeserializer(


(maybe it just works, I'm not familiar with the java codebase and this is not obvious to me)

I don't see you changing any of the existing public functions of TDeserializer, for example:

public void deserialize(TBase base, byte[] bytes) throws TException public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException public void deserialize(TBase base, java.lang.String data, java.lang.String charset) throws TException

Since you are adding new constructors to TDeserializer, you need to ensure that those existing public functions still work as intended. When someone uses this constructor to create a TDeserializer, then calls deserialize(TBase base, byte[] bytes), they would expect that deserialize function does partial deserialization, not the full one.

Your observation is correct and is intentional. I did not want to change the behavior of existing methods as it can cause backward compatibility issues. With that in mind, if one calls any of the existing deserialize() methods, they will continue to get the existing behavior. One will need to call partialDeserilizeObject() to perform partial deserialization.

If you strongly believe that we need to change the behavior of existing methods, I can make it happen. Let me know what you think.

I disagree. Imagine this PR is merged and some developer wrote this code:

TDeserializer deserializer = new TDeserializer(thriftClass, fieldNames, processor, protocolFactory); deserializer.deserialize(base, buf);

Do you expect it to do full or partial deserialization? I would say that as deserializer here is constructed via a constructor that does partial deserialization, I would expect it to only do partial deserializations.

fishy · 2021-10-02T16:08:54Z

lib/java/src/org/apache/thrift/protocol/TProtocol.java

+  protected void skipBool() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipByte() throws TException {
+    this.skipBytes(1);
+  }
+
+  protected void skipI16() throws TException {
+    this.skipBytes(2);
+  }
+
+  protected void skipI32() throws TException {
+    this.skipBytes(4);
+  }
+
+  protected void skipI64() throws TException {
+    this.skipBytes(8);
+  }
+
+  protected void skipDouble() throws TException {
+    this.skipBytes(8);
+  }


I don't think the default implementations should use hardcoded number of bytes. the default implementations should just call read* and discard the result. this guarantees correctness, and leaves individual protocols the room to override the default implementation to improve performance. Using hardcoded number of bytes cannot guarantee the correctness.

I had other aspect in mind. That is, have the derived protocols override when they do not do what the base does. If I call read() for each skip() then each of the derived protocols would want to override all of the methods.

Happy to do what you suggest. Can you please confirm?

That is fine, but then you need to fix all the TProtocol implementations that use different bytes here, which includes TJSONProtocol and TSimpleJSONProtocol, and you did not "fix" them.

TProtocol is not some thing "either binary or compact". there are a lot more of them, and the fixed numbers are only correct for TBInaryProtocol. So it really should be the other way around.

Good catch! I did not know about additional protocols. I updated default implementation of all skip methods to call read and moved specific byte skipping to binary protocol. It'd be great if someone who understands the other protocols better provide more efficient skip implementation for them.

bhalchandrap · 2021-10-12T02:21:22Z

@fishy thanks for the careful code review and the suggestions. I applied your suggestions. Can you please re-review? I will add tests corresponding to the new code once the overall structure looks ok to you.

fishy

sorry for the delayed review, I think the API design in the current state is good enough.

but still I'd wait for someone more familiar with the java code to review it.

fishy · 2021-10-16T03:56:34Z

lib/java/src/org/apache/thrift/TDeserializer.java

@@ -82,12 +147,16 @@ public void deserialize(TBase base, byte[] bytes) throws TException {
   * @throws TException if an error is encountered during deserialization.
   */
  public void deserialize(TBase base, byte[] bytes, int offset, int length) throws TException {


we probably need to update the javadoc of this function accordingly.

bhalchandrap · 2021-10-26T01:04:17Z

Hi @Jens-G, @fishy approved the pull request. I had addressed your earlier comments. Do you have more suggestions? I would like to know what I can do from my side to make progress on this pull request.

bhalchandrap · 2021-11-11T01:55:39Z

@Jens-G can you please let me know if I can do anything from my side to make progress on this pull request?

Jens-G · 2021-11-13T23:36:02Z

This branch cannot be rebased due to conflicts

If there are no objections and the CI is green again I would merge this in the next days.

Jens-G · 2021-11-20T00:01:19Z

Next time please either squash the PR or at least rebase it onto master every now and then.

bhalchandrap · 2021-11-20T00:55:53Z

Certainly. This was the first time I contributed to OSS. It was a good learning experience for me.

…

On Fri, Nov 19, 2021 at 4:01 PM Jens Geyer ***@***.***> wrote: Next time please either squash the PR or at least rebase it onto master every now and then. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ATA7ZBX7PVQ4PZXUGM4RCEDUM3QNVANCNFSM5CILR3NQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

dulong · 2021-11-23T03:20:35Z

lib/java/src/org/apache/thrift/partial/README.md

+Source files:
+- PartialThriftProtocol.java
+- PartialThriftBinaryProtocol.java
+- PartialThriftCompactProtocol.java


If these source files have been removed，the document should be updated。

Sorry that was an oversight. I will update the document asap and send another PR.

kpandit added 2 commits August 16, 2021 12:09

add partial Thrift support

b111ab5

THRIFT-5443: add support for partial Thrift deserialization

b2a7518

Client: java

Jens-G self-requested a review August 22, 2021 16:39

Jens-G reviewed Aug 30, 2021

View reviewed changes

kpandit and others added 4 commits August 30, 2021 18:19

merge with master

180e916

add clarification on map field paths

65e67dc

Merge branch 'apache:master' into kpandit/thrift-partial

e41b268

Merge branch 'kpandit/thrift-partial' of github.com:bhalchandrap/thri…

85e2dcb

…ft into kpandit/thrift-partial

fishy reviewed Sep 5, 2021

View reviewed changes

bhalchandrap and others added 2 commits September 6, 2021 12:45

Merge branch 'apache:master' into kpandit/thrift-partial

9b667f0

merge with master

02811b4

kpandit added 2 commits September 6, 2021 13:16

addressed review comments

93851a1

Merge branch 'kpandit/thrift-partial' of github.com:bhalchandrap/thri…

28bedae

…ft into kpandit/thrift-partial

bhalchandrap and others added 3 commits September 21, 2021 17:03

Merge branch 'apache:master' into kpandit/thrift-partial

9df82f5

move skip() functionality to T*Protocol, add partial deser method to …

df30051

…TDeserializer

Merge branch 'kpandit/thrift-partial' of github.com:bhalchandrap/thri…

de2bcf3

…ft into kpandit/thrift-partial

fishy requested changes Oct 2, 2021

View reviewed changes

kpandit and others added 3 commits October 6, 2021 17:40

use correct name for overloaded method partialDeserializeObject()

2780489

Merge branch 'apache:master' into kpandit/thrift-partial

c4e4139

use read() as the default implementation of skip()

28af5f6

fishy approved these changes Oct 16, 2021

View reviewed changes

fishy reviewed Oct 16, 2021

View reviewed changes

Merge branch 'apache:master' into kpandit/thrift-partial

6d66622

Merge branch 'apache:master' into kpandit/thrift-partial

4e3a7f4

bhalchandrap requested a review from Jens-G October 26, 2021 01:03

bhalchandrap and others added 3 commits October 28, 2021 08:20

Merge branch 'apache:master' into kpandit/thrift-partial

e132925

Merge branch 'apache:master' into kpandit/thrift-partial

0aeb0f0

trigger build

e42321a

merge with master

55da9c4

Jens-G closed this in 5a9d139 Nov 19, 2021

dulong reviewed Nov 23, 2021

View reviewed changes


		### Partial deserialization

		Partial deserialization involves deserializing only a subset of the fields of a serialized Thrift object while efficiently skipping over the rest. One very important benefit of partial deserialization is that the output of the deserialization process is not limited to a `TBase` derived object. It can deserialize a serialized blob into any type by using an appropriate `ThriftFieldValueProcessor`.

		public Object deserialize(byte[] bytes, int offset, int length) throws TException {
		ensurePartialDeserializationMode();

THRIFT-5443: add support for partial Thrift deserialization #2439

THRIFT-5443: add support for partial Thrift deserialization #2439

Conversation

bhalchandrap commented Aug 16, 2021

Jens-G left a comment

Choose a reason for hiding this comment

bhalchandrap commented Aug 31, 2021

fishy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalchandrap Sep 6, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalchandrap commented Sep 6, 2021

bhalchandrap commented Sep 10, 2021

bhalchandrap commented Sep 13, 2021

fishy commented Sep 14, 2021

bhalchandrap commented Oct 1, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalchandrap commented Oct 12, 2021

fishy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalchandrap commented Oct 26, 2021

bhalchandrap commented Nov 11, 2021

Jens-G commented Nov 13, 2021

Jens-G commented Nov 20, 2021

bhalchandrap commented Nov 20, 2021 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bhalchandrap Sep 6, 2021 •

edited