Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-32650][protobuf]Added the ability to split flink-protobuf code… #23162

Closed
wants to merge 2 commits into from

Conversation

ljw-hit
Copy link

@ljw-hit ljw-hit commented Aug 8, 2023

What is the purpose of the change

When the number of fields exceeds a certain threshold and the compiled method body exceeds 8k, the decode/encode method will not be optimized by JIT, seriously affecting serialization or deserialization performance.
This pull request add the ability to split flink-protobuf codegen code to improve decode/encode method performance.

Brief change log

  • PbCodegenDeserializer/PbCodegenSerializer Interface add codegenSplit method
  • Add PbCodeSplitter to split Row type code
  • All PbCodegenDeserializer/PbCodegenSerializer Impl to implement codegenSplit method

Verifying this change

  • This change is already covered by existing tests.
  • add new UT test BigProtoBufCodeSpiltterTest

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (yes / no) no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
  • The serializers: (yes / no / don't know) no
  • The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) no
  • The S3 file system connector: (yes / no / don't know) no

Documentation

  • Does this pull request introduce a new feature? (yes / no) yes
  • If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) not applicable

@flinkbot
Copy link
Collaborator

flinkbot commented Aug 8, 2023

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

@ljw-hit
Copy link
Author

ljw-hit commented Aug 13, 2023

@libenchao hi, If you have time recently, can you help review this code?

@libenchao
Copy link
Member

@maosuhan Are u interested in reviewing this?

@maosuhan
Copy link
Contributor

@libenchao Sure, I will take it. Maybe it will take me about one week.

@ljw-hit ljw-hit force-pushed the FLINK-32650 branch 3 times, most recently from a567d7f to e609488 Compare September 12, 2023 08:19
@maosuhan
Copy link
Contributor

@ljw-hit Hi, thanks for your effort and the code is already in good shape to me. I have left a few comments about unit tests. And could you provide a benchmark test for this improvement? For example, how much time of encoding/decoding 10M large rows can be saved after this improvement..

@ljw-hit
Copy link
Author

ljw-hit commented Sep 15, 2023

@maosuhan I haven’t seen any comments about UT here. Have the comments been submitted?

*
* <p>It is valid proto definition.
*/
public class BigProtoBufCodeSpiltterTest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljw-hit I suggest that we retrieve the generated code from RowToProtoConverter or ProtoToRowConverter, then check the code to make sure that static split code exists.
Also write a complete deserialization/serialization tests to make sure that the data can be correctly processed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maosuhan I have complete deserialization/serialization tests and thank you for carefully reviewing my code,
Are there any other issues with the current code? If not, can you ask the commiter to merge?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljw-hit It seems the test code can not explicitly tell if the BigPbMessage is handled by split or non-split logic. Can you write test to ensure BigPbMessage is handled by split logic? For example, check the existence of generated code for the split code?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is no problem. I can add some explicit tests to indicate that the current code has been split.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maosuhan resolved , i use isCodeSplit method to explicit indicate that the current code has been split

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljw-hit Adding this checking flag looks good to me. I think this MR is in a good status now. Thanks for your effort.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maosuhan Thank you very much for your review!@libenchao Can you pass this PR?

@ljw-hit ljw-hit force-pushed the FLINK-32650 branch 2 times, most recently from 7bc8bed to 816de52 Compare October 5, 2023 07:58
@libenchao
Copy link
Member

@ljw-hit Thanks for your contribution, and @maosuhan thanks for the review, I'll try to give it a final review and merge it in next two weeks (it's kind of busy this week).

@ljw-hit
Copy link
Author

ljw-hit commented Nov 3, 2023

@libenchao Sorry to bother you, do you have time to do the last step of reveiw and commit recently?

@libenchao
Copy link
Member

@libenchao Sorry to bother you, do you have time to do the last step of reveiw and commit recently?

Thanks for the patience, I'll review this next week, let's move forward to get it in.

@libenchao
Copy link
Member

@flinkbot run azure

Copy link
Member

@libenchao libenchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljw-hit Thanks for your contribution, I've left my comments below.

public class PbCodeSplitter {
private final List<String> splitMethodStack = new ArrayList<>();

public PbCodeSplitter() {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to add a blank public default constructor.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

float f_field_8 = 29;
bool f_field_9 = 30;
string f_field_10 = 31;
bytes f_field_11 = 32;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field naming is not consistent(int_field, a_field_n, map_field), can you normalize it with one pattern?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

rowData.setField(9, false);
rowData.setField(10, 1F);
rowData.setField(11, 2D);
rowData.setField(12, new byte[] {1, 2, 3});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you set for all values, then we can be confident that splitting does not affect correctness.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thank you for your suggestion, I will fix it.

Comment on lines 74 to 75
* Flink-Protobuf serialize codegen code size is 13999, over code threshold.
* So pbCodeSplitter split the code.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment is not really needed since the test name and body have already explained it. Besides, 13999 could go stale easily in the future's iteration.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

* So pbCodeSplitter split the code.
*/
@Test
public void testSerializeSplit() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about testSplitInSerialization.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, thanks for the suggestion, I will adopt it

* @param pbObjectCode may be a variable or expression. Current codegen environment can use this
* literal name directly to access the input. {@code pbObject} should be a protobuf object
* literal name directly to access the input. {@code pbGetStr} is a value coming from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we should reference something not in the method signature, and why we should change it from resultVariable to returnInternalDataVarName, and pbObject to pbGetStr

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

* literal name directly to access the input. {@code pbObject} should be a protobuf object
* literal name directly to access the input. {@code pbGetStr} is a value coming from
* protobuf object
* @param pbCodeSplitter when encode/decode method body over 4K, use PbCodeSplitter to Split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's PbConstant.CODEGEN_SPLIT_THRESHOLD

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -41,8 +42,8 @@ public PbCodegenSimpleSerializer(
this.formatContext = formatContext;
}

@Override
public String codegen(String resultVar, String flinkObjectCode, int indent)
public String codegenSplit(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Override

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

* @param internalDataGetStr may be a variable or expression. Current codegen environment can
* use this literal name directly to access the input. {@code internalDataGetStr} is a value
* coming from flink object.
* @param pbCodeSplitter when encode/decode method body over 4K, use PbCodeSplitter to Split
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in PbCodegenDeserializer also apply here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

return String.format("%s(%s, %s);", splitMethodName, rowDataVar, messageTypeVar);
}

public String splitSerializerRowTypeMethod(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

splitSerializerRowTypeMethod and splitDeserializerRowTypeMethod share most of codes, hence I'm wondering if we can reuse them.

Further more, I think these two methods are actually not necessary, and PbCodeSplitter is kind of confusing. Can we just use PbFormatContext with:

  • Add a final List<String> splitMethods = new ArrayList()
  • Add a method addCodeIntoMethod(String code)

And leave others to the caller, since there is only one caller of these two methods.

Then we can avoid introducing PbCodeSplitter everywhere.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for such targeted suggestions.

  1. I have found a way to reuse code, I can solve this part.
    2.Regarding the second point, I am a little confused. Do you mean that pbCodeSplitter is not needed? Put all codeSplit logic into pbFomartContext?

@@ -27,4 +27,10 @@ public class PbConstant {
public static final String PB_MAP_KEY_NAME = "key";
public static final String PB_MAP_VALUE_NAME = "value";
public static final String PB_OUTER_CLASS_SUFFIX = "OuterClass";
/**
* JIT optimizer threshold is 8K, unicode encode one char use 2byte, so use 3K as
Copy link
Member

@libenchao libenchao Nov 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, for ascii chars, there is only 1 byte in unicode encoding.

Copy link
Author

@ljw-hit ljw-hit Nov 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your suggestion, I will modify my comment. By the way, if 1 character corresponds to 1 byte, does this threshold need to be modified?

@ljw-hit
Copy link
Author

ljw-hit commented Nov 13, 2023

@libenchao Thank you very much for your code review. I learned a lot from this review and I have solved all the comments. Please review again in your free time.

Copy link
Member

@libenchao libenchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ljw-hit Thanks for the updating, generally looks good now, I've left a few more minor comments.

String genCode = codegenDes.codegen("rowData", "message", 0);
// if codgen generate code size over threshod then split the code
PbCodeSplitter pbCodeSplitter = new PbCodeSplitter();
LOG.info("Fast-pb generate split deserialize code");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping, it seems you are missing this one.

@@ -109,4 +117,8 @@ public byte[] convertRowToProtoBinary(RowData rowData) throws Exception {
AbstractMessage message = (AbstractMessage) encodeMethod.invoke(null, rowData);
return message.toByteArray();
}

public boolean isCodeSplit() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for testing.

@@ -84,11 +85,18 @@ public RowToProtoConverter(RowType rowType, PbFormatConfig formatConfig)
PbCodegenSerializer codegenSer =
PbCodegenSerializeFactory.getPbCodegenTopRowSer(
descriptor, rowType, formatContext);
LOG.info("Fast-pb generate split serialize code");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary log.


map<string, bytes> map_field_32 = 32;
map<string, string> map_field_33 = 33;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always put a \n in the final line, then it will not complain that "Now new line at the end of file".

@ljw-hit
Copy link
Author

ljw-hit commented Nov 13, 2023

@libenchao Thank you for your detailed review work. I have solved these comments.

Copy link
Member

@libenchao libenchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, will merge after the CI is green. Thanks for your contribution @ljw-hit , and thanks to the review @maosuhan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants