[FLINK-32650][protobuf]Added the ability to split flink-protobuf code… #23162

ljw-hit · 2023-08-08T03:07:50Z

What is the purpose of the change

When the number of fields exceeds a certain threshold and the compiled method body exceeds 8k, the decode/encode method will not be optimized by JIT, seriously affecting serialization or deserialization performance.
This pull request add the ability to split flink-protobuf codegen code to improve decode/encode method performance.

Brief change log

PbCodegenDeserializer/PbCodegenSerializer Interface add codegenSplit method
Add PbCodeSplitter to split Row type code
All PbCodegenDeserializer/PbCodegenSerializer Impl to implement codegenSplit method

Verifying this change

This change is already covered by existing tests.
add new UT test BigProtoBufCodeSpiltterTest

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no) no
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no) no
The serializers: (yes / no / don't know) no
The runtime per-record code paths (performance sensitive): (yes / no / don't know) no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know) no
The S3 file system connector: (yes / no / don't know) no

Documentation

Does this pull request introduce a new feature? (yes / no) yes
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented) not applicable

flinkbot · 2023-08-08T03:10:30Z

CI report:

161ea09 Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

ljw-hit · 2023-08-13T02:40:00Z

@libenchao hi, If you have time recently, can you help review this code?

libenchao · 2023-08-13T15:11:14Z

@maosuhan Are u interested in reviewing this?

maosuhan · 2023-08-15T14:27:48Z

@libenchao Sure, I will take it. Maybe it will take me about one week.

...uf/src/main/java/org/apache/flink/formats/protobuf/deserialize/PbCodegenRowDeserializer.java

maosuhan · 2023-09-13T09:05:14Z

@ljw-hit Hi, thanks for your effort and the code is already in good shape to me. I have left a few comments about unit tests. And could you provide a benchmark test for this improvement? For example, how much time of encoding/decoding 10M large rows can be saved after this improvement..

ljw-hit · 2023-09-15T06:08:17Z

@maosuhan I haven’t seen any comments about UT here. Have the comments been submitted?

maosuhan · 2023-09-13T08:52:32Z

...nk-protobuf/src/test/java/org/apache/flink/formats/protobuf/BigProtoBufCodeSpiltterTest.java

+ *
+ * <p>It is valid proto definition.
+ */
+public class BigProtoBufCodeSpiltterTest {


@ljw-hit I suggest that we retrieve the generated code from RowToProtoConverter or ProtoToRowConverter, then check the code to make sure that static split code exists.
Also write a complete deserialization/serialization tests to make sure that the data can be correctly processed.

@maosuhan I have complete deserialization/serialization tests and thank you for carefully reviewing my code,
Are there any other issues with the current code? If not, can you ask the commiter to merge?

@ljw-hit It seems the test code can not explicitly tell if the BigPbMessage is handled by split or non-split logic. Can you write test to ensure BigPbMessage is handled by split logic? For example, check the existence of generated code for the split code?

OK, this is no problem. I can add some explicit tests to indicate that the current code has been split.

@maosuhan resolved , i use isCodeSplit method to explicit indicate that the current code has been split

@ljw-hit Adding this checking flag looks good to me. I think this MR is in a good status now. Thanks for your effort.

@maosuhan Thank you very much for your review！@libenchao Can you pass this PR?

libenchao · 2023-10-16T09:49:44Z

@ljw-hit Thanks for your contribution, and @maosuhan thanks for the review, I'll try to give it a final review and merge it in next two weeks (it's kind of busy this week).

ljw-hit · 2023-11-03T03:19:11Z

@libenchao Sorry to bother you, do you have time to do the last step of reveiw and commit recently?

libenchao · 2023-11-03T04:36:55Z

@libenchao Sorry to bother you, do you have time to do the last step of reveiw and commit recently?

Thanks for the patience, I'll review this next week, let's move forward to get it in.

libenchao · 2023-11-06T11:41:44Z

@flinkbot run azure

libenchao

@ljw-hit Thanks for your contribution, I've left my comments below.

libenchao · 2023-11-06T11:45:00Z

...k-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/PbCodeSplitter.java

+public class PbCodeSplitter {
+    private final List<String> splitMethodStack = new ArrayList<>();
+
+    public PbCodeSplitter() {}


No need to add a blank public default constructor.

libenchao · 2023-11-06T12:08:31Z

flink-formats/flink-protobuf/src/test/proto/test_big_pb.proto

+  float f_field_8 = 29;
+  bool f_field_9 = 30;
+  string f_field_10 = 31;
+  bytes f_field_11 = 32;


The field naming is not consistent(int_field, a_field_n, map_field), can you normalize it with one pattern?

libenchao · 2023-11-06T12:13:03Z

...mats/flink-protobuf/src/test/java/org/apache/flink/formats/protobuf/BigPbRowToProtoTest.java

+        rowData.setField(9, false);
+        rowData.setField(10, 1F);
+        rowData.setField(11, 2D);
+        rowData.setField(12, new byte[] {1, 2, 3});


Can you set for all values, then we can be confident that splitting does not affect correctness.

Hmm, thank you for your suggestion, I will fix it.

libenchao · 2023-11-06T12:20:32Z

...mats/flink-protobuf/src/test/java/org/apache/flink/formats/protobuf/BigPbRowToProtoTest.java

+     * Flink-Protobuf serialize codegen code size is 13999， over code threshold.
+     * So pbCodeSplitter split the code.


I think the comment is not really needed since the test name and body have already explained it. Besides, 13999 could go stale easily in the future's iteration.

libenchao · 2023-11-06T12:25:11Z

...mats/flink-protobuf/src/test/java/org/apache/flink/formats/protobuf/BigPbRowToProtoTest.java

+     * So pbCodeSplitter split the code.
+     */
+    @Test
+    public void testSerializeSplit() throws Exception {


How about testSplitInSerialization.

Hmm, thanks for the suggestion, I will adopt it

libenchao · 2023-11-06T12:54:53Z

...tobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/PbCodegenDeserializer.java

     * @param pbObjectCode may be a variable or expression. Current codegen environment can use this
-     *     literal name directly to access the input. {@code pbObject} should be a protobuf object
+     *     literal name directly to access the input. {@code pbGetStr} is a value coming from


I'm not sure why we should reference something not in the method signature, and why we should change it from resultVariable to returnInternalDataVarName, and pbObject to pbGetStr

libenchao · 2023-11-06T12:59:30Z

...tobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/PbCodegenDeserializer.java

-     *     literal name directly to access the input. {@code pbObject} should be a protobuf object
+     *     literal name directly to access the input. {@code pbGetStr} is a value coming from
+     *     protobuf object
+     * @param pbCodeSplitter when encode/decode method body over 4K, use PbCodeSplitter to Split


Actually it's PbConstant.CODEGEN_SPLIT_THRESHOLD

libenchao · 2023-11-06T13:00:53Z

...buf/src/main/java/org/apache/flink/formats/protobuf/serialize/PbCodegenSimpleSerializer.java

@@ -41,8 +42,8 @@ public PbCodegenSimpleSerializer(
        this.formatContext = formatContext;
    }

-    @Override
-    public String codegen(String resultVar, String flinkObjectCode, int indent)
+    public String codegenSplit(


@Override

libenchao · 2023-11-06T13:02:21Z

...-protobuf/src/main/java/org/apache/flink/formats/protobuf/serialize/PbCodegenSerializer.java

+     * @param internalDataGetStr may be a variable or expression. Current codegen environment can
+     *     use this literal name directly to access the input. {@code internalDataGetStr} is a value
+     *     coming from flink object.
+     * @param pbCodeSplitter when encode/decode method body over 4K, use PbCodeSplitter to Split


Comments in PbCodegenDeserializer also apply here.

libenchao · 2023-11-07T02:54:28Z

...k-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/PbCodeSplitter.java

+        return String.format("%s(%s, %s);", splitMethodName, rowDataVar, messageTypeVar);
+    }
+
+    public String splitSerializerRowTypeMethod(


splitSerializerRowTypeMethod and splitDeserializerRowTypeMethod share most of codes, hence I'm wondering if we can reuse them.

Further more, I think these two methods are actually not necessary, and PbCodeSplitter is kind of confusing. Can we just use PbFormatContext with:

Add a final List<String> splitMethods = new ArrayList()

Add a method addCodeIntoMethod(String code)

And leave others to the caller, since there is only one caller of these two methods.

Then we can avoid introducing PbCodeSplitter everywhere.

Thank you for such targeted suggestions.

I have found a way to reuse code, I can solve this part.
2.Regarding the second point, I am a little confused. Do you mean that pbCodeSplitter is not needed? Put all codeSplit logic into pbFomartContext?

libenchao · 2023-11-07T03:57:07Z

flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/PbConstant.java

@@ -27,4 +27,10 @@ public class PbConstant {
    public static final String PB_MAP_KEY_NAME = "key";
    public static final String PB_MAP_VALUE_NAME = "value";
    public static final String PB_OUTER_CLASS_SUFFIX = "OuterClass";
+    /**
+     * JIT optimizer threshold is 8K, unicode encode one char use 2byte, so use 3K as


This is not correct, for ascii chars, there is only 1 byte in unicode encoding.

Thank you for your suggestion, I will modify my comment. By the way, if 1 character corresponds to 1 byte, does this threshold need to be modified?

ljw-hit · 2023-11-13T03:27:47Z

@libenchao Thank you very much for your code review. I learned a lot from this review and I have solved all the comments. Please review again in your free time.

…egen code

libenchao

@ljw-hit Thanks for the updating, generally looks good now, I've left a few more minor comments.

libenchao · 2023-11-13T03:50:36Z

...rotobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java

-            String genCode = codegenDes.codegen("rowData", "message", 0);
+            // if codgen generate code size over threshod then split the code
+            PbCodeSplitter pbCodeSplitter = new PbCodeSplitter();
+            LOG.info("Fast-pb generate split deserialize code");


ping, it seems you are missing this one.

libenchao · 2023-11-13T03:51:40Z

...-protobuf/src/main/java/org/apache/flink/formats/protobuf/serialize/RowToProtoConverter.java

@@ -109,4 +117,8 @@ public byte[] convertRowToProtoBinary(RowData rowData) throws Exception {
        AbstractMessage message = (AbstractMessage) encodeMethod.invoke(null, rowData);
        return message.toByteArray();
    }
+
+    public boolean isCodeSplit() {


for testing.

libenchao · 2023-11-13T03:52:06Z

...-protobuf/src/main/java/org/apache/flink/formats/protobuf/serialize/RowToProtoConverter.java

@@ -84,11 +85,18 @@ public RowToProtoConverter(RowType rowType, PbFormatConfig formatConfig)
            PbCodegenSerializer codegenSer =
                    PbCodegenSerializeFactory.getPbCodegenTopRowSer(
                            descriptor, rowType, formatContext);
+            LOG.info("Fast-pb generate split serialize code");


remove unnecessary log.

libenchao · 2023-11-13T03:53:54Z

flink-formats/flink-protobuf/src/test/proto/test_big_pb.proto

+
+  map<string, bytes> map_field_32 = 32;
+  map<string, string> map_field_33 = 33;
+}


Always put a \n in the final line, then it will not complain that "Now new line at the end of file".

ljw-hit · 2023-11-13T04:57:10Z

@libenchao Thank you for your detailed review work. I have solved these comments.

libenchao

+1, will merge after the CI is green. Thanks for your contribution @ljw-hit , and thanks to the review @maosuhan

ljw-hit force-pushed the FLINK-32650 branch from 012134b to 1b39d49 Compare August 8, 2023 03:09

maosuhan reviewed Aug 24, 2023

View reviewed changes

...uf/src/main/java/org/apache/flink/formats/protobuf/deserialize/PbCodegenRowDeserializer.java Outdated Show resolved Hide resolved

ljw-hit force-pushed the FLINK-32650 branch 3 times, most recently from a567d7f to e609488 Compare September 12, 2023 08:19

ljw-hit requested a review from maosuhan September 15, 2023 02:17

maosuhan reviewed Sep 15, 2023

View reviewed changes

ljw-hit force-pushed the FLINK-32650 branch 2 times, most recently from 7bc8bed to 816de52 Compare October 5, 2023 07:58

ljw-hit force-pushed the FLINK-32650 branch from bb381e6 to 251064d Compare October 18, 2023 07:29

ljw-hit mentioned this pull request Nov 3, 2023

[FLINK-32738][formats] PROTOBUF format supports projection push down #23323

Open

libenchao reviewed Nov 7, 2023

View reviewed changes

ljw-hit force-pushed the FLINK-32650 branch from 251064d to 09f65c1 Compare November 13, 2023 03:24

[FLINK-32650][protobuf]Added the ability to split flink-protobuf cod…

b24d7a9

…egen code

ljw-hit force-pushed the FLINK-32650 branch from 09f65c1 to b24d7a9 Compare November 13, 2023 03:32

libenchao reviewed Nov 13, 2023

View reviewed changes

resolved comments

161ea09

libenchao approved these changes Nov 13, 2023

View reviewed changes

libenchao closed this in a2ec4c3 Nov 13, 2023

flinkbot added the component=Formats label Apr 4, 2024

		* Flink-Protobuf serialize codegen code size is 13999， over code threshold.
		* So pbCodeSplitter split the code.

[FLINK-32650][protobuf]Added the ability to split flink-protobuf code… #23162

[FLINK-32650][protobuf]Added the ability to split flink-protobuf code… #23162

Conversation

ljw-hit commented Aug 8, 2023

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Aug 8, 2023 • edited

CI report:

ljw-hit commented Aug 13, 2023

libenchao commented Aug 13, 2023

maosuhan commented Aug 15, 2023

maosuhan commented Sep 13, 2023

ljw-hit commented Sep 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libenchao commented Oct 16, 2023

ljw-hit commented Nov 3, 2023

libenchao commented Nov 3, 2023

libenchao commented Nov 6, 2023

libenchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

libenchao Nov 7, 2023 • edited

Choose a reason for hiding this comment

ljw-hit Nov 12, 2023 • edited

Choose a reason for hiding this comment

ljw-hit commented Nov 13, 2023

libenchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ljw-hit commented Nov 13, 2023

libenchao left a comment

Choose a reason for hiding this comment

flinkbot commented Aug 8, 2023 •

edited

libenchao Nov 7, 2023 •

edited

ljw-hit Nov 12, 2023 •

edited