[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance by usberkeley · Pull Request #12065 · apache/hudi

usberkeley · 2024-10-05T11:07:19Z

Change Logs

1. Fix the bug

Resolve missing hoodie.populate.meta.fields in Table Config (hoodie.properties)
Enhance validity checks

2. Optimize write performance

Reduce unnecessary decoding overhead when disabling meta fields

Impact

Improve write performance. After optimization, the write speed with hoodie.populate.meta.fields=false is 42.9% faster than with hoodie.populate.meta.fields=true.

Testing method
Consume from the earliest position in Kafka until all messages are consumed (Kafka Lag = 0), and compare the time taken for both.

1）populate meta fields
time taken: 21hours and 25mins

2）no meta fields
time taken: 12hours and 14mins

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…meta.fields is not effective and optimize write performance

danny0405 · 2024-10-07T02:34:54Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java

-        HoodieRecord populatedRecord =
-            finalRecord.prependMetaFields(schema, writeSchemaWithMetaFields, metadataValues, recordProperties);
+        if (!metadataValues.isEmpty()) {
+          populatedRecord = finalRecord.prependMetaFields(schema, writeSchemaWithMetaFields, metadataValues, recordProperties);


In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

Yes, our benchmark is already based on that fix (#11028)

danny0405 · 2024-10-07T02:35:25Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java

+    // Check if operation metadata fields are allowed
+    if (writeConfig.allowOperationMetadataField()) {
+      if (!writeConfig.populateMetaFields()) {
+        throw new HoodieException("Operation metadata fields are allowed, but populateMetaFields is not enabled. "


Operation metadata fields are allowed -> Operation metadata field is allowed

Operation metadata fields are allowed -> Operation metadata field is allowed

Could you please help explain why it is possible to enable the 'allow operation meta field' when the 'populate meta field' is turned off?

The operation field is introduced to mark the row-level operation type, for the use cases where metadata fields are disabled (pure inserts), I think it makes sense to also disable the operation field because all the operation types should be I (insert).

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java

danny0405 · 2024-10-07T02:42:25Z

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableConfig.java

      .defaultValue(true)
      .withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated "
-          + "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing");
+          + "and incremental queries will not be functional. In the disabled state, the number of record key fields must be equal to one.");


Why change the limitations?

Why change the liminations?

Because when querying with the Log Scanner, it is necessary to merge records based on the record key. The original design only supports the generation of Simple Key Generator (simple key field). This PR temporarily uses the original design and verifies that Complex Key Generator is also acceptable. Therefore, we are currently limiting it to a single record key field.

The code for generating the record key is as follows:

// AbstractHoodieLogRecordReader#processDataBlock, calls wrapIntoHoodieRecordPayloadWithParams to generate HoodieRecord HoodieRecord completedRecord = recordIterator.next() .wrapIntoHoodieRecordPayloadWithParams(recordsIteratorSchemaPair.getRight(), hoodieTableMetaClient.getTableConfig().getProps(), recordKeyPartitionPathFieldPair, this.withOperationField, this.partitionNameOverrideOpt, populateMetaFields, Option.empty()); // HoodieAvroRecord#wrapIntoHoodieRecordPayloadWithParams, simpleKeyGenFieldsOpt indicates that the Record Key can be generated from the content public HoodieRecord wrapIntoHoodieRecordPayloadWithParams( Schema recordSchema, Properties props, Option<Pair<String, String>> simpleKeyGenFieldsOpt, Boolean withOperation, Option<String> partitionNameOp, Boolean populateMetaFields, Option<Schema> schemaWithoutMetaFields) throws IOException { IndexedRecord indexedRecord = (IndexedRecord) data.getInsertValue(recordSchema, props).get(); String payloadClass = ConfigUtils.getPayloadClass(props); String preCombineField = ConfigUtils.getOrderingField(props); return HoodieAvroUtils.createHoodieRecordFromAvro(indexedRecord, payloadClass, preCombineField, simpleKeyGenFieldsOpt, withOperation, partitionNameOp, populateMetaFields, schemaWithoutMetaFields); }

Yeah, it looks like there is a limitation in the code:

public static <R> HoodieRecord<R> convertToHoodieRecordPayload(GenericRecord record, String payloadClazz, String preCombineField, Pair<String, String> recordKeyPartitionPathFieldPair, boolean withOperationField, Option<String> partitionName, Option<Schema> schemaWithoutMetaFields) { final String recKey = record.get(recordKeyPartitionPathFieldPair.getKey()).toString(); final String partitionPath = (partitionName.isPresent() ? partitionName.get() : record.get(recordKeyPartitionPathFieldPair.getRight()).toString());

I think we should support composite record keys for non-metadata fields use cases in the near future.

danny0405 · 2024-10-07T02:42:57Z

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/StreamerUtil.java

          .setTableVersion(conf.getInteger(FlinkOptions.WRITE_TABLE_VERSION))
          .setDatabaseName(conf.getString(FlinkOptions.DATABASE_NAME))
-          .setRecordKeyFields(conf.getString(FlinkOptions.RECORD_KEY_FIELD, null))
+          .setRecordKeyFields(conf.getString(FlinkOptions.RECORD_KEY_FIELD))


Why change the default value, may not work for pk-less table scenario.

…meta.fields is not effective and optimize write performance

hudi-bot · 2024-10-29T10:53:42Z

CI report:

ba29738 UNKNOWN
90f655a Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.…

aac811b

…meta.fields is not effective and optimize write performance

github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 5, 2024

usberkeley added 3 commits October 5, 2024 21:09

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.…

beb0bba

…meta.fields is not effective and optimize write performance

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.…

82a20ec

…meta.fields is not effective and optimize write performance

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.…

e83bc5d

…meta.fields is not effective and optimize write performance

danny0405 reviewed Oct 7, 2024

View reviewed changes

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/configuration/FlinkOptions.java Show resolved Hide resolved

danny0405 reviewed Oct 7, 2024

View reviewed changes

usberkeley and others added 2 commits October 29, 2024 17:49

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.…

ba29738

…meta.fields is not effective and optimize write performance

Merge branch 'master' into HUDI-8310

90f655a

usberkeley marked this pull request as draft October 29, 2024 10:47

usberkeley closed this Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance#12065

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance#12065
usberkeley wants to merge 6 commits intoapache:masterfrom
usberkeley:HUDI-8310

usberkeley commented Oct 5, 2024 •

edited

Loading

Uh oh!

danny0405 Oct 7, 2024

Uh oh!

usberkeley Oct 8, 2024

Uh oh!

danny0405 Oct 7, 2024

Uh oh!

usberkeley Oct 8, 2024

Uh oh!

danny0405 Oct 8, 2024

Uh oh!

Uh oh!

danny0405 Oct 7, 2024 •

edited

Loading

Uh oh!

usberkeley Oct 8, 2024

Uh oh!

danny0405 Oct 8, 2024

Uh oh!

danny0405 Oct 7, 2024

Uh oh!

hudi-bot commented Oct 29, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

usberkeley commented Oct 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danny0405 Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Oct 29, 2024

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

usberkeley commented Oct 5, 2024 •

edited

Loading

danny0405 Oct 7, 2024 •

edited

Loading