Skip to content

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance#12065

Closed
usberkeley wants to merge 6 commits intoapache:masterfrom
usberkeley:HUDI-8310
Closed

[HUDI-8310] Fix the bug where the Flink table config hoodie.populate.meta.fields is not effective and optimize write performance#12065
usberkeley wants to merge 6 commits intoapache:masterfrom
usberkeley:HUDI-8310

Conversation

@usberkeley
Copy link
Contributor

@usberkeley usberkeley commented Oct 5, 2024

Change Logs

1. Fix the bug

  • Resolve missing hoodie.populate.meta.fields in Table Config (hoodie.properties)
  • Enhance validity checks

2. Optimize write performance

  • Reduce unnecessary decoding overhead when disabling meta fields

Impact

Improve write performance. After optimization, the write speed with hoodie.populate.meta.fields=false is 42.9% faster than with hoodie.populate.meta.fields=true.

Testing method
Consume from the earliest position in Kafka until all messages are consumed (Kafka Lag = 0), and compare the time taken for both.

1)populate meta fields
time taken: 21hours and 25mins
image

2)no meta fields
time taken: 12hours and 14mins
image

Risk level (write none, low medium or high below)

medium

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

…meta.fields is not effective and optimize write performance
@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Oct 5, 2024
…meta.fields is not effective and optimize write performance
…meta.fields is not effective and optimize write performance
…meta.fields is not effective and optimize write performance
HoodieRecord populatedRecord =
finalRecord.prependMetaFields(schema, writeSchemaWithMetaFields, metadataValues, recordProperties);
if (!metadataValues.isEmpty()) {
populatedRecord = finalRecord.prependMetaFields(schema, writeSchemaWithMetaFields, metadataValues, recordProperties);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In #11028, we already fixed the unnecessary rewrite when the schemas are exactly the same, is your benchmark based on the fix then?

Yes, our benchmark is already based on that fix (#11028)

// Check if operation metadata fields are allowed
if (writeConfig.allowOperationMetadataField()) {
if (!writeConfig.populateMetaFields()) {
throw new HoodieException("Operation metadata fields are allowed, but populateMetaFields is not enabled. "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operation metadata fields are allowed -> Operation metadata field is allowed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operation metadata fields are allowed -> Operation metadata field is allowed

Could you please help explain why it is possible to enable the 'allow operation meta field' when the 'populate meta field' is turned off?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operation field is introduced to mark the row-level operation type, for the use cases where metadata fields are disabled (pure inserts), I think it makes sense to also disable the operation field because all the operation types should be I (insert).

.defaultValue(true)
.withDocumentation("When enabled, populates all meta fields. When disabled, no meta fields are populated "
+ "and incremental queries will not be functional. This is only meant to be used for append only/immutable data for batch processing");
+ "and incremental queries will not be functional. In the disabled state, the number of record key fields must be equal to one.");
Copy link
Contributor

@danny0405 danny0405 Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the limitations?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the liminations?

Because when querying with the Log Scanner, it is necessary to merge records based on the record key. The original design only supports the generation of Simple Key Generator (simple key field). This PR temporarily uses the original design and verifies that Complex Key Generator is also acceptable. Therefore, we are currently limiting it to a single record key field.

The code for generating the record key is as follows:

// AbstractHoodieLogRecordReader#processDataBlock, calls wrapIntoHoodieRecordPayloadWithParams to generate HoodieRecord
HoodieRecord completedRecord = recordIterator.next()
            .wrapIntoHoodieRecordPayloadWithParams(recordsIteratorSchemaPair.getRight(),
                hoodieTableMetaClient.getTableConfig().getProps(),
                recordKeyPartitionPathFieldPair,
                this.withOperationField,
                this.partitionNameOverrideOpt,
                populateMetaFields,
                Option.empty());
                
                
// HoodieAvroRecord#wrapIntoHoodieRecordPayloadWithParams, simpleKeyGenFieldsOpt indicates that the Record Key can be generated from the content
public HoodieRecord wrapIntoHoodieRecordPayloadWithParams(
Schema recordSchema, Properties props,
Option<Pair<String, String>> simpleKeyGenFieldsOpt,
Boolean withOperation,
Option<String> partitionNameOp,
Boolean populateMetaFields,
Option<Schema> schemaWithoutMetaFields) throws IOException {
IndexedRecord indexedRecord = (IndexedRecord) data.getInsertValue(recordSchema, props).get();
String payloadClass = ConfigUtils.getPayloadClass(props);
String preCombineField = ConfigUtils.getOrderingField(props);
return HoodieAvroUtils.createHoodieRecordFromAvro(indexedRecord, payloadClass, preCombineField, simpleKeyGenFieldsOpt, withOperation, partitionNameOp, populateMetaFields, schemaWithoutMetaFields);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like there is a limitation in the code:

public static <R> HoodieRecord<R> convertToHoodieRecordPayload(GenericRecord record, String payloadClazz,
                                                                 String preCombineField,
                                                                 Pair<String, String> recordKeyPartitionPathFieldPair,
                                                                 boolean withOperationField,
                                                                 Option<String> partitionName,
                                                                 Option<Schema> schemaWithoutMetaFields) {
    final String recKey = record.get(recordKeyPartitionPathFieldPair.getKey()).toString();
    final String partitionPath = (partitionName.isPresent() ? partitionName.get() :
        record.get(recordKeyPartitionPathFieldPair.getRight()).toString());

I think we should support composite record keys for non-metadata fields use cases in the near future.

.setTableVersion(conf.getInteger(FlinkOptions.WRITE_TABLE_VERSION))
.setDatabaseName(conf.getString(FlinkOptions.DATABASE_NAME))
.setRecordKeyFields(conf.getString(FlinkOptions.RECORD_KEY_FIELD, null))
.setRecordKeyFields(conf.getString(FlinkOptions.RECORD_KEY_FIELD))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the default value, may not work for pk-less table scenario.

@usberkeley usberkeley marked this pull request as draft October 29, 2024 10:47
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@usberkeley usberkeley closed this Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants