Skip to content

perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669

Merged
wombatu-kun merged 2 commits intoapache:masterfrom
kamronis:master
Dec 23, 2025
Merged

perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669
wombatu-kun merged 2 commits intoapache:masterfrom
kamronis:master

Conversation

@kamronis
Copy link
Contributor

Describe the issue this Pull Request addresses

Around 35-40% of put time in HoodieCDCLogger is removeCommitMetadata. This is because for each record schema comparison is called.
Before:
Selection_074

After:
Selection_076

Summary and Changelog

Added the logic to construct record based on schema without heavy comparison.

Impact

Performance improve for CDC.

Risk Level

None

Documentation Update

None

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Dec 22, 2025
Copy link
Contributor

@yihua yihua left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could the perf issue be fixed by #17672 ?

@yihua
Copy link
Contributor

yihua commented Dec 23, 2025

cc @voonhous

@kamronis
Copy link
Contributor Author

kamronis commented Dec 23, 2025

Could the perf issue be fixed by #17672 ?

#17672 PR is about increasing performance of schema comparing. I suggest not to compare schema at all during metadata removal.
Also, I faced this issue on 0.15 branch before HoodieSchema was introduced

@voonhous
Copy link
Member

voonhous commented Dec 23, 2025

@yihua This PR should be different form the O(1) comparison #17672, which only affects workflows that involves HoodieMetadataPayload.

For this PR, the main optimization here is avoiding the costly recursive schema checks. To be specific, performance improvements here comes from replacing the highly generic, recursive, and safety-heavy utility method with a specialized, flat, and shallow implementation.

Less recursion:

Old way:
It uses a recursive switch statement. For every field, it calls rewriteRecordWithNewSchema again. If you have a record with 20 fields, that’s 20+ method calls, stack pushes/pops, and type check.

New Way:
The new getRecordWithoutMetadata performs a single, flat for loop over the top-level fields of the schema. Since CDC data stripping usually only happens at the top level, this avoids the overhead of traversing the entire object tree.

I believe it also alleviates GC pressure:

Old Way:

The utility creates several helper objects for every record processed:

  • A Deque<String> fieldNames to track the breadcrumb path (mainly for error reporting IIUC).
  • Map<String, String> renameCols (even if empty).
  • Multiple string concatenations for createNamePrefix and createFullName

New Way:

  • It creates exactly one GenericData.Record. No collections, no iterators, and no string builders are initialized per record.

CMIIW @kamronis, the performance optimization here should see the most increase for records that are deeply nested.

I'm wondering if we can put this function into a utility class so others can use this.

/**
 * Projects a record to a new schema by performing a shallow copy of fields.
 * Best used for removing top-level metadata fields.
 * <p>
 * This is a high-performance alternative to deep rewriting. It only iterates through 
 * the top-level fields of the target schema and pulls values from the source record 
 * by field name.
 * <p>
 * <p>
 * This is significantly faster than {@link #rewriteRecordWithNewSchema} for:
 * 1. Wide records (many top-level fields): Reduces CPU overhead/recursion.
 * 2. Deeply nested records: Uses reference-copying for nested structures instead of rebuilding them.
 * <p>
 * <b>Warning:</b> This method does not recursively rewrite/transform nested records, arrays, 
 * or maps. It assumes that the underlying values for each field are already 
 * compatible with the target schema.
 *
 * @param record      The source GenericRecord to project.
 * @param targetSchema The schema to project the record into.
 * @return A new GenericRecord matching targetSchema, or the original record if 
 * the schemas are identical in field count.
 */

@kamronis
Copy link
Contributor Author

Hi @voonhous! Thank you for reply.
You are right, nested field will see much performance improve this way.
And we can put this in utility class, but which one should I put it in?
I thought HoodieAvroUtils, but I see you are working on its removal now.

return record == null ? null : getRecordWithoutMetadata(record);
}

private GenericRecord getRecordWithoutMetadata(GenericRecord record) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can do this because there is prerequisite that no fields reordering or renaming, maybe we should check all the usages of HoodieAvroUtils.rewriteRecordWithNewSchema and replace it with this more performant way if it is the similiar use case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fire an issue to trace this: #17679

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danny0405 I can take this task. Please assign to me

@wombatu-kun
Copy link
Contributor

@hudi-bot run azure

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@voonhous
Copy link
Member

Hi @voonhous! Thank you for reply. You are right, nested field will see much performance improve this way. And we can put this in utility class, but which one should I put it in? I thought HoodieAvroUtils, but I see you are working on its removal now.

HoodieAvroUtils makes sense. We can add it in there first as it's a direct manipulation on a Avro GenericRecord. While we are actively removing it, HoodieSchema is still delegating to it. So, full removal will not happen so quickly.

As of now, i feel the most apt and suitable area to place it.

@wombatu-kun
Copy link
Contributor

HoodieAvroUtils makes sense. We can add it in there first as it's a direct manipulation on a Avro GenericRecord. While we are actively removing it, HoodieSchema is still delegating to it. So, full removal will not happen so quickly.

I think, we can merge this PR as @kamronis proposed. And make refactoring under #17679

@wombatu-kun wombatu-kun merged commit 39ae177 into apache:master Dec 23, 2025
138 of 141 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants