perf: optimize removeCommitMetadata method in HoodieCDCLogger by kamronis · Pull Request #17669 · apache/hudi

kamronis · 2025-12-22T10:28:27Z

Describe the issue this Pull Request addresses

Around 35-40% of put time in HoodieCDCLogger is removeCommitMetadata. This is because for each record schema comparison is called.
Before:

After:

Summary and Changelog

Added the logic to construct record based on schema without heavy comparison.

Impact

Performance improve for CDC.

Risk Level

None

Documentation Update

None

Contributor's checklist

Read through contributor's guide
Enough context is provided in the sections above
Adequate tests were added if applicable

yihua

Could the perf issue be fixed by #17672 ?

yihua · 2025-12-23T00:32:45Z

cc @voonhous

kamronis · 2025-12-23T02:01:56Z

Could the perf issue be fixed by #17672 ?

#17672 PR is about increasing performance of schema comparing. I suggest not to compare schema at all during metadata removal.
Also, I faced this issue on 0.15 branch before HoodieSchema was introduced

voonhous · 2025-12-23T03:08:25Z

@yihua This PR should be different form the O(1) comparison #17672, which only affects workflows that involves HoodieMetadataPayload.

For this PR, the main optimization here is avoiding the costly recursive schema checks. To be specific, performance improvements here comes from replacing the highly generic, recursive, and safety-heavy utility method with a specialized, flat, and shallow implementation.

Less recursion:

Old way:
It uses a recursive switch statement. For every field, it calls rewriteRecordWithNewSchema again. If you have a record with 20 fields, that’s 20+ method calls, stack pushes/pops, and type check.

New Way:
The new getRecordWithoutMetadata performs a single, flat for loop over the top-level fields of the schema. Since CDC data stripping usually only happens at the top level, this avoids the overhead of traversing the entire object tree.

I believe it also alleviates GC pressure:

Old Way:

The utility creates several helper objects for every record processed:

A Deque<String> fieldNames to track the breadcrumb path (mainly for error reporting IIUC).
Map<String, String> renameCols (even if empty).
Multiple string concatenations for createNamePrefix and createFullName

New Way:

It creates exactly one GenericData.Record. No collections, no iterators, and no string builders are initialized per record.

CMIIW @kamronis, the performance optimization here should see the most increase for records that are deeply nested.

I'm wondering if we can put this function into a utility class so others can use this.

/**
 * Projects a record to a new schema by performing a shallow copy of fields.
 * Best used for removing top-level metadata fields.
 * <p>
 * This is a high-performance alternative to deep rewriting. It only iterates through 
 * the top-level fields of the target schema and pulls values from the source record 
 * by field name.
 * <p>
 * <p>
 * This is significantly faster than {@link #rewriteRecordWithNewSchema} for:
 * 1. Wide records (many top-level fields): Reduces CPU overhead/recursion.
 * 2. Deeply nested records: Uses reference-copying for nested structures instead of rebuilding them.
 * <p>
 * <b>Warning:</b> This method does not recursively rewrite/transform nested records, arrays, 
 * or maps. It assumes that the underlying values for each field are already 
 * compatible with the target schema.
 *
 * @param record      The source GenericRecord to project.
 * @param targetSchema The schema to project the record into.
 * @return A new GenericRecord matching targetSchema, or the original record if 
 * the schemas are identical in field count.
 */

kamronis · 2025-12-23T04:11:16Z

Hi @voonhous! Thank you for reply.
You are right, nested field will see much performance improve this way.
And we can put this in utility class, but which one should I put it in?
I thought HoodieAvroUtils, but I see you are working on its removal now.

danny0405 · 2025-12-23T04:11:42Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieCDCLogger.java

+    return record == null ? null : getRecordWithoutMetadata(record);
+  }
+
+  private GenericRecord getRecordWithoutMetadata(GenericRecord record) {


yes, we can do this because there is prerequisite that no fields reordering or renaming, maybe we should check all the usages of HoodieAvroUtils.rewriteRecordWithNewSchema and replace it with this more performant way if it is the similiar use case.

fire an issue to trace this: #17679

@danny0405 I can take this task. Please assign to me

wombatu-kun · 2025-12-23T05:13:14Z

@hudi-bot run azure

hudi-bot · 2025-12-23T06:23:45Z

CI report:

a123e79 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

voonhous · 2025-12-23T07:50:35Z

Hi @voonhous! Thank you for reply. You are right, nested field will see much performance improve this way. And we can put this in utility class, but which one should I put it in? I thought HoodieAvroUtils, but I see you are working on its removal now.

HoodieAvroUtils makes sense. We can add it in there first as it's a direct manipulation on a Avro GenericRecord. While we are actively removing it, HoodieSchema is still delegating to it. So, full removal will not happen so quickly.

As of now, i feel the most apt and suitable area to place it.

wombatu-kun · 2025-12-23T08:04:14Z

HoodieAvroUtils makes sense. We can add it in there first as it's a direct manipulation on a Avro GenericRecord. While we are actively removing it, HoodieSchema is still delegating to it. So, full removal will not happen so quickly.

I think, we can merge this PR as @kamronis proposed. And make refactoring under #17679

kamronis added 2 commits December 22, 2025 17:24

perf: optimize removeCommitMetadata method in HoodieCDCLogger

4b07511

fix: checkstyle

a123e79

github-actions bot added the size:S PR with lines of changes in (10, 100] label Dec 22, 2025

yihua reviewed Dec 23, 2025

View reviewed changes

danny0405 reviewed Dec 23, 2025

View reviewed changes

wombatu-kun approved these changes Dec 23, 2025

View reviewed changes

wombatu-kun merged commit 39ae177 into apache:master Dec 23, 2025
138 of 141 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669

perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669
wombatu-kun merged 2 commits intoapache:masterfrom
kamronis:master

kamronis commented Dec 22, 2025

Uh oh!

yihua left a comment

Uh oh!

yihua commented Dec 23, 2025

Uh oh!

kamronis commented Dec 23, 2025 •

edited

Loading

Uh oh!

voonhous commented Dec 23, 2025 •

edited

Loading

Uh oh!

kamronis commented Dec 23, 2025

Uh oh!

danny0405 Dec 23, 2025

Uh oh!

danny0405 Dec 23, 2025

Uh oh!

kamronis Dec 23, 2025

Uh oh!

wombatu-kun commented Dec 23, 2025

Uh oh!

hudi-bot commented Dec 23, 2025

Uh oh!

voonhous commented Dec 23, 2025

Uh oh!

wombatu-kun commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Conversation

kamronis commented Dec 22, 2025

Describe the issue this Pull Request addresses

Summary and Changelog

Impact

Risk Level

Documentation Update

Contributor's checklist

Uh oh!

yihua left a comment

Choose a reason for hiding this comment

Uh oh!

yihua commented Dec 23, 2025

Uh oh!

kamronis commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

voonhous commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Less recursion:

I believe it also alleviates GC pressure:

Uh oh!

kamronis commented Dec 23, 2025

Uh oh!

danny0405 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

danny0405 Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

kamronis Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

wombatu-kun commented Dec 23, 2025

Uh oh!

hudi-bot commented Dec 23, 2025

CI report:

Uh oh!

voonhous commented Dec 23, 2025

Uh oh!

wombatu-kun commented Dec 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kamronis commented Dec 23, 2025 •

edited

Loading

voonhous commented Dec 23, 2025 •

edited

Loading