perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669
perf: optimize removeCommitMetadata method in HoodieCDCLogger#17669wombatu-kun merged 2 commits intoapache:masterfrom
Conversation
|
cc @voonhous |
|
@yihua This PR should be different form the O(1) comparison #17672, which only affects workflows that involves For this PR, the main optimization here is avoiding the costly recursive schema checks. To be specific, performance improvements here comes from replacing the highly generic, recursive, and safety-heavy utility method with a specialized, flat, and shallow implementation. Less recursion:Old way: New Way: I believe it also alleviates GC pressure:Old Way: The utility creates several helper objects for every record processed:
New Way:
CMIIW @kamronis, the performance optimization here should see the most increase for records that are deeply nested. I'm wondering if we can put this function into a utility class so others can use this. /**
* Projects a record to a new schema by performing a shallow copy of fields.
* Best used for removing top-level metadata fields.
* <p>
* This is a high-performance alternative to deep rewriting. It only iterates through
* the top-level fields of the target schema and pulls values from the source record
* by field name.
* <p>
* <p>
* This is significantly faster than {@link #rewriteRecordWithNewSchema} for:
* 1. Wide records (many top-level fields): Reduces CPU overhead/recursion.
* 2. Deeply nested records: Uses reference-copying for nested structures instead of rebuilding them.
* <p>
* <b>Warning:</b> This method does not recursively rewrite/transform nested records, arrays,
* or maps. It assumes that the underlying values for each field are already
* compatible with the target schema.
*
* @param record The source GenericRecord to project.
* @param targetSchema The schema to project the record into.
* @return A new GenericRecord matching targetSchema, or the original record if
* the schemas are identical in field count.
*/ |
|
Hi @voonhous! Thank you for reply. |
| return record == null ? null : getRecordWithoutMetadata(record); | ||
| } | ||
|
|
||
| private GenericRecord getRecordWithoutMetadata(GenericRecord record) { |
There was a problem hiding this comment.
yes, we can do this because there is prerequisite that no fields reordering or renaming, maybe we should check all the usages of HoodieAvroUtils.rewriteRecordWithNewSchema and replace it with this more performant way if it is the similiar use case.
There was a problem hiding this comment.
@danny0405 I can take this task. Please assign to me
|
@hudi-bot run azure |
As of now, i feel the most apt and suitable area to place it. |
I think, we can merge this PR as @kamronis proposed. And make refactoring under #17679 |
Describe the issue this Pull Request addresses
Around 35-40% of put time in HoodieCDCLogger is removeCommitMetadata. This is because for each record schema comparison is called.

Before:
After:

Summary and Changelog
Added the logic to construct record based on schema without heavy comparison.
Impact
Performance improve for CDC.
Risk Level
None
Documentation Update
None
Contributor's checklist