[core] Fix dedicated-format bundle write path#7598
Conversation
|
Can you explain what went wrong? |
The old dedicated-format That becomes problematic in the dedicated fan-out path, where one logical bundle has to be written to projected main/blob/vector writers. This patch makes those constraints explicit: replayable bundles can be passed through safely, non-replayable bundles are materialized once, and dedicated row-data writers preserve the row-level side effects after bundle writes. So the issue was not that the old row-by-row fallback always corrupted data. The issue was that the dedicated-format bundle path could not preserve bundle semantics safely and correctly once a single bundle needed to be fanned out to multiple writers. |
JingsongLi
left a comment
There was a problem hiding this comment.
Thanks for fixing the dedicated-format bundle write path. This is a complex change touching the append writer hierarchy. Comments:
Design:
-
ReplayableBundleRecords/ProjectableBundleRecordsinterfaces: The opt-in capability pattern is clean. It avoids strengthening the baseBundleRecordscontract and allows runtime capability checks. This is the right extension pattern for the existing architecture. -
Materialization strategy: "Materialize only non-replayable bundles" is correct — if the bundle supports replay, we can iterate it multiple times without copying. The
MaterializedBundleRecordsfallback for non-replayable bundles ensures correctness. -
ArrowBundleRecordsimplementingProjectableBundleRecords: This preserves the Arrow fast path through projection, which avoids row-by-row materialization. Theproject()implementation usingrowType.project(projection)is elegant.
Concerns:
-
BundlePassThroughWriterinterface: This introduces yet another interface in the writer hierarchy. The writer abstraction is getting quite layered:SingleFileWriter→RowDataFileWriter→BundleAwareRowDataFileWriter→ implementsBundlePassThroughWriter. Ensure this doesn't become a maintenance burden. -
Sequence number side effects: The PR description mentions "preserves row-level side effects in bundle writes, including sequence number updates." This is critical — if the sequence number counter isn't advanced correctly for bundle writes, it could cause data consistency issues during compaction. The test
BundleAwareRowDataRollingFileWriterTestshould explicitly verify sequence numbers are correct after bundle writes. -
File index writing: "file-index writing" side effects — does the main-file index get updated correctly when the blob writer uses the bundle path? The
DedicatedFormatRollingFileWriterVectorTestshould cover this. -
1054 lines added: The change is large. Could the interface additions (
ReplayableBundleRecords,ProjectableBundleRecords) be a separate preparatory PR?
Good test coverage with 4 dedicated test classes. Please confirm sequence number correctness in tests.
Signed-off-by: QuakeWang <wangfuzheng0814@foxmail.com>
|
@JingsongLi Thanks for the review. I added explicit coverage for the side effects:
I kept The remaining failed CI is a timeout in |
Purpose
Fix the dedicated-format writeBundle path so bundle writes remain correct when data is fanned out to main/blob/vector writers.
This change:
Tests
Added or updated tests in paimon-core:
These cover: