[Fix][Connector-V2][Iceberg] Fix IcebergAggregatedCommitter atomic commit per checkpoint#10714
Conversation
|
good job,and the unit tests cover the merge logic well. |
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled this locally and traced both the normal checkpoint-complete path and the restore path.
The change in IcebergAggregatedCommitter.commitFiles() does fix the main aggregated-commit path: SinkAggregatedCommitterTask.notifyCheckpointComplete() now merges all workers' WriteResults into a single filesCommitter.doCommit(...), so the normal checkpoint-close path produces one Iceberg commit instead of N per-worker commits.
However, I don't think the PR fully fixes the restore-side semantics described in the issue.
On recovery, the engine replays both:
- writer state via
SinkFlowLifeCycle.restoreState()->IcebergSink.restoreWriter()->IcebergSinkWriter.preCommit(states)(IcebergSinkWriter.java:81-91) - aggregated commit state via
SinkAggregatedCommitterTask.restoreState()->aggregatedCommitter.restoreCommit(...)(SinkAggregatedCommitterTask.java:262-276,SinkAggregatedCommitter.java:42-45)
This PR only changes the aggregated committer path. IcebergSinkWriter.preCommit() still calls filesCommitter.doCommit(...) directly for each restored writer state, so after a recovery we can still replay already-committed files outside the new merged path.
Because of that, I don't think the broader "exactly 1 snapshot per checkpoint regardless of parallelism" / "duplicate rows on restore are fixed" claim is fully established yet.
Could we either:
- make the aggregated committer the only replay path when this sink uses an aggregated committer, or
- make the writer-side replay explicitly idempotent (for example by using
commitUser/ checkpoint metadata) and add a recovery regression test?
Since issue #10710 explicitly calls out restore behavior, I think we need to close this gap before calling the fix complete.
There was a problem hiding this comment.
Superseded by my blocking review #10714 (review), which keeps the clearer restore-path conclusion for this PR. I am shortening this earlier note to avoid a duplicate mixed signal.
|
Thanks for the detailed review — this helped clarify the gap on the restore path. Based on your feedback, I’m planning to make the aggregated committer the single commit path for both normal execution and recovery. In particular, I’ll remove the writer-side commit during restore so that During recovery, since Let me know if this direction makes sense, or if you’d recommend a different approach for handling idempotency during recovery. |
|
Thanks for the detailed follow-up, and yes, the direction makes sense to me. I re-checked the current code at the latest PR head, and the blocking gap is still present today because the implementation has not changed yet:
So the restore path is still bypassing the aggregated committer in the current code, even though the normal checkpoint-complete path is fixed. I do think your proposed direction is the right one: make the aggregated committer the single restore/commit path, and remove the writer-side restore commit. If you still need idempotency after that, storing My suggestion would be to keep the next patch focused on that first structural step:
Once that code lands, I'm happy to re-review it. You're definitely moving in the right direction here. |
davidzollo
left a comment
There was a problem hiding this comment.
+1 if CI passes.
LGTM, it'll be much better to add E2E just as @nzw921rx said.
In conclusion, thank you so much for contributing to Apache SeaTunnel — your participation is greatly welcomed!
DanielLeens
left a comment
There was a problem hiding this comment.
Thanks for iterating on this. I pulled the latest branch locally and rechecked the restore path. The writer no longer commits restored state in its constructor, prepareCommit(long checkpointId) now carries checkpoint identity, and the aggregated committer restores through the idempotency check instead of double-committing restored files. That addresses the correctness issue I raised earlier. I don't see a new code blocker from the current implementation. The remaining item is the red Build check.
ed88ace to
98ec9b2
Compare
…ers atomically per checkpoint
…rkers per checkpoint and prevent duplicate commits on recovery
Previously, IcebergAggregatedCommitter committed each worker's WriteResults as a separate Iceberg snapshot, breaking atomicity. On recovery, SinkWriter's
preCommit() could re-apply already-committed checkpoints, causing duplicates.
Changes:
- Merge all workers' WriteResults into a single Iceberg snapshot per checkpoint
- Add checkpoint ID as Iceberg snapshot summary property (seatunnel.checkpoint-id)
for idempotency tracking
- Implement restoreCommit() to skip already-committed checkpoints on recovery
- Move commit responsibility fully to IcebergAggregatedCommitter; remove
filesCommitter/results/preCommit() from IcebergSinkWriter
- Add checkpointId to IcebergCommitInfo and propagate via prepareCommit(checkpointId)
- Simplify IcebergSinkState to no longer carry WriteResults
- Add unit tests (IcebergAggregatedCommitterTest, IcebergFilesCommitterTest,
IcebergSinkWriterTest) and E2E tests covering streaming, batch, RowDelta upsert,
and crash-recovery scenarios
…) for Java 8 compatibility in IcebergSinkWriterTest Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
98ec9b2 to
8ec0684
Compare
…ng scenario - Rename table/method iceberg_parallel_sink_table → iceberg_parallel_streaming_table to distinguish the streaming-parallelism test from the recovery test - Replace awaitility polling with direct assertions now that FakeSource terminates naturally on Zeta after all splits are consumed - Add @DisabledOnContainer(SPARK, FLINK) because FakeSource in STREAMING mode does not terminate on those engines - Set split.read-interval=3000 ms and split.num=4 so data is spread across multiple checkpoints, making the one-snapshot-per-checkpoint invariant observable
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest branch locally and re-traced the Iceberg sink write and restore paths.
Current runtime chain:
normal write
-> IcebergSinkWriter.prepareCommit(checkpointId)
-> returns IcebergCommitInfo only
-> SinkAggregatedCommitterTask.notifyCheckpointComplete(...)
-> IcebergAggregatedCommitter.commit(...)
-> IcebergFilesCommitter.doCommit(..., checkpointId)
restore path
-> IcebergSinkWriter.of(..., states)
-> restores commitUser only, no writer-side doCommit()
-> SinkAggregatedCommitterTask.restoreState(...)
-> IcebergAggregatedCommitter.restoreCommit(...)
-> IcebergFilesCommitter.isAlreadyCommitted(checkpointId, ...)
-> skip already-persisted checkpoints
I also checked the delta since my previous review. The new commit is test-only:
IcebergSinkParallelCommitITnow scopes the streaming assertion to the Zeta engine, whereFakeSourceterminates naturally after exhausting splits.fake_to_iceberg_parallel_streaming.confnow usessplit.read-interval=3000andsplit.num=4so multiple checkpoints are guaranteed and the “one snapshot per checkpoint” invariant is actually observable.
I do not see a new code blocker in the current revision. The remaining item is the pending Build check.
|
@DanielLeens Looking back, I made two mistakes in the original test:
I followed the patterns in IcebergSinkCDCIT and KafkaIT for handling streaming jobs: I also removed the unnecessary @DisabledOnContainer(SPARK, FLINK) at the method level and updated the class-level SPARK reason to accurately reflect the limitations of spark-submit --master local in this test environment. I’ll keep an eye on the pending build check to ensure everything passes this time. Sorry for the noise, and thanks again for catching these issues! |
…nd poll snapshot invariants FakeSource in STREAMING mode never self-terminates, causing executeJob() to block indefinitely. Replace synchronous execution with CompletableFuture.supplyAsync() and wrap snapshot assertions in Awaitility polling (atMost 60s), removing the Flink/Spark @DisabledOnContainer restriction.
DanielLeens
left a comment
There was a problem hiding this comment.
I re-pulled the latest HEAD locally and checked only the delta since my previous approval.
Current E2E chain is now:
start streaming job asynchronously
-> poll Iceberg table snapshots
-> assert distinct checkpoint ids in snapshot summary
The new commit only changes the test harness in IcebergSinkParallelCommitIT so it no longer blocks on synchronous completion of a non-terminating streaming FakeSource. The production write / restore path I approved earlier is unchanged, and the updated test still checks the same per-checkpoint snapshot invariant.
I do not see a new code blocker in the current revision. The remaining item is the pending Build check.
…nk containers in IcebergSinkParallelCommitIT
|
@nzw921rx @DanielLeens https://github.com/suhyeon729/seatunnel/actions/runs/13149723048 |
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the latest branch locally and rechecked both the write path and the restore path.
Current runtime chain:
normal checkpoint
-> IcebergSinkWriter.prepareCommit(checkpointId)
-> returns IcebergCommitInfo only
-> SinkAggregatedCommitterTask.notifyCheckpointComplete(...)
-> IcebergAggregatedCommitter.commit(...)
-> IcebergFilesCommitter.doCommit(..., checkpointId)
restore path
-> IcebergSinkWriter.of(..., states)
-> restores commitUser only
-> SinkAggregatedCommitterTask.restoreState(...)
-> IcebergAggregatedCommitter.restoreCommit(...)
-> filesCommitter.isAlreadyCommitted(...)
-> skip already-persisted checkpoints
I also checked the delta since my previous review. The new commit is test-only:
IcebergSinkParallelCommitITnow scopes the streaming assertion to the environments where the checkpoint-to-snapshot invariant is actually observable- the batch E2E config was aligned with the expected row count
I do not see a new code blocker in the current revision. The remaining item is the pending Build check.
|
Hi @davidzollo, I hope this finds you well. I've added the E2E test class IcebergSinkParallelCommitIT as requested a week ago.
Could you please take a look when you have a moment? |
|
Hi @suhyeon729, I rechecked the current head locally as The important chain is still: The added unit/E2E coverage still exercises the parallel commit and recovery behavior that mattered for this fix. The latest Conclusion: can mergeNo new blocker from my side; thanks for sticking with the recovery-path details. |
dybyte
left a comment
There was a problem hiding this comment.
+1. Thanks for your contribution.
Fixes: #10710
Purpose of this pull request
IcebergAggregatedCommitter had two related correctness issues:
Non-atomic commits (N snapshots per checkpoint)
commitFiles() called filesCommitter.doCommit() once per worker's IcebergCommitInfo, producing N Iceberg snapshots per checkpoint
(where N = parallelism) instead of a single atomic snapshot. This caused partial data visibility if any commit failed mid-way,
and snapshot proliferation proportional to parallelism.
Duplicate rows on recovery (idempotency gap)
On pipeline restore, IcebergSinkWriter.preCommit() re-submitted already-committed WriteResults, causing duplicate rows in Iceberg
tables.
Changes:
Does this PR introduce any user-facing change?
Yes. Previously, with parallelism N, each checkpoint produced N Iceberg snapshots instead of 1. This caused:
After this fix, each checkpoint produces exactly 1 atomic snapshot regardless of parallelism, and recovery correctly skips already-committed checkpoints.
How was this patch tested?
Unit tests:
checkpoints each produce exactly one snapshot
isAlreadyCommitted() correctly identifies already-committed checkpoints
(including legacy path without checkpoint-id property)
E2E tests (IcebergSinkParallelCommitIT):
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.