[PIP-30] Improve Paimon committer in Flink#7963
Conversation
JingsongLi
left a comment
There was a problem hiding this comment.
This is a significant architectural change (PIP-30). A few high-level concerns:
Design Questions:
-
Custom HDFS state vs Flink StateBackend: I understand the timing mismatch rationale, but custom state management introduces its own complexity:
- Who cleans up stale state files when jobs are cancelled/fail without proper cleanup?
- What happens if the HDFS path is not accessible (permissions, HA failover)?
- How does this interact with Flink's checkpoint retention policies?
-
State path collision: The path
<flink-checkpoint-dir>/pwc/<operatorId>/checkpoint-{ckId}.statelives inside Flink's checkpoint directory but is managed independently. Could Flink's checkpoint cleanup accidentally delete these files? Or could stale PWC state files accumulate unboundedly? -
Scope limitation: Currently only supports
FixedBucketSink. What's the plan forUnawareBucketSinkandDynamicBucketSink? The design should be extensible to these without requiring a second large refactoring.
Code Comments:
-
DiscardingSinkfor committed stream: InFlinkSink.doCommit(), when coordinator is enabled, the committed stream is discarded viawritten.sinkTo(new DiscardingSink<>()).name("end").setParallelism(1). Why parallelism 1? If the write operator has high parallelism, this creates a bottleneck for the stream before discarding. Can you just set it to the same parallelism as writers? -
Error handling in coordinator: What happens when the coordinator fails to commit (e.g., conflict with another writer)? How is this surfaced to the Flink job? Does it trigger a failover?
-
Testing: The current tests cover recovery logic, but I'd like to see:
- An end-to-end IT test that validates data correctness after checkpoint/restore
- A test for concurrent writes (multiple subtasks committing in the same checkpoint)
- A test for the cancelled job → stale state scenario
This is promising work but given its complexity, it might benefit from a more detailed design doc on the wiki for community discussion before merging.
JingsongLi
left a comment
There was a problem hiding this comment.
If this feature is strongly bound to HDFS, then there must be a problem, and we definitely need to support object storage.
Purpose
Following PIP-30 (Improvement For Paimon Committer In Flink) , this PR implements the Paimon Write Coordinator (PWC) to replace the current
CommitOperatorwith a JobManager-levelOperatorCoordinator, eliminating the network shuffle bottleneck for commit messages.Key Design Decision: Custom HDFS State
Instead of using Flink's native
StateBackend, I chose custom HDFS state management because:snapshotState()returns, but PWC needs to persist aggregated messages before acknowledging WriteOperators.resetToCheckpoint()logic with idempotent commit.State path:
<flink-checkpoint-dir>/pwc/<operatorId>/checkpoint-{ckId}.stateCurrent Scope
FixedBucketSinkTesting
Related Issue
fix #2641