[fix](streaming-job) Fix NPE in StreamingInsertJob.replayOnCommitted during EditLog replay#62416
Conversation
…during EditLog replay When a streaming insert task is canceled while its transaction is being committed, beforeCommitted() silently returned without setting the txnCommitAttachment. The transaction still committed with a null attachment, causing NullPointerException during EditLog replay in replayOnCommitted(). Fix two issues in beforeCommitted(): 1. Throw TransactionException when task is canceled instead of silent return, preventing the transaction from committing with null attachment. 2. Fix write lock leak by using passCheck pattern - release lock on failure, keep it for onStreamTaskSuccess/Fail on success. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
FE UT Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
Pull request overview
Fixes a null attachment being written during streaming insert transaction commit, which could later crash FE checkpoint EditLog replay with an NPE in StreamingInsertJob.replayOnCommitted().
Changes:
- Change
beforeCommitted()to throwTransactionExceptionwhen the running streaming task is canceled, preventing committing with a nulltxnCommitAttachment. - Adjust
beforeCommitted()locking so the write lock is released on failure paths (viapassCheck), while keeping it held for subsequent success/failure callbacks to release.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Review finding:
- Missing coverage for the exact cancel-during-commit replay path that this patch fixes.
Critical checkpoint conclusions:
- Goal of the task: Mostly accomplished in code. The change prevents
beforeCommitted()from silently allowing a canceled streaming task to commit with a null attachment, and it fixes the leaked write-lock pattern. However, there is still no test proving the replay scenario is fixed. - Modification size/focus: Yes. The change is small and focused to
StreamingInsertJob.beforeCommitted(). - Concurrency: Acceptable in the modified path. The new
passCheckflow now matches the existing callback-lock pattern used in similar code: failure releases the lock infinally, success retains it foronStreamTaskSuccess/Fail()to release. I did not find a new deadlock path in the reviewed flow. - Special lifecycle/static initialization: Not applicable for this patch.
- Configuration changes: None.
- Incompatible changes: None observed in APIs or persisted formats.
- Functionally parallel code paths: The fix is consistent with the
RoutineLoadJob.executeBeforeCheck()pattern, which is the relevant parallel transaction-callback path. - Special conditional checks: The new canceled-task check is justified because the prior silent return allowed a committed txn with null attachment, which then crashed replay.
- Test coverage: Insufficient. There is no FE unit or regression test for cancel-during-commit followed by EditLog replay / FE restart, which is the exact failure mode described in the PR.
- Observability: Adequate for this change. The new exception includes job/task/txn identifiers; no extra metrics appear necessary.
- Transaction and persistence: Improved. The change should prevent persisting committed streaming txns without attachments.
- Data writes/modifications: Improved. The system now rejects commit for canceled tasks instead of allowing inconsistent transaction metadata.
- New FE-BE variables: Not applicable.
- Performance: Neutral.
- Other issues: No additional correctness issues found beyond the missing coverage above.
|
PR approved by anyone and no changes requested. |
|
PR approved by at least one committer and no changes requested. |
|
/review |
There was a problem hiding this comment.
One blocking issue remains in the write-lock fix.
- Goal of the task: The PR does prevent the null
txnCommitAttachment/ replay NPE when a streaming task is canceled beforebeforeCommitted()finishes. That part looks correct. However, the lock-leak fix is incomplete: oncebeforeCommitted()passes and keeps the job write lock, there is still no cleanup for commit failures that happen afterbeforeStateTransform(COMMITTED)and beforeafterCommitted().StreamingInsertJob.afterAborted()is empty, so the job can stay permanently write-locked. - Small / focused change: Yes, the patch is small and focused.
- Concurrency: Applicable. The callback intentionally holds
StreamingInsertJob's write lock across the transaction callback window, but the newpassCheckonly unlocks failures insidebeforeCommitted()itself. IfDatabaseTransactionMgr.commitTransaction()throws later fromunprotectedCommitTransaction(...),persistTransactionState(...), orenqueueTransactionState(...)(fe/fe-core/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java:820-840and893-909),afterCommitted()is never reached andafterAborted()still does not unlock, so later pause/resume/cancel/scheduling paths can block forever. - Lifecycle / static-init concerns: No special lifecycle issue beyond the existing callback-retention mechanism in
TransactionState, which is used correctly here. - Config changes: None.
- Compatibility changes: None.
- Parallel code paths: The routine-load callback uses the same hold-lock-until-after pattern, but it has explicit unlock logic in
afterCommitted()/ abort handling. The streaming path still lacks equivalent cleanup for the post-check failure window. - Conditional checks: The new canceled-task check is reasonable and the error message now includes job/task/txn identifiers, which is good for diagnosis.
- Test coverage: No FE unit test or regression test was added for the canceled-during-commit replay case or for the lock-release behavior. This is still a testing gap.
- Observability: The new exception message is sufficient for tracing the failing transaction.
- Transaction / persistence: Applicable. Replay correctness is improved, but the transaction callback can still leave the job in a stuck state if commit fails after the attachment is built and before the
afterCommitted()callback runs. - Data write / modification safety: Applicable. A stuck write lock can block later job-state transitions after a failed commit.
- FE-BE variable passing: Not applicable.
- Performance: No material performance concern in the patch itself.
- Other issues: None beyond the lock-release gap above.
Please cover the post-passCheck failure path as well, for example by making sure the held lock is released when the transaction transitions to abort / failed commit cleanup, not only when beforeCommitted() itself throws.
The lock will be released onStreamTaskFail.
…during EditLog replay (#62416) ### What problem does this PR solve? Problem Summary: Fix NPE in `StreamingInsertJob.replayOnCommitted()` during FE checkpoint EditLog replay. When a streaming insert task is canceled while its transaction is being committed, `beforeCommitted()` silently returned without setting `txnCommitAttachment`. The transaction still committed with a null attachment written to EditLog. During checkpoint replay, `replayOnCommitted()` calls `Preconditions.checkNotNull(txnState.getTxnCommitAttachment())`, throwing NPE. Two issues fixed in `beforeCommitted()`: 1. Throw `TransactionException` when task is canceled instead of silent `return`, preventing commit with null attachment. Consistent with `RoutineLoadJob.executeBeforeCheck()` pattern. 2. Fix write lock leak — replaced broken `shouldReleaseLock` (always `false`) with `passCheck` pattern: release lock on failure in `finally`, keep it for `onStreamTaskSuccess/Fail()` to release on success.
…during EditLog replay (apache#62416) Problem Summary: Fix NPE in `StreamingInsertJob.replayOnCommitted()` during FE checkpoint EditLog replay. When a streaming insert task is canceled while its transaction is being committed, `beforeCommitted()` silently returned without setting `txnCommitAttachment`. The transaction still committed with a null attachment written to EditLog. During checkpoint replay, `replayOnCommitted()` calls `Preconditions.checkNotNull(txnState.getTxnCommitAttachment())`, throwing NPE. Two issues fixed in `beforeCommitted()`: 1. Throw `TransactionException` when task is canceled instead of silent `return`, preventing commit with null attachment. Consistent with `RoutineLoadJob.executeBeforeCheck()` pattern. 2. Fix write lock leak — replaced broken `shouldReleaseLock` (always `false`) with `passCheck` pattern: release lock on failure in `finally`, keep it for `onStreamTaskSuccess/Fail()` to release on success. (cherry picked from commit 8548d7b)
What problem does this PR solve?
Problem Summary:
Fix NPE in
StreamingInsertJob.replayOnCommitted()during FE checkpoint EditLog replay.When a streaming insert task is canceled while its transaction is being committed,
beforeCommitted()silently returned without settingtxnCommitAttachment. The transaction still committed with a null attachment written to EditLog. During checkpoint replay,replayOnCommitted()callsPreconditions.checkNotNull(txnState.getTxnCommitAttachment()), throwing NPE.Two issues fixed in
beforeCommitted():TransactionExceptionwhen task is canceled instead of silentreturn, preventing commit with null attachment. Consistent withRoutineLoadJob.executeBeforeCheck()pattern.shouldReleaseLock(alwaysfalse) withpassCheckpattern: release lock on failure infinally, keep it foronStreamTaskSuccess/Fail()to release on success.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)