Skip to content

[fix](streaming-job) Fix NPE in StreamingInsertJob.replayOnCommitted during EditLog replay#62416

Merged
JNSimba merged 1 commit intoapache:masterfrom
JNSimba:fix/streaming-job-beforecommitted-npe
Apr 15, 2026
Merged

[fix](streaming-job) Fix NPE in StreamingInsertJob.replayOnCommitted during EditLog replay#62416
JNSimba merged 1 commit intoapache:masterfrom
JNSimba:fix/streaming-job-beforecommitted-npe

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 13, 2026

What problem does this PR solve?

Problem Summary:

Fix NPE in StreamingInsertJob.replayOnCommitted() during FE checkpoint EditLog replay.

When a streaming insert task is canceled while its transaction is being committed, beforeCommitted() silently returned without setting txnCommitAttachment. The transaction still committed with a null attachment written to EditLog. During checkpoint replay, replayOnCommitted() calls Preconditions.checkNotNull(txnState.getTxnCommitAttachment()), throwing NPE.

Two issues fixed in beforeCommitted():

  1. Throw TransactionException when task is canceled instead of silent return, preventing commit with null attachment. Consistent with RoutineLoadJob.executeBeforeCheck() pattern.
  2. Fix write lock leak — replaced broken shouldReleaseLock (always false) with passCheck pattern: release lock on failure in finally, keep it for onStreamTaskSuccess/Fail() to release on success.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • Previous test can cover this change.
  • Behavior changed:

    • Yes. When a streaming insert task is canceled during transaction commit, the transaction will now fail with TransactionException instead of committing with null attachment.
  • Does this need documentation?

    • No.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

…during EditLog replay

When a streaming insert task is canceled while its transaction is being
committed, beforeCommitted() silently returned without setting the
txnCommitAttachment. The transaction still committed with a null
attachment, causing NullPointerException during EditLog replay in
replayOnCommitted().

Fix two issues in beforeCommitted():
1. Throw TransactionException when task is canceled instead of silent
   return, preventing the transaction from committing with null attachment.
2. Fix write lock leak by using passCheck pattern - release lock on
   failure, keep it for onStreamTaskSuccess/Fail on success.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Thearas
Copy link
Copy Markdown
Contributor

Thearas commented Apr 13, 2026

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 13, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 0.00% (0/6) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 14, 2026

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a null attachment being written during streaming insert transaction commit, which could later crash FE checkpoint EditLog replay with an NPE in StreamingInsertJob.replayOnCommitted().

Changes:

  • Change beforeCommitted() to throw TransactionException when the running streaming task is canceled, preventing committing with a null txnCommitAttachment.
  • Adjust beforeCommitted() locking so the write lock is released on failure paths (via passCheck), while keeping it held for subsequent success/failure callbacks to release.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review finding:

  • Missing coverage for the exact cancel-during-commit replay path that this patch fixes.

Critical checkpoint conclusions:

  • Goal of the task: Mostly accomplished in code. The change prevents beforeCommitted() from silently allowing a canceled streaming task to commit with a null attachment, and it fixes the leaked write-lock pattern. However, there is still no test proving the replay scenario is fixed.
  • Modification size/focus: Yes. The change is small and focused to StreamingInsertJob.beforeCommitted().
  • Concurrency: Acceptable in the modified path. The new passCheck flow now matches the existing callback-lock pattern used in similar code: failure releases the lock in finally, success retains it for onStreamTaskSuccess/Fail() to release. I did not find a new deadlock path in the reviewed flow.
  • Special lifecycle/static initialization: Not applicable for this patch.
  • Configuration changes: None.
  • Incompatible changes: None observed in APIs or persisted formats.
  • Functionally parallel code paths: The fix is consistent with the RoutineLoadJob.executeBeforeCheck() pattern, which is the relevant parallel transaction-callback path.
  • Special conditional checks: The new canceled-task check is justified because the prior silent return allowed a committed txn with null attachment, which then crashed replay.
  • Test coverage: Insufficient. There is no FE unit or regression test for cancel-during-commit followed by EditLog replay / FE restart, which is the exact failure mode described in the PR.
  • Observability: Adequate for this change. The new exception includes job/task/txn identifiers; no extra metrics appear necessary.
  • Transaction and persistence: Improved. The change should prevent persisting committed streaming txns without attachments.
  • Data writes/modifications: Improved. The system now rejects commit for canceled tasks instead of allowing inconsistent transaction metadata.
  • New FE-BE variables: Not applicable.
  • Performance: Neutral.
  • Other issues: No additional correctness issues found beyond the missing coverage above.

Copy link
Copy Markdown
Contributor

@sollhui sollhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Apr 15, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@zclllyybb
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One blocking issue remains in the write-lock fix.

  1. Goal of the task: The PR does prevent the null txnCommitAttachment / replay NPE when a streaming task is canceled before beforeCommitted() finishes. That part looks correct. However, the lock-leak fix is incomplete: once beforeCommitted() passes and keeps the job write lock, there is still no cleanup for commit failures that happen after beforeStateTransform(COMMITTED) and before afterCommitted(). StreamingInsertJob.afterAborted() is empty, so the job can stay permanently write-locked.
  2. Small / focused change: Yes, the patch is small and focused.
  3. Concurrency: Applicable. The callback intentionally holds StreamingInsertJob's write lock across the transaction callback window, but the new passCheck only unlocks failures inside beforeCommitted() itself. If DatabaseTransactionMgr.commitTransaction() throws later from unprotectedCommitTransaction(...), persistTransactionState(...), or enqueueTransactionState(...) (fe/fe-core/src/main/java/org/apache/doris/transaction/DatabaseTransactionMgr.java:820-840 and 893-909), afterCommitted() is never reached and afterAborted() still does not unlock, so later pause/resume/cancel/scheduling paths can block forever.
  4. Lifecycle / static-init concerns: No special lifecycle issue beyond the existing callback-retention mechanism in TransactionState, which is used correctly here.
  5. Config changes: None.
  6. Compatibility changes: None.
  7. Parallel code paths: The routine-load callback uses the same hold-lock-until-after pattern, but it has explicit unlock logic in afterCommitted() / abort handling. The streaming path still lacks equivalent cleanup for the post-check failure window.
  8. Conditional checks: The new canceled-task check is reasonable and the error message now includes job/task/txn identifiers, which is good for diagnosis.
  9. Test coverage: No FE unit test or regression test was added for the canceled-during-commit replay case or for the lock-release behavior. This is still a testing gap.
  10. Observability: The new exception message is sufficient for tracing the failing transaction.
  11. Transaction / persistence: Applicable. Replay correctness is improved, but the transaction callback can still leave the job in a stuck state if commit fails after the attachment is built and before the afterCommitted() callback runs.
  12. Data write / modification safety: Applicable. A stuck write lock can block later job-state transitions after a failed commit.
  13. FE-BE variable passing: Not applicable.
  14. Performance: No material performance concern in the patch itself.
  15. Other issues: None beyond the lock-release gap above.

Please cover the post-passCheck failure path as well, for example by making sure the held lock is released when the transaction transitions to abort / failed commit cleanup, not only when beforeCommitted() itself throws.

@JNSimba JNSimba dismissed github-actions[bot]’s stale review April 15, 2026 06:30

The lock will be released onStreamTaskFail.

@JNSimba JNSimba merged commit 8548d7b into apache:master Apr 15, 2026
39 of 42 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 15, 2026
…during EditLog replay (#62416)

### What problem does this PR solve?

Problem Summary:

Fix NPE in `StreamingInsertJob.replayOnCommitted()` during FE checkpoint
EditLog replay.

When a streaming insert task is canceled while its transaction is being
committed, `beforeCommitted()` silently returned without setting
`txnCommitAttachment`. The transaction still committed with a null
attachment written to EditLog. During checkpoint replay,
`replayOnCommitted()` calls
`Preconditions.checkNotNull(txnState.getTxnCommitAttachment())`,
throwing NPE.

Two issues fixed in `beforeCommitted()`:
1. Throw `TransactionException` when task is canceled instead of silent
`return`, preventing commit with null attachment. Consistent with
`RoutineLoadJob.executeBeforeCheck()` pattern.
2. Fix write lock leak — replaced broken `shouldReleaseLock` (always
`false`) with `passCheck` pattern: release lock on failure in `finally`,
keep it for `onStreamTaskSuccess/Fail()` to release on success.
yiguolei pushed a commit that referenced this pull request Apr 16, 2026
…OnCommitted during EditLog replay #62416 (#62516)

Cherry-picked from #62416

Co-authored-by: wudi <wudi@selectdb.com>
JNSimba added a commit to JNSimba/doris that referenced this pull request Apr 20, 2026
…during EditLog replay (apache#62416)

Problem Summary:

Fix NPE in `StreamingInsertJob.replayOnCommitted()` during FE checkpoint
EditLog replay.

When a streaming insert task is canceled while its transaction is being
committed, `beforeCommitted()` silently returned without setting
`txnCommitAttachment`. The transaction still committed with a null
attachment written to EditLog. During checkpoint replay,
`replayOnCommitted()` calls
`Preconditions.checkNotNull(txnState.getTxnCommitAttachment())`,
throwing NPE.

Two issues fixed in `beforeCommitted()`:
1. Throw `TransactionException` when task is canceled instead of silent
`return`, preventing commit with null attachment. Consistent with
`RoutineLoadJob.executeBeforeCheck()` pattern.
2. Fix write lock leak — replaced broken `shouldReleaseLock` (always
`false`) with `passCheck` pattern: release lock on failure in `finally`,
keep it for `onStreamTaskSuccess/Fail()` to release on success.

(cherry picked from commit 8548d7b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.x dev/4.1.1-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants