[improve](streaming-job) Cap auto-resume attempts and expose structured FailureReason#62345
[improve](streaming-job) Cap auto-resume attempts and expose structured FailureReason#62345JNSimba merged 1 commit intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
/review |
|
run buildall |
There was a problem hiding this comment.
Found 2 issues worth fixing before merging.
StreamingInsertJob.resetFailureInfo()does not resetlatestAutoResumeTimestamp, so the next failure after a manual resume or a successful run can reuse the previous retry window and auto-resume earlier than the base backoff.lastTaskSuccessTimeis written and exposed, but streaming replay only applies a partial update inreplayOnUpdated(). The new field is not copied there, sojobs()losesLastTaskSuccessTimeafter FE replay/restart.
Critical checkpoints
- Goal of task: Partially met. The retry cap/backoff accumulation and TVF additions are implemented, but the two issues above leave the fresh-backoff behavior and persisted TVF state incomplete.
- Minimality/focus: Yes. The patch stays focused on streaming job retry policy and job TVF output.
- Concurrency: Reviewed. The changed paths run under the existing job lock/scheduler flow; I did not find a new lock-order or deadlock issue. The fresh-backoff bug is logical rather than synchronization-related.
- Lifecycle/static init: No special lifecycle or static-init issue introduced.
- Config changes:
streaming_job_max_auto_resume_countis master-only and mutable, and the scheduler reads it directly so runtime changes take effect without restart. - Compatibility/incompatible change: Applicable.
jobs("type"="insert")gains a new column; no FE/BE protocol issue seen. User-visible TVF behavior changed, so replay correctness for the new field matters. - Parallel paths: The new
LastTaskSuccessTimecolumn was added for both insert and streaming job TVF paths. - Special conditionals: The new auto-resume guards are understandable, but the reset path is incomplete because it leaves the old retry timestamp behind.
- Test coverage: Added regression tests cover happy-path
LastTaskSuccessTimeand max-retry exhaustion, but they miss FE restart/edit-log replay and the fresh-backoff-after-resume/success timing case. - Observability: Adequate for this scope.
- Transaction/persistence: Applicable. The new serialized field is not fully replayed on the streaming partial-update path.
- Data writes/modifications: Not applicable beyond job metadata persistence.
- FE/BE variable passing: Not applicable.
- Performance: No material new performance issue found.
- Other issues: None beyond the two findings above.
Overall opinion: not ready yet; the retry reset and replay path should be fixed first.
There was a problem hiding this comment.
Pull request overview
This PR improves operational safety and observability for streaming insert jobs by (1) capping auto-resume retries to avoid infinite retry loops, and (2) exposing richer, structured failure/progress information via the jobs("type"="insert") TVF.
Changes:
- Add a new FE config
streaming_job_max_auto_resume_countto cap auto-resume attempts; once exhausted, the job rewrites its failure reason toCANNOT_RESUME_ERRto stop further retries. - Change streaming insert job TVF
ErrorMsgto return a JSON-serializedFailureReason(code + msg), and add a new TVF columnLastTaskSuccessTime. - Fix retry/backoff accounting by no longer resetting
autoResumeCountonPENDING -> RUNNING, and add regression coverage for the retry cap + new TVF column.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/job_p0/streaming_job/test_streaming_job_max_retry.groovy | New regression to verify retry budget exhaustion rewrites failure reason to CANNOT_RESUME_ERR. |
| regression-test/suites/job_p0/streaming_job/test_streaming_insert_job.groovy | Adds assertion that LastTaskSuccessTime is populated after successful commits. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingJobSchedulerTask.java | Enforces max auto-resume count and stops retrying by rewriting failure reason. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java | Resets retry budget only on true “success/resume”; emits FailureReason JSON and LastTaskSuccessTime in TVF. |
| fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/InsertJob.java | Extends insert-job TVF schema/row with LastTaskSuccessTime. |
| fe/fe-core/src/main/java/org/apache/doris/job/base/AbstractJob.java | Tracks lastTaskSuccessTime on task success for jobs that use onTaskSuccess. |
| fe/fe-common/src/main/java/org/apache/doris/common/Config.java | Adds streaming_job_max_auto_resume_count config definition. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
e23f742 to
d7bc886
Compare
|
run buildall |
…ed FailureReason Add a max auto-resume retry budget and make the jobs() TVF expose the full FailureReason as JSON so the frontend can discriminate failure codes without grepping the message string. Also fix a latent exponential backoff bug so the new budget actually accumulates across failed retries. - New master-only mutable config streaming_job_max_auto_resume_count (default 10). Once exhausted, autoResumeHandler rewrites the failure reason to CANNOT_RESUME_ERR and stops retrying, without introducing a new JobStatus. - ErrorMsg TVF column now returns GsonUtils.GSON.toJson(failureReason) so frontends can switch on the structured code instead of pattern-matching msg. - New LastTaskSuccessTime TVF column shows the time of the last successful task commit, so operators can see whether a job is making real progress. Also copied in replayOnUpdated() so it survives FE restart. - Fix latent backoff bug: handlePendingState used to reset autoResumeCount on every PENDING->RUNNING transition, so the exponential backoff interval was stuck at the base value across repeated failures. The counter (and the paired latestAutoResumeTimestamp) is now reset only inside resetFailureInfo(null), which fires on a successful task or on RESUME JOB, so exponential backoff and the max retry budget actually accumulate across failed retries. - autoResumeCount and latestAutoResumeTimestamp are intentionally not serialized: an FE restart resets the retry budget, which gives paused jobs a fresh retry chance rather than freezing them at the burn-out boundary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
|
run cloud_p0 |
FE Regression Coverage ReportIncrement line coverage |
…ed FailureReason (#62345) ### What problem does this PR solve? `StreamingInsertJob` has two gaps: 1. **No retry budget.** `autoResumeHandler` retries forever — a permanently broken job keeps hammering at the 5-min backoff cap indefinitely. 2. **Unstructured `ErrorMsg`.** The `jobs()` TVF only returns `FailureReason.msg`, so the frontend cannot tell manual pause from auto-recovery from non-recoverable failure without grepping the error message.
What problem does this PR solve?
StreamingInsertJobhas two gaps:autoResumeHandlerretries forever — a permanently broken job keeps hammering at the 5-min backoff cap indefinitely.ErrorMsg. Thejobs()TVF only returnsFailureReason.msg, so the frontend cannot tell manual pause from auto-recovery from non-recoverable failure without grepping the error message.Changes
streaming_job_max_auto_resume_count(default 10). Once exhausted,autoResumeHandlerrewrites the failure reason toCANNOT_RESUME_ERRand stops retrying — no newJobStatusvalue needed.ErrorMsgcolumn now returns a FailureReason JSON ({"code":"...","msg":"..."}). Existing.contains(...)test assertions still work because the message is embedded.LastTaskSuccessTimeTVF column for observability.handlePendingStateused to resetautoResumeCounton everyPENDING → RUNNING, so the backoff interval was stuck at the base value. Reset is now insideresetFailureInfo(null)(fires on task success andRESUME JOB), so exponential backoff and the retry budget accumulate correctly.Release note
Add
streaming_job_max_auto_resume_countconfig (default 10) to cap auto-resume attempts for streaming insert jobs. TheErrorMsgcolumn injobs("type"="insert")now returnsFailureReasonas JSON. NewLastTaskSuccessTimecolumn shows the time of the last successful task commit.Check List (For Author)
Test
test_streaming_job_max_retry;test_streaming_insert_jobextended withLastTaskSuccessTimeassertion)Behavior changed:
ErrorMsgcolumn: plain message →FailureReasonJSON (.contains(...)unaffected).LastTaskSuccessTimecolumn appended toInsertJobschema.streaming_job_max_auto_resume_countattempts.Does this need documentation?
Check List (For Reviewer who merge this PR)