Skip to content

[fix](streaming-job) keep isCanceled set when cancel runs on terminal task#63427

Open
JNSimba wants to merge 2 commits into
apache:masterfrom
JNSimba:fix/streaming-late-commit-offset-skip
Open

[fix](streaming-job) keep isCanceled set when cancel runs on terminal task#63427
JNSimba wants to merge 2 commits into
apache:masterfrom
JNSimba:fix/streaming-late-commit-offset-skip

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented May 20, 2026

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:

Streaming insert job (CDC source / JdbcSourceOffsetProvider) can become permanently stuck in PAUSED when a BE-side commit arrives after FE-side task timeout. Symptoms observed in production:

  • Job status PAUSED with empty ErrorMsg / JobRuntimeMsg.
  • Latest task PENDING and never scheduled (scheduler logs "do not need to schedule invalid task ... job status: PAUSED").
  • Previous task status SUCCESS but its ErrorMsg = "task failed cause timeout".
  • auto resume never recovers the job; only manual RESUME JOB works.

Root cause

  1. FE processTimeoutTasks detects task timeout and calls runningMultiTask.onFail("task failed cause timeout"). AbstractStreamingTask.onFail sets task status to FAILED.
  2. StreamingInsertJob.onStreamTaskFail sets failureReason and calls updateJobStatus(PAUSED), which in turn invokes clearRunningStreamTasktask.cancel(true).
  3. AbstractStreamingTask.cancel() short-circuits on terminal status: it returns immediately when status is already FAILED/SUCCESS/CANCELED, so isCanceled is never flipped to true.
  4. A late BE callback arrives at StreamingInsertJob.commitOffset. The current runningStreamTask != null && instanceof StreamingMultiTblTask + taskId match checks all pass, and downstream defenses in successCallback/beforeCommitted also gate on getIsCanceled().get(), which is still false. successCallback therefore overrides task status back to SUCCESS, calls onStreamTaskSuccessresetFailureInfo(null), clearing failureReason.
  5. StreamingJobSchedulerTask.autoResumeHandler returns early whenever failureReason == null, so the PAUSED job is never resumed.

The bug is essentially: cancel() is supposed to be the single source of truth that says "this task instance is dead, do not accept further callbacks", but its terminal short-circuit prevents the signal from being broadcast through isCanceled, leaving every other defense in the streaming task path silently bypassed.

Fix

  • AbstractStreamingTask.cancel(): always flip isCanceled on entry, even when the task is already in a terminal state. This restores the contract that 10+ existing getIsCanceled().get() checks across the streaming task path rely on (e.g. successCallback, beforeCommitted, internal abort points in StreamingInsertTask / StreamingMultiTblTask).
  • StreamingInsertJob.commitOffset(): add an isCanceled guard right after the instanceof StreamingMultiTblTask check so the late callback is dropped (logged at INFO) before any side effects (updateNoTxnJobStatisticAndOffset, onTaskCommitted, persistOffsetProviderIfNeed) run.

Release note

Fix streaming insert job stuck in PAUSED when a late BE commit callback arrives after FE-side task timeout.

Check List (For Author)

  • Test
    • Unit Test
    • Regression test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason

New unit tests in StreamingInsertJobLateCallbackTest:

  • cancel() flips isCanceled on a terminal-state task (FAILED / SUCCESS) without overriding the existing status.
  • cancel() transitions a RUNNING task to CANCELED correctly.
  • cancel() is idempotent — the second invocation early-returns and leaves task state untouched.
  • commitOffset() silently skips when the running task is already canceled (status preserved, no successCallback side effects).

Regression coverage relies on existing CDC pause/resume suites under regression-test/suites/job_p0/streaming_job/cdc/ to guard the normal happy path. The exact "BE late callback after FE timeout" timing cannot be reliably reproduced in the existing non-nonConcurrent CDC tests without adding debug points to commitOffset.

  • Behavior changed:

    • Yes. A late BE commitOffset arriving after a streaming task has been canceled is now dropped (logged at INFO) instead of being allowed to mutate task / job state and clear failureReason. Side effect: on non-unique-key target tables, auto-resume may now produce a small number of duplicate rows from re-running the same input range, in exchange for the job no longer being permanently stuck in PAUSED.
  • Does this need documentation?

    • No.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run buildall

@JNSimba JNSimba requested a review from Copilot May 20, 2026 03:50
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a streaming insert (CDC / JdbcSourceOffsetProvider) edge case where a late BE commitOffset callback could overwrite a timed-out/canceled task’s terminal state, clear the job’s failureReason, and leave the job permanently stuck in PAUSED (auto-resume never triggers).

Changes:

  • Update AbstractStreamingTask.cancel() to always flip isCanceled even if the task is already in a terminal state.
  • Add an isCanceled guard in StreamingInsertJob.commitOffset() to drop late commit callbacks for canceled multi-table tasks before any side effects occur.
  • Add unit tests covering terminal-state cancel semantics, idempotency, and commitOffset late-callback suppression.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/AbstractStreamingTask.java Ensures cancel() reliably publishes cancellation via isCanceled, even for terminal tasks.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Drops late commitOffset callbacks when the current multi-table task is already canceled.
fe/fe-core/src/test/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJobLateCallbackTest.java Adds unit coverage for terminal-task cancel behavior and late-callback suppression.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31561 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 22dabb03b721ca94f021653f087f3f2fa5ad3917, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17599	3895	3909	3895
q2	q3	10824	1451	831	831
q4	4695	479	343	343
q5	7580	2348	2129	2129
q6	249	179	141	141
q7	946	793	653	653
q8	9437	1674	1578	1578
q9	5109	4990	4919	4919
q10	6385	2106	1775	1775
q11	442	268	242	242
q12	639	427	293	293
q13	18115	3461	2778	2778
q14	263	256	240	240
q15	q16	828	776	719	719
q17	1016	1002	949	949
q18	6934	5667	5646	5646
q19	1166	1322	1139	1139
q20	541	536	313	313
q21	5834	2821	2655	2655
q22	464	385	323	323
Total cold run time: 99066 ms
Total hot run time: 31561 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4897	4557	4498	4498
q2	q3	4902	5344	4668	4668
q4	2164	2236	1423	1423
q5	4737	4835	4733	4733
q6	233	177	127	127
q7	1882	1816	1544	1544
q8	2415	2065	2068	2065
q9	7779	7558	7263	7263
q10	4470	4424	4005	4005
q11	538	375	346	346
q12	718	718	504	504
q13	3100	3397	2872	2872
q14	271	290	250	250
q15	q16	678	697	608	608
q17	1319	1252	1243	1243
q18	7245	6915	6859	6859
q19	1156	1079	1086	1079
q20	2212	2219	1929	1929
q21	5358	4686	4592	4592
q22	527	467	405	405
Total cold run time: 56601 ms
Total hot run time: 51013 ms

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run cloud_p0

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169981 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 22dabb03b721ca94f021653f087f3f2fa5ad3917, data reload: false

query5	4319	665	538	538
query6	354	222	205	205
query7	4243	551	322	322
query8	341	228	218	218
query9	8833	4004	4037	4004
query10	450	358	323	323
query11	5765	2498	2233	2233
query12	183	134	133	133
query13	1298	615	426	426
query14	5957	5371	5068	5068
query14_1	4369	4384	4363	4363
query15	210	205	186	186
query16	1063	471	490	471
query17	1196	734	589	589
query18	2546	484	379	379
query19	220	201	170	170
query20	135	136	130	130
query21	220	141	121	121
query22	13581	13603	13327	13327
query23	17152	16512	16163	16163
query23_1	16192	16303	16043	16043
query24	7631	1755	1301	1301
query24_1	1300	1301	1314	1301
query25	544	516	403	403
query26	1324	312	171	171
query27	2690	518	348	348
query28	4439	1970	1943	1943
query29	998	632	502	502
query30	308	246	199	199
query31	1110	1059	942	942
query32	100	81	72	72
query33	549	369	314	314
query34	1169	1159	645	645
query35	768	789	676	676
query36	1317	1344	1145	1145
query37	154	107	93	93
query38	3234	3131	3031	3031
query39	933	937	905	905
query39_1	886	891	907	891
query40	247	151	126	126
query41	68	64	62	62
query42	120	116	111	111
query43	324	334	279	279
query44	
query45	209	202	195	195
query46	1070	1199	729	729
query47	2339	2353	2249	2249
query48	403	429	339	339
query49	648	513	403	403
query50	1040	348	265	265
query51	4298	4270	4296	4270
query52	109	109	99	99
query53	255	291	211	211
query54	329	297	279	279
query55	98	97	87	87
query56	314	317	320	317
query57	1428	1396	1353	1353
query58	320	288	282	282
query59	1604	1677	1442	1442
query60	338	344	360	344
query61	157	147	154	147
query62	667	632	517	517
query63	244	201	212	201
query64	2441	796	645	645
query65	
query66	1715	478	361	361
query67	30080	29990	29941	29941
query68	
query69	477	342	335	335
query70	1011	1024	1040	1024
query71	318	275	273	273
query72	2956	2738	2454	2454
query73	825	760	442	442
query74	5060	4886	4704	4704
query75	2657	2620	2254	2254
query76	2296	1127	766	766
query77	407	407	334	334
query78	12050	12066	11748	11748
query79	1251	1029	687	687
query80	578	558	470	470
query81	450	276	241	241
query82	234	153	123	123
query83	292	270	250	250
query84	263	135	111	111
query85	865	536	457	457
query86	349	331	351	331
query87	3372	3383	3231	3231
query88	3484	2647	2622	2622
query89	435	405	342	342
query90	2184	185	177	177
query91	182	165	143	143
query92	80	78	73	73
query93	1404	1463	915	915
query94	521	345	317	317
query95	676	377	458	377
query96	1018	831	338	338
query97	2684	2717	2577	2577
query98	232	228	227	227
query99	1102	1087	1000	1000
Total cold run time: 251602 ms
Total hot run time: 169981 ms

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run fe_ut

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run cloud_p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run p0

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run feut

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run external

1 similar comment
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

run external

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 2.11% (2/95) 🎉
Increment coverage report
Complete coverage report

1 similar comment
@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 2.11% (2/95) 🎉
Increment coverage report
Complete coverage report

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 20, 2026

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31818 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6164e104c6667632613fa84e86ad823acf29e452, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	18004	3977	3923	3923
q2	q3	10799	1371	838	838
q4	4685	475	348	348
q5	7537	2319	2128	2128
q6	244	181	139	139
q7	988	792	644	644
q8	9363	1772	1514	1514
q9	5850	4978	4949	4949
q10	6464	2114	1812	1812
q11	448	278	249	249
q12	664	434	299	299
q13	18123	3360	2844	2844
q14	269	255	236	236
q15	q16	817	799	716	716
q17	1003	944	940	940
q18	7077	6007	5703	5703
q19	1219	1317	1233	1233
q20	576	444	312	312
q21	6006	2887	2662	2662
q22	569	371	329	329
Total cold run time: 100705 ms
Total hot run time: 31818 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4658	4578	4518	4518
q2	q3	5011	5292	4594	4594
q4	2145	2242	1432	1432
q5	4911	4590	4758	4590
q6	241	182	133	133
q7	1869	1727	1538	1538
q8	2397	2089	2088	2088
q9	7687	7248	7205	7205
q10	4485	4384	3982	3982
q11	529	383	351	351
q12	723	745	518	518
q13	3008	3398	2777	2777
q14	270	279	247	247
q15	q16	682	703	616	616
q17	1271	1261	1248	1248
q18	7172	6890	6803	6803
q19	1143	1086	1078	1078
q20	2219	2216	1937	1937
q21	5395	4716	4536	4536
q22	525	478	414	414
Total cold run time: 56341 ms
Total hot run time: 50605 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171240 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6164e104c6667632613fa84e86ad823acf29e452, data reload: false

query5	4368	664	537	537
query6	332	229	214	214
query7	4229	592	307	307
query8	341	235	220	220
query9	8809	4071	4017	4017
query10	464	347	309	309
query11	5815	2432	2249	2249
query12	187	132	132	132
query13	1281	618	448	448
query14	6018	5374	5108	5108
query14_1	4425	4377	4400	4377
query15	212	208	189	189
query16	1075	475	459	459
query17	1156	763	603	603
query18	2750	483	371	371
query19	226	218	183	183
query20	149	139	134	134
query21	227	139	122	122
query22	13570	13549	13466	13466
query23	17236	16471	16062	16062
query23_1	16242	16186	16136	16136
query24	7487	1770	1319	1319
query24_1	1308	1308	1308	1308
query25	591	500	454	454
query26	1331	354	181	181
query27	2667	569	343	343
query28	4398	1993	1925	1925
query29	1016	657	540	540
query30	316	242	199	199
query31	1110	1074	940	940
query32	94	82	78	78
query33	567	363	304	304
query34	1220	1134	660	660
query35	761	778	664	664
query36	1332	1350	1191	1191
query37	155	108	90	90
query38	3228	3146	3065	3065
query39	933	921	888	888
query39_1	868	871	879	871
query40	232	152	129	129
query41	67	64	64	64
query42	115	115	119	115
query43	329	332	285	285
query44	
query45	219	201	198	198
query46	1110	1166	717	717
query47	2287	2351	2184	2184
query48	391	423	285	285
query49	633	496	393	393
query50	981	351	259	259
query51	4314	4291	4275	4275
query52	108	112	97	97
query53	252	285	204	204
query54	319	276	261	261
query55	96	99	87	87
query56	327	326	325	325
query57	1440	1374	1321	1321
query58	307	279	269	269
query59	1598	1605	1441	1441
query60	338	337	317	317
query61	162	148	148	148
query62	673	627	572	572
query63	249	209	204	204
query64	2377	790	631	631
query65	
query66	1664	502	357	357
query67	30024	30021	29932	29932
query68	
query69	463	349	314	314
query70	1095	1033	965	965
query71	308	286	279	279
query72	2987	2730	2442	2442
query73	847	799	417	417
query74	5103	4917	4718	4718
query75	2709	2604	2283	2283
query76	2309	1147	788	788
query77	408	432	346	346
query78	11994	12118	11617	11617
query79	1472	1053	762	762
query80	674	573	468	468
query81	458	273	245	245
query82	1368	166	120	120
query83	363	278	249	249
query84	266	147	117	117
query85	902	543	463	463
query86	415	346	323	323
query87	3425	3363	3196	3196
query88	3565	2685	2650	2650
query89	450	392	335	335
query90	1993	183	187	183
query91	179	170	147	147
query92	86	77	75	75
query93	1558	1452	933	933
query94	553	361	326	326
query95	682	396	457	396
query96	1072	796	348	348
query97	2703	2701	2544	2544
query98	242	238	232	232
query99	1095	1128	985	985
Total cold run time: 253745 ms
Total hot run time: 171240 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 1.16% (2/173) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a blocking data-correctness issue in the new late-callback handling.

Critical checkpoints:

  • Goal and tests: The PR aims to keep failed/canceled streaming multi-table tasks from being turned back into successful tasks by late BE commit callbacks. The unit test covers the skipped-callback shape, but it does not cover the case where the BE write has already succeeded and only the FE callback is late.
  • Scope/minimality: The code change is small, but the canceled-task guard is broader than the intended failed-terminal case.
  • Concurrency/lifecycle: This path is inherently concurrent with user pause/stop, timeout failure handling, BE stream-load completion, and callback delivery. The job write lock serializes FE state mutation, but it does not cancel or roll back the already-issued BE multi-table write.
  • Compatibility/config/static lifecycle: No new configs, persistence format changes, static lifecycle concerns, or mixed-version protocol changes found.
  • Parallel paths: Single-table transactional streaming uses txn callbacks; this issue is specific to multi-table commitOffset, where offset persistence is separated from BE writes.
  • Tests: The added unit test verifies that a canceled failed task does not become SUCCESS, but it misses the data-consistency scenario where a canceled/timed-out task already loaded rows and still needs its offset committed or otherwise compensated.
  • Observability: Existing logs are sufficient to see skipped callbacks, but logging does not prevent offset/data divergence.
  • Transaction/persistence/data writes: Blocking issue found: target data can be loaded while source offset persistence is skipped, causing reprocessing/duplicates after resume.
  • User focus: No additional user-provided review focus.

Please address the inline issue before approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants