Skip to content

[fix](be) Fix time-sharing executor queued task count#63568

Merged
yiguolei merged 1 commit into
apache:masterfrom
Jungzhang:fix-time-sharing-queue-count
May 26, 2026
Merged

[fix](be) Fix time-sharing executor queued task count#63568
yiguolei merged 1 commit into
apache:masterfrom
Jungzhang:fix-time-sharing-queue-count

Conversation

@Jungzhang
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:

TimeSharingTaskExecutor uses _total_queued_tasks for queue-size metrics and capacity checks. When queued splits were removed before execution, for example when a task was cancelled or removed, the split queue removed those splits but _total_queued_tasks was not decremented.

After repeated removals, _total_queued_tasks could become larger than the real queue size. This made the executor report a non-zero queue size even when there were no active or queued splits, and later submissions could be rejected as if the queue were full.

This PR keeps queue offer/remove operations consistent by updating _total_queued_tasks together with the split queue and token state.

Release note

Fix a bug where the time-sharing scan executor queue size could become inaccurate after queued splits were removed before execution.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. Queued splits removed before execution now decrement executor queued-task accounting, so queue-size metrics and capacity checks reflect the real queued split count.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Keep the executor queued-task metric consistent when queued splits are removed before execution. The fix routes queue offer/remove operations through helpers that update _total_queued_tasks together with the split queue and token state.

Add a regression test that enqueues splits with no worker threads, removes their tasks, and verifies the queue count returns to zero so later submissions are not rejected as full.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@Jungzhang
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31309 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3859c1aecff89494887d6fbd8c67e1ab30b042ef, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17621	4050	4033	4033
q2	q3	10747	1379	814	814
q4	4681	479	351	351
q5	7539	2275	2140	2140
q6	255	168	135	135
q7	954	786	639	639
q8	9388	1667	1582	1582
q9	6048	4977	4938	4938
q10	6454	2245	1917	1917
q11	451	274	247	247
q12	693	434	294	294
q13	18212	3463	2758	2758
q14	266	261	242	242
q15	q16	827	772	713	713
q17	978	1023	888	888
q18	6906	5739	5554	5554
q19	1228	1365	1092	1092
q20	500	418	281	281
q21	5704	2619	2395	2395
q22	436	356	296	296
Total cold run time: 99888 ms
Total hot run time: 31309 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4380	4379	4339	4339
q2	q3	4566	5015	4365	4365
q4	2126	2240	1424	1424
q5	4471	4344	4588	4344
q6	259	201	149	149
q7	2080	1856	1700	1700
q8	2529	2176	2168	2168
q9	8073	8045	8042	8042
q10	4844	4766	4365	4365
q11	596	453	396	396
q12	744	763	571	571
q13	3319	3660	2992	2992
q14	302	318	275	275
q15	q16	717	725	643	643
q17	1385	1384	1451	1384
q18	7908	7393	7010	7010
q19	1162	1098	1096	1096
q20	2243	2259	1944	1944
q21	5405	4640	4521	4521
q22	537	487	426	426
Total cold run time: 57646 ms
Total hot run time: 52154 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171614 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3859c1aecff89494887d6fbd8c67e1ab30b042ef, data reload: false

query5	4327	649	515	515
query6	327	230	198	198
query7	4276	583	329	329
query8	335	234	217	217
query9	8832	4122	4059	4059
query10	473	331	301	301
query11	5790	2464	2264	2264
query12	184	127	124	124
query13	1270	625	448	448
query14	6111	5498	5238	5238
query14_1	4576	4532	4485	4485
query15	210	203	184	184
query16	985	458	430	430
query17	1155	750	636	636
query18	2536	487	342	342
query19	221	204	158	158
query20	145	128	129	128
query21	211	133	117	117
query22	13648	13620	13402	13402
query23	17401	16549	16213	16213
query23_1	16259	16316	16330	16316
query24	7603	1778	1344	1344
query24_1	1305	1304	1331	1304
query25	524	463	424	424
query26	1328	324	170	170
query27	2699	583	337	337
query28	4453	2032	1997	1997
query29	995	629	502	502
query30	312	251	198	198
query31	1117	1078	950	950
query32	95	74	72	72
query33	536	349	289	289
query34	1165	1129	654	654
query35	771	800	696	696
query36	1405	1452	1292	1292
query37	173	113	94	94
query38	3223	3154	3088	3088
query39	913	913	909	909
query39_1	885	877	866	866
query40	229	148	121	121
query41	65	64	60	60
query42	138	108	109	108
query43	332	329	288	288
query44	
query45	214	205	193	193
query46	1064	1173	759	759
query47	2382	2410	2217	2217
query48	380	443	296	296
query49	626	481	381	381
query50	1061	350	248	248
query51	4413	4284	4250	4250
query52	102	102	92	92
query53	253	278	205	205
query54	305	283	251	251
query55	92	88	89	88
query56	299	306	313	306
query57	1445	1409	1340	1340
query58	301	268	251	251
query59	1578	1725	1424	1424
query60	317	317	305	305
query61	157	159	151	151
query62	687	641	588	588
query63	242	196	207	196
query64	2368	792	634	634
query65	
query66	1682	519	350	350
query67	30010	29932	29132	29132
query68	
query69	464	340	289	289
query70	1028	1012	994	994
query71	304	270	265	265
query72	2958	2946	2416	2416
query73	847	743	435	435
query74	5138	4952	4831	4831
query75	2719	2611	2272	2272
query76	2272	1136	780	780
query77	398	407	338	338
query78	12375	12311	11842	11842
query79	1259	1035	757	757
query80	588	605	483	483
query81	467	280	255	255
query82	242	162	131	131
query83	373	288	255	255
query84	257	145	112	112
query85	983	618	539	539
query86	387	356	336	336
query87	3477	3399	3263	3263
query88	3625	2772	2741	2741
query89	433	396	359	359
query90	2024	189	180	180
query91	192	186	170	170
query92	78	84	80	80
query93	1424	1447	880	880
query94	535	363	341	341
query95	717	388	356	356
query96	1034	768	365	365
query97	2751	2715	2613	2613
query98	254	232	239	232
query99	1180	1146	1034	1034
Total cold run time: 253225 ms
Total hot run time: 171614 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 93.33% (28/30) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.70% (27991/37979)
Line Coverage 57.69% (304310/527497)
Region Coverage 54.94% (255012/464151)
Branch Coverage 56.39% (110020/195113)

Copy link
Copy Markdown
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

ghost commented May 25, 2026

PR approved by anyone and no changes requested.

@Jungzhang Jungzhang requested a review from kaka11chen May 25, 2026 11:48
@Jungzhang
Copy link
Copy Markdown
Contributor Author

Jungzhang commented May 25, 2026

Can it be patched to 4.1.0?

@yiguolei
Copy link
Copy Markdown
Contributor

@Jungzhang I will be released to 4.1.2. And I think you can patch the PR to 4.1.0 in your own env and test it.

@yiguolei
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issue found in the changed code.

Critical checkpoint conclusions:

  • Goal/test: The PR fixes stale queued-task accounting when queued splits are removed and when auto-rescheduled splits are re-offered. The added unit test covers the remove_task capacity/count regression.
  • Scope: The change is small and focused on centralizing queue offer/remove accounting.
  • Concurrency: The modified counters and token state transitions are performed under _lock; task maps/sets remain under _mutex. I did not find a new lock-order or missed-notification issue in the changed paths.
  • Lifecycle/static init: No new static initialization or non-obvious ownership lifecycle risk was introduced.
  • Configuration/compatibility: No new config, protocol, storage, or compatibility surface.
  • Parallel paths: The changed direct queue offers/removals in submit, auto-reschedule, task removal, and destructor are covered by the helper paths. Stop already resets the queue count after clearing.
  • Tests: A targeted BE unit test was added for queued count cleanup after remove_task. I did not run the full BE unit suite in this review environment.
  • Observability/performance: Existing metrics now receive a more accurate queued count; no obvious added hot-path overhead beyond constant-time accounting and existing queue-size scans on removal.
  • User focus: No additional user-provided review focus was present.

@yiguolei yiguolei merged commit ff1012d into apache:master May 26, 2026
29 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants