Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Nov 9, 2025

What problem does this PR solve?

Problem Summary:

Release note

Fix time-sharing task executor don't re-schedule when exceeded concurrency limit case. The time-sharing task executor needs the state of ScanTask to track split scheduling and determine whether to schedule it as the first time by enqueue_splits or reschedule it by re_enqueue_split.

virtual Status submit_scan_task(SimplifiedScanTask scan_task) override {
        if (!_is_stop) {
            std::cout << "submitting scan task to task executor scheduler, context id: "
                      << scan_task.scanner_context->ctx_id << std::endl;
            std::shared_ptr<SplitRunner> split_runner;
            if (scan_task.scan_task->is_first_schedule) {
                split_runner = std::make_shared<ScannerSplitRunner>("scanner_split_runner",
                                                                    scan_task.scan_func);
                RETURN_IF_ERROR(split_runner->init());
                auto result = _task_executor->enqueue_splits(
                        scan_task.scanner_context->task_handle(), false, {split_runner});
                if (!result.has_value()) {
                    LOG(WARNING) << "enqueue_splits failed: " << result.error();
                    return result.error();
                }
                scan_task.scan_task->is_first_schedule = false;
            } else {
                split_runner = scan_task.scan_task->split_runner.lock();
                if (split_runner == nullptr) {
                    return Status::OK();
                }
                RETURN_IF_ERROR(_task_executor->re_enqueue_split(
                        scan_task.scanner_context->task_handle(), false, split_runner));
            }
            scan_task.scan_task->split_runner = split_runner;
            return Status::OK();
        } else {
            return Status::InternalError<false>("scanner pool {} is shutdown.", _sched_name);
        }
    }

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
      tpcds1000 q88
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen
Copy link
Contributor Author

run buildall

@kaka11chen kaka11chen marked this pull request as draft November 9, 2025 05:20
@kaka11chen kaka11chen force-pushed the fix_time_sharing_task_executor_concurrency branch from bcb9f25 to 4f3de8c Compare November 9, 2025 11:35
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 35639 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 4f3de8cbb00df64e983077cfbd3a0d9aba5dcbbf, data reload: false

------ Round 1 ----------------------------------
q1	17639	5281	5123	5123
q2	2064	333	223	223
q3	10245	1307	755	755
q4	10235	897	383	383
q5	7524	2467	2364	2364
q6	189	168	140	140
q7	958	772	654	654
q8	9385	1382	1205	1205
q9	7034	5220	5174	5174
q10	6928	2248	1852	1852
q11	511	303	292	292
q12	376	383	244	244
q13	17804	3668	3085	3085
q14	238	239	217	217
q15	584	520	504	504
q16	1012	1000	959	959
q17	619	889	359	359
q18	7414	7605	7643	7605
q19	1334	1005	609	609
q20	389	359	245	245
q21	4372	4187	2563	2563
q22	1157	1084	1104	1084
Total cold run time: 108011 ms
Total hot run time: 35639 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5526	5413	5383	5383
q2	262	351	247	247
q3	2441	2923	2501	2501
q4	1441	1911	1570	1570
q5	4574	4441	4583	4441
q6	217	174	132	132
q7	2004	1916	1820	1820
q8	2593	2897	2560	2560
q9	7213	7390	7184	7184
q10	2913	3141	2674	2674
q11	592	509	493	493
q12	670	738	628	628
q13	3294	3660	3075	3075
q14	272	288	265	265
q15	553	505	501	501
q16	1053	1040	1006	1006
q17	1124	1431	1349	1349
q18	7329	7249	7102	7102
q19	823	775	871	775
q20	1921	1981	1819	1819
q21	4780	4460	4319	4319
q22	1058	1075	995	995
Total cold run time: 52653 ms
Total hot run time: 50839 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185689 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 4f3de8cbb00df64e983077cfbd3a0d9aba5dcbbf, data reload: false

query1	1065	421	389	389
query2	6606	1702	1711	1702
query3	6760	225	223	223
query4	26331	23293	23223	23223
query5	4440	622	485	485
query6	326	227	214	214
query7	4652	490	299	299
query8	299	249	248	248
query9	8685	2601	2575	2575
query10	515	318	281	281
query11	15277	15204	14889	14889
query12	194	117	111	111
query13	1678	568	456	456
query14	10579	9441	9483	9441
query15	205	189	182	182
query16	7585	689	538	538
query17	1264	778	648	648
query18	2038	435	338	338
query19	219	222	186	186
query20	131	125	118	118
query21	219	143	115	115
query22	4027	4065	3949	3949
query23	33880	32882	32938	32882
query24	8483	2473	2426	2426
query25	594	520	449	449
query26	1240	273	158	158
query27	2750	488	348	348
query28	4348	2210	2185	2185
query29	796	605	489	489
query30	297	228	195	195
query31	883	836	703	703
query32	82	75	73	73
query33	593	358	329	329
query34	796	833	519	519
query35	807	841	737	737
query36	955	1013	897	897
query37	124	106	85	85
query38	3595	3520	3444	3444
query39	1481	1408	1404	1404
query40	256	131	118	118
query41	63	59	60	59
query42	127	116	113	113
query43	491	524	471	471
query44	1237	740	744	740
query45	188	178	171	171
query46	883	980	648	648
query47	1754	1780	1746	1746
query48	400	423	317	317
query49	783	530	430	430
query50	654	686	411	411
query51	4043	3862	3854	3854
query52	112	108	105	105
query53	241	290	204	204
query54	304	291	271	271
query55	92	87	84	84
query56	336	327	330	327
query57	1177	1190	1116	1116
query58	289	272	276	272
query59	2543	2694	2513	2513
query60	342	349	329	329
query61	155	157	161	157
query62	780	735	693	693
query63	230	215	210	210
query64	4466	1132	864	864
query65	4055	3916	3940	3916
query66	1167	431	349	349
query67	15311	15055	14958	14958
query68	7674	878	601	601
query69	506	318	292	292
query70	1335	1228	1299	1228
query71	414	349	314	314
query72	6093	4929	2686	2686
query73	659	568	361	361
query74	8902	9143	8915	8915
query75	3292	3293	2787	2787
query76	3282	1147	753	753
query77	499	390	312	312
query78	9530	9774	8950	8950
query79	1798	789	610	610
query80	718	575	499	499
query81	490	261	230	230
query82	204	167	138	138
query83	271	261	250	250
query84	251	108	92	92
query85	908	476	455	455
query86	322	284	286	284
query87	3892	3629	3579	3579
query88	2834	2290	2244	2244
query89	403	329	306	306
query90	1848	221	224	221
query91	204	192	159	159
query92	74	70	68	68
query93	1156	969	649	649
query94	676	464	353	353
query95	415	328	326	326
query96	484	590	273	273
query97	2895	2941	2881	2881
query98	250	219	210	210
query99	1311	1413	1284	1284
Total cold run time: 270868 ms
Total hot run time: 185689 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.99 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 4f3de8cbb00df64e983077cfbd3a0d9aba5dcbbf, data reload: false

query1	0.05	0.05	0.06
query2	0.09	0.04	0.04
query3	0.25	0.08	0.08
query4	1.61	0.12	0.11
query5	0.27	0.25	0.25
query6	1.16	0.64	0.65
query7	0.04	0.03	0.02
query8	0.05	0.04	0.04
query9	0.61	0.51	0.51
query10	0.58	0.58	0.57
query11	0.16	0.12	0.12
query12	0.14	0.12	0.12
query13	0.62	0.61	0.60
query14	0.99	1.01	1.01
query15	0.83	0.83	0.84
query16	0.40	0.40	0.39
query17	1.03	1.04	1.05
query18	0.22	0.20	0.20
query19	1.89	1.82	1.85
query20	0.02	0.01	0.02
query21	15.43	0.19	0.13
query22	5.01	0.07	0.05
query23	15.68	0.25	0.10
query24	2.59	0.93	1.08
query25	0.09	0.07	0.06
query26	0.13	0.14	0.14
query27	0.07	0.05	0.06
query28	4.91	1.17	0.91
query29	12.56	4.01	3.23
query30	0.28	0.15	0.11
query31	2.82	0.60	0.37
query32	3.23	0.56	0.46
query33	3.00	3.04	3.00
query34	16.01	5.16	4.57
query35	4.55	4.58	4.58
query36	0.66	0.49	0.49
query37	0.09	0.07	0.06
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.14	0.14
query41	0.08	0.03	0.03
query42	0.03	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 98.54 s
Total hot run time: 27.99 s

@doris-robot
Copy link

BE UT Coverage Report

Increment line coverage 80.00% (4/5) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.78% (18226/34533)
Line Coverage 38.13% (165763/434731)
Region Coverage 33.16% (129064/389195)
Branch Coverage 33.87% (55314/163328)

@kaka11chen kaka11chen marked this pull request as ready for review November 10, 2025 06:21
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 14, 2025
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman
Copy link
Contributor

run check_coverage

@morningman morningman merged commit 4017fc7 into apache:master Nov 14, 2025
28 of 30 checks passed
github-actions bot pushed a commit that referenced this pull request Nov 14, 2025
… exceeded concurrency limit case. (#57833)

### What problem does this PR solve?

Problem Summary:

### Release note

Fix time-sharing task executor don't re-schedule when exceeded
concurrency limit case. The time-sharing task executor needs the state
of `ScanTask` to track split scheduling and determine whether to
schedule it as the first time by `enqueue_splits` or reschedule it by
`re_enqueue_split`.
```
virtual Status submit_scan_task(SimplifiedScanTask scan_task) override {
        if (!_is_stop) {
            std::cout << "submitting scan task to task executor scheduler, context id: "
                      << scan_task.scanner_context->ctx_id << std::endl;
            std::shared_ptr<SplitRunner> split_runner;
            if (scan_task.scan_task->is_first_schedule) {
                split_runner = std::make_shared<ScannerSplitRunner>("scanner_split_runner",
                                                                    scan_task.scan_func);
                RETURN_IF_ERROR(split_runner->init());
                auto result = _task_executor->enqueue_splits(
                        scan_task.scanner_context->task_handle(), false, {split_runner});
                if (!result.has_value()) {
                    LOG(WARNING) << "enqueue_splits failed: " << result.error();
                    return result.error();
                }
                scan_task.scan_task->is_first_schedule = false;
            } else {
                split_runner = scan_task.scan_task->split_runner.lock();
                if (split_runner == nullptr) {
                    return Status::OK();
                }
                RETURN_IF_ERROR(_task_executor->re_enqueue_split(
                        scan_task.scanner_context->task_handle(), false, split_runner));
            }
            scan_task.scan_task->split_runner = split_runner;
            return Status::OK();
        } else {
            return Status::InternalError<false>("scanner pool {} is shutdown.", _sched_name);
        }
    }
```
yiguolei pushed a commit that referenced this pull request Nov 15, 2025
…chedule when exceeded concurrency limit case. #57833 (#58013)

Cherry-picked from #57833

Co-authored-by: Qi Chen <chenqi@selectdb.com>
morningman pushed a commit that referenced this pull request Nov 15, 2025
…lt value to true. (#58017)

### What problem does this PR solve?

Related PR: #55935, #57833

Problem Summary:

### Release note
Because the time task scheduler timeout issue, in #55935, we set
`enable_task_executor_in_internal_table` conf's default value to true.
in #57833, we fix it for some cases. So in this PR, set
`enable_task_executor_in_internal_table` conf's default value to true.
github-actions bot pushed a commit that referenced this pull request Nov 15, 2025
…lt value to true. (#58017)

### What problem does this PR solve?

Related PR: #55935, #57833

Problem Summary:

### Release note
Because the time task scheduler timeout issue, in #55935, we set
`enable_task_executor_in_internal_table` conf's default value to true.
in #57833, we fix it for some cases. So in this PR, set
`enable_task_executor_in_internal_table` conf's default value to true.
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
… exceeded concurrency limit case. (apache#57833)

### What problem does this PR solve?

Problem Summary:

### Release note

Fix time-sharing task executor don't re-schedule when exceeded
concurrency limit case. The time-sharing task executor needs the state
of `ScanTask` to track split scheduling and determine whether to
schedule it as the first time by `enqueue_splits` or reschedule it by
`re_enqueue_split`.
```
virtual Status submit_scan_task(SimplifiedScanTask scan_task) override {
        if (!_is_stop) {
            std::cout << "submitting scan task to task executor scheduler, context id: "
                      << scan_task.scanner_context->ctx_id << std::endl;
            std::shared_ptr<SplitRunner> split_runner;
            if (scan_task.scan_task->is_first_schedule) {
                split_runner = std::make_shared<ScannerSplitRunner>("scanner_split_runner",
                                                                    scan_task.scan_func);
                RETURN_IF_ERROR(split_runner->init());
                auto result = _task_executor->enqueue_splits(
                        scan_task.scanner_context->task_handle(), false, {split_runner});
                if (!result.has_value()) {
                    LOG(WARNING) << "enqueue_splits failed: " << result.error();
                    return result.error();
                }
                scan_task.scan_task->is_first_schedule = false;
            } else {
                split_runner = scan_task.scan_task->split_runner.lock();
                if (split_runner == nullptr) {
                    return Status::OK();
                }
                RETURN_IF_ERROR(_task_executor->re_enqueue_split(
                        scan_task.scanner_context->task_handle(), false, split_runner));
            }
            scan_task.scan_task->split_runner = split_runner;
            return Status::OK();
        } else {
            return Status::InternalError<false>("scanner pool {} is shutdown.", _sched_name);
        }
    }
```
nagisa-kunhah pushed a commit to nagisa-kunhah/doris that referenced this pull request Dec 14, 2025
…lt value to true. (apache#58017)

### What problem does this PR solve?

Related PR: apache#55935, apache#57833

Problem Summary:

### Release note
Because the time task scheduler timeout issue, in apache#55935, we set
`enable_task_executor_in_internal_table` conf's default value to true.
in apache#57833, we fix it for some cases. So in this PR, set
`enable_task_executor_in_internal_table` conf's default value to true.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.2-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants