Skip to content

[fix](load) fix the error msg of task submission failure for memory back pressure #51078

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 22, 2025

Conversation

sollhui
Copy link
Contributor

@sollhui sollhui commented May 20, 2025

What problem does this PR solve?

If backend node memory reached limit, task submission will fail for memory back pressure. But the error msg is confusing:

failed to send task: errCode = 2, detailMessage = failed to submit task. error code: TOO_MANY_TASKS, msg:
(127.0.0.1)[TOO_MANY_TASKS]...

Change the error msg to:

failed to submit task. error code: MEM_LIMIT_EXCEEDED, msg: (127.0.0.1)[MEM_LIMIT_EXCEEDED]...

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui
Copy link
Contributor Author

sollhui commented May 20, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 33792 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0a8788866768880be645f120fc2bc5996132b98d, data reload: false

------ Round 1 ----------------------------------
q1	26248	5000	5004	5000
q2	2067	273	179	179
q3	10416	1229	708	708
q4	10232	999	522	522
q5	7663	2386	2374	2374
q6	182	161	134	134
q7	912	736	614	614
q8	9314	1333	1204	1204
q9	6769	5102	5051	5051
q10	6869	2288	1890	1890
q11	488	291	271	271
q12	351	350	218	218
q13	17766	3680	3105	3105
q14	244	218	205	205
q15	529	469	504	469
q16	425	429	371	371
q17	616	865	367	367
q18	7626	7025	7226	7025
q19	1370	940	563	563
q20	337	331	212	212
q21	3905	3128	2346	2346
q22	994	1000	964	964
Total cold run time: 115323 ms
Total hot run time: 33792 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5121	5038	5206	5038
q2	235	321	229	229
q3	2132	2688	2319	2319
q4	1311	1741	1319	1319
q5	4478	4405	4378	4378
q6	222	163	128	128
q7	2047	1940	1764	1764
q8	2605	2604	2518	2518
q9	7254	7223	6950	6950
q10	3036	3202	2702	2702
q11	578	512	492	492
q12	689	753	623	623
q13	3504	3878	3219	3219
q14	273	301	277	277
q15	519	485	487	485
q16	436	505	447	447
q17	1120	1613	1365	1365
q18	7715	7668	7380	7380
q19	853	822	875	822
q20	2080	2031	1846	1846
q21	4816	4531	4432	4432
q22	1107	1036	1012	1012
Total cold run time: 52131 ms
Total hot run time: 49745 ms

Copy link
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 20, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@liaoxin01 liaoxin01 added dev/2.1.x dev/3.0.x and removed approved Indicates a PR has been approved by one committer. reviewed labels May 20, 2025
@doris-robot
Copy link

TPC-DS: Total hot run time: 192462 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0a8788866768880be645f120fc2bc5996132b98d, data reload: false

query1	1402	1095	1057	1057
query2	6292	1884	1913	1884
query3	10995	4419	4405	4405
query4	53714	26096	22847	22847
query5	5199	493	447	447
query6	348	211	184	184
query7	4911	502	280	280
query8	301	238	218	218
query9	5985	2649	2639	2639
query10	456	325	271	271
query11	15037	15026	14832	14832
query12	170	121	109	109
query13	1076	535	434	434
query14	10055	6413	6204	6204
query15	211	194	194	194
query16	7138	669	500	500
query17	1119	744	596	596
query18	1667	422	338	338
query19	195	202	183	183
query20	134	132	125	125
query21	208	124	109	109
query22	4349	4451	4300	4300
query23	34322	33585	33603	33585
query24	6501	2374	2409	2374
query25	464	461	419	419
query26	741	271	156	156
query27	2496	520	341	341
query28	3277	2163	2151	2151
query29	590	574	442	442
query30	273	222	189	189
query31	881	891	803	803
query32	75	65	65	65
query33	459	375	330	330
query34	812	858	537	537
query35	819	860	738	738
query36	945	998	898	898
query37	128	101	82	82
query38	4158	4337	4239	4239
query39	1518	1484	1474	1474
query40	216	117	106	106
query41	59	56	53	53
query42	156	109	110	109
query43	491	543	505	505
query44	1392	861	840	840
query45	179	187	164	164
query46	856	1026	656	656
query47	1841	1858	1793	1793
query48	411	438	345	345
query49	693	541	444	444
query50	666	702	413	413
query51	4306	4283	4207	4207
query52	116	112	104	104
query53	241	260	184	184
query54	603	582	523	523
query55	86	88	83	83
query56	368	321	297	297
query57	1226	1180	1129	1129
query58	263	264	268	264
query59	2831	2907	2725	2725
query60	334	323	314	314
query61	142	123	119	119
query62	694	718	664	664
query63	227	185	190	185
query64	1719	997	734	734
query65	4332	4254	4256	4254
query66	709	405	302	302
query67	15813	15738	15363	15363
query68	5704	889	544	544
query69	532	317	274	274
query70	1172	1118	1122	1118
query71	464	317	291	291
query72	5840	4842	4865	4842
query73	1258	674	357	357
query74	9120	9157	8766	8766
query75	3408	3192	2692	2692
query76	3793	1182	766	766
query77	531	377	281	281
query78	9963	10141	9444	9444
query79	2123	830	580	580
query80	584	510	444	444
query81	496	254	218	218
query82	302	130	158	130
query83	245	249	227	227
query84	292	109	82	82
query85	792	345	308	308
query86	411	297	270	270
query87	4334	4445	4365	4365
query88	3649	2345	2328	2328
query89	408	336	288	288
query90	1843	203	208	203
query91	143	143	118	118
query92	71	60	57	57
query93	1750	927	579	579
query94	659	398	315	315
query95	368	302	280	280
query96	583	578	286	286
query97	2705	2750	2680	2680
query98	237	201	206	201
query99	1359	1434	1270	1270
Total cold run time: 295530 ms
Total hot run time: 192462 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 29.23 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 0a8788866768880be645f120fc2bc5996132b98d, data reload: false

query1	0.03	0.04	0.03
query2	0.13	0.10	0.11
query3	0.26	0.19	0.18
query4	1.60	0.19	0.11
query5	0.44	0.42	0.44
query6	1.16	0.66	0.65
query7	0.02	0.02	0.02
query8	0.04	0.03	0.03
query9	0.59	0.50	0.52
query10	0.58	0.59	0.56
query11	0.16	0.10	0.11
query12	0.15	0.12	0.12
query13	0.63	0.61	0.60
query14	0.79	0.80	0.81
query15	0.87	0.85	0.83
query16	0.37	0.39	0.39
query17	1.01	1.06	1.02
query18	0.23	0.21	0.21
query19	1.93	1.82	1.80
query20	0.01	0.01	0.01
query21	15.42	0.86	0.55
query22	0.77	1.17	0.66
query23	14.98	1.36	0.65
query24	6.87	1.14	1.32
query25	0.50	0.23	0.06
query26	0.53	0.15	0.13
query27	0.05	0.05	0.05
query28	10.59	0.84	0.45
query29	12.54	3.96	3.32
query30	0.25	0.10	0.07
query31	2.83	0.59	0.39
query32	3.24	0.56	0.47
query33	3.05	3.08	3.12
query34	15.73	5.14	4.46
query35	4.52	4.50	4.47
query36	0.66	0.50	0.49
query37	0.09	0.06	0.07
query38	0.05	0.04	0.04
query39	0.03	0.02	0.02
query40	0.17	0.13	0.12
query41	0.08	0.03	0.03
query42	0.03	0.02	0.02
query43	0.04	0.03	0.02
Total cold run time: 104.02 s
Total hot run time: 29.23 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 63.16% (12/19) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 55.94% (14922/26675)
Line Coverage 44.76% (132402/295796)
Region Coverage 43.86% (66618/151897)
Branch Coverage 38.44% (34127/88782)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 63.16% (12/19) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 79.40% (20850/26260)
Line Coverage 72.64% (214862/295794)
Region Coverage 70.82% (126389/178455)
Branch Coverage 64.60% (65516/101422)

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label May 22, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@liaoxin01 liaoxin01 merged commit 98d87b2 into apache:master May 22, 2025
32 of 33 checks passed
github-actions bot pushed a commit that referenced this pull request May 22, 2025
…ack pressure (#51078)

If backend node memory reached limit, task submission will fail for
memory back pressure. But the error msg is confusing:
```
failed to send task: errCode = 2, detailMessage = failed to submit task. error code: TOO_MANY_TASKS, msg:
(127.0.0.1)[TOO_MANY_TASKS]...
```

Change the error msg to:
```
failed to submit task. error code: MEM_LIMIT_EXCEEDED, msg: (127.0.0.1)[MEM_LIMIT_EXCEEDED]...
```
github-actions bot pushed a commit that referenced this pull request May 22, 2025
…ack pressure (#51078)

If backend node memory reached limit, task submission will fail for
memory back pressure. But the error msg is confusing:
```
failed to send task: errCode = 2, detailMessage = failed to submit task. error code: TOO_MANY_TASKS, msg:
(127.0.0.1)[TOO_MANY_TASKS]...
```

Change the error msg to:
```
failed to submit task. error code: MEM_LIMIT_EXCEEDED, msg: (127.0.0.1)[MEM_LIMIT_EXCEEDED]...
```
yiguolei pushed a commit that referenced this pull request May 22, 2025
…for memory back pressure #51078 (#51131)

Cherry-picked from #51078

Co-authored-by: hui lai <laihui@selectdb.com>
dataroaring pushed a commit that referenced this pull request May 24, 2025
…for memory back pressure #51078 (#51130)

Cherry-picked from #51078

Co-authored-by: hui lai <laihui@selectdb.com>
koarz pushed a commit to koarz/doris that referenced this pull request Jun 4, 2025
…ack pressure (apache#51078)

If backend node memory reached limit, task submission will fail for
memory back pressure. But the error msg is confusing:
```
failed to send task: errCode = 2, detailMessage = failed to submit task. error code: TOO_MANY_TASKS, msg:
(127.0.0.1)[TOO_MANY_TASKS]...
```

Change the error msg to:
```
failed to submit task. error code: MEM_LIMIT_EXCEEDED, msg: (127.0.0.1)[MEM_LIMIT_EXCEEDED]...
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.11-merged dev/3.0.6-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants