Skip to content

[opt](memory) Refactor memory maintenance thread (retry)#40551

Merged
xinyiZzz merged 2 commits into
apache:masterfrom
xinyiZzz:20240909_fix_memory3
Sep 10, 2024
Merged

[opt](memory) Refactor memory maintenance thread (retry)#40551
xinyiZzz merged 2 commits into
apache:masterfrom
xinyiZzz:20240909_fix_memory3

Conversation

@xinyiZzz
Copy link
Copy Markdown
Contributor

@xinyiZzz xinyiZzz commented Sep 9, 2024

step 1. Refresh process memory metrics.
step 2. Refresh allocator memory metrics.
step 3. Update and print memory stat when the memory changes by 256M.
step 4. Asyn Refresh cache capacity
step 5. Cancel top memory task when process memory exceed hard limit.
step 6. Refresh weighted memory ratio of workload groups.
step 7. Analyze blocking queries.
step 8. Flush memtable.
step 9. Jemalloc purge all arena dirty pages.

memory_maintenance_thread execute once cost:

  • 3ms (cluster idle)
  • 20ms (cluster high concurrency, CPU full)

memory_maintenance_thread CPU usage:

  • 10%-20% (default memory_maintenance_sleep_time_ms=20ms)
  • 20%-30% (memory_maintenance_sleep_time_ms=10ms)
  • 30%+ (memory_maintenance_sleep_time_ms=5ms)

step 1. Refresh process memory metrics.
step 2. Refresh allocator memory metrics.
step 3. Update and print memory stat when the memory changes by 256M.
step 4. Asyn Refresh cache capacity
step 5. Cancel top memory task when process memory exceed hard limit.
step 6. Refresh weighted memory ratio of workload groups.
step 7. Analyze blocking queries.
step 8. Flush memtable.
step 9. Jemalloc purge all arena dirty pages.

`memory_maintenance_thread` execute once cost:
- 3ms (cluster idle)
- 20ms (cluster high concurrency, CPU full)

`memory_maintenance_thread` CPU usage:
- 10%-20% (default memory_maintenance_sleep_time_ms=20ms)
- 20%-30% (memory_maintenance_sleep_time_ms=10ms)
- 30%+ (memory_maintenance_sleep_time_ms=5ms)
@doris-robot
Copy link
Copy Markdown

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@xinyiZzz
Copy link
Copy Markdown
Contributor Author

xinyiZzz commented Sep 9, 2024

run buildall

@xinyiZzz xinyiZzz force-pushed the 20240909_fix_memory3 branch from 0e426c4 to ad96e15 Compare September 9, 2024 09:17
@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 38659 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 0e426c42b130e1fc99c4b980278e5b7d605a3c54, data reload: false

------ Round 1 ----------------------------------
q1	18235	4840	4495	4495
q2	2943	190	181	181
q3	11589	1143	1151	1143
q4	10667	720	680	680
q5	7756	2987	2896	2896
q6	238	139	141	139
q7	974	609	601	601
q8	9323	2098	2062	2062
q9	7277	6562	6566	6562
q10	7006	2216	2205	2205
q11	465	239	255	239
q12	402	222	218	218
q13	17968	3130	3071	3071
q14	287	242	245	242
q15	564	498	493	493
q16	525	433	442	433
q17	978	749	790	749
q18	7459	6978	6962	6962
q19	1397	1012	1044	1012
q20	681	326	333	326
q21	4487	2961	3175	2961
q22	1103	989	1017	989
Total cold run time: 112324 ms
Total hot run time: 38659 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4427	4317	4313	4313
q2	382	264	271	264
q3	2938	2660	2715	2660
q4	1966	1660	1630	1630
q5	5469	5441	5462	5441
q6	217	129	130	129
q7	2181	1757	1763	1757
q8	3222	3385	3399	3385
q9	8553	8486	8559	8486
q10	3478	3286	3216	3216
q11	612	506	511	506
q12	825	606	608	606
q13	10600	3060	3098	3060
q14	308	281	284	281
q15	536	489	499	489
q16	565	471	467	467
q17	1794	1512	1483	1483
q18	7922	7463	7422	7422
q19	1714	1596	1507	1507
q20	2049	1834	1836	1834
q21	5498	5190	5189	5189
q22	1108	1059	1029	1029
Total cold run time: 66364 ms
Total hot run time: 55154 ms

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 9, 2024

PR approved by at least one committer and no changes requested.

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label Sep 9, 2024
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Sep 9, 2024

PR approved by anyone and no changes requested.

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 192176 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 0e426c42b130e1fc99c4b980278e5b7d605a3c54, data reload: false

query1	925	378	403	378
query2	6495	1913	1881	1881
query3	6650	203	215	203
query4	34227	23508	23215	23215
query5	4167	499	491	491
query6	269	179	157	157
query7	4583	296	296	296
query8	280	208	237	208
query9	8641	2473	2473	2473
query10	445	266	254	254
query11	18003	15188	15154	15154
query12	160	99	97	97
query13	1624	386	363	363
query14	9465	7129	7355	7129
query15	270	164	171	164
query16	7845	470	446	446
query17	1575	579	564	564
query18	1741	288	279	279
query19	233	144	146	144
query20	119	112	110	110
query21	207	105	102	102
query22	4555	4140	4117	4117
query23	34237	33793	33703	33703
query24	11194	2923	2856	2856
query25	634	380	392	380
query26	1138	152	151	151
query27	2789	274	276	274
query28	7575	2049	2039	2039
query29	775	419	435	419
query30	307	156	146	146
query31	1014	773	796	773
query32	99	53	57	53
query33	752	297	283	283
query34	983	480	490	480
query35	875	744	697	697
query36	1075	914	913	913
query37	159	91	85	85
query38	4066	3876	3908	3876
query39	1448	1443	1405	1405
query40	195	115	113	113
query41	48	48	46	46
query42	114	94	94	94
query43	490	469	445	445
query44	1186	770	727	727
query45	201	164	166	164
query46	1107	761	759	759
query47	1885	1788	1823	1788
query48	366	297	301	297
query49	1119	484	455	455
query50	814	408	411	408
query51	7066	6922	6939	6922
query52	99	86	85	85
query53	255	178	187	178
query54	916	450	448	448
query55	77	72	76	72
query56	286	253	253	253
query57	1237	1103	1073	1073
query58	252	223	225	223
query59	2923	2912	2913	2912
query60	298	268	275	268
query61	106	102	99	99
query62	835	655	643	643
query63	223	186	212	186
query64	5297	665	663	663
query65	3248	3175	3176	3175
query66	1392	350	342	342
query67	16221	15447	15260	15260
query68	3115	852	848	848
query69	472	315	319	315
query70	1155	1196	1183	1183
query71	350	338	338	338
query72	6052	3496	2638	2638
query73	592	587	583	583
query74	9128	8942	8775	8775
query75	3122	2944	2956	2944
query76	1845	851	841	841
query77	495	402	397	397
query78	9957	9180	9169	9169
query79	894	853	863	853
query80	817	815	801	801
query81	452	263	265	263
query82	268	272	264	264
query83	192	192	189	189
query84	237	107	104	104
query85	649	447	440	440
query86	320	322	308	308
query87	4289	4401	4343	4343
query88	4361	4150	4134	4134
query89	450	370	368	368
query90	1299	319	310	310
query91	125	124	124	124
query92	75	76	71	71
query93	921	939	918	918
query94	456	348	345	345
query95	415	407	410	407
query96	474	472	476	472
query97	3090	3091	3159	3091
query98	224	223	226	223
query99	1436	1303	1261	1261
Total cold run time: 287582 ms
Total hot run time: 192176 ms

@doris-robot
Copy link
Copy Markdown

ClickBench: Total hot run time: 31.65 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 0e426c42b130e1fc99c4b980278e5b7d605a3c54, data reload: false

query1	0.04	0.05	0.04
query2	0.08	0.04	0.04
query3	0.23	0.05	0.05
query4	1.68	0.08	0.07
query5	0.52	0.50	0.49
query6	1.13	0.74	0.73
query7	0.02	0.01	0.01
query8	0.06	0.04	0.04
query9	0.56	0.50	0.49
query10	0.53	0.56	0.54
query11	0.15	0.12	0.12
query12	0.15	0.13	0.12
query13	0.60	0.60	0.59
query14	1.41	1.40	1.42
query15	0.85	0.84	0.83
query16	0.38	0.38	0.38
query17	1.06	1.06	1.05
query18	0.21	0.20	0.19
query19	1.86	1.79	1.76
query20	0.01	0.02	0.01
query21	15.40	0.67	0.67
query22	4.22	6.90	2.02
query23	18.34	1.31	1.28
query24	2.06	0.24	0.22
query25	0.14	0.09	0.08
query26	0.28	0.18	0.19
query27	0.08	0.09	0.08
query28	13.20	1.03	1.01
query29	12.59	3.31	3.32
query30	0.25	0.06	0.05
query31	2.89	0.40	0.40
query32	3.26	0.48	0.49
query33	3.00	2.97	3.04
query34	17.11	4.39	4.42
query35	4.46	4.48	4.49
query36	0.66	0.49	0.50
query37	0.19	0.17	0.16
query38	0.15	0.15	0.15
query39	0.05	0.04	0.04
query40	0.15	0.13	0.13
query41	0.09	0.05	0.04
query42	0.06	0.04	0.04
query43	0.05	0.03	0.04
Total cold run time: 110.21 s
Total hot run time: 31.65 s

@xinyiZzz
Copy link
Copy Markdown
Contributor Author

xinyiZzz commented Sep 9, 2024

run buildall

@doris-robot
Copy link
Copy Markdown

TPC-H: Total hot run time: 38340 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ad96e15bf2cef7935c64626f1844cb8325dfb83e, data reload: false

------ Round 1 ----------------------------------
q1	17633	4370	4328	4328
q2	2017	188	181	181
q3	10461	1143	1167	1143
q4	10120	693	701	693
q5	8067	2907	2847	2847
q6	230	142	140	140
q7	978	632	644	632
q8	9322	2054	2088	2054
q9	7093	6559	6535	6535
q10	7013	2175	2247	2175
q11	465	245	246	245
q12	392	229	232	229
q13	17755	3143	3095	3095
q14	275	250	244	244
q15	542	488	485	485
q16	523	430	442	430
q17	971	713	788	713
q18	7402	7015	6841	6841
q19	1395	1051	1054	1051
q20	694	356	335	335
q21	4453	3039	2953	2953
q22	1130	991	1039	991
Total cold run time: 108931 ms
Total hot run time: 38340 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4397	4356	4320	4320
q2	370	266	276	266
q3	2885	2650	2737	2650
q4	1957	1761	1685	1685
q5	5469	5427	5433	5427
q6	219	133	135	133
q7	2149	1779	1799	1779
q8	3211	3361	3377	3361
q9	8500	8462	8589	8462
q10	3472	3234	3198	3198
q11	602	498	516	498
q12	790	657	619	619
q13	10571	3163	3132	3132
q14	328	270	292	270
q15	517	491	479	479
q16	511	485	486	485
q17	1779	1503	1514	1503
q18	7940	7532	7405	7405
q19	1661	1494	1577	1494
q20	2044	1847	1818	1818
q21	5428	5166	5219	5166
q22	1122	1016	1026	1016
Total cold run time: 65922 ms
Total hot run time: 55166 ms

@doris-robot
Copy link
Copy Markdown

TeamCity be ut coverage result:
Function Coverage: 36.86% (9393/25486)
Line Coverage: 28.25% (77491/274273)
Region Coverage: 27.65% (40004/144678)
Branch Coverage: 24.27% (20345/83830)
Coverage Report: http://coverage.selectdb-in.cc/coverage/ad96e15bf2cef7935c64626f1844cb8325dfb83e_ad96e15bf2cef7935c64626f1844cb8325dfb83e/report/index.html

@doris-robot
Copy link
Copy Markdown

TPC-DS: Total hot run time: 192820 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ad96e15bf2cef7935c64626f1844cb8325dfb83e, data reload: false

query1	917	381	374	374
query2	6482	1885	1885	1885
query3	6656	205	212	205
query4	34481	23184	23287	23184
query5	4218	514	495	495
query6	265	178	170	170
query7	4585	303	300	300
query8	318	228	222	222
query9	8548	2486	2463	2463
query10	452	288	291	288
query11	17324	14994	15076	14994
query12	151	98	97	97
query13	1641	403	365	365
query14	10051	7548	6742	6742
query15	246	169	177	169
query16	7923	485	493	485
query17	1696	557	540	540
query18	1957	285	276	276
query19	312	162	142	142
query20	120	108	116	108
query21	214	105	103	103
query22	4586	4083	4277	4083
query23	34110	33659	34207	33659
query24	11241	2885	2840	2840
query25	599	387	384	384
query26	1139	152	152	152
query27	2755	278	276	276
query28	7056	2061	2041	2041
query29	807	430	426	426
query30	314	156	155	155
query31	1018	754	803	754
query32	96	58	58	58
query33	754	280	283	280
query34	981	476	489	476
query35	868	717	731	717
query36	1073	942	944	942
query37	153	91	84	84
query38	3980	3903	3860	3860
query39	1444	1392	1398	1392
query40	204	115	114	114
query41	49	44	45	44
query42	118	94	99	94
query43	479	454	471	454
query44	1270	760	746	746
query45	197	168	168	168
query46	1102	769	772	769
query47	1914	1807	1836	1807
query48	372	299	300	299
query49	1054	430	437	430
query50	826	405	406	405
query51	7034	6845	6890	6845
query52	103	89	85	85
query53	260	184	182	182
query54	952	458	447	447
query55	77	75	74	74
query56	274	250	253	250
query57	1240	1091	1113	1091
query58	221	222	222	222
query59	3074	2824	2723	2723
query60	296	273	261	261
query61	103	97	100	97
query62	834	673	676	673
query63	227	185	186	185
query64	5243	700	647	647
query65	3215	3145	3168	3145
query66	1220	330	328	328
query67	15845	15388	15391	15388
query68	3119	850	843	843
query69	413	314	324	314
query70	1136	1119	1152	1119
query71	338	336	340	336
query72	6126	3510	3435	3435
query73	585	595	585	585
query74	9106	8926	8851	8851
query75	3122	2942	2951	2942
query76	1935	852	847	847
query77	460	422	396	396
query78	9452	10074	9869	9869
query79	896	876	868	868
query80	887	813	814	813
query81	447	258	263	258
query82	262	264	262	262
query83	196	196	198	196
query84	225	113	107	107
query85	643	391	377	377
query86	314	312	334	312
query87	4352	4355	4232	4232
query88	4344	4103	4098	4098
query89	367	363	367	363
query90	1442	317	315	315
query91	121	121	122	121
query92	82	73	74	73
query93	923	927	916	916
query94	532	373	398	373
query95	416	415	406	406
query96	478	472	476	472
query97	3102	3092	3093	3092
query98	242	222	221	221
query99	1390	1249	1260	1249
Total cold run time: 286787 ms
Total hot run time: 192820 ms

@doris-robot
Copy link
Copy Markdown

ClickBench: Total hot run time: 31.42 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit ad96e15bf2cef7935c64626f1844cb8325dfb83e, data reload: false

query1	0.05	0.04	0.04
query2	0.07	0.04	0.04
query3	0.23	0.05	0.05
query4	1.66	0.07	0.08
query5	0.50	0.48	0.50
query6	1.14	0.73	0.72
query7	0.02	0.01	0.02
query8	0.05	0.04	0.05
query9	0.55	0.50	0.48
query10	0.55	0.58	0.56
query11	0.16	0.12	0.12
query12	0.15	0.13	0.13
query13	0.60	0.58	0.59
query14	1.43	1.40	1.41
query15	0.84	0.84	0.82
query16	0.37	0.37	0.38
query17	1.01	0.99	1.06
query18	0.20	0.20	0.22
query19	1.96	1.78	1.81
query20	0.01	0.00	0.01
query21	15.40	0.67	0.67
query22	3.84	7.68	1.85
query23	18.25	1.36	1.31
query24	2.08	0.23	0.22
query25	0.15	0.07	0.08
query26	0.29	0.18	0.19
query27	0.08	0.09	0.07
query28	13.31	1.02	1.01
query29	12.64	3.30	3.29
query30	0.24	0.06	0.07
query31	2.84	0.40	0.40
query32	3.25	0.48	0.48
query33	3.00	3.00	2.99
query34	16.93	4.42	4.44
query35	4.47	4.43	4.45
query36	0.65	0.50	0.50
query37	0.18	0.16	0.15
query38	0.16	0.15	0.15
query39	0.04	0.04	0.04
query40	0.16	0.12	0.12
query41	0.09	0.05	0.05
query42	0.06	0.05	0.04
query43	0.04	0.04	0.04
Total cold run time: 109.7 s
Total hot run time: 31.42 s

Copy link
Copy Markdown
Contributor

@wangbo wangbo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xinyiZzz xinyiZzz merged commit 140008f into apache:master Sep 10, 2024
dataroaring pushed a commit that referenced this pull request Oct 9, 2024
step 1. Refresh process memory metrics.
step 2. Refresh allocator memory metrics.
step 3. Update and print memory stat when the memory changes by 256M.
step 4. Asyn Refresh cache capacity
step 5. Cancel top memory task when process memory exceed hard limit.
step 6. Refresh weighted memory ratio of workload groups.
step 7. Analyze blocking queries.
step 8. Flush memtable.
step 9. Jemalloc purge all arena dirty pages.

`memory_maintenance_thread` execute once cost:
- 3ms (cluster idle)
- 20ms (cluster high concurrency, CPU full)

`memory_maintenance_thread` CPU usage:
- 10%-20% (default memory_maintenance_sleep_time_ms=20ms)
- 20%-30% (memory_maintenance_sleep_time_ms=10ms)
- 30%+ (memory_maintenance_sleep_time_ms=5ms)
HappenLee pushed a commit to HappenLee/incubator-doris that referenced this pull request Apr 24, 2026
step 1. Refresh process memory metrics.
step 2. Refresh allocator memory metrics.
step 3. Update and print memory stat when the memory changes by 256M.
step 4. Asyn Refresh cache capacity
step 5. Cancel top memory task when process memory exceed hard limit.
step 6. Refresh weighted memory ratio of workload groups.
step 7. Analyze blocking queries.
step 8. Flush memtable.
step 9. Jemalloc purge all arena dirty pages.

`memory_maintenance_thread` execute once cost:
- 3ms (cluster idle)
- 20ms (cluster high concurrency, CPU full)

`memory_maintenance_thread` CPU usage:
- 10%-20% (default memory_maintenance_sleep_time_ms=20ms)
- 20%-30% (memory_maintenance_sleep_time_ms=10ms)
- 30%+ (memory_maintenance_sleep_time_ms=5ms)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.0.3-merged not-merge/2.1 reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants