Skip to content

[fix](memory) fix Hadoop metrics2 memory leak causing FE OOM in external table workloads#60927

Open
shuke987 wants to merge 3 commits intoapache:masterfrom
shuke987:pr-60926
Open

[fix](memory) fix Hadoop metrics2 memory leak causing FE OOM in external table workloads#60927
shuke987 wants to merge 3 commits intoapache:masterfrom
shuke987:pr-60926

Conversation

@shuke987
Copy link
Collaborator

@shuke987 shuke987 commented Mar 1, 2026

What problem does this PR solve?

Each new Hadoop FileSystem instance (e.g., S3AFileSystem) registers metrics
with the global DefaultMetricsSystem singleton but never unregisters them on
close(). Since Doris disables the Hadoop FS cache, every cache miss creates a
new FileSystem, causing MetricCounterLong and MBeanAttributeInfo objects to
accumulate unboundedly until OOM.

use standby mode -Dhadoop.metrics.init.mode=STANDBY to solve this problem

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@shuke987
Copy link
Collaborator Author

shuke987 commented Mar 1, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 28728 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6dcdf42957f0698903842b5e2e91db22a3d97d53, data reload: false

------ Round 1 ----------------------------------
============================================
q1	17617	4620	4295	4295
q2	q3	10651	769	513	513
q4	4678	359	249	249
q5	7534	1175	1027	1027
q6	168	175	145	145
q7	758	833	662	662
q8	9286	1441	1301	1301
q9	4632	4714	4683	4683
q10	6827	1876	1655	1655
q11	483	253	234	234
q12	709	558	463	463
q13	17785	4229	3415	3415
q14	223	224	211	211
q15	952	799	807	799
q16	744	727	660	660
q17	686	882	377	377
q18	5879	5272	5241	5241
q19	1113	957	611	611
q20	513	494	388	388
q21	4766	2013	1549	1549
q22	374	309	250	250
Total cold run time: 96378 ms
Total hot run time: 28728 ms

----- Round 2, with runtime_filter_mode=off -----
============================================
q1	4520	4490	4527	4490
q2	q3	1793	2248	1761	1761
q4	872	1213	768	768
q5	4020	4360	4346	4346
q6	201	194	146	146
q7	1810	1675	1575	1575
q8	2535	2728	2532	2532
q9	7451	7320	7399	7320
q10	2666	2900	2361	2361
q11	513	433	409	409
q12	489	582	444	444
q13	4000	4457	3509	3509
q14	287	294	271	271
q15	839	799	786	786
q16	697	918	729	729
q17	1156	1529	1330	1330
q18	7123	6768	6641	6641
q19	894	889	874	874
q20	2112	2206	2020	2020
q21	3988	3511	3276	3276
q22	468	445	365	365
Total cold run time: 48434 ms
Total hot run time: 45953 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 183774 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6dcdf42957f0698903842b5e2e91db22a3d97d53, data reload: false

query5	4573	667	501	501
query6	336	236	204	204
query7	4236	485	275	275
query8	356	245	241	241
query9	8749	2724	2668	2668
query10	523	388	353	353
query11	16965	17444	17322	17322
query12	226	161	147	147
query13	1308	485	364	364
query14	7281	3297	3051	3051
query14_1	2924	2903	2841	2841
query15	210	206	186	186
query16	4987	505	395	395
query17	1495	785	688	688
query18	2349	466	350	350
query19	215	205	185	185
query20	153	133	132	132
query21	228	143	118	118
query22	5958	5590	4933	4933
query23	17269	16887	16544	16544
query23_1	16730	16659	16736	16659
query24	7148	1604	1212	1212
query24_1	1206	1234	1224	1224
query25	533	461	393	393
query26	1237	262	150	150
query27	2729	464	283	283
query28	4503	1888	1845	1845
query29	732	542	466	466
query30	301	247	210	210
query31	886	725	647	647
query32	81	76	70	70
query33	518	340	281	281
query34	911	899	557	557
query35	629	657	583	583
query36	1091	1150	997	997
query37	138	89	84	84
query38	2963	2938	2833	2833
query39	885	885	851	851
query39_1	820	870	848	848
query40	228	146	134	134
query41	65	58	57	57
query42	104	100	100	100
query43	368	395	342	342
query44	
query45	198	188	180	180
query46	881	1012	611	611
query47	2093	2137	2050	2050
query48	313	315	240	240
query49	616	509	388	388
query50	667	275	224	224
query51	4111	4114	4052	4052
query52	104	105	96	96
query53	291	335	283	283
query54	308	263	278	263
query55	88	83	82	82
query56	309	304	299	299
query57	1376	1334	1284	1284
query58	280	286	277	277
query59	2563	2700	2589	2589
query60	325	336	325	325
query61	147	147	144	144
query62	641	603	540	540
query63	310	273	280	273
query64	4473	1248	1107	1107
query65	
query66	1311	460	368	368
query67	16465	16575	16237	16237
query68	
query69	403	321	287	287
query70	1014	973	911	911
query71	336	313	296	296
query72	2959	2753	2377	2377
query73	532	544	322	322
query74	9963	9944	9745	9745
query75	2821	2737	2471	2471
query76	1866	1036	665	665
query77	355	373	309	309
query78	11160	11382	10752	10752
query79	2738	832	595	595
query80	1724	634	539	539
query81	568	278	257	257
query82	971	145	113	113
query83	335	265	253	253
query84	254	122	98	98
query85	897	484	420	420
query86	405	304	326	304
query87	3102	3151	2967	2967
query88	3540	2674	2675	2674
query89	423	365	346	346
query90	2032	173	172	172
query91	165	162	131	131
query92	74	79	67	67
query93	1131	835	516	516
query94	625	331	289	289
query95	580	333	373	333
query96	631	516	224	224
query97	2457	2496	2426	2426
query98	229	212	226	212
query99	1009	1004	935	935
Total cold run time: 259601 ms
Total hot run time: 183774 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants