Skip to content

branch-3.0: [Fix](catalog)Remove the fs.disable.cache parameter to prevent excessive FS-associated objects and memory leaks #46184#46189

Merged
morningman merged 1 commit intobranch-3.0from
auto-pick-46184-branch-3.0
Dec 31, 2024
Merged

branch-3.0: [Fix](catalog)Remove the fs.disable.cache parameter to prevent excessive FS-associated objects and memory leaks #46184#46189
morningman merged 1 commit intobranch-3.0from
auto-pick-46184-branch-3.0

Conversation

@github-actions
Copy link
Contributor

Cherry-picked from #46184

…ive FS-associated objects and memory leaks (#46184)

### Background
In the current file system implementation, the fs.disable.cache
parameter allows disabling FS caching. While this provides flexibility,
it introduces several critical issues:
```

1:      22537201      721190432  java.util.HashMap$Node
2:      21559238      689895616  javax.management.MBeanAttributeInfo
3:      21559098      517418352  javax.management.Attribute
4:      19380247      465125928  org.apache.hadoop.metrics2.impl.MetricCounterLong
5:        122603      461180096  [J
6:        294309      255533536  [B
7:        724598      252264048  [Ljava.lang.Object;
8:       2012368      189047432  [C
9:        159442      131064400  [Ljava.util.HashMap$Node;
10:        114752       88075072  [Ljavax.management.MBeanAttributeInfo;
11:       1899581       45589944  java.lang.String
12:       1720140       41283360  org.apache.hadoop.metrics2.impl.MetricGaugeLong
```

#### Unbounded FS Instance Creation
When fs.disable.cache=true, a new FS instance is created for every
access, preventing instance reuse.
```

    String disableCacheName = String.format("fs.%s.impl.disable.cache", scheme);
    if (conf.getBoolean(disableCacheName, false)) {
      LOGGER.debug("Bypassing cache to create filesystem {}", uri);
      return createFileSystem(uri, conf);
    }
```

#### Resource Leakage
Associated objects, such as thread metrics and connection pools, are not
properly released due to excessive FS instance creation, leading to
memory leaks.

#### Performance Degradation
Frequent creation and destruction of FS instances impose significant
overhead, especially in high-concurrency scenarios.



### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [x] Manual test (add detailed scripts or steps below)
```
CREATE CATALOG `iceberg_cos` PROPERTIES (
"warehouse" = "cosn://ha/ha/ha/stress/multi_fs",
"type" = "iceberg",
"iceberg.catalog.type" = "hadoop",
"cos.secret_key" = "*XXX",
"cos.region" = "ap-beijing",
"cos.endpoint" = "cos.ap-beijing.myqcloud.com",
"cos.access_key" = "**************"
);

Create a catalog using object storage, then write a scheduled script to continuously refresh the catalog. Query the catalog periodically and monitor whether the thread memory behaves as expected.
```
<img width="1131" alt="image"
src="https://github.com/user-attachments/assets/c7b04a5a-449f-432c-975b-524fdb81247a"
/>

At 22:30, I replaced it with the fixed version.
@Thearas
Copy link
Contributor

Thearas commented Dec 31, 2024

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@dataroaring dataroaring reopened this Dec 31, 2024
@Thearas
Copy link
Contributor

Thearas commented Dec 31, 2024

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 40702 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fa91de3fbed35be476a3e50e53bc6d6efc817018, data reload: false

------ Round 1 ----------------------------------
q1	17583	7432	7226	7226
q2	2068	177	166	166
q3	10590	1055	1227	1055
q4	10584	726	767	726
q5	7745	2795	2907	2795
q6	236	150	142	142
q7	1078	615	605	605
q8	9575	2003	2042	2003
q9	6709	6462	6497	6462
q10	7039	2304	2356	2304
q11	469	265	257	257
q12	398	207	215	207
q13	17773	2990	2981	2981
q14	237	217	217	217
q15	576	522	520	520
q16	700	618	610	610
q17	965	525	542	525
q18	7220	6755	6627	6627
q19	1409	969	935	935
q20	472	211	206	206
q21	3979	3156	3152	3152
q22	1131	1027	981	981
Total cold run time: 108536 ms
Total hot run time: 40702 ms

----- Round 2, with runtime_filter_mode=off -----
q1	7254	7267	7223	7223
q2	326	233	239	233
q3	2950	2963	2893	2893
q4	2021	1818	1828	1818
q5	5699	5778	5750	5750
q6	219	139	142	139
q7	2309	1857	1850	1850
q8	3317	3625	3470	3470
q9	8806	8872	8843	8843
q10	3578	3564	3534	3534
q11	600	502	498	498
q12	795	597	593	593
q13	9195	3212	3190	3190
q14	288	290	264	264
q15	578	537	517	517
q16	700	669	651	651
q17	1870	1607	1570	1570
q18	8317	7730	7595	7595
q19	1685	1578	1435	1435
q20	2105	1859	1866	1859
q21	5495	5323	5325	5323
q22	1144	1071	1057	1057
Total cold run time: 69251 ms
Total hot run time: 60305 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 198692 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fa91de3fbed35be476a3e50e53bc6d6efc817018, data reload: false

query1	1264	937	920	920
query2	6240	2063	2068	2063
query3	11052	4284	4235	4235
query4	66301	29243	23462	23462
query5	4925	477	459	459
query6	415	186	206	186
query7	5569	333	305	305
query8	320	238	231	231
query9	8919	2715	2697	2697
query10	452	283	264	264
query11	17238	15243	15783	15243
query12	152	102	105	102
query13	1502	437	431	431
query14	10239	7601	7444	7444
query15	205	180	184	180
query16	7152	522	471	471
query17	1055	561	579	561
query18	2043	315	313	313
query19	215	161	163	161
query20	118	114	114	114
query21	84	43	41	41
query22	4784	4577	4589	4577
query23	34761	34458	34544	34458
query24	6084	2905	2877	2877
query25	518	411	412	411
query26	654	176	176	176
query27	1861	307	307	307
query28	4332	2508	2498	2498
query29	711	466	458	458
query30	250	169	169	169
query31	992	820	836	820
query32	73	65	59	59
query33	425	292	285	285
query34	933	492	514	492
query35	835	788	763	763
query36	1082	985	969	969
query37	127	78	74	74
query38	4134	4069	4051	4051
query39	1492	1559	1475	1475
query40	143	86	90	86
query41	55	50	51	50
query42	119	106	103	103
query43	527	503	503	503
query44	1218	862	850	850
query45	196	175	170	170
query46	1171	736	730	730
query47	2071	1935	1938	1935
query48	497	396	382	382
query49	727	376	392	376
query50	847	429	429	429
query51	7342	7325	7229	7229
query52	101	84	83	83
query53	246	174	179	174
query54	541	458	441	441
query55	73	74	75	74
query56	261	243	242	242
query57	1230	1106	1111	1106
query58	209	201	211	201
query59	3281	3004	3032	3004
query60	285	252	242	242
query61	127	105	106	105
query62	785	661	691	661
query63	208	199	197	197
query64	1404	656	619	619
query65	3302	3218	3204	3204
query66	712	306	304	304
query67	16091	15773	15604	15604
query68	3689	589	584	584
query69	424	271	262	262
query70	1189	1090	1135	1090
query71	360	249	247	247
query72	6313	4133	4057	4057
query73	754	351	352	351
query74	10062	8941	9024	8941
query75	3355	2648	2662	2648
query76	1865	1025	1023	1023
query77	493	271	268	268
query78	10458	9736	9687	9687
query79	1156	596	596	596
query80	845	418	434	418
query81	494	246	238	238
query82	1292	123	121	121
query83	227	143	142	142
query84	282	76	86	76
query85	894	304	291	291
query86	328	284	291	284
query87	4350	4377	4389	4377
query88	3820	2401	2376	2376
query89	406	293	295	293
query90	2054	180	186	180
query91	177	146	142	142
query92	58	51	50	50
query93	1309	538	552	538
query94	820	291	291	291
query95	357	253	255	253
query96	603	273	278	273
query97	3357	3209	3234	3209
query98	215	208	200	200
query99	1567	1305	1280	1280
Total cold run time: 316850 ms
Total hot run time: 198692 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.75 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit fa91de3fbed35be476a3e50e53bc6d6efc817018, data reload: false

query1	0.03	0.03	0.04
query2	0.07	0.04	0.03
query3	0.24	0.07	0.06
query4	1.63	0.10	0.10
query5	0.53	0.50	0.52
query6	1.13	0.74	0.72
query7	0.02	0.01	0.01
query8	0.04	0.03	0.03
query9	0.57	0.50	0.50
query10	0.56	0.55	0.57
query11	0.14	0.09	0.10
query12	0.14	0.12	0.13
query13	0.61	0.59	0.59
query14	2.91	2.92	2.92
query15	0.88	0.83	0.82
query16	0.38	0.38	0.37
query17	0.99	1.05	1.01
query18	0.26	0.22	0.22
query19	1.82	1.74	1.97
query20	0.01	0.01	0.01
query21	15.35	0.60	0.61
query22	2.77	2.26	2.08
query23	17.05	1.05	0.83
query24	3.39	0.66	0.66
query25	0.25	0.09	0.12
query26	0.41	0.15	0.14
query27	0.04	0.03	0.04
query28	11.01	1.11	1.06
query29	12.56	3.24	3.24
query30	0.25	0.07	0.06
query31	2.86	0.38	0.38
query32	3.28	0.46	0.47
query33	2.99	3.05	2.99
query34	17.15	4.44	4.43
query35	4.51	4.48	4.50
query36	0.68	0.51	0.50
query37	0.10	0.06	0.06
query38	0.05	0.03	0.04
query39	0.04	0.03	0.02
query40	0.16	0.12	0.12
query41	0.07	0.02	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 108 s
Total hot run time: 32.75 s

@morningman morningman merged commit 7e6c77d into branch-3.0 Dec 31, 2024
@github-actions github-actions bot deleted the auto-pick-46184-branch-3.0 branch December 31, 2024 09:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants