Skip to content

[improvement](fe) Improve external catalog meta cache observability#63809

Open
suxiaogang223 wants to merge 2 commits into
apache:masterfrom
suxiaogang223:codex/add-catalog-meta-cache-eviction-rate
Open

[improvement](fe) Improve external catalog meta cache observability#63809
suxiaogang223 wants to merge 2 commits into
apache:masterfrom
suxiaogang223:codex/add-catalog-meta-cache-eviction-rate

Conversation

@suxiaogang223
Copy link
Copy Markdown
Member

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

External catalog meta cache statistics exposed cumulative eviction count, but did not provide a direct replacement frequency metric for judging whether cache capacity is too small. This PR adds EVICTION_RATE to information_schema.catalog_meta_cache_statistics, calculated as eviction_count / request_count and returned as 0 when there are no requests.

Hive partition metadata cache defaults were also too small for common external catalog workloads, causing frequent evictions without explicit tuning. This PR increases the default Hive single-partition cache capacity from 10,000 to 100,000 and the Hive partitioned-table values cache capacity from 1,000 to 10,000. While checking similar cache entries, MaxCompute partition_values was found to cache table-level partition value structures but use the Hive single-partition capacity; it now follows the table-level partition values capacity.

Release note

Add EVICTION_RATE to information_schema.catalog_meta_cache_statistics, increase default Hive partition meta cache capacities, and make MaxCompute partition_values use the table-level partition values capacity.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
      • ./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest
      • ./run-fe-ut.sh --run org.apache.doris.datasource.hive.HiveMetaStoreCacheTest,org.apache.doris.datasource.maxcompute.MaxComputeExternalMetaCacheTest
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. catalog_meta_cache_statistics includes EVICTION_RATE; default Hive partition meta cache capacities are larger; MaxCompute partition_values uses the table-level partition values capacity.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The catalog_meta_cache_statistics table exposed cumulative eviction count for external catalog meta cache entries, but did not provide a direct replacement frequency metric. Users had to derive replacement pressure manually when checking whether the configured meta cache capacity was too small. This change adds EVICTION_RATE, computed as eviction_count / request_count with zero returned when there are no requests, and exposes it beside EVICTION_COUNT in the information schema result.

### Release note

Add EVICTION_RATE to information_schema.catalog_meta_cache_statistics for observing catalog meta cache replacement frequency.

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest
- Behavior changed: Yes (catalog_meta_cache_statistics now includes EVICTION_RATE)
- Does this need documentation: No
### What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary: The default Hive partition metadata cache capacity was too small for common external catalog workloads, which could cause frequent evictions even without explicit capacity tuning. This change raises the single Hive partition metadata cache default from 10,000 to 100,000 and raises the Hive partitioned-table values cache default from 1,000 to 10,000. External fuzzy testing now includes the new default-size candidates. While checking similar cache entries, MaxCompute partition_values was found to cache table-level partition value structures but reused the single-partition Hive capacity; it now follows the table-level partition values capacity instead.

### Release note

Increase default Hive partition meta cache capacities to reduce frequent evictions. MaxCompute partition_values cache now uses the table-level partition values capacity setting.

### Check List (For Author)

- Test: Unit Test
    - ./run-fe-ut.sh --run org.apache.doris.datasource.hive.HiveMetaStoreCacheTest,org.apache.doris.datasource.maxcompute.MaxComputeExternalMetaCacheTest
- Behavior changed: Yes (default Hive partition meta cache capacities are larger, and MaxCompute partition_values uses the table-level partition values capacity)
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@suxiaogang223 suxiaogang223 marked this pull request as ready for review May 28, 2026 07:27
@suxiaogang223
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31577 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ff4fea880a326c846de12a5d6fe56a5b439d23b6, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17695	4184	4268	4184
q2	q3	10769	1425	839	839
q4	4681	472	350	350
q5	7550	2248	2107	2107
q6	257	179	143	143
q7	982	795	640	640
q8	9418	1781	1539	1539
q9	5131	4986	4961	4961
q10	6386	2242	1872	1872
q11	434	269	246	246
q12	633	432	298	298
q13	18164	3388	2751	2751
q14	274	257	243	243
q15	q16	826	780	711	711
q17	986	984	911	911
q18	7111	5902	5731	5731
q19	1335	1257	1068	1068
q20	532	408	257	257
q21	5862	2629	2430	2430
q22	459	358	296	296
Total cold run time: 99485 ms
Total hot run time: 31577 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4469	4487	4417	4417
q2	q3	4499	4987	4339	4339
q4	2110	2246	1399	1399
q5	4443	4328	4766	4328
q6	285	238	159	159
q7	2166	1877	1663	1663
q8	2623	2275	2246	2246
q9	8373	8021	8022	8021
q10	4891	4790	4333	4333
q11	574	442	431	431
q12	761	780	544	544
q13	3337	3701	3010	3010
q14	295	300	268	268
q15	q16	721	740	663	663
q17	1460	1435	1347	1347
q18	7934	7344	7410	7344
q19	1173	1170	1113	1113
q20	2229	2238	1961	1961
q21	5300	4598	4452	4452
q22	517	474	404	404
Total cold run time: 58160 ms
Total hot run time: 52442 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 172185 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ff4fea880a326c846de12a5d6fe56a5b439d23b6, data reload: false

query5	4323	679	519	519
query6	366	231	214	214
query7	4230	589	322	322
query8	340	241	235	235
query9	8815	4089	4032	4032
query10	468	350	312	312
query11	5814	2539	2254	2254
query12	211	134	127	127
query13	1283	615	449	449
query14	6150	5533	5198	5198
query14_1	4564	4514	4483	4483
query15	224	211	183	183
query16	1036	484	462	462
query17	1182	774	623	623
query18	2573	513	376	376
query19	223	208	175	175
query20	140	137	133	133
query21	222	147	121	121
query22	13887	13675	13353	13353
query23	17510	16585	16142	16142
query23_1	16371	16451	16316	16316
query24	7476	1814	1358	1358
query24_1	1346	1348	1311	1311
query25	576	504	444	444
query26	1294	326	184	184
query27	2684	584	356	356
query28	4473	2038	2054	2038
query29	1014	677	529	529
query30	303	245	205	205
query31	1143	1123	953	953
query32	96	74	76	74
query33	543	349	300	300
query34	1180	1162	660	660
query35	786	790	714	714
query36	1381	1418	1226	1226
query37	155	114	96	96
query38	3219	3192	3065	3065
query39	937	917	926	917
query39_1	889	871	878	871
query40	242	152	125	125
query41	64	63	62	62
query42	113	110	115	110
query43	332	340	294	294
query44	
query45	217	205	207	205
query46	1118	1275	781	781
query47	2347	2381	2219	2219
query48	403	437	298	298
query49	649	521	389	389
query50	1006	347	257	257
query51	4307	4292	4350	4292
query52	107	107	95	95
query53	251	287	210	210
query54	313	263	251	251
query55	95	93	90	90
query56	296	316	297	297
query57	1437	1411	1332	1332
query58	303	280	268	268
query59	1634	1745	1454	1454
query60	325	332	319	319
query61	157	161	158	158
query62	708	655	588	588
query63	248	204	202	202
query64	2430	806	655	655
query65	
query66	1699	485	357	357
query67	29793	29978	29645	29645
query68	
query69	477	364	312	312
query70	1077	1031	991	991
query71	322	275	272	272
query72	2991	2813	2428	2428
query73	836	828	414	414
query74	5117	4961	4800	4800
query75	2700	2619	2286	2286
query76	2292	1187	779	779
query77	417	419	346	346
query78	12524	12506	11903	11903
query79	1501	1109	752	752
query80	924	545	452	452
query81	493	282	248	248
query82	1380	164	121	121
query83	356	288	253	253
query84	264	141	112	112
query85	930	567	508	508
query86	456	359	334	334
query87	3467	3407	3252	3252
query88	3682	2722	2748	2722
query89	452	397	339	339
query90	1801	192	194	192
query91	186	174	150	150
query92	81	79	73	73
query93	1474	1495	906	906
query94	635	362	299	299
query95	670	388	456	388
query96	1115	798	345	345
query97	2732	2769	2594	2594
query98	247	227	237	227
query99	1189	1145	1053	1053
Total cold run time: 255807 ms
Total hot run time: 172185 ms

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 28, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Copy Markdown
Contributor

PR approved by anyone and no changes requested.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 57.14% (4/7) 🎉
Increment coverage report
Complete coverage report

@Gabriel39
Copy link
Copy Markdown
Contributor

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review opinion: no blocking issues found in the PR diff.

Critical checkpoint conclusions:

  • Goal/test: The PR adds EVICTION_RATE, increases Hive partition metadata cache defaults, and adjusts MaxCompute partition_values capacity. The changed code and added unit tests cover these goals.
  • Scope: The changes are small and focused on cache configuration and information_schema metadata exposure.
  • Concurrency/lifecycle: No new concurrency or special lifecycle risk found; cache specs are still captured during catalog cache initialization as before.
  • Configuration: Existing config defaults are changed, with fuzzy config updated accordingly. No new dynamic config behavior is introduced.
  • Compatibility: The information_schema table gains a new trailing column; row generation was updated in the same order.
  • Parallel paths: Hive and MaxCompute partition-values cache paths were checked; the MaxCompute path now uses the intended table-level capacity.
  • Tests: I attempted to run ./run-fe-ut.sh --run org.apache.doris.datasource.metacache.MetaCacheEntryTest, but the runner lacks thirdparty/installed/bin/protoc, so generated-source setup failed before the test executed. I did not run the remaining FE unit tests for the same environment reason.
  • Observability: The new EVICTION_RATE is exposed through catalog_meta_cache_statistics and computed safely as 0 when there are no requests.

User focus: No additional user-provided review focus was present.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 4.35% (1/23) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants