Skip to content

[feat](metric) Add JVM buffer pool metrics to FE metric endpoints#63916

Open
saurabhkgp21 wants to merge 1 commit into
apache:masterfrom
saurabhkgp21:jvm-metrics
Open

[feat](metric) Add JVM buffer pool metrics to FE metric endpoints#63916
saurabhkgp21 wants to merge 1 commit into
apache:masterfrom
saurabhkgp21:jvm-metrics

Conversation

@saurabhkgp21
Copy link
Copy Markdown

Summary

Expose JVM buffer pool statistics in FE metric endpoints to improve observability of off-heap memory usage.

Changes:

  • Add buffer pool metrics (jvm_buffer_pool_used_bytes, jvm_buffer_pool_capacity_bytes, jvm_buffer_pool_count) to both Prometheus and JSON metric visitors, broken down by pool name (direct, mapped)
  • Add jvm_buffer_pool_max_bytes metric exposing the configured MaxDirectMemorySize
  • Improve JvmInfo to parse -XX:MaxDirectMemorySize= from JVM input arguments as a fallback when the standard API returns 0

Motivation

Buffer pool metrics are critical for diagnosing off-heap memory issues (e.g., direct buffer OOM). Currently, FE exposes heap and GC metrics but not buffer pool stats, making it difficult to monitor direct memory pressure via Prometheus/Grafana.

Expose JVM buffer pool statistics (used bytes, capacity, count) and
direct memory max in both Prometheus and JSON metric visitors to improve
observability of off-heap memory usage.

Also improve MaxDirectMemorySize detection by parsing JVM input arguments
as a fallback when the standard API returns 0.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@saurabhkgp21
Copy link
Copy Markdown
Author

/review

1 similar comment
@morrySnow
Copy link
Copy Markdown
Contributor

/review

@morrySnow
Copy link
Copy Markdown
Contributor

run buildall

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: request changes.

Critical checkpoint conclusions:

  • Goal/test: The PR adds FE JVM buffer pool metrics to JSON and Prometheus endpoints. The used/capacity/count metrics are wired to existing JvmStats buffer-pool collection, but the new max-direct-memory metric is incorrect for the default FE Java 17 startup path, and I did not find a test covering that case.
  • Scope: The change is small and focused on metric visitors plus JvmInfo fallback logic.
  • Concurrency/lifecycle: Metric generation is under MetricRepo.getMetric synchronization and JvmInfo is a static singleton; no new mutable shared state or lock-order issue was found.
  • Config/compatibility: No new config or storage/protocol compatibility concern.
  • Parallel paths: JSON and Prometheus visitors were both updated. SimpleCoreMetricVisitor intentionally remains unchanged.
  • Conditions/error handling: The direct-memory fallback silently keeps 0 after all detection paths fail; this becomes user-visible through the new metric and is the blocking issue.
  • Test coverage: No tests were added for metric output or MaxDirectMemorySize fallback/default behavior.
  • Observability/performance: The feature improves observability, but the new max metric is misleading in common deployments until the fallback is fixed. No meaningful performance issue was found.

User focus: No additional user-provided review focus was present.

try {
directMemoryMax = ByteSizeValue.simpleParseBytesSizeValue(
arg.substring("-XX:MaxDirectMemorySize=".length()),
"MaxDirectMemorySize");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still leaves directMemoryMax as 0 in the default FE Java 17 configuration. conf/fe.conf/bin/start_fe.sh do not set -XX:MaxDirectMemorySize or open java.base/jdk.internal.misc, so sun.misc.VM is unavailable, jdk.internal.misc.VM.maxDirectMemory() cannot be made accessible, and this new fallback finds no input argument. The newly added jvm_buffer_pool_max_bytes{name="direct"} metric then reports 0 even though HotSpot's unset MaxDirectMemorySize limit defaults to the max heap size. That makes the new metric misleading for normal deployments and breaks direct-memory pressure alerting based on used / max. Please resolve the default case, for example by falling back to the VM's effective default instead of publishing zero, or by not emitting the max metric when the limit is genuinely unknown.

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 81.40% (35/43) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 6.68% (35/524) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29167 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7418019dbabc2bf587e0a977797e5457e0fc6f39, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17706	4045	3981	3981
q2	q3	10772	1375	826	826
q4	4689	483	345	345
q5	7549	895	599	599
q6	179	170	138	138
q7	764	864	655	655
q8	9348	1700	1629	1629
q9	5934	4521	4512	4512
q10	6749	1815	1558	1558
q11	434	274	259	259
q12	634	432	283	283
q13	18138	3338	2773	2773
q14	265	255	237	237
q15	q16	831	787	708	708
q17	948	875	966	875
q18	6973	5684	5609	5609
q19	1331	1309	987	987
q20	524	400	263	263
q21	6348	2820	2605	2605
q22	451	386	325	325
Total cold run time: 100567 ms
Total hot run time: 29167 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5177	4705	4833	4705
q2	q3	4829	5386	4768	4768
q4	2098	2195	1396	1396
q5	4828	4765	4825	4765
q6	233	181	128	128
q7	1850	1747	1579	1579
q8	2381	2134	2073	2073
q9	7959	7873	7556	7556
q10	4752	4635	4206	4206
q11	526	386	356	356
q12	732	739	530	530
q13	2953	3442	2799	2799
q14	268	279	251	251
q15	q16	686	703	619	619
q17	1268	1247	1245	1245
q18	7175	6869	6734	6734
q19	1171	1073	1118	1073
q20	2194	2201	1945	1945
q21	5271	4561	4405	4405
q22	507	449	398	398
Total cold run time: 56858 ms
Total hot run time: 51531 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171232 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7418019dbabc2bf587e0a977797e5457e0fc6f39, data reload: false

query5	4308	648	520	520
query6	329	243	198	198
query7	4223	539	308	308
query8	333	227	213	213
query9	8812	4037	3994	3994
query10	461	335	289	289
query11	5803	2404	2186	2186
query12	194	133	131	131
query13	1282	619	430	430
query14	6087	5426	5080	5080
query14_1	4404	4418	4402	4402
query15	211	201	181	181
query16	999	447	442	442
query17	952	721	577	577
query18	2466	479	353	353
query19	222	202	175	175
query20	140	134	136	134
query21	222	143	118	118
query22	13602	13704	13327	13327
query23	17347	16473	16244	16244
query23_1	16336	16350	16380	16350
query24	7493	1790	1305	1305
query24_1	1335	1322	1322	1322
query25	597	510	457	457
query26	1304	326	173	173
query27	2697	569	344	344
query28	4464	2021	1995	1995
query29	1026	650	529	529
query30	308	238	202	202
query31	1154	1086	950	950
query32	97	77	76	76
query33	563	370	302	302
query34	1174	1134	644	644
query35	772	806	712	712
query36	1392	1411	1291	1291
query37	155	109	89	89
query38	3224	3170	3088	3088
query39	937	918	919	918
query39_1	878	885	890	885
query40	244	157	133	133
query41	76	72	71	71
query42	121	118	116	116
query43	329	333	290	290
query44	
query45	217	207	200	200
query46	1092	1192	719	719
query47	2445	2409	2234	2234
query48	403	425	298	298
query49	661	526	411	411
query50	994	352	267	267
query51	4403	4256	4274	4256
query52	112	107	99	99
query53	263	284	207	207
query54	349	292	288	288
query55	98	95	99	95
query56	329	322	349	322
query57	1463	1439	1361	1361
query58	331	287	282	282
query59	1581	1669	1418	1418
query60	343	342	361	342
query61	165	166	160	160
query62	710	646	578	578
query63	247	204	215	204
query64	2399	805	636	636
query65	
query66	1715	479	359	359
query67	29622	29644	29490	29490
query68	
query69	475	338	301	301
query70	1043	989	994	989
query71	317	297	278	278
query72	3081	2756	2483	2483
query73	826	783	437	437
query74	5119	4941	4793	4793
query75	2710	2614	2273	2273
query76	2270	1151	757	757
query77	402	411	341	341
query78	12398	12414	11896	11896
query79	1536	1055	752	752
query80	777	531	484	484
query81	472	279	245	245
query82	1487	157	129	129
query83	347	277	246	246
query84	303	143	112	112
query85	931	546	503	503
query86	443	358	324	324
query87	3429	3373	3245	3245
query88	3655	2722	2724	2722
query89	458	391	344	344
query90	1803	191	188	188
query91	176	163	138	138
query92	82	78	76	76
query93	1560	1439	818	818
query94	628	374	302	302
query95	666	393	437	393
query96	1017	764	343	343
query97	2744	2740	2607	2607
query98	261	226	243	226
query99	1178	1167	1051	1051
Total cold run time: 254515 ms
Total hot run time: 171232 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants