Skip to content

[improvement](executor) use real elapsed time to compute workload group metrics refresh interval#63224

Closed
bosswnx wants to merge 1 commit into
apache:masterfrom
bosswnx:fix/wg-metrics-real-interval
Closed

[improvement](executor) use real elapsed time to compute workload group metrics refresh interval#63224
bosswnx wants to merge 1 commit into
apache:masterfrom
bosswnx:fix/wg-metrics-real-interval

Conversation

@bosswnx
Copy link
Copy Markdown

@bosswnx bosswnx commented May 13, 2026

Proposed changes

Issue Number: N/A

Problem Summary

WorkloadGroupMetrics::refresh_metrics() computes per-second CPU time and
local/remote scan bytes by dividing the delta of each counter by an
interval derived from config::workload_group_metrics_interval_ms:

int interval_second = config::workload_group_metrics_interval_ms / 1000;
...
_per_sec_cpu_time_nanos = (_current_cpu_time_nanos - _last_cpu_time_nanos) / interval_second;

This is inaccurate for several reasons:

  1. The refresh thread is not guaranteed to fire exactly every
    workload_group_metrics_interval_ms. Under BE load, GC pressure or
    scheduling delays the actual gap between two refreshes can drift
    significantly, but the divisor is still the configured interval, so the
    reported per-second rates do not reflect reality.
  2. If workload_group_metrics_interval_ms is changed at runtime, the divisor
    updates immediately while the counter delta still spans the old
    interval, producing a one-shot incorrect rate.
  3. When workload_group_metrics_interval_ms < 1000, the integer division
    rounds the divisor down to 0, which would cause a divide-by-zero.

What this PR does

Replace the fixed config-based interval with the actual monotonic time
delta
between two consecutive refresh_metrics() invocations:

uint64_t current_time_ms = MonotonicMillis();
uint64_t interval_second = (current_time_ms - _last_refresh_time_ms) / 1000;
_last_refresh_time_ms = current_time_ms;

A new member std::atomic<uint64_t> _last_refresh_time_ms{0} is added to
record the timestamp of the previous refresh.

This makes the per-second CPU / local-scan / remote-scan metrics reflect the
true elapsed wall-clock interval, regardless of refresh-thread jitter or
runtime config changes.

Files changed

  • be/src/runtime/workload_group/workload_group_metrics.cpp
  • be/src/runtime/workload_group/workload_group_metrics.h

Known limitation / follow-up

On the first invocation after BE startup _last_refresh_time_ms is 0,
so interval_second becomes MonotonicMillis() / 1000 (a very large
number) and the first sample of each per-second metric will be reported as
near-zero. The values converge to correct readings from the second refresh
onwards. A follow-up can initialize _last_refresh_time_ms to
MonotonicMillis() in the constructor to also make the first sample
accurate; kept out of this PR to keep the change minimal.

Checklist

  • I have created an [improvement] type PR
  • I have updated the relevant unit/regression tests if necessary
    (no behavioral change observable from SQL; covered by existing
    workload-group metrics tests)
  • Documentation update is not required

Need backport

Please backport to all maintained branches:

  • master
  • branch-4.1
  • branch-4.0
  • branch-3.1
  • branch-3.0
  • branch-2.1
  • branch-2.0

…up metrics refresh interval

Replace the fixed config-based interval with the actual monotonic time delta between two refreshes when calculating per-second CPU and scan IO rates in WorkloadGroupMetrics, so the rates stay accurate even when the refresh thread is delayed or the configured interval is changed at runtime.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bosswnx
Copy link
Copy Markdown
Author

bosswnx commented May 13, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29887 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6a0d4c0113ff49c1a9081faae71c6160f88129ea, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17612	4035	4061	4035
q2	q3	10725	878	612	612
q4	4664	465	343	343
q5	7456	1349	1159	1159
q6	195	177	147	147
q7	908	968	754	754
q8	9304	1440	1311	1311
q9	5621	5364	5342	5342
q10	6262	2083	1805	1805
q11	464	266	258	258
q12	633	411	299	299
q13	18079	3256	2726	2726
q14	303	287	262	262
q15	q16	905	865	795	795
q17	954	1132	750	750
q18	6468	5727	5589	5589
q19	1168	1262	1108	1108
q20	510	402	271	271
q21	4607	2392	1986	1986
q22	449	388	335	335
Total cold run time: 97287 ms
Total hot run time: 29887 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4978	4994	5027	4994
q2	q3	4692	4787	4187	4187
q4	2129	2207	1411	1411
q5	5015	5039	5278	5039
q6	210	181	139	139
q7	2054	1814	1601	1601
q8	3333	3098	3091	3091
q9	8337	8431	8477	8431
q10	4505	4472	4247	4247
q11	616	407	390	390
q12	698	742	521	521
q13	3306	3588	2926	2926
q14	306	296	353	296
q15	q16	858	813	678	678
q17	1403	1348	1278	1278
q18	8040	7240	7095	7095
q19	1148	1143	1162	1143
q20	2233	2222	1972	1972
q21	6175	5453	4901	4901
q22	547	506	417	417
Total cold run time: 60583 ms
Total hot run time: 54757 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169899 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6a0d4c0113ff49c1a9081faae71c6160f88129ea, data reload: false

query5	4317	651	498	498
query6	341	227	195	195
query7	4282	586	312	312
query8	331	240	219	219
query9	8853	4012	3963	3963
query10	451	354	289	289
query11	5783	2394	2176	2176
query12	175	126	121	121
query13	1264	612	466	466
query14	5977	5344	5085	5085
query14_1	4374	4369	4320	4320
query15	212	204	187	187
query16	995	480	462	462
query17	987	770	640	640
query18	2470	497	369	369
query19	218	202	163	163
query20	138	132	131	131
query21	231	139	119	119
query22	13702	13521	13288	13288
query23	17278	16417	15954	15954
query23_1	16093	16095	16206	16095
query24	7408	1778	1322	1322
query24_1	1335	1353	1390	1353
query25	544	514	453	453
query26	1321	317	177	177
query27	2720	622	346	346
query28	4430	1997	2007	1997
query29	1001	639	517	517
query30	307	240	191	191
query31	1129	1049	941	941
query32	86	76	78	76
query33	529	368	289	289
query34	1199	1175	653	653
query35	777	788	692	692
query36	1314	1326	1128	1128
query37	153	105	97	97
query38	3203	3162	3032	3032
query39	936	934	888	888
query39_1	920	881	859	859
query40	245	169	140	140
query41	77	64	66	64
query42	108	108	112	108
query43	328	335	279	279
query44	
query45	214	202	193	193
query46	1086	1221	737	737
query47	2299	2374	2168	2168
query48	399	404	293	293
query49	628	563	436	436
query50	729	287	225	225
query51	4334	4219	4223	4219
query52	106	102	92	92
query53	253	284	211	211
query54	306	271	249	249
query55	91	88	85	85
query56	303	294	291	291
query57	1377	1392	1290	1290
query58	291	276	268	268
query59	1527	1624	1396	1396
query60	345	331	330	330
query61	161	155	161	155
query62	665	616	555	555
query63	242	198	206	198
query64	2444	835	679	679
query65	
query66	1752	557	395	395
query67	30001	29918	29747	29747
query68	
query69	459	345	304	304
query70	1017	1001	966	966
query71	307	274	264	264
query72	2915	2719	2468	2468
query73	851	763	443	443
query74	5086	4929	4746	4746
query75	2756	2634	2323	2323
query76	2277	1140	790	790
query77	422	427	359	359
query78	12804	12833	12410	12410
query79	1402	1043	772	772
query80	707	586	492	492
query81	463	284	237	237
query82	1386	157	121	121
query83	357	284	251	251
query84	292	143	116	116
query85	895	518	442	442
query86	391	323	352	323
query87	3403	3339	3200	3200
query88	3620	2720	2692	2692
query89	438	392	339	339
query90	1957	185	181	181
query91	189	165	145	145
query92	81	79	73	73
query93	966	973	553	553
query94	534	351	307	307
query95	647	483	337	337
query96	1041	766	363	363
query97	2698	2703	2535	2535
query98	239	228	228	228
query99	1126	1106	981	981
Total cold run time: 252710 ms
Total hot run time: 169899 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.57% (20614/38484)
Line Coverage 37.20% (194903/523893)
Region Coverage 33.62% (152536/453725)
Branch Coverage 34.62% (66523/192128)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.69% (27771/37684)
Line Coverage 57.62% (301063/522512)
Region Coverage 54.90% (251541/458141)
Branch Coverage 56.37% (108716/192856)

@bosswnx
Copy link
Copy Markdown
Author

bosswnx commented May 14, 2026

/review

@bosswnx bosswnx closed this May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants