Skip to content

[improvement](executor) use real elapsed time to compute workload group metrics refresh interval#63244

Open
bosswnx wants to merge 2 commits into
apache:masterfrom
bosswnx:improvement/wg-metrics-real-interval
Open

[improvement](executor) use real elapsed time to compute workload group metrics refresh interval#63244
bosswnx wants to merge 2 commits into
apache:masterfrom
bosswnx:improvement/wg-metrics-real-interval

Conversation

@bosswnx
Copy link
Copy Markdown

@bosswnx bosswnx commented May 14, 2026

What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary:

WorkloadGroupMetrics::refresh_metrics() computes per-second CPU time and
local/remote scan bytes by dividing the delta of each counter by an interval
derived from config::workload_group_metrics_interval_ms:

int interval_second = config::workload_group_metrics_interval_ms / 1000;
...
_per_sec_cpu_time_nanos = (_current_cpu_time_nanos - _last_cpu_time_nanos) / interval_second;

This is inaccurate for several reasons:

  1. The refresh thread is not guaranteed to fire exactly every
    workload_group_metrics_interval_ms. Under BE load or scheduling delays
    the actual gap between two refreshes can drift significantly, but the
    divisor is still the configured interval, so the reported per-second
    rates do not reflect reality.
  2. If workload_group_metrics_interval_ms is changed at runtime, the
    divisor updates immediately while the counter delta still spans the
    old interval, producing a one-shot incorrect rate.
  3. When workload_group_metrics_interval_ms < 1000, the integer division
    rounds the divisor down to 0, which would cause a divide-by-zero.

This PR replaces the fixed config-based interval with the actual
monotonic time delta
between two consecutive refresh_metrics()
invocations:

uint64_t current_time_ms = MonotonicMillis();
uint64_t interval_second = (current_time_ms - _last_refresh_time_ms) / 1000;
_last_refresh_time_ms = current_time_ms;

A new member std::atomic<uint64_t> _last_refresh_time_ms{0} is added to
record the timestamp of the previous refresh. This makes the per-second
CPU / local-scan / remote-scan metrics reflect the true elapsed wall-clock
interval, regardless of refresh-thread jitter or runtime config changes.

Known limitation: on the very first invocation after BE startup
_last_refresh_time_ms is 0, so interval_second becomes
MonotonicMillis() / 1000 (a large number) and the first sample of each
per-second metric will be reported as near-zero. The values converge to
correct readings from the second refresh onwards. A follow-up may
initialize _last_refresh_time_ms to MonotonicMillis() in the
constructor; kept out of this PR to keep the change minimal.

Release note

None

Check List (For Author)

  • Test

    • No need to test or manual test. Explain why:
      • Previous test can cover this change. The change only replaces
        the per-second-rate divisor with a more accurate value; the
        metric semantics, plumbing and existing workload-group metrics
        tests remain unchanged.
  • Behavior changed:

    • No.
  • Does this need documentation?

    • No.

…up metrics refresh interval

Replace the fixed config-based interval with the actual monotonic time delta between two refreshes when calculating per-second CPU and scan IO rates in WorkloadGroupMetrics, so the rates stay accurate even when the refresh thread is delayed or the configured interval is changed at runtime.
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@bosswnx
Copy link
Copy Markdown
Author

bosswnx commented May 14, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29661 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b25939d7e38299b32e567abfa728616850bcef9a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17871	3863	3935	3863
q2	q3	10714	870	604	604
q4	4670	461	342	342
q5	7451	1317	1135	1135
q6	191	171	139	139
q7	898	954	776	776
q8	9324	1420	1301	1301
q9	5637	5398	5373	5373
q10	6259	2069	1821	1821
q11	468	271	268	268
q12	637	418	298	298
q13	18054	3327	2688	2688
q14	294	288	263	263
q15	q16	861	875	795	795
q17	959	1053	761	761
q18	6488	5764	5613	5613
q19	1191	1167	1082	1082
q20	523	395	261	261
q21	5076	2423	1947	1947
q22	486	425	331	331
Total cold run time: 98052 ms
Total hot run time: 29661 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4713	4821	4811	4811
q2	q3	4687	4768	4207	4207
q4	2126	2215	1420	1420
q5	4982	5010	5255	5010
q6	201	182	142	142
q7	2031	1852	1639	1639
q8	3384	3229	3162	3162
q9	8571	8464	8470	8464
q10	4531	4494	4253	4253
q11	602	434	390	390
q12	717	757	521	521
q13	3301	3597	2925	2925
q14	453	350	288	288
q15	q16	807	790	705	705
q17	1319	1293	1323	1293
q18	7956	7049	7136	7049
q19	1175	1135	1115	1115
q20	2233	2256	1967	1967
q21	6105	5427	4994	4994
q22	541	501	397	397
Total cold run time: 60435 ms
Total hot run time: 54752 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169745 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b25939d7e38299b32e567abfa728616850bcef9a, data reload: false

query5	4340	664	526	526
query6	354	220	212	212
query7	4245	565	311	311
query8	333	245	230	230
query9	8836	4087	4070	4070
query10	472	347	306	306
query11	5788	2422	2200	2200
query12	180	130	126	126
query13	1298	602	426	426
query14	6035	5386	5083	5083
query14_1	4416	4400	4411	4400
query15	215	204	187	187
query16	992	469	390	390
query17	1005	794	649	649
query18	2459	499	362	362
query19	230	214	169	169
query20	143	145	134	134
query21	223	140	128	128
query22	13722	13544	13457	13457
query23	17238	16438	16014	16014
query23_1	16273	16263	16217	16217
query24	7415	1738	1350	1350
query24_1	1348	1335	1354	1335
query25	558	480	438	438
query26	1315	327	171	171
query27	2694	622	334	334
query28	4489	1956	1927	1927
query29	1048	629	507	507
query30	304	229	195	195
query31	1123	1067	948	948
query32	97	77	76	76
query33	544	359	289	289
query34	1179	1155	638	638
query35	783	779	670	670
query36	1317	1324	1149	1149
query37	154	110	92	92
query38	3227	3158	3055	3055
query39	931	931	897	897
query39_1	872	903	910	903
query40	232	158	138	138
query41	66	63	61	61
query42	110	112	112	112
query43	322	335	285	285
query44	
query45	215	200	192	192
query46	1057	1156	688	688
query47	2300	2284	2104	2104
query48	406	415	286	286
query49	625	523	461	461
query50	718	286	216	216
query51	4272	4300	4256	4256
query52	105	110	97	97
query53	257	287	205	205
query54	309	304	248	248
query55	93	90	87	87
query56	294	315	313	313
query57	1423	1400	1300	1300
query58	307	261	272	261
query59	1561	1656	1436	1436
query60	335	335	322	322
query61	150	155	155	155
query62	670	616	563	563
query63	242	204	208	204
query64	2427	821	670	670
query65	
query66	1769	521	391	391
query67	30072	29866	29179	29179
query68	
query69	458	330	297	297
query70	979	1037	1000	1000
query71	301	272	269	269
query72	2968	2739	2405	2405
query73	871	762	444	444
query74	5096	4893	4726	4726
query75	2778	2676	2331	2331
query76	2288	1134	763	763
query77	419	419	349	349
query78	12841	12876	12463	12463
query79	1380	946	735	735
query80	667	582	501	501
query81	464	281	241	241
query82	1292	165	119	119
query83	365	292	259	259
query84	297	144	112	112
query85	879	501	438	438
query86	415	330	329	329
query87	3443	3358	3234	3234
query88	3533	2653	2663	2653
query89	445	384	339	339
query90	1943	181	173	173
query91	177	170	148	148
query92	85	80	78	78
query93	984	929	565	565
query94	575	357	345	345
query95	681	479	366	366
query96	1042	805	328	328
query97	2673	2702	2562	2562
query98	240	231	237	231
query99	1128	1106	1016	1016
Total cold run time: 253250 ms
Total hot run time: 169745 ms

@bosswnx
Copy link
Copy Markdown
Author

bosswnx commented May 14, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29528 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 6a22b6a14826608663456532a644be8c1d8ddc1d, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17706	3934	3812	3812
q2	q3	10698	901	608	608
q4	4676	461	347	347
q5	7460	1323	1124	1124
q6	183	165	138	138
q7	908	964	735	735
q8	9314	1384	1265	1265
q9	5716	5360	5352	5352
q10	6273	2094	1788	1788
q11	456	275	259	259
q12	625	416	293	293
q13	18122	3424	2777	2777
q14	289	282	260	260
q15	q16	896	876	786	786
q17	1014	992	704	704
q18	6514	5776	5618	5618
q19	1159	1233	1089	1089
q20	505	404	268	268
q21	5079	2371	1962	1962
q22	477	441	343	343
Total cold run time: 98070 ms
Total hot run time: 29528 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4677	4718	4760	4718
q2	q3	4662	4765	4195	4195
q4	2158	2219	1441	1441
q5	4972	4952	5229	4952
q6	200	179	140	140
q7	2044	1812	1614	1614
q8	3312	3123	3175	3123
q9	8512	8367	8446	8367
q10	4479	4517	4262	4262
q11	621	429	392	392
q12	695	750	508	508
q13	3252	3540	2886	2886
q14	320	400	281	281
q15	q16	758	783	691	691
q17	1335	1292	1271	1271
q18	7967	7185	7089	7089
q19	1144	1165	1156	1156
q20	2230	2237	1966	1966
q21	6163	5439	4890	4890
q22	547	501	428	428
Total cold run time: 60048 ms
Total hot run time: 54370 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171560 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 6a22b6a14826608663456532a644be8c1d8ddc1d, data reload: false

query5	4336	648	533	533
query6	328	211	202	202
query7	4247	570	310	310
query8	321	232	220	220
query9	8823	4008	4042	4008
query10	444	347	295	295
query11	5788	2366	2282	2282
query12	184	128	123	123
query13	1286	580	437	437
query14	6197	5387	5076	5076
query14_1	4380	4333	4355	4333
query15	212	206	200	200
query16	1030	459	424	424
query17	1120	745	617	617
query18	2744	486	357	357
query19	214	206	166	166
query20	137	133	128	128
query21	215	141	119	119
query22	13586	14989	14347	14347
query23	17547	16527	16184	16184
query23_1	16332	16198	16103	16103
query24	7412	1764	1317	1317
query24_1	1382	1351	1370	1351
query25	548	487	425	425
query26	1341	326	167	167
query27	2688	584	325	325
query28	4416	1985	1944	1944
query29	965	608	514	514
query30	305	237	197	197
query31	1135	1059	940	940
query32	86	73	68	68
query33	527	347	291	291
query34	1207	1135	662	662
query35	760	766	681	681
query36	1370	1322	1222	1222
query37	160	99	86	86
query38	3216	3129	3064	3064
query39	933	920	903	903
query39_1	870	880	877	877
query40	232	151	134	134
query41	63	60	63	60
query42	107	107	107	107
query43	315	322	280	280
query44	
query45	217	206	192	192
query46	1078	1194	719	719
query47	2294	2284	2198	2198
query48	398	418	290	290
query49	640	521	417	417
query50	734	282	222	222
query51	4350	4269	4220	4220
query52	102	101	93	93
query53	260	282	198	198
query54	312	272	251	251
query55	92	87	91	87
query56	291	310	299	299
query57	1387	1358	1285	1285
query58	296	282	274	274
query59	1535	1628	1386	1386
query60	340	339	325	325
query61	158	166	160	160
query62	671	620	555	555
query63	237	203	205	203
query64	2386	897	740	740
query65	
query66	1714	524	407	407
query67	30107	30032	29821	29821
query68	
query69	476	353	309	309
query70	1063	1012	965	965
query71	312	282	262	262
query72	3178	2988	2657	2657
query73	870	752	395	395
query74	5146	4902	4720	4720
query75	2774	2658	2335	2335
query76	2306	1126	733	733
query77	423	431	343	343
query78	12998	12997	12370	12370
query79	1485	967	755	755
query80	1222	576	505	505
query81	501	273	241	241
query82	1324	155	129	129
query83	345	284	245	245
query84	264	143	112	112
query85	915	508	444	444
query86	428	360	319	319
query87	3425	3322	3211	3211
query88	3553	2682	2663	2663
query89	446	372	346	346
query90	1786	179	175	175
query91	184	163	140	140
query92	79	80	73	73
query93	968	952	555	555
query94	610	368	297	297
query95	675	375	466	375
query96	1053	794	338	338
query97	2721	2697	2562	2562
query98	238	229	230	229
query99	1092	1110	973	973
Total cold run time: 254920 ms
Total hot run time: 171560 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.57% (20629/38505)
Line Coverage 37.20% (195030/524257)
Region Coverage 33.62% (152684/454145)
Branch Coverage 34.61% (66549/192270)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.49% (20640/38586)
Line Coverage 37.15% (195060/525073)
Region Coverage 33.52% (152649/455411)
Branch Coverage 34.55% (66545/192628)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (3/3) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.71% (27851/37786)
Line Coverage 57.67% (302004/523696)
Region Coverage 54.89% (252387/459826)
Branch Coverage 56.46% (109167/193358)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants