Skip to content

[feat](job) add per-job routine load metrics#63576

Open
sollhui wants to merge 1 commit into
apache:masterfrom
sollhui:enhance_job_metrics
Open

[feat](job) add per-job routine load metrics#63576
sollhui wants to merge 1 commit into
apache:masterfrom
sollhui:enhance_job_metrics

Conversation

@sollhui
Copy link
Copy Markdown
Contributor

@sollhui sollhui commented May 25, 2026

What problem does this PR solve?

Problem Summary:

Before this change, FE only exposed aggregate routine load metrics, such as total loaded rows, error rows, received bytes, task execution time, progress, lag, and aborted task count across all routine load jobs. These metrics were useful for observing the whole FE, but they could not identify which routine load job contributed to a spike, lag, or abnormal error/task count.

This change adds per-job routine load metrics. Each metric is exported with job_id and job_name labels, so users can inspect the status of a single routine load job from the FE metrics endpoint.

The new metrics are:

  • doris_fe_routine_load_per_job_total_rows
  • doris_fe_routine_load_per_job_error_rows
  • doris_fe_routine_load_per_job_received_bytes
  • doris_fe_routine_load_per_job_task_execute_time
  • doris_fe_routine_load_per_job_task_execute_count
  • doris_fe_routine_load_per_job_progress
  • doris_fe_routine_load_per_job_lag
  • doris_fe_routine_load_per_job_abort_task_num

Example query from FE metrics endpoint:

curl http://<fe_host>:<http_port>/metrics | grep routine_load_per_job

Example Prometheus query:

doris_fe_routine_load_per_job_lag
image

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@sollhui sollhui force-pushed the enhance_job_metrics branch from 0c17b7d to 17b51b9 Compare May 25, 2026 03:48
@sollhui
Copy link
Copy Markdown
Contributor Author

sollhui commented May 25, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31618 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 17b51b93fd6852779de71f06802f8a45122699b5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17683	4057	4059	4057
q2	q3	10749	1374	797	797
q4	4686	480	351	351
q5	7620	2241	2071	2071
q6	257	174	140	140
q7	997	768	628	628
q8	9415	1665	1575	1575
q9	6664	4974	4932	4932
q10	6433	2227	1905	1905
q11	430	271	243	243
q12	698	426	290	290
q13	18203	3410	2783	2783
q14	265	256	242	242
q15	q16	808	763	707	707
q17	913	900	922	900
q18	6706	5890	5520	5520
q19	1344	1220	1230	1220
q20	544	429	282	282
q21	6047	2772	2668	2668
q22	466	378	307	307
Total cold run time: 100928 ms
Total hot run time: 31618 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4971	4812	4968	4812
q2	q3	4989	5344	4683	4683
q4	2107	2206	1424	1424
q5	4987	4797	4656	4656
q6	228	172	130	130
q7	1866	1772	1630	1630
q8	2449	1970	1954	1954
q9	7462	7427	7407	7407
q10	4745	4674	4252	4252
q11	546	394	364	364
q12	745	749	536	536
q13	3075	3391	2803	2803
q14	271	289	249	249
q15	q16	691	702	609	609
q17	1314	1271	1289	1271
q18	7247	6746	6857	6746
q19	1096	1077	1089	1077
q20	2223	2227	1987	1987
q21	5332	4668	4555	4555
q22	524	449	404	404
Total cold run time: 56868 ms
Total hot run time: 51549 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 173292 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 17b51b93fd6852779de71f06802f8a45122699b5, data reload: false

query5	4335	650	514	514
query6	332	226	198	198
query7	4241	575	326	326
query8	324	230	232	230
query9	8839	4164	4161	4161
query10	443	351	299	299
query11	5808	2483	2205	2205
query12	180	134	125	125
query13	1280	611	426	426
query14	6246	5552	5259	5259
query14_1	4570	4570	4552	4552
query15	215	206	187	187
query16	1025	459	449	449
query17	1178	762	616	616
query18	2605	496	362	362
query19	219	212	168	168
query20	146	130	131	130
query21	218	149	122	122
query22	13701	13634	13424	13424
query23	17455	16566	16343	16343
query23_1	16432	16388	16369	16369
query24	7410	1788	1326	1326
query24_1	1334	1296	1356	1296
query25	567	486	433	433
query26	1329	330	166	166
query27	2672	584	350	350
query28	4434	2012	2013	2012
query29	966	626	492	492
query30	301	239	204	204
query31	1148	1081	959	959
query32	84	74	74	74
query33	539	361	291	291
query34	1181	1146	651	651
query35	766	799	714	714
query36	1403	1388	1263	1263
query37	152	103	91	91
query38	3208	3186	3092	3092
query39	920	926	887	887
query39_1	901	878	887	878
query40	234	145	127	127
query41	68	66	65	65
query42	109	118	111	111
query43	338	349	306	306
query44	
query45	209	204	195	195
query46	1103	1217	727	727
query47	2447	2428	2340	2340
query48	417	415	314	314
query49	654	507	404	404
query50	989	361	256	256
query51	4497	4368	4289	4289
query52	105	103	93	93
query53	281	283	205	205
query54	312	272	270	270
query55	93	89	86	86
query56	293	302	294	294
query57	1443	1408	1343	1343
query58	297	280	254	254
query59	1669	1735	1526	1526
query60	325	319	312	312
query61	159	158	159	158
query62	695	654	578	578
query63	248	201	205	201
query64	2380	796	652	652
query65	
query66	1691	480	353	353
query67	30050	30021	29907	29907
query68	
query69	468	348	294	294
query70	1081	1003	1011	1003
query71	301	274	268	268
query72	3073	2685	2473	2473
query73	847	776	429	429
query74	5179	4958	4844	4844
query75	2701	2604	2256	2256
query76	2302	1120	771	771
query77	410	410	348	348
query78	12473	12321	11848	11848
query79	1271	1012	763	763
query80	585	538	492	492
query81	453	277	249	249
query82	239	153	120	120
query83	271	304	250	250
query84	256	143	114	114
query85	864	547	464	464
query86	361	330	330	330
query87	3427	3397	3255	3255
query88	3641	2760	2736	2736
query89	427	389	343	343
query90	2202	185	181	181
query91	189	169	142	142
query92	80	83	74	74
query93	1527	1458	855	855
query94	520	362	320	320
query95	706	465	342	342
query96	1049	858	342	342
query97	2722	2753	2613	2613
query98	235	231	229	229
query99	1174	1146	1029	1029
Total cold run time: 254194 ms
Total hot run time: 173292 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 35.14% (39/111) 🎉
Increment coverage report
Complete coverage report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants