Skip to content

[improve](streaming-job) add per-job lag metric to streaming insert jobs#63194

Merged
JNSimba merged 2 commits into
apache:masterfrom
JNSimba:improve/streaming-job-per-job-lag-metric
May 15, 2026
Merged

[improve](streaming-job) add per-job lag metric to streaming insert jobs#63194
JNSimba merged 2 commits into
apache:masterfrom
JNSimba:improve/streaming-job-per-job-lag-metric

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented May 13, 2026

What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (streaming_job_per_job_scanned_rows, _load_bytes, _filtered_rows, _succeed_task_count, _failed_task_count) for streaming insert jobs, exposed via /metrics with job_id/job_name labels for Prometheus.

#62269 later added a Lag column to SHOW JOBS / jobs() TVF that reports end-to-end CDC delay in seconds, but the value was only exposed through SQL — there was no corresponding Prometheus metric, so dashboards/alerting on lag was not possible.

This PR adds streaming_job_per_job_lag (unit: SECONDS) to the existing per-job metric set.

Implementation:

  • New StreamingInsertJob#getLagSeconds() parses the existing offsetProvider.getLag() string into a long. Returns -1 when lag is not applicable (S3, snapshot phase) or unparseable, so dashboards can filter N/A jobs via lag >= 0 and distinguish them from "CDC caught up (lag = 0)".
  • MetricRepo.updateStreamingJobPerJobMetrics() registers the gauge for every streaming insert job (no skip), matching the always-present behavior of the other per-job metrics.

Semantics summary:

Job state TVF Lag metric value
S3 / snapshot phase / no provider "" -1
CDC caught up (idle) "0" 0
CDC actively lagging "N" N

Release note

Add streaming_job_per_job_lag Prometheus metric (unit: seconds, -1 means N/A) for streaming insert jobs.

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 13, 2026

/review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the existing per-job Prometheus metrics for streaming insert jobs by adding a per-job lag gauge (streaming_job_per_job_lag, seconds). The lag value is sourced from the CDC offset provider and is only emitted when meaningful (CDC incremental/binlog/WAL phase), aligning Prometheus exposure with the existing SQL Lag column added previously.

Changes:

  • Add new per-job metric name constant and register/unregister logic for streaming_job_per_job_lag in FE metrics export.
  • Add StreamingInsertJob#getLag() wrapper to expose offset-provider lag to the metrics layer.
  • Extend MySQL streaming job regression test to (a) drive the job into incremental phase and (b) validate the new per-job lag metric appears in /metrics?type=json.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
regression-test/suites/job_p0/streaming_job/cdc/test_streaming_mysql_job_metrics.groovy Ensures lag becomes available after snapshot and checks the new per-job lag metric is emitted.
fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Adds per-job lag metric registration and cleanup as part of per-request per-job metric regeneration.
fe/fe-core/src/main/java/org/apache/doris/job/extensions/insert/streaming/StreamingInsertJob.java Adds a small wrapper method to fetch lag from the offset provider for metric consumption.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review result: no blocking issues found.

Critical checkpoint conclusions:

  • Goal and tests: The change adds streaming_job_per_job_lag for streaming insert jobs and extends the MySQL CDC metrics regression test to wait for a numeric Lag and then assert the metric appears.
  • Scope: The actual GitHub PR diff is focused on StreamingInsertJob, MetricRepo, and the CDC metrics regression test.
  • Concurrency and locking: No new locking or shared mutable structures are introduced; per-job metric refresh remains inside the existing synchronized metric collection flow.
  • Lifecycle: No new persisted state or replay behavior is added; lag is read from the existing offset provider.
  • Configuration and compatibility: No new config, thrift, storage format, or rolling-upgrade-sensitive protocol changes.
  • Parallel paths: Existing per-job streaming metrics are extended consistently with remove/re-register behavior; empty/non-applicable lag is skipped as intended.
  • Error handling: Invalid or non-numeric lag strings are ignored rather than breaking /metrics, matching the defensive behavior expected for observability code.
  • Test coverage: The regression test covers CDC lag becoming available and verifies the Prometheus JSON metric label appears. I did not run the external MySQL regression locally in this runner.
  • Observability: Metric name, unit, and labels follow the adjacent per-job streaming metrics.
  • User focus: No additional user-provided focus points were present.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 13, 2026

/review

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 13, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

Comment on lines +1334 to +1344
GaugeMetric<Long> lag = new GaugeMetric<Long>(
STREAMING_JOB_PER_JOB_LAG, MetricUnit.SECONDS,
"per job lag in seconds of streaming job, -1 means N/A") {
@Override
public Long getValue() {
return sJob.getLagSeconds();
}
};
lag.addLabel(new MetricLabel("job_id", jobId))
.addLabel(new MetricLabel("job_name", jobName));
DORIS_METRIC_REGISTER.addMetrics(lag);
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one blocking test-coverage issue: the added regression check only verifies that the new lag metric exists, not that it exposes a real CDC lag value. Because the implementation emits -1 for N/A/snapshot/unparseable lag, this test would still pass if the job never reaches the binlog/WAL phase or if getLag() always returns an empty/unparseable value.

Critical checkpoint conclusions:

  • Goal/test proof: The code adds a per-job lag metric, but the regression test does not prove the CDC lag value is actually surfaced.
  • Scope/focus: The production change is small and localized.
  • Concurrency/lifecycle: Metrics are updated from the existing synchronized metrics path; no new lock-order or lifecycle issue found.
  • Configuration/compatibility: No new configs or storage/protocol compatibility concerns found.
  • Parallel paths: The metric is added to the existing streaming per-job metrics path.
  • Tests: The new assertion needs to validate the metric value is numeric and non-negative after the job has entered CDC/binlog phase, instead of accepting the -1 N/A fallback.
  • Observability/performance: The new metric is lightweight and uses existing labels; no additional observability issue found.

User focus: No additional user-provided review focus was present.

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented May 13, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review summary:

No additional distinct inline issues found beyond the existing review threads.

Critical checkpoint conclusions:

  • Goal and proof: The PR adds a per-job streaming lag metric from the existing Lag value. The implementation exposes the metric, but the existing thread correctly notes the regression test only proves the series exists and does not prove a real non-N/A CDC lag value is exported.
  • Scope: The code change is small and focused on one metric plus its regression coverage.
  • Concurrency and locking: No new metadata locks or multi-lock paths are introduced. The metric reads existing job/offset state during /metrics collection, matching the existing metrics pattern.
  • Lifecycle/static initialization: No new static initialization dependency or non-obvious lifecycle ownership issue found.
  • Configuration/compatibility: No new config, persistence format, FE-BE protocol, or rolling-upgrade compatibility issue found.
  • Parallel paths: The new metric is added to the existing per-job streaming metric registration/removal path. No separate cloud/shared-nothing path requiring another change was found.
  • Conditional checks/error handling: The -1 N/A sentinel behavior is intentional per the current PR description; existing review discussion already covers the semantic tradeoff and test implications.
  • Test coverage: Existing review threads cover the main gap: the regression test should verify the lag value semantics, not just metric presence.
  • Observability/performance: The metric improves observability and uses the existing metric refresh path; no additional hot-path or excessive-allocation concern found.
  • Transaction/data correctness: No transaction visibility, persistence, or data write path changes are introduced.

User focus: No additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29725 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9f11c95b853170daccc6b988a6e4afb3efba5a18, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17621	3907	3847	3847
q2	q3	10709	878	612	612
q4	4661	467	337	337
q5	7449	1346	1142	1142
q6	187	166	138	138
q7	913	958	742	742
q8	9325	1421	1280	1280
q9	5609	5378	5359	5359
q10	6256	2083	1809	1809
q11	462	270	267	267
q12	649	421	292	292
q13	18122	3386	2751	2751
q14	288	289	265	265
q15	q16	907	875	786	786
q17	967	1092	837	837
q18	6534	5734	5632	5632
q19	1161	1234	1066	1066
q20	494	408	261	261
q21	4635	2468	1946	1946
q22	470	407	356	356
Total cold run time: 97419 ms
Total hot run time: 29725 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4892	4700	4806	4700
q2	q3	4657	4834	4213	4213
q4	2144	2228	1404	1404
q5	5009	5026	5285	5026
q6	191	165	139	139
q7	2085	1778	1648	1648
q8	3426	3123	3121	3121
q9	8447	8445	8484	8445
q10	4478	4471	4255	4255
q11	590	421	403	403
q12	699	752	502	502
q13	3269	3566	2935	2935
q14	313	314	288	288
q15	q16	799	797	681	681
q17	1502	1291	1260	1260
q18	8033	7112	7038	7038
q19	1190	1179	1150	1150
q20	2242	2218	1961	1961
q21	6097	5396	4836	4836
q22	557	537	418	418
Total cold run time: 60620 ms
Total hot run time: 54423 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 171883 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9f11c95b853170daccc6b988a6e4afb3efba5a18, data reload: false

query5	4310	649	510	510
query6	337	219	202	202
query7	4309	566	332	332
query8	336	244	226	226
query9	8852	4023	4028	4023
query10	443	346	299	299
query11	5810	2417	2237	2237
query12	185	128	126	126
query13	1303	595	446	446
query14	6101	5369	5047	5047
query14_1	4350	4333	4313	4313
query15	211	204	179	179
query16	1005	465	423	423
query17	1131	754	623	623
query18	2723	476	355	355
query19	211	202	164	164
query20	140	133	136	133
query21	213	140	115	115
query22	13568	14083	14552	14083
query23	17422	16625	16259	16259
query23_1	16335	16442	16223	16223
query24	7368	1766	1360	1360
query24_1	1355	1371	1377	1371
query25	592	518	463	463
query26	1321	314	173	173
query27	2749	608	349	349
query28	4412	1952	1976	1952
query29	1031	669	554	554
query30	303	239	197	197
query31	1129	1067	942	942
query32	89	75	72	72
query33	557	361	301	301
query34	1169	1109	617	617
query35	772	792	663	663
query36	1326	1336	1217	1217
query37	156	106	90	90
query38	3186	3105	3061	3061
query39	922	916	893	893
query39_1	881	883	892	883
query40	240	161	142	142
query41	70	67	65	65
query42	111	109	106	106
query43	335	332	286	286
query44	
query45	210	207	196	196
query46	1107	1168	739	739
query47	2340	2341	2213	2213
query48	401	448	291	291
query49	639	534	416	416
query50	699	279	209	209
query51	4262	4280	4187	4187
query52	105	105	94	94
query53	256	272	197	197
query54	319	280	253	253
query55	93	89	83	83
query56	296	307	316	307
query57	1421	1417	1312	1312
query58	311	276	271	271
query59	1527	1620	1362	1362
query60	346	330	318	318
query61	164	155	157	155
query62	667	623	550	550
query63	243	192	206	192
query64	2373	843	672	672
query65	
query66	1686	530	384	384
query67	30015	29992	29854	29854
query68	
query69	473	344	297	297
query70	1012	990	973	973
query71	308	277	266	266
query72	3021	2727	2475	2475
query73	858	770	441	441
query74	5089	4884	4726	4726
query75	2806	2664	2340	2340
query76	2254	1129	775	775
query77	443	431	361	361
query78	12944	13052	12378	12378
query79	1493	975	723	723
query80	1278	579	501	501
query81	496	283	231	231
query82	1364	161	125	125
query83	373	288	253	253
query84	269	144	118	118
query85	959	529	446	446
query86	426	316	312	312
query87	3424	3374	3192	3192
query88	3504	2657	2627	2627
query89	438	382	340	340
query90	1815	178	177	177
query91	182	180	146	146
query92	80	78	75	75
query93	963	969	551	551
query94	626	324	317	317
query95	665	461	350	350
query96	1047	765	338	338
query97	2714	2704	2585	2585
query98	237	232	243	232
query99	1118	1099	991	991
Total cold run time: 254403 ms
Total hot run time: 171883 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

FE Regression Coverage Report

Increment line coverage 7.14% (1/14) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions Bot added the approved Indicates a PR has been approved by one committer. label May 14, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR approved by at least one committer and no changes requested.

@JNSimba JNSimba merged commit 5be6557 into apache:master May 15, 2026
39 checks passed
github-actions Bot pushed a commit that referenced this pull request May 15, 2026
…obs (#63194)

### What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`,
`_load_bytes`, `_filtered_rows`, `_succeed_task_count`,
`_failed_task_count`) for streaming insert jobs, exposed via `/metrics`
with `job_id`/`job_name` labels for Prometheus.

#62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that
reports end-to-end CDC delay in seconds, but the value was only exposed
through SQL — there was no corresponding Prometheus metric, so
dashboards/alerting on lag was not possible.

This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the
existing per-job metric set.
JNSimba added a commit that referenced this pull request May 19, 2026
…obs (#63194)

### What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`,
`_load_bytes`, `_filtered_rows`, `_succeed_task_count`,
`_failed_task_count`) for streaming insert jobs, exposed via `/metrics`
with `job_id`/`job_name` labels for Prometheus.

#62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that
reports end-to-end CDC delay in seconds, but the value was only exposed
through SQL — there was no corresponding Prometheus metric, so
dashboards/alerting on lag was not possible.

This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the
existing per-job metric set.
yiguolei pushed a commit that referenced this pull request May 20, 2026
…ing insert jobs #63194 (#63271)

Cherry-picked from #63194

Co-authored-by: wudi <wudi@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.1.2-merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants