[improve](streaming-job) Add per-job metrics for streaming insert jobs by JNSimba · Pull Request #62224 · apache/doris

JNSimba · 2026-04-08T08:49:01Z

Summary

Add per-job granularity metrics for streaming insert jobs with job_id and job_name labels
New metrics: streaming_job_per_job_scanned_rows, streaming_job_per_job_load_bytes, streaming_job_per_job_filtered_rows, streaming_job_per_job_succeed_task_count, streaming_job_per_job_failed_task_count
Existing global aggregated metrics remain unchanged
Follow-up to [Improve](StreamingJob) add more metrics to observe the streaming job #60493

Approach

Follows generateBackendsTabletMetrics() pattern: on each /metrics request, remove all previous per-job metrics then re-register with current job data. This ensures values are always up-to-date and stale jobs are cleaned up automatically.

Offset info is intentionally excluded from metric labels to avoid Prometheus series churn and serialization issues. Offset can be viewed via SHOW STREAMING JOBS or jobs("type"="insert") TVF.

Test plan

Verify per-job metrics appear in /metrics?type=json with correct job_id and job_name labels
Verify existing global streaming job metrics still present
Verify FE replay is not affected
Run test_streaming_mysql_job_metrics.groovy regression test

🤖 Generated with Claude Code

Add per-job granularity metrics (scanned_rows, load_bytes, filtered_rows, succeed/failed_task_count, offset) with job_id and job_name labels to the /metrics endpoint, enabling Grafana monitoring at individual job level. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hello-stephen · 2026-04-08T08:49:08Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

JNSimba · 2026-04-08T08:52:06Z

run buildall

JNSimba · 2026-04-08T09:53:04Z

/review

github-actions

Found 1 issue during review.

fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java: current_offset / end_offset are exported as raw label values even though the offset providers return JSON strings. Both metric visitors serialize labels without escaping, so a normal non-empty offset makes /metrics invalid Prometheus text and /metrics?type=json invalid JSON. Because these labels also change as the job advances, they would create a new Prometheus series identity on each offset update even if escaping were added later.

Critical checkpoint conclusions:

Goal of the task: Add per-job streaming-job metrics for monitoring. Partially achieved; the per-job counters/gauges are reasonable, but the offset-label design breaks exporter correctness.
Small/clear/focused: Mostly focused, but the offset information should not be modeled as metric labels.
Concurrency: No new locking or deadlock issue found in this patch. MetricRepo.getMetric() is synchronized, and JobManager.queryJobs() reads from a ConcurrentHashMap, so the new traversal is weakly consistent but safe enough for metrics.
Lifecycle/static initialization: No special lifecycle or static-init issue found.
Configuration changes: None.
Compatibility/incompatible changes: No FE/BE protocol or storage compatibility issue found.
Functionally parallel code paths: The bug affects both /metrics and /metrics?type=json because both visitors serialize the same labels.
Special conditional checks: No issue beyond the unsafe assumption that arbitrary offset strings are valid metric labels.
Test coverage: A regression test was added, but there is still no direct coverage for label escaping/export serialization. I did not run the test suite in this review.
Observability: Per-job metrics are useful, but dynamic offset labels are not safe observability design because they break encoding and cause series churn.
Transaction/persistence: Not applicable for this patch.
Data writes/modifications: Not applicable for this patch.
FE/BE variable passing: Not applicable for this patch.
Performance: Walking insert jobs per scrape is acceptable; dynamic offset labels would create avoidable scrape/TSDB churn.
Other issues: None beyond the finding above.

hello-stephen · 2026-04-08T10:08:34Z

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-08T10:17:35Z

run buildall

Offset is a JSON string that changes frequently, which would create series churn in Prometheus and break metric serialization. Remove the offset metric; offset info can be viewed via SHOW STREAMING JOBS. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

JNSimba · 2026-04-08T11:41:21Z

run buildall

hello-stephen · 2026-04-08T11:42:33Z

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

liaoxin01

LGTM

#62224) ## Summary - Add per-job granularity metrics for streaming insert jobs with `job_id` and `job_name` labels - New metrics: `streaming_job_per_job_scanned_rows`, `streaming_job_per_job_load_bytes`, `streaming_job_per_job_filtered_rows`, `streaming_job_per_job_succeed_task_count`, `streaming_job_per_job_failed_task_count` - Existing global aggregated metrics remain unchanged - Follow-up to #60493 ## Approach Follows `generateBackendsTabletMetrics()` pattern: on each `/metrics` request, remove all previous per-job metrics then re-register with current job data. This ensures values are always up-to-date and stale jobs are cleaned up automatically. Offset info is intentionally excluded from metric labels to avoid Prometheus series churn and serialization issues. Offset can be viewed via `SHOW STREAMING JOBS` or `jobs("type"="insert")` TVF. ## Test plan - [ ] Verify per-job metrics appear in `/metrics?type=json` with correct `job_id` and `job_name` labels - [ ] Verify existing global streaming job metrics still present - [ ] Verify FE replay is not affected - [ ] Run `test_streaming_mysql_job_metrics.groovy` regression test 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…g insert jobs #62224 (#62790) Cherry-picked from #62224 Co-authored-by: wudi <wudi@selectdb.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…obs (#63194) ### What problem does this PR solve? Related PR: #62224 (per-job metrics), #62269 (Lag column) Problem Summary: #62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`, `_load_bytes`, `_filtered_rows`, `_succeed_task_count`, `_failed_task_count`) for streaming insert jobs, exposed via `/metrics` with `job_id`/`job_name` labels for Prometheus. #62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that reports end-to-end CDC delay in seconds, but the value was only exposed through SQL — there was no corresponding Prometheus metric, so dashboards/alerting on lag was not possible. This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the existing per-job metric set.

JNSimba added dev/4.1.x labels Apr 8, 2026

github-actions Bot reviewed Apr 8, 2026

View reviewed changes

Comment thread fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Outdated

[fix](streaming-job) Escape offset JSON in per-job metrics labels

8d980b0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

liaoxin01 approved these changes Apr 24, 2026

View reviewed changes

liaoxin01 merged commit 608cfd3 into apache:master Apr 24, 2026
27 of 28 checks passed

github-actions Bot mentioned this pull request Apr 24, 2026

branch-4.1: [improve](streaming-job) Add per-job metrics for streaming insert jobs #62224 #62790

Merged

yiguolei added dev/4.1.1-merged and removed dev/4.1.x labels Apr 25, 2026

JNSimba mentioned this pull request May 13, 2026

[improve](streaming-job) add per-job lag metric to streaming insert jobs #63194

Merged

12 tasks

yiguolei mentioned this pull request May 20, 2026

4.1.1 Release Notes #63426

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224
liaoxin01 merged 3 commits into
apache:masterfrom
JNSimba:add_per_job_metrics

JNSimba commented Apr 8, 2026 •

edited

Loading

Uh oh!

hello-stephen commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

Uh oh!

hello-stephen commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

hello-stephen commented Apr 8, 2026

Uh oh!

liaoxin01 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

JNSimba commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Test plan

Uh oh!

hello-stephen commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hello-stephen commented Apr 8, 2026

FE UT Coverage Report

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

JNSimba commented Apr 8, 2026

Uh oh!

hello-stephen commented Apr 8, 2026

FE UT Coverage Report

Uh oh!

liaoxin01 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JNSimba commented Apr 8, 2026 •

edited

Loading