Skip to content

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224

Merged
liaoxin01 merged 3 commits into
apache:masterfrom
JNSimba:add_per_job_metrics
Apr 24, 2026
Merged

[improve](streaming-job) Add per-job metrics for streaming insert jobs#62224
liaoxin01 merged 3 commits into
apache:masterfrom
JNSimba:add_per_job_metrics

Conversation

@JNSimba
Copy link
Copy Markdown
Member

@JNSimba JNSimba commented Apr 8, 2026

Summary

  • Add per-job granularity metrics for streaming insert jobs with job_id and job_name labels
  • New metrics: streaming_job_per_job_scanned_rows, streaming_job_per_job_load_bytes, streaming_job_per_job_filtered_rows, streaming_job_per_job_succeed_task_count, streaming_job_per_job_failed_task_count
  • Existing global aggregated metrics remain unchanged
  • Follow-up to [Improve](StreamingJob) add more metrics to observe the streaming job #60493

Approach

Follows generateBackendsTabletMetrics() pattern: on each /metrics request, remove all previous per-job metrics then re-register with current job data. This ensures values are always up-to-date and stale jobs are cleaned up automatically.

Offset info is intentionally excluded from metric labels to avoid Prometheus series churn and serialization issues. Offset can be viewed via SHOW STREAMING JOBS or jobs("type"="insert") TVF.

Test plan

  • Verify per-job metrics appear in /metrics?type=json with correct job_id and job_name labels
  • Verify existing global streaming job metrics still present
  • Verify FE replay is not affected
  • Run test_streaming_mysql_job_metrics.groovy regression test

🤖 Generated with Claude Code

Add per-job granularity metrics (scanned_rows, load_bytes, filtered_rows,
succeed/failed_task_count, offset) with job_id and job_name labels to the
/metrics endpoint, enabling Grafana monitoring at individual job level.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 issue during review.

  1. fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java: current_offset / end_offset are exported as raw label values even though the offset providers return JSON strings. Both metric visitors serialize labels without escaping, so a normal non-empty offset makes /metrics invalid Prometheus text and /metrics?type=json invalid JSON. Because these labels also change as the job advances, they would create a new Prometheus series identity on each offset update even if escaping were added later.

Critical checkpoint conclusions:

  • Goal of the task: Add per-job streaming-job metrics for monitoring. Partially achieved; the per-job counters/gauges are reasonable, but the offset-label design breaks exporter correctness.
  • Small/clear/focused: Mostly focused, but the offset information should not be modeled as metric labels.
  • Concurrency: No new locking or deadlock issue found in this patch. MetricRepo.getMetric() is synchronized, and JobManager.queryJobs() reads from a ConcurrentHashMap, so the new traversal is weakly consistent but safe enough for metrics.
  • Lifecycle/static initialization: No special lifecycle or static-init issue found.
  • Configuration changes: None.
  • Compatibility/incompatible changes: No FE/BE protocol or storage compatibility issue found.
  • Functionally parallel code paths: The bug affects both /metrics and /metrics?type=json because both visitors serialize the same labels.
  • Special conditional checks: No issue beyond the unsafe assumption that arbitrary offset strings are valid metric labels.
  • Test coverage: A regression test was added, but there is still no direct coverage for label escaping/export serialization. I did not run the test suite in this review.
  • Observability: Per-job metrics are useful, but dynamic offset labels are not safe observability design because they break encoding and cause series churn.
  • Transaction/persistence: Not applicable for this patch.
  • Data writes/modifications: Not applicable for this patch.
  • FE/BE variable passing: Not applicable for this patch.
  • Performance: Walking insert jobs per scrape is acceptable; dynamic offset labels would create avoidable scrape/TSDB churn.
  • Other issues: None beyond the finding above.

Comment thread fe/fe-core/src/main/java/org/apache/doris/metric/MetricRepo.java Outdated
@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

Offset is a JSON string that changes frequently, which would create
series churn in Prometheus and break metric serialization. Remove
the offset metric; offset info can be viewed via SHOW STREAMING JOBS.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@JNSimba
Copy link
Copy Markdown
Member Author

JNSimba commented Apr 8, 2026

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

FE UT Coverage Report

Increment line coverage 2.67% (2/75) 🎉
Increment coverage report
Complete coverage report

Copy link
Copy Markdown
Contributor

@liaoxin01 liaoxin01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liaoxin01 liaoxin01 merged commit 608cfd3 into apache:master Apr 24, 2026
27 of 28 checks passed
github-actions Bot pushed a commit that referenced this pull request Apr 24, 2026
#62224)

## Summary
- Add per-job granularity metrics for streaming insert jobs with
`job_id` and `job_name` labels
- New metrics: `streaming_job_per_job_scanned_rows`,
`streaming_job_per_job_load_bytes`,
`streaming_job_per_job_filtered_rows`,
`streaming_job_per_job_succeed_task_count`,
`streaming_job_per_job_failed_task_count`
- Existing global aggregated metrics remain unchanged
- Follow-up to #60493

## Approach
Follows `generateBackendsTabletMetrics()` pattern: on each `/metrics`
request, remove all previous per-job metrics then re-register with
current job data. This ensures values are always up-to-date and stale
jobs are cleaned up automatically.

Offset info is intentionally excluded from metric labels to avoid
Prometheus series churn and serialization issues. Offset can be viewed
via `SHOW STREAMING JOBS` or `jobs("type"="insert")` TVF.

## Test plan
- [ ] Verify per-job metrics appear in `/metrics?type=json` with correct
`job_id` and `job_name` labels
- [ ] Verify existing global streaming job metrics still present
- [ ] Verify FE replay is not affected
- [ ] Run `test_streaming_mysql_job_metrics.groovy` regression test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
morningman pushed a commit that referenced this pull request Apr 24, 2026
#62224)

## Summary
- Add per-job granularity metrics for streaming insert jobs with
`job_id` and `job_name` labels
- New metrics: `streaming_job_per_job_scanned_rows`,
`streaming_job_per_job_load_bytes`,
`streaming_job_per_job_filtered_rows`,
`streaming_job_per_job_succeed_task_count`,
`streaming_job_per_job_failed_task_count`
- Existing global aggregated metrics remain unchanged
- Follow-up to #60493

## Approach
Follows `generateBackendsTabletMetrics()` pattern: on each `/metrics`
request, remove all previous per-job metrics then re-register with
current job data. This ensures values are always up-to-date and stale
jobs are cleaned up automatically.

Offset info is intentionally excluded from metric labels to avoid
Prometheus series churn and serialization issues. Offset can be viewed
via `SHOW STREAMING JOBS` or `jobs("type"="insert")` TVF.

## Test plan
- [ ] Verify per-job metrics appear in `/metrics?type=json` with correct
`job_id` and `job_name` labels
- [ ] Verify existing global streaming job metrics still present
- [ ] Verify FE replay is not affected
- [ ] Run `test_streaming_mysql_job_metrics.groovy` regression test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
yiguolei pushed a commit that referenced this pull request Apr 25, 2026
…g insert jobs #62224 (#62790)

Cherry-picked from #62224

Co-authored-by: wudi <wudi@selectdb.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
JNSimba added a commit that referenced this pull request May 15, 2026
…obs (#63194)

### What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`,
`_load_bytes`, `_filtered_rows`, `_succeed_task_count`,
`_failed_task_count`) for streaming insert jobs, exposed via `/metrics`
with `job_id`/`job_name` labels for Prometheus.

#62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that
reports end-to-end CDC delay in seconds, but the value was only exposed
through SQL — there was no corresponding Prometheus metric, so
dashboards/alerting on lag was not possible.

This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the
existing per-job metric set.
github-actions Bot pushed a commit that referenced this pull request May 15, 2026
…obs (#63194)

### What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`,
`_load_bytes`, `_filtered_rows`, `_succeed_task_count`,
`_failed_task_count`) for streaming insert jobs, exposed via `/metrics`
with `job_id`/`job_name` labels for Prometheus.

#62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that
reports end-to-end CDC delay in seconds, but the value was only exposed
through SQL — there was no corresponding Prometheus metric, so
dashboards/alerting on lag was not possible.

This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the
existing per-job metric set.
JNSimba added a commit that referenced this pull request May 19, 2026
…obs (#63194)

### What problem does this PR solve?

Related PR: #62224 (per-job metrics), #62269 (Lag column)

Problem Summary:

#62224 introduced per-job metrics (`streaming_job_per_job_scanned_rows`,
`_load_bytes`, `_filtered_rows`, `_succeed_task_count`,
`_failed_task_count`) for streaming insert jobs, exposed via `/metrics`
with `job_id`/`job_name` labels for Prometheus.

#62269 later added a `Lag` column to `SHOW JOBS` / `jobs()` TVF that
reports end-to-end CDC delay in seconds, but the value was only exposed
through SQL — there was no corresponding Prometheus metric, so
dashboards/alerting on lag was not possible.

This PR adds `streaming_job_per_job_lag` (unit: `SECONDS`) to the
existing per-job metric set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants