Skip to content

[spark] Expose Paimon metrics via JMX for Prometheus scraping#7647

Open
junmuz wants to merge 2 commits into
apache:masterfrom
junmuz:spark_paimon_metrics
Open

[spark] Expose Paimon metrics via JMX for Prometheus scraping#7647
junmuz wants to merge 2 commits into
apache:masterfrom
junmuz:spark_paimon_metrics

Conversation

@junmuz
Copy link
Copy Markdown
Contributor

@junmuz junmuz commented Apr 14, 2026

Purpose

Bridge Paimon commit metrics into Spark's MetricsSystem using a Codahale MetricRegistry so they are accessible as JMX MBeans. This enables jmx_prometheus_javaagent to scrape Paimon metrics (commit duration, files added/deleted, records appended, partitions written, etc.).

  • Add SparkMetricGroup extending MetricGroupImpl to dual-register metrics in both Paimon's internal registry and a shared Codahale registry
  • Add PaimonMetricsSource singleton with its own JmxReporter to register MBeans immediately (bypasses Spark MetricsSystem snapshot limitation)
  • Wire SparkMetricRegistry to use SparkMetricGroup
  • Add withMetricRegistry() to V1 commit path in PaimonSparkWriter

Tests

  • Unit tests for SparkMetricGroup: gauge/counter/histogram registration, lazy value updates, close cleanup, name sanitization, collision handling, backward compatibility with getMetrics()
    • Integration tests verifying commit metrics appear in the Codahale registry after writes and update correctly on subsequent commits

@junmuz junmuz force-pushed the spark_paimon_metrics branch from f583df8 to 3261073 Compare April 15, 2026 16:09
@junmuz
Copy link
Copy Markdown
Contributor Author

junmuz commented Apr 16, 2026

@JingsongLi @Zouxxyy Can I get a review on the PR? This change allows scraping metrics from Prometheus.

@junmuz
Copy link
Copy Markdown
Contributor Author

junmuz commented Apr 28, 2026

@YannByron @Zouxxyy I see you have been pretty active with the Spark - Paimon connector. Can I get a review on the PR for exposing Paimon metrics?

@junmuz junmuz force-pushed the spark_paimon_metrics branch from 9cc14b2 to 62b13d2 Compare April 28, 2026 10:01
@junmuz junmuz force-pushed the spark_paimon_metrics branch from 62b13d2 to 9859bd6 Compare April 28, 2026 10:02
Copy link
Copy Markdown
Contributor

@JingsongLi JingsongLi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding metrics observability to the Spark integration!

Architecture concerns:

  1. PaimonMetricsSource singleton with its own JmxReporter: Bypassing Spark's MetricsSystem snapshot limitation is pragmatic, but running a separate JmxReporter means Paimon metrics appear under a different MBean domain than Spark's own metrics. Document this clearly — operators using jmx_prometheus_javaagent need to know which domain to scrape.

  2. Thread safety of SparkMetricGroup: The registerCodahaleGauge method does remove + register. While MetricRegistry is internally thread-safe, the two-step operation has a race window where another thread could register between remove and register. In practice this is unlikely since commit is single-threaded per executor, but worth noting.

  3. Gauge semantics: Registering counters as Codahale Gauges (via paimonCounter::getCount) works but loses the "counter" semantic. Monitoring tools that expect monotonically increasing counters won't get proper rate() calculations from a Gauge. Consider using codahaleRegistry.counter() for Paimon counters instead.

  4. Memory lifecycle: The SparkMetricGroup.close() removes registered names from the registry. But if close() is never called (e.g., task failure), the stale gauges remain in the registry pointing to deallocated state. The registerCodahaleGauge remove-before-register pattern helps with subsequent commits, but for multi-table scenarios, old gauges could accumulate.

  5. Metric naming: paimon.{table}.{groupName}.{metricName} — the table name may contain dots or special chars. The sanitize() method replaces non-alphanumeric chars with _, which is good, but consider also handling empty table names gracefully.

Overall the approach is sound for V1. Please address the counter vs gauge semantic issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants