Skip to content

Gracefully handle negative counter values for prometheus-emitter & update metric types for sys/disk/* metrics#18719

Merged
abhishekrb19 merged 6 commits intoapache:masterfrom
aho135:prometheus-counter-negative-value
Nov 17, 2025
Merged

Gracefully handle negative counter values for prometheus-emitter & update metric types for sys/disk/* metrics#18719
abhishekrb19 merged 6 commits intoapache:masterfrom
aho135:prometheus-counter-negative-value

Conversation

@aho135
Copy link
Contributor

@aho135 aho135 commented Nov 5, 2025

The OshiSysMonitor DiskStats uses KeyedDiff which can occasionally lead to negative values for metrics such as sys/disk/write/size when the underlying long in HWDiskStore overflows. This leads to the following uncaught exception:

ERROR [MonitorScheduler-0] org.apache.druid.java.util.common.concurrent.ScheduledExecutors - Uncaught exception.
java.lang.IllegalArgumentException: Amount to increment must be non-negative.
	at io.prometheus.client.Counter$Child.incWithExemplar(Counter.java:237) ~[simpleclient-0.12.0.jar:?]
	at io.prometheus.client.Counter$Child.inc(Counter.java:215) ~[simpleclient-0.12.0.jar:?]
	at org.apache.druid.emitter.prometheus.PrometheusEmitter.emitMetric(PrometheusEmitter.java:173) ~[?:?]
	at org.apache.druid.emitter.prometheus.PrometheusEmitter.emit(PrometheusEmitter.java:132) ~[?:?]
	at org.apache.druid.java.util.emitter.core.ComposingEmitter.emit(ComposingEmitter.java:57) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:67) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.emitter.service.ServiceEmitter.emit(ServiceEmitter.java:72) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.metrics.OshiSysMonitor$DiskStats.emit(OshiSysMonitor.java:282) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.metrics.OshiSysMonitor.monitorDiskStats(OshiSysMonitor.java:146) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.metrics.OshiSysMonitor.lambda$doMonitor$0(OshiSysMonitor.java:122) ~[druid-processing-32.0.1.jar:32.0.1]
	at com.google.common.collect.RegularImmutableMap.forEach(RegularImmutableMap.java:297) ~[guava-32.0.1-jre.jar:?]
	at org.apache.druid.java.util.metrics.OshiSysMonitor.doMonitor(OshiSysMonitor.java:120) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.metrics.AbstractMonitor.monitor(AbstractMonitor.java:54) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.metrics.BasicMonitorScheduler.lambda$startMonitor$0(BasicMonitorScheduler.java:55) ~[druid-processing-32.0.1.jar:32.0.1]
	at org.apache.druid.java.util.common.concurrent.ScheduledExecutors$4.run(ScheduledExecutors.java:153) [druid-processing-32.0.1.jar:32.0.1]
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539) [?:?]
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) [?:?]
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.base/java.lang.Thread.run(Thread.java:840) [?:?]

This kind of metric should arguably be configured as gauge. But in the case of misconfiguration this logs out a warning instead of throwing an uncaught exception

Description

More graceful handling when a counter is unintentionally updated with a negative value

Release note

More graceful handling when a prometheus counter is unintentionally updated with a negative value.
Update the OshiSysMonitor DiskStats sys/disk/* delta metrics from counter to gauge type for prometheus-emitter, dropwizard-emitter and statsd-emitter.


Key changed/added classes in this PR
  • PrometheusEmitter
  • PrometheusEmitterTest

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@abhishekrb19
Copy link
Contributor

Thanks for the investigation and fix, @aho135!

The OshiSysMonitor DiskStats uses KeyedDiff which can occasionally lead to negative values for metrics such as sys/disk/write/size when the underlying long in HWDiskStore overflows. This kind of metric should arguably be configured as gauge.

I agree that these metrics should be configured as gauges instead. Should we also fix the metric type in the defaultMetrics.json config? (I think it'd apply to a few other emitters too)

For any other metric misconfigurations, the exception handling should give us more insights into that I suppose.

@aho135
Copy link
Contributor Author

aho135 commented Nov 5, 2025

Thanks for the investigation and fix, @aho135!

The OshiSysMonitor DiskStats uses KeyedDiff which can occasionally lead to negative values for metrics such as sys/disk/write/size when the underlying long in HWDiskStore overflows. This kind of metric should arguably be configured as gauge.

I agree that these metrics should be configured as gauges instead. Should we also fix the metric type in the defaultMetrics.json config? (I think it'd apply to a few other emitters too)

For any other metric misconfigurations, the exception handling should give us more insights into that I suppose.

Thanks for the review @abhishekrb19 Yeah good point, let me update the metric type and take a look at other emitters

aho135 and others added 3 commits November 13, 2025 23:12
…/druid/emitter/prometheus/PrometheusEmitter.java

Co-authored-by: Abhishek Radhakrishnan <abhishek.rb19@gmail.com>
@abhishekrb19 abhishekrb19 changed the title prometheus-emitter: Gracefully handle negative counter values Gracefully handle negative counter values for prometheus-emitter & update metric types for sys/disk/* metrics Nov 17, 2025
@abhishekrb19 abhishekrb19 merged commit e0baddd into apache:master Nov 17, 2025
44 checks passed
@kgyrtkirk kgyrtkirk added this to the 36.0.0 milestone Jan 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants