-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prometheus metric exporter #10412
prometheus metric exporter #10412
Conversation
@@ -0,0 +1,128 @@ | |||
{ | |||
"query/time" : { "dimensions" : ["dataSource", "type"], "type" : "timer", "conversionFactor": 1000.0, "help": "Seconds taken to complete a query."}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these conversion a good idea?
It would mean that these metrics will be slightly different from how they are described in this documentation. https://druid.apache.org/docs/latest/operations/metrics.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://prometheus.io/docs/practices/naming/#base-units
It's coming from prometheus common practice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. I guess there are tradeoffs with either choices. Maybe a good way to do is that for those converted, we can just put the unit in the prometheus names. Otherwise, if we refer to the druid metrics doc, we would find the name of the metrics to be documented in a different unit.
"query/failed/count" : { "dimensions" : [], "type" : "count", "help": "Number of failed queries"}, | ||
"query/interrupted/count" : { "dimensions" : [], "type" : "count", "help": "Number of queries interrupted due to cancellation or timeout"}, | ||
|
||
"query/cache/delta/numEntries" : { "dimensions" : [], "type" : "count", "help": "Number of entries in cache"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential bug, deltas can be negative but Prometheus counter accepts only non-negative increment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change to guage
. This only happens in since last emission we got more entries evicted than added right.
...theus-emitter/src/main/java/org/apache/druid/emitter/prometheus/PrometheusEmitterConfig.java
Outdated
Show resolved
Hide resolved
private final Metrics metrics; | ||
private final PrometheusEmitterConfig config; | ||
private final PrometheusEmitterConfig.Strategy strategy; | ||
private final Pattern pattern = Pattern.compile("[^a-zA-Z0-9_][^a-zA-Z0-9_]*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse the pattern in PrometheusEmitterConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two is not the same regex. The one in PromtheusEmitterConfig is for namespace regex that need to start with alphabetic character.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, my bad.
private static final Logger log = new Logger(Metrics.class); | ||
private final Map<String, DimensionsAndCollector> map = new HashMap<>(); | ||
private final ObjectMapper mapper = new ObjectMapper(); | ||
private final Pattern pattern = Pattern.compile("[^a-zA-Z_:][^a-zA-Z0-9_:]*"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reuse the pattern in PrometheusEmitterConfig
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
used the one in PrometheusEmitter.java
} | ||
} | ||
|
||
public Map<String, DimensionsAndCollector> getMap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can rename the map
to registeredMetrics
and then we could rename this method to getRegisteredMetrics()
, I feel like this will be easier to read
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
}).readValue(is); | ||
} | ||
catch (IOException e) { | ||
throw new ISE(e, "Failed to parse metric dimensions and types"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
} | ||
} | ||
|
||
void emitMetric(ServiceMetricEvent metricEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
Map<String, DimensionsAndCollector> map = metrics.getMap(); | ||
try { | ||
for (DimensionsAndCollector collector : map.values()) { | ||
pushGateway.push(collector.getCollector(), config.getNamespace(), ImmutableMap.of(config.getNamespace(), identifier)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potential NPE? if the configured strategy is not pushgateway, then this pushGateway
wouldn't have been instantiated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also should we use a more meaningful label name for identifier
instead of using the config.getNamespace()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add the null check - however flush() for this emitter should only called by close() which strategy check already done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the identifier label name, any suggestion? The config.namespace
will be set in config files for each service. So for example peon task it could be peon=taskXXX as groupingKey.
|
||
|
||
@Override | ||
public void start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should schedule a task to push updates periodically when the strategy is set to pushgateway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, every 5min sounds reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this - I think the scheduled executor may not be necessary. Main reason we've added strategy pushgateway
is for things that are potentially too short-lived to be scraped by prometheus (in druid that's really just peon tasks). Things that are living long enough to be pushing every 5 minutes are likely not "task" based, and may be better fit for normal scraping. I lean toward keeping things simple, and pushing once at close seems sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the only metric pushed by peon is "last pushed timestamp", I think it's valid to remove the scheduled task. Removed.
|
||
public DimensionsAndCollector getByName(String name, String service) | ||
{ | ||
if (map.containsKey(name)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return Optional.ofNullable(map.get(name)).orElse(map.get(service + "_" + name));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the second part to a getOrDefault()
for simplification. I don't see the need of changing this function return type from DimensionsAndCollector
to Optional<DimensionsAndCollector>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to change the return type, but anyway this is a minor comment, feel free to pick whichever you prefer
Is there a timeline for this feature? |
For what is worth, Ive been running production clusters with this extension for monitoring for over a year. Things are stable but there are a couple open issues around pushgateway collection of Peon task metrics that are the main reasons to delay merging. |
Verification of metrics on a local mini druid cluster as shown in below table. Coordinator, historical, broker, router, middle-manager, overlord set up as prometheus monitoring target.
|
@Tiaaa awesome! I see peon metrics in pushgateway as well. One thing - the label we're using right now is the task ID. I think this it going to be too high cardinality for Prometheus |
hmm, it looks like the commits have become sort of messed up for this PR one way or another and github is showing a lot of unrelated commits. @Tiaaa any chance you can try clean this up to only show the changes of this PR to make it easier to review? |
… updated statsd json)
…are not using seconds. Correct metric dimension name in documentation
…st multiple calls to PrometheusEmitter.start()
…alues. Additional tests
4b0b414
to
3a7a2b6
Compare
Any update on this ? Looking forward having it. |
@clintropolis it looks like this is the last failing build step: https://travis-ci.com/github/apache/druid/jobs/483326128 - seems unrelated to the new emitter. Are we good to merge? |
Sorry for the delay, I will have a look as soon as I'm able and see if we can get this merged 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor comments, but overall lgtm 👍 I don't know prometheus very well, but I think the mappings of metrics looks reasonable.
Thanks for your patience and persistence!
website/.spelling
Outdated
@@ -1223,7 +1228,7 @@ SysMonitor | |||
TaskCountStatsMonitor | |||
TaskSlotCountStatsMonitor | |||
bufferCapacity | |||
bufferpoolName | |||
bufferPoolName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, missed one, it's causing CI to fail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@clintropolis anything left we need to do before merge? |
Will this be available în the next 0.21.0 release? Thank you |
Oops no, sorry got distracted and hadn't got back to this yet.
Unfortunately we have already cut the branch for 0.21.0, after which we only merge bug fixes, so this will go out in the release after that. 0.21.0 has been a bit delayed, so it shouldn't be too much longer before we begin the next release as well. |
Fixes #8621
Adds a new extension prometheus-emitter to expose Druid metrics for collection directly by a Prometheus server.
This PR has:
Key changed/added classes in this PR
This is re-open the PR here #8621