Additional dimensions for service/heartbeat#14743
Conversation
docs/operations/metrics.md
Outdated
| |------|-----------|----------|------------| | ||
| | `service/heartbeat` | Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. |`leader` on the Overlord and Coordinator.|1| | ||
| |Metric|Description| Dimensions |Normal value| | ||
| |------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------| |
There was a problem hiding this comment.
Nit: I am not sure if this is the format we follow in the docs. The original one made more sense to me.
There was a problem hiding this comment.
This is what intelliJ did when I edited the other lines in the table. I can undo these changes, but are there some settings I can set in intelliJ to auto lint the markdown to match the expected styleguide?
| /** | ||
| * The named binding for tags that should be reported with the `service/heartbeat` metric. | ||
| */ | ||
| public static final String TAGS_BINDING = "heartbeat"; |
There was a problem hiding this comment.
| public static final String TAGS_BINDING = "heartbeat"; | |
| public static final String HEARTBEAT_TAGS = "heartbeat"; |
services/src/main/java/org/apache/druid/cli/CliMiddleManager.java
Outdated
Show resolved
Hide resolved
services/src/main/java/org/apache/druid/cli/CliMiddleManager.java
Outdated
Show resolved
Hide resolved
kfaraz
left a comment
There was a problem hiding this comment.
Minor fix required after latest commit, otherwise LGTM.
|
Thanks for the review @kfaraz ! |
| "namespace/cache/heapSizeInBytes" : { "dimensions" : [], "type" : "gauge" }, | ||
|
|
||
| "service/heartbeat" : { "dimensions" : ["leader"], "type" : "gauge" } | ||
| "service/heartbeat" : { "dimensions" : ["leader"], "type" : "count" } |
There was a problem hiding this comment.
This should be gauge IMO, as it's not a cumulative value. E.g. it doesn't make sense to say the total number of heartbeat yesterday.
There was a problem hiding this comment.
From the datadog docs - it looks like a count metric reports the count of the metrics retrieved in that interval and is not cumulative across intervals.
Suppose you are submitting a COUNT metric, activeusers.basket_size, from a single host running the Datadog Agent. This host emits the following values in a flush time interval: [1,1,1,2,2,2,3,3].
The Agent adds all of the values received in one time interval. Then, it submits the total number, in this case 15, as the COUNT metric’s value.
From an analysis point, it is very useful to know how many heartbeats were sent across a time period as it gives us an idea of uptime if we know how many heartbeats are sent per minute.
What do you think?
| public Supplier<Map<String, Object>> heartbeatDimensions(WorkerConfig workerConfig, WorkerTaskManager workerTaskManager) | ||
| { | ||
| return () -> ImmutableMap.of( | ||
| "workerVersion", workerConfig.getVersion(), |
There was a problem hiding this comment.
Should we try to make the dimensions as few as possible? Means to have a general dimension name so it can be reused in other services, e.g. workerVersion -> version, workerCategory -> category
There was a problem hiding this comment.
In this instance I think more specific dimension names are helpful as it clears up some confusion. eg. version can be misinterpreted - Should it be the version of Druid that it is running or version of the worker.
If we find a dimension that should be reported on multiple services, this would be good to consider
There was a problem hiding this comment.
Thanks to this comment, I did look for standard dimension names that are used by other metrics and used static references to those dimensions. LMK if this looks better
* Additional dimensions for service/heartbeat * docs * review * review
* Additional dimensions for service/heartbeat * docs * review * review
Description
This patch builds on top of #14443 to report some additional dimensions along with the heartbeat metric for the peons and middle managers.
The peons now report the task id, group id, datasource and task type. This will be useful as operators can use these dimensions to see how much of their cluster capacity is being used by different types, datasources, group, etc.
The middle managers will now report the worker version, category and whether it is enabled or not.
Release note
Improved: The service/heartbeat metric now has additional dimensions to provide more insight into ingestion on the cluster.
The metric on the middle manager looks like
The metric on the peon looks like
This PR has: