Skip to content

Additional dimensions for service/heartbeat#14743

Merged
suneet-s merged 4 commits intoapache:masterfrom
suneet-s:heartbeat
Aug 4, 2023
Merged

Additional dimensions for service/heartbeat#14743
suneet-s merged 4 commits intoapache:masterfrom
suneet-s:heartbeat

Conversation

@suneet-s
Copy link
Copy Markdown
Contributor

@suneet-s suneet-s commented Aug 3, 2023

Description

This patch builds on top of #14443 to report some additional dimensions along with the heartbeat metric for the peons and middle managers.

The peons now report the task id, group id, datasource and task type. This will be useful as operators can use these dimensions to see how much of their cluster capacity is being used by different types, datasources, group, etc.

The middle managers will now report the worker version, category and whether it is enabled or not.

Release note

Improved: The service/heartbeat metric now has additional dimensions to provide more insight into ingestion on the cluster.

The metric on the middle manager looks like

{"feed":"metrics","metric":"service/heartbeat","workerCapacity":2,"workerEnabled":1,"service":"druid/middleManager","workerVersion":"0","host":"localhost:8091","version":"28.0.0-SNAPSHOT","value":1,"timestamp":"2023-08-03T01:57:34.335Z"}

The metric on the peon looks like

{"feed":"metrics","taskType":"index_parallel","metric":"service/heartbeat","service":"druid/middleManager","groupId":"index_parallel_wikipedia_hibldimn_2023-08-03T02:00:01.754Z","host":"localhost:8100","version":"28.0.0-SNAPSHOT","value":1,"dataSource":"wikipedia","taskId":"index_parallel_wikipedia_hibldimn_2023-08-03T02:00:01.754Z","timestamp":"2023-08-03T02:00:07.146Z"}

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

|------|-----------|----------|------------|
| `service/heartbeat` | Metric indicating the service is up. `ServiceStatusMonitor` must be enabled. |`leader` on the Overlord and Coordinator.|1|
|Metric|Description| Dimensions |Normal value|
|------|-----------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I am not sure if this is the format we follow in the docs. The original one made more sense to me.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what intelliJ did when I edited the other lines in the table. I can undo these changes, but are there some settings I can set in intelliJ to auto lint the markdown to match the expected styleguide?

/**
* The named binding for tags that should be reported with the `service/heartbeat` metric.
*/
public static final String TAGS_BINDING = "heartbeat";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public static final String TAGS_BINDING = "heartbeat";
public static final String HEARTBEAT_TAGS = "heartbeat";

Copy link
Copy Markdown
Contributor

@kfaraz kfaraz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor fix required after latest commit, otherwise LGTM.

@suneet-s
Copy link
Copy Markdown
Contributor Author

suneet-s commented Aug 3, 2023

Thanks for the review @kfaraz !

"namespace/cache/heapSizeInBytes" : { "dimensions" : [], "type" : "gauge" },

"service/heartbeat" : { "dimensions" : ["leader"], "type" : "gauge" }
"service/heartbeat" : { "dimensions" : ["leader"], "type" : "count" }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be gauge IMO, as it's not a cumulative value. E.g. it doesn't make sense to say the total number of heartbeat yesterday.

Copy link
Copy Markdown
Contributor Author

@suneet-s suneet-s Aug 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the datadog docs - it looks like a count metric reports the count of the metrics retrieved in that interval and is not cumulative across intervals.

Suppose you are submitting a COUNT metric, activeusers.basket_size, from a single host running the Datadog Agent. This host emits the following values in a flush time interval: [1,1,1,2,2,2,3,3].

The Agent adds all of the values received in one time interval. Then, it submits the total number, in this case 15, as the COUNT metric’s value.

From an analysis point, it is very useful to know how many heartbeats were sent across a time period as it gives us an idea of uptime if we know how many heartbeats are sent per minute.

What do you think?

public Supplier<Map<String, Object>> heartbeatDimensions(WorkerConfig workerConfig, WorkerTaskManager workerTaskManager)
{
return () -> ImmutableMap.of(
"workerVersion", workerConfig.getVersion(),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we try to make the dimensions as few as possible? Means to have a general dimension name so it can be reused in other services, e.g. workerVersion -> version, workerCategory -> category

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this instance I think more specific dimension names are helpful as it clears up some confusion. eg. version can be misinterpreted - Should it be the version of Druid that it is running or version of the worker.

If we find a dimension that should be reported on multiple services, this would be good to consider

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks to this comment, I did look for standard dimension names that are used by other metrics and used static references to those dimensions. LMK if this looks better

Copy link
Copy Markdown
Contributor

@YongGang YongGang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@suneet-s suneet-s merged commit 62ddeaf into apache:master Aug 4, 2023
@suneet-s suneet-s deleted the heartbeat branch August 4, 2023 18:01
@LakshSingla LakshSingla added this to the 28.0 milestone Oct 12, 2023
FrankChen021 pushed a commit that referenced this pull request Feb 3, 2025
* Additional dimensions for service/heartbeat

* docs

* review

* review
GabrielCWT pushed a commit to GabrielCWT/druid that referenced this pull request Sep 9, 2025
* Additional dimensions for service/heartbeat

* docs

* review

* review
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants