Skip to content

supervisor: Emit active/publishing task counts#17268

Closed
ac9817 wants to merge 4 commits intoapache:masterfrom
ac9817:emit-task-counts
Closed

supervisor: Emit active/publishing task counts#17268
ac9817 wants to merge 4 commits intoapache:masterfrom
ac9817:emit-task-counts

Conversation

@ac9817
Copy link
Contributor

@ac9817 ac9817 commented Oct 7, 2024

Description

Adding this metric would help see how much of time a supervisor is spending to publish tasks, It is important to keep this time low because auto scaling would be skipped in during this period which could cause increased lag.

Release note

Adds new metrics: task/supervisor/active/count and task/supervisor/publishing/count.


Key changed/added classes in this PR
  • SeekableStreamSupervisor.java

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

Adithya Chakilam added 2 commits October 7, 2024 14:51
Copy link
Contributor

@suneet-s suneet-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reporting 2 new metrics, could you add the SeekableStreamIndexTaskRunner#status as a dimension to the service/heartbeat metric instead.

This change would make it so that we have visibility into all the different states a streaming task could be in, and the metrics would also provide visibility into which specific task is in which state, as opposed to just knowing the number of tasks that are in the publishing state.

@abhishekagarwal87
Copy link
Contributor

There has to be some docs changes. How are you going to infer the time in publishing tasks (btw what does supervisor publishing a task mean exactly)? And how do you keep that time low assuming you can find the time is high.

@ac9817 ac9817 marked this pull request as draft October 8, 2024 04:01
@kfaraz
Copy link
Contributor

kfaraz commented Oct 8, 2024

@adithyachakilam , leaving some suggestions here even though the PR is in draft right now.

how much of time a supervisor is spending to publish tasks

Could you please elaborate? What time are you referring to exactly?
The supervisor is just a thread which wakes up and launches or kills tasks and updates some metadata.

If you want to capture the time a task spends in publishing segments,
then the correct metric for that would be something like ingest/publish/time (in the same vein as ingest/handoff/time and ingest/merge/time).

If you want to capture the number of tasks currently in publishing phase etc, then as @suneet-s has suggested, emitting the current phase/state of a streaming task in its heartbeat makes sense.
But it would need some changes from the current approach:

  • The status is not an intrinsic property of a task and must not be a part of the Task interface. You can inject the runner to build up the heartbeat map in the CliPeon.heartbeatDimensions() method.
  • For non-streaming tasks, instead of always emitting UNKNOWN, do not emit any value for this dimension.

@github-actions
Copy link

github-actions bot commented Dec 8, 2024

This pull request has been marked as stale due to 60 days of inactivity.
It will be closed in 4 weeks if no further activity occurs. If you think
that's incorrect or this pull request should instead be reviewed, please simply
write any comment. Even if closed, you can still revive the PR at any time or
discuss it on the dev@druid.apache.org list.
Thank you for your contributions.

@github-actions github-actions bot added the stale label Dec 8, 2024
@github-actions
Copy link

github-actions bot commented Jan 6, 2025

This pull request/issue has been closed due to lack of activity. If you think that
is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants