Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose recovery, snapshot, and restore rate limits and throttle times in node stats #91354

Open
kingherc opened this issue Nov 7, 2022 · 1 comment
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team (obsolete)

Comments

@kingherc
Copy link
Contributor

kingherc commented Nov 7, 2022

Description

The rate limits for recoveries and snapshots can be indirectly computed from Elasticsearch configurations, which makes it hard to ascertain their real values used during runtime because:

  • Adjust indices.recovery.max_bytes_per_sec according to external settings #82819 introduced a way to indirectly influence the recovery speed indices.recovery.max_bytes_per_sec by configuring three key bandwidth metrics settings.
  • Snapshot recovery rate limit max_restore_bytes_per_sec is already capped by the recovery rate limit.
  • Per-node rate limits for snapshots #57023 aims to tie and potentially cap the snapshot speed max_snapshot_bytes_per_sec to the recovery limit (when the node bandwidth metrics settings are configured) apart from the existing snapshot speed configuration.
  • Furthermore, some of the aforementioned limits can be configured per node and/or repository and/or at runtime.

To make observability of these rate limits easier, the proposal is to expose the final used rate limit values (considering also any cap, e.g., by the recovery rate limit) in the node stats. Snapshot rate limits will be reported per repository.

Apart from the speed, the proposal is to also expose the throttling times more prominently. Specifically:

  • The repos analysis API exposes the snapshot and snapshot restore throttle times. But this is not useful for live usage. We could expose the throttle times in node stats per repository.
  • Recovery throttling is already exposed in nodes stats under node_id > indices > recovery > throttle_time. I think no change is needed for recovery throttling stats.
@kingherc kingherc added >enhancement :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Team:Distributed Meta label for distributed team (obsolete) needs:triage Requires assignment of a team area label Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. labels Nov 7, 2022
@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Nov 7, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed Meta label for distributed team (obsolete)
Projects
None yet
Development

No branches or pull requests

2 participants