Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not age out no-op peer recovery retention leases #47905

Open
henningandersen opened this issue Oct 11, 2019 · 3 comments
Open

Do not age out no-op peer recovery retention leases #47905

henningandersen opened this issue Oct 11, 2019 · 3 comments
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team

Comments

@henningandersen
Copy link
Contributor

We remove peer recovery leases belonging to unavailable nodes following these rules:

  • When all shard copies are started for the shard.
  • If too many operations are applied since the retaining-seq-no, making it undesirable to make an operations based recovery anyway.
  • If the index.soft_deletes.retention_lease.period time has passed.

If the index is no longer receiving any updates and the lease does not really retain any operations (retaining-seq-no >= gcp+1), it seems desirable to keep the retention lease until all shards are started (regardless of retention period). This would ensure that a no-op recovery can happen rather than a file based recovery.

This will only affect edge cases where either there is no extra node to assign the shard to or allocation has been disabled for more than 6h (with default values). Also, the benefit is small to non-existent if the shards have already been undergoing a file based recovery. Still, the change is simple and could be beneficial so seems worthwhile doing.

@henningandersen henningandersen added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Oct 11, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Recovery)

@DaveCTurner
Copy link
Contributor

One further advantage of this is that we would no longer need to renew (and persist) PRRLs when they are half-expired. It's a small thing, but it is a little irksome that otherwise stationary shards (including frozen ones) must still periodically write things to disk. Apropos of #45286 we may want to improve our support for readonly filesystems in future.

I think we should make an attempt to bound the number of inactive leases. Today they belong to the node (i.e. its persistent ID) and in most situations the number of these is bounded. But I can imagine a situation where a badly-configured cluster repeatedly starts up a node on an empty data path, creates a new node ID, obtains a lease and then crashes, and this would create leases ad infinitum if it never got to green. In practice it takes quite some time to start a node so the 12-hour time bound limits the number of leases in play to a few thousand, and also offers a way for users to remove any unnecessary leases (by setting the time bound to 0 on this index).

@rjernst rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020
@tlrx
Copy link
Member

tlrx commented Aug 31, 2022

We discussed this again today in team and agreed that all changes proposed here (keep the retention lease until all shards are started, bound the number of inactive leases) are still relevant improvements so we'll keep this issue open. But we have no plan on prioritizing this work for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. >enhancement Team:Distributed Meta label for distributed team
Projects
None yet
Development

No branches or pull requests

5 participants