Do not age out no-op peer recovery retention leases #47905

henningandersen · 2019-10-11T11:47:05Z

We remove peer recovery leases belonging to unavailable nodes following these rules:

When all shard copies are started for the shard.
If too many operations are applied since the retaining-seq-no, making it undesirable to make an operations based recovery anyway.
If the index.soft_deletes.retention_lease.period time has passed.

If the index is no longer receiving any updates and the lease does not really retain any operations (retaining-seq-no >= gcp+1), it seems desirable to keep the retention lease until all shards are started (regardless of retention period). This would ensure that a no-op recovery can happen rather than a file based recovery.

This will only affect edge cases where either there is no extra node to assign the shard to or allocation has been disabled for more than 6h (with default values). Also, the benefit is small to non-existent if the shards have already been undergoing a file based recovery. Still, the change is simple and could be beneficial so seems worthwhile doing.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-11T11:47:06Z

Pinging @elastic/es-distributed (:Distributed/Recovery)

DaveCTurner · 2019-10-15T07:52:50Z

One further advantage of this is that we would no longer need to renew (and persist) PRRLs when they are half-expired. It's a small thing, but it is a little irksome that otherwise stationary shards (including frozen ones) must still periodically write things to disk. Apropos of #45286 we may want to improve our support for readonly filesystems in future.

I think we should make an attempt to bound the number of inactive leases. Today they belong to the node (i.e. its persistent ID) and in most situations the number of these is bounded. But I can imagine a situation where a badly-configured cluster repeatedly starts up a node on an empty data path, creates a new node ID, obtains a lease and then crashes, and this would create leases ad infinitum if it never got to green. In practice it takes quite some time to start a node so the 12-hour time bound limits the number of leases in play to a few thousand, and also offers a way for users to remove any unnecessary leases (by setting the time bound to 0 on this index).

tlrx · 2022-08-31T10:18:11Z

We discussed this again today in team and agreed that all changes proposed here (keep the retention lease until all shards are started, bound the number of inactive leases) are still relevant improvements so we'll keep this issue open. But we have no plan on prioritizing this work for now.

henningandersen added >enhancement :Distributed/Recovery Anything around constructing a new shard, either from a local or a remote source. labels Oct 11, 2019

DaveCTurner added the team-discuss label Oct 15, 2019

henningandersen removed the team-discuss label Oct 23, 2019

rjernst added the Team:Distributed Meta label for distributed team label May 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not age out no-op peer recovery retention leases #47905

Do not age out no-op peer recovery retention leases #47905

henningandersen commented Oct 11, 2019

elasticmachine commented Oct 11, 2019

DaveCTurner commented Oct 15, 2019

tlrx commented Aug 31, 2022

Do not age out no-op peer recovery retention leases #47905

Do not age out no-op peer recovery retention leases #47905

Comments

henningandersen commented Oct 11, 2019

elasticmachine commented Oct 11, 2019

DaveCTurner commented Oct 15, 2019

tlrx commented Aug 31, 2022