Add a configurable cooldown period between SLM operations #47520

dakrone · 2019-10-03T18:50:15Z

For some of our repository types (particularly S3), there can sometimes be issues due to eventual visibility of files that have been uploaded.

To handle this from SLM, we should add a configurable "cooldown" period in between operations (snapshot and delete).

elasticmachine · 2019-10-03T18:50:16Z

Pinging @elastic/es-core-features (:Core/Features/ILM)

This is a special kind of thread pool executor that runs operations in a single-threaded manner, but with a configurable cooldown in between. The executor is always forced to have a single fixed thread with a configurable queue size. This special executor is not currently used, but is part of work for elastic#47520

jasontedor · 2019-10-04T12:05:20Z

This doesn't feel right to me, that our solution to "eventual consistency" problems is to wait a little bit, while we will still have no guarantees that how long we wait will make what we're waiting on visible. I think @original-brownbear has been working so that we aren't as susceptible to eventual consistency issues, so I really want to challenge the necessity of this. Since he's spent far more time thinking about this, I would solicit his input here.

original-brownbear · 2019-10-04T12:17:32Z

@dakrone @jasontedor yea, I talked about this with @ywelsch this morning and I gotta admit I'm a little at fault for suggesting we move ahead with this.

The situation is this:

As it stands now we are still succeptible to S3 eventual consistency issue unfortunately. The impact of it has been reduced step by step over the last year though. #46250 will reduce the impact to a single operation during snapshotting only and a follow up should be able to fix the eventual consistency issue for good.
My assumption when suggesting we move the wait from operations into ES/SLM was that:

It would be very simple to add that wait to SLM
We'd be able to remove it transparently once it has become unnecessary

we will still have no guarantees that how long we wait will make what we're waiting on visible

That's true. Unfortunately though, currently waiting is what we do in operations to work around these issues. In practice the wait we're using turns out to be safe across a very large sample size.

=> I think what this comes down to is:
If SLM will schedule operations (create/delete) on the repo in rapid succession that could be a problem with S3 repositories. Do we want to fix this problem for users or can we wait another minor for the eventual consistency fixes to land?

jasontedor · 2019-10-04T14:42:41Z

It would be my preference that we wait another minor release for the eventual consistency fixes to land than we hack around this, and building infrastructure that will otherwise be unused (#47522) for this.

dakrone · 2019-10-04T15:29:56Z

I agree I'd rather wait until this was handled at the snapshot/restore side rather than with infra in SLM. It is actually not very simple to add to SLM, the dedicated #47522 threadpool is the "simplest" way I could think of to address it, I'd be more than happy not to have to add it.

dakrone · 2019-10-04T16:40:13Z

Given that we have agreed (I think?) to address this through consistency fixes in snapshot/delete itself, I'm going to close this as well.

This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048

This is intended as a stop-gap solution/improvement to elastic#38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of elastic#38941 causing trouble via SLM (see elastic#47520). Closes elastic#47834 Closes elastic#49048

This is intended as a stop-gap solution/improvement to #38941 that prevents repo modifications without an intermittent master failover from causing inconsistent (outdated due to inconsistent listing of index-N blobs) `RepositoryData` to be written. Tracking the latest repository generation will move to the cluster state in a separate pull request. This is intended as a low-risk change to be backported as far as possible and motived by the recently increased chance of #38941 causing trouble via SLM (see #47520). Closes #47834 Closes #49048

dakrone added >enhancement :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Oct 3, 2019

dakrone self-assigned this Oct 3, 2019

dakrone mentioned this issue Oct 3, 2019

Retention for Snapshot Lifecycle Management #43663

Closed

26 tasks

dakrone mentioned this issue Oct 3, 2019

Add CooldownEsThreadPoolExecutor #47522

Closed

dakrone closed this as completed Oct 4, 2019

original-brownbear mentioned this issue Nov 11, 2019

Track Repository Gen. in BlobStoreRepository #48944

Merged

original-brownbear mentioned this issue Nov 14, 2019

Track Repository Gen. in BlobStoreRepository (#48944) #49116

Merged

original-brownbear mentioned this issue Nov 14, 2019

Track Repository Gen. in BlobStoreRepository (#48944) #49119

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a configurable cooldown period between SLM operations #47520

Add a configurable cooldown period between SLM operations #47520

dakrone commented Oct 3, 2019

elasticmachine commented Oct 3, 2019

jasontedor commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

jasontedor commented Oct 4, 2019

dakrone commented Oct 4, 2019

dakrone commented Oct 4, 2019

Add a configurable cooldown period between SLM operations #47520

Add a configurable cooldown period between SLM operations #47520

Comments

dakrone commented Oct 3, 2019

elasticmachine commented Oct 3, 2019

jasontedor commented Oct 4, 2019

original-brownbear commented Oct 4, 2019

jasontedor commented Oct 4, 2019

dakrone commented Oct 4, 2019

dakrone commented Oct 4, 2019