Use Cluster State to Track Repository Generation #49729

original-brownbear · 2019-11-29T18:52:01Z

Step on the road to #49060.

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no index-N will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same index-N are not a possibility any longer.

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations.

elasticmachine · 2019-11-29T18:52:03Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

…ment

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

original-brownbear · 2019-11-30T19:54:02Z

server/src/test/java/org/elasticsearch/repositories/blobstore/BlobStoreRepositoryTests.java

@@ -137,7 +140,7 @@ public void testRetrieveSnapshots() throws Exception {

    public void testReadAndWriteSnapshotsThroughIndexFile() throws Exception {
        final BlobStoreRepository repository = setupRepo();
-
+        final long pendingGeneration = getPendingGeneration(repository);


The generation here will now be something other than -1 because we are using the same repository name across tests. I intentionally didn't add a repository cleanup step between tests here, since this serves as a neat test (and illustration) of how the pending generation is kept consistent no matter what happens to the repo contents.

original-brownbear · 2019-11-30T20:20:57Z

server/src/test/java/org/elasticsearch/snapshots/DedicatedClusterSnapshotRestoreIT.java

@@ -497,11 +497,10 @@ public void testSnapshotWithStuckNode() throws Exception {
        logger.info("--> Go through a loop of creating and deleting a snapshot to trigger repository cleanup");
        client().admin().cluster().prepareCleanupRepository("test-repo").get();

-        // Subtract four files that will remain in the repository:
+        // Expect two files to remain in the repository:


This whole business of keeping a "backup" index-N has long been fairly obsolete. At best it may have saved some repository status API calls from erros on S3 (when the most recent index-N would not show up in a listing ... but obviously then also serving old state anyway). With a consistent way of tracking the generation for writes now, I don't see any point in keeping this behavior around and simplified the deleting of old index-N accordingly in production code.

original-brownbear · 2019-11-30T20:25:43Z

@ywelsch @tlrx let me know what you think about splitting #49060 in half this way.
I was thinking (hoping) that only fixing the write path but retaining all the fallbacks on the read side makes this somewhat easy to review because it seems to me like a relatively non-controversial change. We're not altering behavior in any other way but for not starting at generation -1 when writing to a repo that was emptied externally. This means we get by without BwC changes here and can make the difficult decisions on how to handle external repo modifications in the last step #49060 without having to discuss the details of the new structure in the CS and its handling because it's all introduced here.

ywelsch

I've left some comments and questions. I would prefer to reuse RepositoryMetaData for this, just so that we have all repo-related information in one location.

ywelsch · 2019-12-02T12:49:48Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus
    public static List<Entry> getNamedWriteables() {
        List<Entry> entries = new ArrayList<>();
        // Cluster State
+        registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom);


why not use the RepositoryMetaData for this? It is already persisted in the cluster state and even survives full-cluster restarts. This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.

Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData here :)

@ywelsch hmm on second thought:

This one is a bit odd, as it now has to keep a map for each repo again, and needs clean-up when repos are added / removed.

This seems to be the only real downside though? And it's not really a downside since if I move this stuff into RepositoryMetaData I get the some new complication from having to adjust RepositoriesService to now use some custom comparisons on RepositoryMetaData that ignores the generation related fields to see if a repo has changed.

The other downside is, that we now have the generation logic leak into all the Repository implementations in a bunch of new spots. CCR and all the wrapping repositories don't really care about the generation and it's entirely BlobStoreRepository specific. Would we then make adjustments to the serialization of e.g. GetRepositoriesResponse so that it doesn't include the generation? We also lose some incrementality/efficiency in the serialization ClusterState as a result of mixing somewhat dynamic and pretty static things in it don't we?

I think persisting this state across full-cluster restarts is a good thing still, but I'm not so sure about using RepositoryMetaData here having tried implementating this practically now.
Still think it's worth it and make adjustments to things like GetRepositoriesResponse accordingly?

RepositoriesService to now use some custom comparisons on RepositoryMetaData that ignores the generation related fields to see if a repo has changed

I think this is ok (it's similar to how we have it with most items in the cluster state). Further down the line, we should not reinit the repository implementation when there is an update from the CS, but leave it to the implementation to decide whether it needs complete reinitialization or not. This will allow us to do dynamic throttling (e.g. allow user to dynamically change max_restore_bytes_per_sec which directly takes place on an ongoing restore). It's a bit bigger investment right now, but will hopefully pay off further down the line.

Regarding serialization, we can make RepositoryMetaData implement Diffable to have it more smartly serialize changes (this would benefit any other field in it as well). I don't expect this to matter much though.

I also think we won't need to adapt GetRepositoriesResponse but can just conditionally expose the additional fields by parameterizing toXContent if need be (I'm even fine exposing these fields on the repositories API and not having two variants).

ywelsch · 2019-12-02T12:52:42Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+        if (currentGen != expectedGen) {
+            // the index file was updated by a concurrent operation, so we were operating on stale
+            // repository data
+            throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" +


listener.onFailure(new RepositoryException...)?

Sure, seems nicer :)

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

ywelsch · 2019-12-02T12:55:18Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet(
+                    () -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName));
+                if (repoState.pendingGeneration() != repoState.generation()) {
+                    logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]",


Why do we warn here? Shouldn't this just be info level logging?

Yea, this isn't such a bad state :) made it info

ywelsch · 2019-12-02T13:00:34Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+
+                @Override
+                public void onFailure(String source, Exception e) {
+                    l.onFailure(e);


should we wrap the exception here to add more context where exactly something went wrong (same for setting pending generation)?

Yea added some wrapping here

ywelsch · 2019-12-02T13:15:35Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+
+        // Step 1: Set repository generation state to the next possible pending generation
+        final StepListener<Long> setPendingStep = new StepListener<>();
+        clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() {


maybe mention generation + repo here

Sure added for both CS update steps

ywelsch · 2019-12-02T13:19:40Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                        repoState.pendingGeneration(), repoState);
+                }
+                if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) {
+                    throw new IllegalStateException(


when do we expect this to happen?

Come to think of it ... it's impossible already :) Removing it in favor of an assert. We already make sure not to be in a bad spot here in safeRepositoryData anyway

ywelsch · 2019-12-02T13:23:34Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+        });
+
+        // Step 2: Write new index-N blob to repository and update index.latest
+        setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> {


why use the generic threadpool here, and not the snapshot one?

Right ... moved to the snapshot pool.

…ment

original-brownbear

Answered + addressed all comments working on the below now

I've left some comments and questions. I would prefer to reuse RepositoryMetaData for this

original-brownbear · 2019-12-02T14:20:11Z

server/src/main/java/org/elasticsearch/cluster/ClusterModule.java

@@ -114,6 +115,7 @@ public ClusterModule(Settings settings, ClusterService clusterService, List<Clus
    public static List<Entry> getNamedWriteables() {
        List<Entry> entries = new ArrayList<>();
        // Cluster State
+        registerClusterCustom(entries, RepositoriesState.TYPE, RepositoriesState::new, RepositoriesState::readDiffFrom);


Hmm, I was thinking we may not actually want to persist this ... but now that I think about S3 and the pending generation, we actually probably want to ... will see what I can do about the RepositoryMetaData here :)

original-brownbear · 2019-12-02T14:24:08Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+        if (currentGen != expectedGen) {
+            // the index file was updated by a concurrent operation, so we were operating on stale
+            // repository data
+            throw new RepositoryException(metadata.name(), "concurrent modification of the index-N file, expected current generation [" +


Sure, seems nicer :)

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

original-brownbear · 2019-12-02T14:34:11Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                final RepositoriesState.State repoState = Optional.ofNullable(state.state(repoName)).orElseGet(
+                    () -> RepositoriesState.builder().putState(repoName, expectedGen, expectedGen).build().state(repoName));
+                if (repoState.pendingGeneration() != repoState.generation()) {
+                    logger.warn("Trying to write new repository data of generation [{}] over unfinished write, repo is in state [{}]",


Yea, this isn't such a bad state :) made it info

original-brownbear · 2019-12-02T14:43:02Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+                        repoState.pendingGeneration(), repoState);
+                }
+                if (expectedGen != RepositoryData.EMPTY_REPO_GEN && expectedGen != repoState.generation()) {
+                    throw new IllegalStateException(


Come to think of it ... it's impossible already :) Removing it in favor of an assert. We already make sure not to be in a bad spot here in safeRepositoryData anyway

original-brownbear · 2019-12-02T14:46:58Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+        });
+
+        // Step 2: Write new index-N blob to repository and update index.latest
+        setPendingStep.whenComplete(newGen -> threadPool().generic().execute(ActionRunnable.wrap(listener, l -> {


Right ... moved to the snapshot pool.

original-brownbear · 2019-12-02T14:51:47Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+
+                @Override
+                public void onFailure(String source, Exception e) {
+                    l.onFailure(e);


Yea added some wrapping here

original-brownbear · 2019-12-02T14:52:22Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

+
+        // Step 1: Set repository generation state to the next possible pending generation
+        final StepListener<Long> setPendingStep = new StepListener<>();
+        clusterService.submitStateUpdateTask("set pending repository generation", new ClusterStateUpdateTask() {


Sure added for both CS update steps

original-brownbear · 2019-12-02T15:41:52Z

@ywelsch thanks, addressed all comments now. Let me know what you think about the RepositoriesMetaData situation :)

…ment

original-brownbear · 2019-12-04T07:43:52Z

@ywelsch @tlrx all points addressed again I think :)

tlrx

LGTM, thanks Armin

ywelsch

LGTM (Left one comment about an exception message)

…ment

original-brownbear · 2019-12-04T12:00:45Z

Thanks so much Yannick + Tanguy! Just one more step left now in this ordeal I think :)

This moves the blob store repository to only use the information available in the clusterstate for loading `RepositoryData` without falling back to listing to determine a repositories' generation. Relates elastic#49729 Closes elastic#38941

Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

Step on the road to #49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in #49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In #49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

) (#50267) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

Use Cluster State to Track Repository Generation

d5c26f5

Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations.

original-brownbear added WIP :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 29, 2019

original-brownbear added 2 commits November 29, 2019 20:21

simpler change

ab530c5

cleaner

a67bb90

original-brownbear added v7.6.0 v8.0.0 WIP and removed WIP labels Nov 29, 2019

original-brownbear added 4 commits November 29, 2019 21:26

add docs

7072b7e

better wording

3cd2e72

Merge remote-tracking branch 'elastic/master' into repo-uses-cs-incre…

3effaad

…ment

nicer docs

afa7b67

original-brownbear commented Nov 30, 2019

View reviewed changes

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java Show resolved Hide resolved

original-brownbear added >enhancement and removed WIP labels Nov 30, 2019

original-brownbear marked this pull request as ready for review November 30, 2019 19:33

original-brownbear commented Nov 30, 2019

View reviewed changes

original-brownbear requested review from ywelsch and tlrx November 30, 2019 20:21

ywelsch suggested changes Dec 2, 2019

View reviewed changes

original-brownbear added 2 commits December 2, 2019 14:56

Merge remote-tracking branch 'elastic/master' into repo-uses-cs-incre…

8518bfc

…ment

small changes

59352a1

original-brownbear commented Dec 2, 2019

View reviewed changes

original-brownbear requested a review from ywelsch December 2, 2019 15:40

Merge remote-tracking branch 'elastic/master' into repo-uses-cs-incre…

4bb379b

…ment

original-brownbear requested a review from ywelsch December 4, 2019 07:43

tlrx approved these changes Dec 4, 2019

View reviewed changes

ywelsch approved these changes Dec 4, 2019

View reviewed changes

original-brownbear added 2 commits December 4, 2019 12:00

Merge remote-tracking branch 'elastic/master' into repo-uses-cs-incre…

357fa56

…ment

CR: adjust ex. message

c3c758c

original-brownbear merged commit b34daeb into elastic:master Dec 4, 2019

original-brownbear deleted the repo-uses-cs-increment branch December 4, 2019 12:01

original-brownbear added the backport pending label Dec 4, 2019

This was referenced Dec 5, 2019

S3 Snapshot Repository Erroneously Assumes Consistent List Operation #38941

Closed

Cleanup Old index-N Blobs in Repository Cleanup (#49862) #49902

Merged

original-brownbear mentioned this pull request Dec 9, 2019

Use Cluster State to Track Repository Generation (#49729) #49976

Merged

original-brownbear removed the backport pending label Dec 9, 2019

This was referenced Dec 9, 2019

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Merged

[CI] Failure in DedicatedClusterSnapshotRestoreIT.testMasterAndDataShutdownDuringSnapshot #49989

Closed

original-brownbear mentioned this pull request Dec 17, 2019

Use ClusterState as Consistency Source for Snapshot Repositories (#49060) #50267

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

original-brownbear restored the repo-uses-cs-increment branch August 6, 2020 18:26

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Cluster State to Track Repository Generation #49729

Use Cluster State to Track Repository Generation #49729

original-brownbear commented Nov 29, 2019 •

edited

Loading

elasticmachine commented Nov 29, 2019

original-brownbear Nov 30, 2019

original-brownbear Nov 30, 2019

original-brownbear commented Nov 30, 2019

ywelsch left a comment

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

ywelsch Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear left a comment

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear Dec 2, 2019

original-brownbear commented Dec 2, 2019

original-brownbear commented Dec 4, 2019

tlrx left a comment

ywelsch left a comment

original-brownbear commented Dec 4, 2019

Use Cluster State to Track Repository Generation #49729

Use Cluster State to Track Repository Generation #49729

Conversation

original-brownbear commented Nov 29, 2019 • edited Loading

elasticmachine commented Nov 29, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Nov 30, 2019

ywelsch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

original-brownbear commented Dec 2, 2019

original-brownbear commented Dec 4, 2019

tlrx left a comment

Choose a reason for hiding this comment

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Dec 4, 2019

original-brownbear commented Nov 29, 2019 •

edited

Loading