Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization #38368

Conversation

Projects
None yet
4 participants
@original-brownbear
Copy link
Member

commented Feb 4, 2019

  • Partly extracted and inspired by https://github.com/elastic/elasticsearch/compare/master...ywelsch:snapshot-refactored?expand=1#diff-a0853be4492c052f24917b5c1464003dR975
  • The problem in #38226 is that in some corner cases multiple calls to endSnapshot were made concurrently, leading to non-deterministic behavior (beginSnapshot was triggering a repository finalization while one that was triggered by a deleteSnapshot was already in progress)
    • Fix by:
      • Making all endSnapshot calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into SnapshotsInProgress.Entry.
      • Adding deduplication logic to endSnapshot
  • Also:
    • Streamlined the init behavior to work the same way (keep state on the SnapshotsService to decide which snapshot entries are stale)
  • closes #38226

Note: I ran a few thousand iterations of the SnapshotResiliencyTests for these changes and they came back green,

original-brownbear added some commits Feb 4, 2019

bck
@elasticmachine

This comment has been minimized.

Copy link

commented Feb 4, 2019

@@ -680,14 +692,27 @@ public void applyClusterState(ClusterChangedEvent event) {
try {
if (event.localNodeMaster()) {
// We don't remove old master when master flips anymore. So, we need to check for change in master
if (event.nodesRemoved() || event.previousState().nodes().isLocalNodeElectedMaster() == false) {
processSnapshotsOnRemovedNodes(event);
final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

Simplified the logic here a little to avoid the endless null check nestings that make it really hard to figure out what line of conditions led to something being executed.

// 1. Completed snapshots
// 2. Snapshots in state INIT that the previous master failed to start
// 3. Snapshots in any other state that have all their shard tasks completed
snapshotsInProgress.entries().stream().filter(

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

All snapshot ending happens here now.

  1. This should prevent any future stale snapshots that have all their shards completed.
  2. Makes it much easier to reason about master failovers.
*/
private void removeFinishedSnapshotFromClusterState(ClusterChangedEvent event) {

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

This is now automatically covered by the applyClusterState hook

}
}
entries.add(updatedSnapshot);
} else if (snapshot.state() == State.INIT && initializingSnapshots.contains(snapshot.snapshot()) == false) {

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

This should be more stable and easier to reason about. It's weird that we check newMaster on some version of the state and then "later" on run this code based on whether or not we failed over earlier.

return false;
private static boolean removedNodesCleanupNeeded(SnapshotsInProgress snapshotsInProgress, List<DiscoveryNode> removedNodes) {
// If at least one shard was running on a removed node - we need to fail it
return removedNodes.isEmpty() == false && snapshotsInProgress.entries().stream().flatMap(snapshot ->

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

This could be way simplified now too since we're already cleaning up snapshots in SUCCESS and INIT state at the top level of applyClusterState.

* @param failure failure reason or null if snapshot was successful
*/
private void endSnapshot(final SnapshotsInProgress.Entry entry, final String failure) {
private void endSnapshot(final SnapshotsInProgress.Entry entry) {

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

Just one private method now, the potential failure message lives in the cluster state.

SnapshotsStatusResponse status =
client.admin().cluster().prepareSnapshotStatus("repository").setSnapshots("snap").get();
assertThat(status.getSnapshots().iterator().next().getState(), equalTo(State.ABORTED));
} catch (Exception e) {

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 4, 2019

Author Member

This isn't necessary anymore, we'll never create a broken repository with this fix.

@original-brownbear original-brownbear changed the title Fix Concurrent Snapshot Ending [WIP] Fix Concurrent Snapshot Ending Feb 4, 2019

original-brownbear added some commits Feb 5, 2019

@@ -156,9 +154,6 @@ public void clusterChanged(ClusterChangedEvent event) {
logger.info("--> got exception from race in master operation retries");
} else {
logger.info("--> got exception from hanged master", ex);
assertThat(cause, instanceOf(MasterNotDiscoveredException.class));

This comment has been minimized.

Copy link
@original-brownbear

original-brownbear Feb 5, 2019

Author Member

The timing here changed now and we're running into

[2019-02-05T09:27:33,492][INFO ][o.e.d.SnapshotDisruptionIT] [testDisruptionOnSnapshotInitialization] --> got exception from hanged master
java.util.concurrent.ExecutionException: RemoteTransportException[[node_tm0][127.0.0.1:46407][cluster:admin/snapshot/create]]; nested: InvalidSnapshotNameException[[test-repo:test-snap-2] Invalid snapshot name [test-snap-2], snapshot with the same name already exists];

in most cases from the retries on the hanged master. I relaxed the assertion as we did elsewhere for this case.

@original-brownbear original-brownbear changed the title [WIP] Fix Concurrent Snapshot Ending Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization Feb 5, 2019

@original-brownbear original-brownbear removed the WIP label Feb 5, 2019

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2019

Jenkins run elasticsearch-ci/2

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2019

test failure is due to #38412

@original-brownbear original-brownbear merged commit 2f6afd2 into elastic:master Feb 5, 2019

7 checks passed

CLA Commit author has signed the CLA
Details
elasticsearch-ci/1 Build finished.
Details
elasticsearch-ci/2 Build finished.
Details
elasticsearch-ci/default-distro Build finished.
Details
elasticsearch-ci/docbldesx Build finished.
Details
elasticsearch-ci/oss-distro-docs Build finished.
Details
elasticsearch-ci/packaging-sample Build finished.
Details

@original-brownbear original-brownbear deleted the original-brownbear:fix-concurrent-snapshot-ending branch Feb 5, 2019

@original-brownbear

This comment has been minimized.

Copy link
Member Author

commented Feb 5, 2019

@ywelsch thanks!

@colings86 colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Feb 12, 2019

Fix Concurrent Snapshot Ending And Stabilize Snapshot Finalization (e…
…lastic#38368)

* The problem in elastic#38226 is that in some corner cases multiple calls to `endSnapshot` were made concurrently, leading to non-deterministic behavior (`beginSnapshot` was triggering a repository finalization while one that was triggered by a `deleteSnapshot` was already in progress)
   * Fixed by:
      * Making all `endSnapshot` calls originate from the cluster state being in a "completed" state (apart from on short-circuit on initializing an empty snapshot). This forced putting the failure string into `SnapshotsInProgress.Entry`.
      * Adding deduplication logic to `endSnapshot`
* Also:
  * Streamlined the init behavior to work the same way (keep state on the `SnapshotsService` to decide which snapshot entries are stale)
* closes elastic#38226

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 28, 2019

Snapshot Stability Fixes
* Backport of various snapshot stability fixes from `master` to `6.7`
* Includes elastic#38368, elastic#38025 and elastic#37612

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Feb 28, 2019

Snapshot Stability Fixes
* Backport of various snapshot stability fixes from `master` to `6.7`
* Includes elastic#38368, elastic#38025 and elastic#37612

original-brownbear added a commit that referenced this pull request Mar 1, 2019

Snapshot Stability Fixes (#39502)
* Snapshot Stability Fixes

* Backport of various snapshot stability fixes from `master` to `6.7`
* Includes #38368, #38025 and #37612

original-brownbear added a commit that referenced this pull request Mar 4, 2019

Snapshot Stability Fixes (#39550)
* Backport of various snapshot stability fixes from `master` to `6.7` making the snapshot logic in `6.7` equivalent to that in `master` functionally
* Includes #38368, #38025 and #37612

kovrus added a commit to crate/crate that referenced this pull request Apr 24, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)

@kovrus kovrus referenced this pull request Apr 24, 2019

Merged

Port ES snapshotting code. #8601

5 of 5 tasks complete

kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization
    (elastic/elasticsearch#38368)

kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

kovrus added a commit to crate/crate that referenced this pull request Apr 25, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

kovrus added a commit to crate/crate that referenced this pull request Apr 26, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)

mergify bot added a commit to crate/crate that referenced this pull request Apr 26, 2019

Port ES snapshotting code.
- Fix two races condition that lead to stuck snapshots (elastic/elasticsearch#37686)
- Improve resilience SnapshotShardService (elastic/elasticsearch#36113)
- Fix concurrent snapshot ending and stabilize snapshot finalization (elastic/elasticsearch#38368)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.