[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

original-brownbear · 2019-02-05T16:40:10Z

Seems #38368 did destabilize org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT on CI (so far can't reproduce this locally though).

I'm on it.

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=3356211FA607A974 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testMasterAndDataShutdownDuringSnapshot" \
  -Dtests.security.manager=true \
  -Dtests.locale=teo \
  -Dtests.timezone=Asia/Aqtobe \
  -Dcompiler.java=12 \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=3356211FA607A974 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testMasterAndDataShutdownDuringSnapshot" \
  -Dtests.security.manager=true \
  -Dtests.locale=teo \
  -Dtests.timezone=Asia/Aqtobe \
  -Dcompiler.java=12 \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=A2E8801713D0E35A \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotWithStuckNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=bg-BG \
  -Dtests.timezone=Etc/GMT+8 \
  -Dcompiler.java=11 \
  -Druntime.java=8

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/224/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1846/console

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-05T16:40:13Z

Pinging @elastic/es-distributed

henningandersen · 2019-02-05T17:01:38Z

Thanks @original-brownbear , FYI my PR build where #38368 is not merged also failed in testMasterAndDataShutdownDuringSnapshot:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/7454/console

Looks like it ran for an extended duration:

HEARTBEAT J4 PID(48487@elasticsearch-ci-immutable-ubuntu-1404-1549379268547314587): 2019-02-05T15:27:19, stalled for 60.3s at: DedicatedClusterSnapshotRestoreIT.testMasterAndDataShutdownDuringSnapshot
S

henningandersen · 2019-02-05T18:09:09Z

I were able to reproduce the same failure locally by running the single test method on repeat until failure from within IntelliJ.

original-brownbear · 2019-02-05T18:50:21Z

Thanks for checking @henningandersen same here. Looking into fixing it now :)

original-brownbear · 2019-02-05T19:18:57Z

Seems with the changes from #38368 we're missing a state update on leaving nodes and get snapshots stuck in a state like:

{
  "snapshots": [
    {
      "repository": "test-repo",
      "snapshot": "test-snap",
      "uuid": "KuLcFik0T86h21Y4bMVZ4Q",
      "include_global_state": true,
      "partial": false,
      "state": "STARTED",
      "indices": [
        {
          "name": "test-idx",
          "id": "shkWw-veQFaFvBHAGegcjQ"
        }
      ],
      "start_time_millis": 1549394078202,
      "repository_state_id": -1,
      "shards": [
        {
          "index": {
            "index_name": "test-idx",
            "index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
          },
          "shard": 1,
          "state": "SUCCESS",
          "node": "6JO5slm7SzynyqKZ-C2sGg"
        },
        {
          "index": {
            "index_name": "test-idx",
            "index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
          },
          "shard": 0,
          "state": "INIT",
          "node": "Ovz6FD8bSEOdCzohp0_DKw"
        }
      ]
    }
  ]
}

When node Ovz6FD8bSEOdCzohp0_DKw has already left the cluster. => still on it :)

original-brownbear · 2019-02-05T20:14:46Z

This fixes it, we were missing the node removal event when it coincided with master failover, opening a PR in a bit (trying to create a deterministic test that reproduces the issue):

diff --git a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
index 17306e1585f..e2628fda991 100644
--- a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
+++ b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
@@ -687,8 +687,9 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
             if (event.localNodeMaster()) {
                 // We don't remove old master when master flips anymore. So, we need to check for change in master
                 final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);
+                final boolean newMaster = event.previousState().nodes().isLocalNodeElectedMaster() == false;
                 if (snapshotsInProgress != null) {
-                    if (removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
+                    if (newMaster || removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
                         processSnapshotsOnRemovedNodes();
                     }
                     if (event.routingTableChanged() && waitingShardsStartedOrUnassigned(snapshotsInProgress, event)) {
@@ -704,7 +705,7 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
                             || entry.state() != State.INIT && completed(entry.shards().values())
                     ).forEach(this::endSnapshot);
                 }
-                if (event.previousState().nodes().isLocalNodeElectedMaster() == false) {
+                if (newMaster) {
                     finalizeSnapshotDeletionFromPreviousMaster(event);
                 }
             }

* Closes elastic#38447

* Closes #38447

* Closes elastic#38447

original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0 labels Feb 5, 2019

original-brownbear self-assigned this Feb 5, 2019

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 5, 2019

Fix Master Failover and DataNode Leave Blocking Snapshot

f4d6d20

* Closes elastic#38447

original-brownbear mentioned this issue Feb 5, 2019

Fix Master Failover and DataNode Leave Blocked Snapshot #38460

Merged

original-brownbear closed this as completed in #38460 Feb 5, 2019

original-brownbear added a commit that referenced this issue Feb 5, 2019

Fix Master Failover and DataNode Leave Blocking Snapshot (#38460)

34f2cc7

* Closes #38447

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dimitris-athanasiou pushed a commit to dimitris-athanasiou/elasticsearch that referenced this issue Feb 12, 2019

Fix Master Failover and DataNode Leave Blocking Snapshot (elastic#38460)

d9700ce

* Closes elastic#38447

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

original-brownbear commented Feb 5, 2019 •

edited

elasticmachine commented Feb 5, 2019

henningandersen commented Feb 5, 2019

henningandersen commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

Comments

original-brownbear commented Feb 5, 2019 • edited

elasticmachine commented Feb 5, 2019

henningandersen commented Feb 5, 2019

henningandersen commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019

original-brownbear commented Feb 5, 2019 •

edited