Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Multiple Failures in org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT #38447

Closed
original-brownbear opened this issue Feb 5, 2019 · 6 comments
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1

Comments

@original-brownbear
Copy link
Member

original-brownbear commented Feb 5, 2019

Seems #38368 did destabilize org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT on CI (so far can't reproduce this locally though).

I'm on it.

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=3356211FA607A974 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testMasterAndDataShutdownDuringSnapshot" \
  -Dtests.security.manager=true \
  -Dtests.locale=teo \
  -Dtests.timezone=Asia/Aqtobe \
  -Dcompiler.java=12 \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=3356211FA607A974 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testMasterAndDataShutdownDuringSnapshot" \
  -Dtests.security.manager=true \
  -Dtests.locale=teo \
  -Dtests.timezone=Asia/Aqtobe \
  -Dcompiler.java=12 \
  -Druntime.java=11

REPRODUCE WITH: ./gradlew :server:integTest \
  -Dtests.seed=A2E8801713D0E35A \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotWithStuckNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=bg-BG \
  -Dtests.timezone=Etc/GMT+8 \
  -Dcompiler.java=11 \
  -Druntime.java=8

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/224/console

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/1846/console

@original-brownbear original-brownbear added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0 labels Feb 5, 2019
@original-brownbear original-brownbear self-assigned this Feb 5, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@henningandersen
Copy link
Contributor

Thanks @original-brownbear , FYI my PR build where #38368 is not merged also failed in testMasterAndDataShutdownDuringSnapshot:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/7454/console

Looks like it ran for an extended duration:

HEARTBEAT J4 PID(48487@elasticsearch-ci-immutable-ubuntu-1404-1549379268547314587): 2019-02-05T15:27:19, stalled for 60.3s at: DedicatedClusterSnapshotRestoreIT.testMasterAndDataShutdownDuringSnapshot
S

@henningandersen
Copy link
Contributor

I were able to reproduce the same failure locally by running the single test method on repeat until failure from within IntelliJ.

@original-brownbear
Copy link
Member Author

Thanks for checking @henningandersen same here. Looking into fixing it now :)

@original-brownbear
Copy link
Member Author

Seems with the changes from #38368 we're missing a state update on leaving nodes and get snapshots stuck in a state like:

{
  "snapshots": [
    {
      "repository": "test-repo",
      "snapshot": "test-snap",
      "uuid": "KuLcFik0T86h21Y4bMVZ4Q",
      "include_global_state": true,
      "partial": false,
      "state": "STARTED",
      "indices": [
        {
          "name": "test-idx",
          "id": "shkWw-veQFaFvBHAGegcjQ"
        }
      ],
      "start_time_millis": 1549394078202,
      "repository_state_id": -1,
      "shards": [
        {
          "index": {
            "index_name": "test-idx",
            "index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
          },
          "shard": 1,
          "state": "SUCCESS",
          "node": "6JO5slm7SzynyqKZ-C2sGg"
        },
        {
          "index": {
            "index_name": "test-idx",
            "index_uuid": "Lhug5HNuQ-i3ykpz0gkz-A"
          },
          "shard": 0,
          "state": "INIT",
          "node": "Ovz6FD8bSEOdCzohp0_DKw"
        }
      ]
    }
  ]
}

When node Ovz6FD8bSEOdCzohp0_DKw has already left the cluster. => still on it :)

@original-brownbear
Copy link
Member Author

This fixes it, we were missing the node removal event when it coincided with master failover, opening a PR in a bit (trying to create a deterministic test that reproduces the issue):

diff --git a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
index 17306e1585f..e2628fda991 100644
--- a/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
+++ b/server/src/main/java/org/elasticsearch/snapshots/SnapshotsService.java
@@ -687,8 +687,9 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
             if (event.localNodeMaster()) {
                 // We don't remove old master when master flips anymore. So, we need to check for change in master
                 final SnapshotsInProgress snapshotsInProgress = event.state().custom(SnapshotsInProgress.TYPE);
+                final boolean newMaster = event.previousState().nodes().isLocalNodeElectedMaster() == false;
                 if (snapshotsInProgress != null) {
-                    if (removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
+                    if (newMaster || removedNodesCleanupNeeded(snapshotsInProgress, event.nodesDelta().removedNodes())) {
                         processSnapshotsOnRemovedNodes();
                     }
                     if (event.routingTableChanged() && waitingShardsStartedOrUnassigned(snapshotsInProgress, event)) {
@@ -704,7 +705,7 @@ public class SnapshotsService extends AbstractLifecycleComponent implements Clus
                             || entry.state() != State.INIT && completed(entry.shards().values())
                     ).forEach(this::endSnapshot);
                 }
-                if (event.previousState().nodes().isLocalNodeElectedMaster() == false) {
+                if (newMaster) {
                     finalizeSnapshotDeletionFromPreviousMaster(event);
                 }
             }        

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI v7.0.0-beta1
Projects
None yet
Development

No branches or pull requests

4 participants