Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed #39852

Closed
droberts195 opened this issue Mar 8, 2019 · 9 comments · Fixed by #43537
Closed

[CI] DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed #39852

droberts195 opened this issue Mar 8, 2019 · 9 comments · Fixed by #43537
Assignees
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@droberts195
Copy link
Contributor

org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode failed in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/60/console

The error was:

FAILURE 4.30s J4 | DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode <<< FAILURES!
   > Throwable #1: java.lang.AssertionError: not all files were deleted during snapshot cancellation
   > Expected: <1>
   >      but: was <0>
   > 	at __randomizedtesting.SeedInfo.seed([5C205D54D3187BA0:97C59A25EEAE3FCF]:0)
   > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   > 	at org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode(DedicatedClusterSnapshotRestoreIT.java:496)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   > 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
   > 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
   > 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
   > 	at java.base/java.lang.Thread.run(Thread.java:834)

The REPRO command was:

./gradlew :server:integTest \
  -Dtests.seed=5C205D54D3187BA0 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotWithStuckNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=fr-SC \
  -Dtests.timezone=America/Montevideo \
  -Dcompiler.java=12 \
  -Druntime.java=11

(Note compiled with Java 12.)

The problem did not reproduce locally for me.

@droberts195 droberts195 added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Mar 8, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 11, 2019
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates elastic#39852
@original-brownbear
Copy link
Member

It's a little tricky to tell where this is coming from without better logging/asserting => I opened #39893 for that.

original-brownbear added a commit that referenced this issue Mar 11, 2019
* The test failure in #39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates #39852
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 11, 2019
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates elastic#39852
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Mar 11, 2019
* The test failure in elastic#39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates elastic#39852
original-brownbear added a commit that referenced this issue Mar 11, 2019
* The test failure in #39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates #39852
original-brownbear added a commit that referenced this issue Mar 11, 2019
* The test failure in #39852 is caused by a file in the initial repository when there should not be any
  * It seems that on a normal consistent file system no left-over file should exist ever here after the validation finishes and I can't reproduce or see any other path to a dangling file in the fresh respository
=> added a more verbose and strict assertion that will log what file is left over next time
* Relates #39852
@jaymode
Copy link
Member

jaymode commented Mar 18, 2019

This test failed again, https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=openjdk12,nodes=immutable&&linux&&docker/81/console. It does not reproduce locally for me.

Reproduce line:

./gradlew :server:integTest \
  -Dtests.seed=3584DAFDBBDAC378 \
  -Dtests.class=org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT \
  -Dtests.method="testSnapshotWithStuckNode" \
  -Dtests.security.manager=true \
  -Dtests.locale=bm-ML \
  -Dtests.timezone=Asia/Hovd \
  -Dcompiler.java=12 \
  -Druntime.java=12

The failure with the additional output added by @original-brownbear is:

java.lang.AssertionError: Unexpected file count, found: [[/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5/temp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_3584DAFDBBDAC378-001/tempDir-008/repos/tsrjWCInwK/index-0, /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5/temp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_3584DAFDBBDAC378-001/tempDir-008/repos/tsrjWCInwK/index-1, /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5/temp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_3584DAFDBBDAC378-001/tempDir-008/repos/tsrjWCInwK/index.latest, /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5/temp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_3584DAFDBBDAC378-001/tempDir-008/repos/tsrjWCInwK/indices/j_HRCLsWRiyTc5QZyQR9hQ/0/__0, /var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5/temp/org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT_3584DAFDBBDAC378-001/tempDir-008/repos/tsrjWCInwK/incompatible-snapshots]]. expected:<4> but was:<5>
	at __randomizedtesting.SeedInfo.seed([3584DAFDBBDAC378:FE611D8C866C8717]:0)
	at org.junit.Assert.fail(Assert.java:88)
	at org.junit.Assert.failNotEquals(Assert.java:834)
	at org.junit.Assert.assertEquals(Assert.java:645)
	at org.elasticsearch.snapshots.AbstractSnapshotIntegTestCase.assertFileCount(AbstractSnapshotIntegTestCase.java:87)
	at org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT.testSnapshotWithStuckNode(DedicatedClusterSnapshotRestoreIT.java:493)

Output from testSnapshotWithStuckNode:

  1> [2019-03-18T23:49:36,462][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] before test
  1> [2019-03-18T23:49:36,462][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] [DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode]: setting up test
  1> [2019-03-18T23:49:36,463][INFO ][o.e.t.InternalTestCluster] [testSnapshotWithStuckNode] Setup InternalTestCluster [TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster] with seed [ED8015CEB5306E9B] using [0] dedicated masters, [0] (data) nodes and [0] coord only nodes (min_master_nodes are [auto-managed])
  1> [2019-03-18T23:49:36,463][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] [DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode]: all set up test
  1> [2019-03-18T23:49:36,463][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> start 2 nodes
  1> [2019-03-18T23:49:36,479][INFO ][o.e.e.NodeEnvironment    ] [testSnapshotWithStuckNode] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [252.4gb], net total_space [295.2gb], types [ext4]
  1> [2019-03-18T23:49:36,479][INFO ][o.e.e.NodeEnvironment    ] [testSnapshotWithStuckNode] heap size [512mb], compressed ordinary object pointers [true]
  1> [2019-03-18T23:49:36,484][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] node name [node_t0], node ID [a9IZPJC5TF-pc-AY-fVYVA]
  1> [2019-03-18T23:49:36,484][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] version[7.1.0], pid[26969], build[unknown/unknown/Unknown/Unknown], OS[Linux/4.4.0-142-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/12-ea/12-ea+30]
  1> [2019-03-18T23:49:36,484][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] JVM Home [/var/lib/jenkins/.java/openjdk-12+30-linux]
  1> [2019-03-18T23:49:36,484][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] JVM arguments [-Dfile.encoding=UTF8, -ea, -esa, -Xmx512m, -Xms512m, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/heapdump, --illegal-access=warn, -Dtests.prefix=tests, -Dtests.seed=3584DAFDBBDAC378, -Druntime.java=12, -Des.scripting.update.ctx_in_params=false, -Dtests.security.manager=true, -Djava.io.tmpdir=./temp, -Dtests.gradle=true, -Dtests.artifact=server, -Djava.awt.headless=true, -Dcompiler.java=12, -Dtests.task=:server:integTest, -Djna.nosys=true, -Dtests.logger.level=WARN, -Djunit4.childvm.cwd=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5, -Djunit4.tempDir=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/temp, -Djunit4.childvm.id=5, -Djunit4.childvm.count=8]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] no modules loaded
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$BrokenSettingPlugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$TestCustomMetaDataPlugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.mockstore.MockRepository$Plugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.ESIntegTestCase$AssertActionNamePlugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.ESIntegTestCase$TestSeedPlugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.MockHttpTransport$TestPlugin]
  1> [2019-03-18T23:49:36,486][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2019-03-18T23:49:36,521][INFO ][o.e.d.DiscoveryModule    ] [testSnapshotWithStuckNode] using discovery type [zen] and seed hosts providers [settings, file]
  1> [2019-03-18T23:49:36,538][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] initialized
  1> [2019-03-18T23:49:36,539][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#1]]] starting ...
  1> [2019-03-18T23:49:36,543][INFO ][o.e.t.TransportService   ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#1]]] publish_address {127.0.0.1:36099}, bound_addresses {[::1]:39781}, {127.0.0.1:36099}
  1> [2019-03-18T23:49:36,546][INFO ][o.e.c.c.Coordinator      ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#1]]] setting initial configuration to VotingConfiguration{a9IZPJC5TF-pc-AY-fVYVA}
  1> [2019-03-18T23:49:36,662][INFO ][o.e.c.s.MasterService    ] [node_t0] elected-as-master ([1] nodes joined)[{node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099} elect leader, _BECOME_MASTER_TASK_, _FINISH_ELECTION_], term: 1, version: 1, reason: master node changed {previous [], current [{node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099}]}
  1> [2019-03-18T23:49:36,723][INFO ][o.e.c.s.ClusterApplierService] [node_t0] master node changed {previous [], current [{node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099}]}, term: 1, version: 1, reason: Publication{term=1, version=1}
  1> [2019-03-18T23:49:36,724][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#1]]] started
  1> [2019-03-18T23:49:36,745][INFO ][o.e.g.GatewayService     ] [node_t0] recovered [0] indices into cluster_state
  1> [2019-03-18T23:49:36,750][INFO ][o.e.e.NodeEnvironment    ] [testSnapshotWithStuckNode] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [252.4gb], net total_space [295.2gb], types [ext4]
  1> [2019-03-18T23:49:36,751][INFO ][o.e.e.NodeEnvironment    ] [testSnapshotWithStuckNode] heap size [512mb], compressed ordinary object pointers [true]
  1> [2019-03-18T23:49:36,758][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] node name [node_t1], node ID [aCk35BpAT8atIDo0bV4Dyw]
  1> [2019-03-18T23:49:36,759][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] version[7.1.0], pid[26969], build[unknown/unknown/Unknown/Unknown], OS[Linux/4.4.0-142-generic/amd64], JVM[Oracle Corporation/OpenJDK 64-Bit Server VM/12-ea/12-ea+30]
  1> [2019-03-18T23:49:36,759][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] JVM Home [/var/lib/jenkins/.java/openjdk-12+30-linux]
  1> [2019-03-18T23:49:36,759][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] JVM arguments [-Dfile.encoding=UTF8, -ea, -esa, -Xmx512m, -Xms512m, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/heapdump, --illegal-access=warn, -Dtests.prefix=tests, -Dtests.seed=3584DAFDBBDAC378, -Druntime.java=12, -Des.scripting.update.ctx_in_params=false, -Dtests.security.manager=true, -Djava.io.tmpdir=./temp, -Dtests.gradle=true, -Dtests.artifact=server, -Djava.awt.headless=true, -Dcompiler.java=12, -Dtests.task=:server:integTest, -Djna.nosys=true, -Dtests.logger.level=WARN, -Djunit4.childvm.cwd=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/J5, -Djunit4.tempDir=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/openjdk12/nodes/immutable&&linux&&docker/server/build/testrun/integTest/temp, -Djunit4.childvm.id=5, -Djunit4.childvm.count=8]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] no modules loaded
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$BrokenSettingPlugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$TestCustomMetaDataPlugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.snapshots.mockstore.MockRepository$Plugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.ESIntegTestCase$AssertActionNamePlugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.ESIntegTestCase$TestSeedPlugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.test.MockHttpTransport$TestPlugin]
  1> [2019-03-18T23:49:36,760][INFO ][o.e.p.PluginsService     ] [testSnapshotWithStuckNode] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2019-03-18T23:49:36,813][INFO ][o.e.d.DiscoveryModule    ] [testSnapshotWithStuckNode] using discovery type [zen] and seed hosts providers [settings, file]
  1> [2019-03-18T23:49:36,833][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] initialized
  1> [2019-03-18T23:49:36,837][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#2]]] starting ...
  1> [2019-03-18T23:49:36,841][INFO ][o.e.t.TransportService   ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#2]]] publish_address {127.0.0.1:42438}, bound_addresses {[::1]:43774}, {127.0.0.1:42438}
  1> [2019-03-18T23:49:36,937][INFO ][o.e.c.s.MasterService    ] [node_t0] node-join[{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438} join existing leader], term: 1, version: 3, reason: added {{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438},}
  1> [2019-03-18T23:49:36,985][INFO ][o.e.c.s.ClusterApplierService] [node_t1] master node changed {previous [], current [{node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099}]}, added {{node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099},}, term: 1, version: 3, reason: ApplyCommitRequest{term=1, version=3, sourceNode={node_t0}{a9IZPJC5TF-pc-AY-fVYVA}{0vlPLqPITuKm1QJAX75V4w}{127.0.0.1}{127.0.0.1:36099}}
  1> [2019-03-18T23:49:36,990][INFO ][o.e.n.Node               ] [[test_TEST-CHILD_VM=[5]-CLUSTER_SEED=[-1333041512154370405]-HASH=[226C51948EB]-cluster[T#2]]] started
  1> [2019-03-18T23:49:36,991][INFO ][o.e.c.s.ClusterApplierService] [node_t0] added {{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438},}, term: 1, version: 3, reason: Publication{term=1, version=3}
  1> [2019-03-18T23:49:37,034][INFO ][o.e.c.m.MetaDataCreateIndexService] [node_t0] [test-idx] creating index, cause [api], templates [], shards [2]/[0], mappings []
  1> [2019-03-18T23:49:37,373][INFO ][o.e.c.r.a.AllocationService] [node_t0] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[test-idx][1]] ...]).
  1> [2019-03-18T23:49:37,424][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> indexing some data
  1> [2019-03-18T23:49:37,430][INFO ][o.e.c.m.MetaDataMappingService] [node_t0] [test-idx/gnYgaDbeRIW7DwHa9KL1vg] create_mapping [doc]
  1> [2019-03-18T23:49:38,020][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> creating repository
  1> [2019-03-18T23:49:38,021][INFO ][o.e.s.m.MockRepository   ] [testSnapshotWithStuckNode] starting mock repository with random prefix ZHXZJJQppL
  1> [2019-03-18T23:49:38,022][INFO ][o.e.r.RepositoriesService] [node_t0] put repository [test-repo]
  1> [2019-03-18T23:49:38,067][INFO ][o.e.s.m.MockRepository   ] [node_t1] starting mock repository with random prefix ZHXZJJQppL
  1> [2019-03-18T23:49:38,070][INFO ][o.e.s.m.MockRepository   ] [node_t0] starting mock repository with random prefix ZHXZJJQppL
  1> [2019-03-18T23:49:38,119][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> snapshot
  1> [2019-03-18T23:49:38,163][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:test-snap/X9-2vaDMQw6Zlg9smbRRQg] started
  1> [2019-03-18T23:49:38,189][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> waiting for block to kick in
  1> [2019-03-18T23:49:38,257][INFO ][o.e.s.m.MockRepository   ] [node_t1] [test-repo] blocking I/O operation for file [__0] at path [[indices][j_HRCLsWRiyTc5QZyQR9hQ][0]]
  1> [2019-03-18T23:49:38,290][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> execution was blocked on node [node_t1], aborting snapshot
  1> [2019-03-18T23:49:38,383][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:test-snap/X9-2vaDMQw6Zlg9smbRRQg] completed with state [FAILED]
  1> [2019-03-18T23:49:38,392][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> stopping node [node_t1]
  1> [2019-03-18T23:49:38,393][INFO ][o.e.t.InternalTestCluster] [testSnapshotWithStuckNode] Closing filtered random node [node_t1] 
  1> [2019-03-18T23:49:38,393][INFO ][o.e.t.InternalTestCluster] [testSnapshotWithStuckNode] adding voting config exclusions [node_t1] prior to restart/shutdown
  1> [2019-03-18T23:49:38,539][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] stopping ...
  1> [2019-03-18T23:49:38,551][INFO ][o.e.c.s.MasterService    ] [node_t0] node-left[{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438} disconnected], term: 1, version: 19, reason: removed {{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438},}
  1> [2019-03-18T23:49:38,591][INFO ][o.e.c.s.ClusterApplierService] [node_t0] removed {{node_t1}{aCk35BpAT8atIDo0bV4Dyw}{pBM7_6NzQHCe--jaEOMuAA}{127.0.0.1}{127.0.0.1:42438},}, term: 1, version: 19, reason: Publication{term=1, version=19}
  1> [2019-03-18T23:49:38,594][WARN ][o.e.s.SnapshotShardsService] [node_t1] [[test-idx][0]][test-repo:test-snap/X9-2vaDMQw6Zlg9smbRRQg] failed to snapshot shard
  1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1408) ~[main/:?]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1401) ~[main/:?]
  1> 	at java.io.InputStream.transferTo(InputStream.java:771) ~[?:?]
  1> 	at org.elasticsearch.core.internal.io.Streams.copy(Streams.java:47) ~[elasticsearch-core-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.common.blobstore.fs.FsBlobContainer.writeBlob(FsBlobContainer.java:133) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.mockstore.BlobContainerWrapper.writeBlob(BlobContainerWrapper.java:53) ~[test/:?]
  1> 	at org.elasticsearch.snapshots.mockstore.MockRepository$MockBlobStore$MockBlobContainer.writeBlob(MockRepository.java:349) ~[test/:?]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1335) ~[main/:?]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1274) ~[main/:?]
  1> 	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:846) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:368) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:93) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:314) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:747) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:835) [?:?]
  1> [2019-03-18T23:49:38,598][WARN ][o.e.s.SnapshotShardsService] [node_t1] [test-repo:test-snap/X9-2vaDMQw6Zlg9smbRRQg] [ShardSnapshotStatus[state=FAILED, nodeId=aCk35BpAT8atIDo0bV4Dyw, reason=IndexShardSnapshotFailedException[Aborted]]] failed to update snapshot state
  1> org.elasticsearch.transport.SendRequestTransportException: [node_t1][127.0.0.1:42438][internal:cluster/snapshot/update_snapshot_status]
  1> 	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:639) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:542) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:517) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$1(SnapshotShardsService.java:507) ~[main/:?]
  1> 	at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:493) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:488) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService.access$400(SnapshotShardsService.java:93) ~[main/:?]
  1> 	at org.elasticsearch.snapshots.SnapshotShardsService$1.onAfter(SnapshotShardsService.java:332) ~[main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onAfter(ThreadContext.java:722) [main/:?]
  1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:41) [main/:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
  1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
  1> 	at java.lang.Thread.run(Thread.java:835) [?:?]
  1> Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
  1> 	at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:621) ~[main/:?]
  1> 	... 13 more
  1> [2019-03-18T23:49:38,603][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] stopped
  1> [2019-03-18T23:49:38,603][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] closing ...
  1> [2019-03-18T23:49:38,611][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] closed
  1> [2019-03-18T23:49:38,611][INFO ][o.e.t.InternalTestCluster] [testSnapshotWithStuckNode] removing voting config exclusions for [node_t1] after restart/shutdown
  1> [2019-03-18T23:49:38,653][INFO ][o.e.s.SnapshotsService   ] [node_t0] snapshot [test-repo:test-snap/X9-2vaDMQw6Zlg9smbRRQg] deleted
  1> [2019-03-18T23:49:38,669][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] --> making sure that snapshot no longer exists
  1> [2019-03-18T23:49:38,682][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] [DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode]: cleaning up after test
  1> [2019-03-18T23:49:38,697][INFO ][o.e.c.m.MetaDataDeleteIndexService] [node_t0] [test-idx/gnYgaDbeRIW7DwHa9KL1vg] deleting index
  1> [2019-03-18T23:49:38,774][INFO ][o.e.r.RepositoriesService] [node_t0] delete repository [test-repo]
  1> [2019-03-18T23:49:38,815][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] stopping ...
  1> [2019-03-18T23:49:38,817][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] stopped
  1> [2019-03-18T23:49:38,817][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] closing ...
  1> [2019-03-18T23:49:38,820][INFO ][o.e.n.Node               ] [testSnapshotWithStuckNode] closed
  1> [2019-03-18T23:49:38,820][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] [DedicatedClusterSnapshotRestoreIT#testSnapshotWithStuckNode]: cleaned up after test
  1> [2019-03-18T23:49:38,820][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testSnapshotWithStuckNode] after test

@original-brownbear
Copy link
Member

Thanks @jaymode ! looks like we're leaving a dangling data file behind here. That's kind of interesting:

repos/tsrjWCInwK/indices/j_HRCLsWRiyTc5QZyQR9hQ/0/__0

shouldn't be there. Looking into it shortly.

@original-brownbear
Copy link
Member

I tracked this down now. The problem is that we are deleting a snapshot while a data node is writing shard files (when it should be aborted already).
If I artificially slow down the deleting on the master it fails every time because the data node will write the extra shard file repos/tsrjWCInwK/indices/j_HRCLsWRiyTc5QZyQR9hQ/0/__0 before the master node tries to delete the index's folder. The only reason this test normally passes is that the master node deletes the index folder before the data node writes the data file so the data node just errors out on the write.
=> this makes the whole test kind of questionable to begin with since the behavior seen here does not apply on e.g. S3 where writing a blob to any path will work without checks for a "parent" path since there's no such thing in S3.
I'll look into ways to stabilize the situation without a big change, but ideally we want some stronger retrying on deleting here for which #40144 would be a prerequisite imo.

@jtibshirani
Copy link
Contributor

It sounds @original-brownbear has enough logging to identify the issue, so I've muted the test in 1de2a25 while we're working on a fix.

@droberts195
Copy link
Contributor Author

@original-brownbear
Copy link
Member

Yea, this unfortunately is a pretty principle issue with the way the blob store repository works. A fix is incoming in #42189 but until that is in we should probably just mute this test wherever it comes up.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jun 24, 2019
* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
   * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes elastic#39852
original-brownbear added a commit that referenced this issue Jun 25, 2019
* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode

* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
   * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes #39852
jkakavas pushed a commit to jkakavas/elasticsearch that referenced this issue Jun 27, 2019
…tic#43537)

* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode

* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
   * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes elastic#39852
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Jul 8, 2019
…tic#43537)

* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode

* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
   * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes elastic#39852
original-brownbear added a commit that referenced this issue Jul 9, 2019
…) (#44082)

* Fix DedicatedClusterSnapshotRestoreIT testSnapshotWithStuckNode

* See comment in the test: The problem is that when the snapshot delete works out partially on master failover and the retry fails on `SnapshotMissingException` no repository cleanup is run => we still failed even with repo cleanup logic in the delete path now
   * Fixed the test by rerunning a create snapshot and delete loop to clean up the repo before verifying file counts
* Closes #39852
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
6 participants