New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] SourceOnlySnapshotIT testSnapshotAndRestore fails reproducibly #36330

Open
jdconrad opened this Issue Dec 6, 2018 · 7 comments

Comments

Projects
None yet
4 participants
@jdconrad
Contributor

jdconrad commented Dec 6, 2018

I was able to reproduce this locally. Looks like a node wasn't able to start.

REPRODUCE WITH:

./gradlew :x-pack:plugin:core:internalClusterTest \
  -Dtests.seed=3F6F96908F439A68 \
  -Dtests.class=org.elasticsearch.snapshots.SourceOnlySnapshotIT \
  -Dtests.method="testSnapshotAndRestore" \
  -Dtests.security.manager=true \
  -Dtests.locale=fi \
  -Dtests.timezone=Africa/Lome \
  -Dcompiler.java=11 \
  -Druntime.java=8

CONSOLE:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/564/console

STACK TRACE:

FAILURE 38.2s | SourceOnlySnapshotIT.testSnapshotAndRestore <<< FAILURES!
11:07:27    > Throwable #1: java.lang.AssertionError: timed out waiting for green state
11:07:27    > 	at __randomizedtesting.SeedInfo.seed([3F6F96908F439A68:FF08F5E3528C675F]:0)
11:07:27    > 	at org.elasticsearch.test.ESIntegTestCase.ensureColor(ESIntegTestCase.java:970)
11:07:27    > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:926)
11:07:27    > 	at org.elasticsearch.test.ESIntegTestCase.ensureGreen(ESIntegTestCase.java:915)
11:07:27    > 	at org.elasticsearch.snapshots.SourceOnlySnapshotIT.testSnapshotAndRestore(SourceOnlySnapshotIT.java:121)
11:07:27    > 	at java.lang.Thread.run(Thread.java:748)
11:07:27   1> [2018-12-06T19:07:27,659][INFO ][o.e.n.Node               ] [suite] stopping ...
11:07:27   1> [2018-12-06T19:07:27,662][INFO ][o.e.c.s.ClusterApplierService] [node_sm2] master node changed {previous [{node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305}], current []}, reason: becoming candidate: onLeaderFailure
11:07:27   1> [2018-12-06T19:07:27,666][INFO ][o.e.c.s.ClusterApplierService] [node_sd4] master node changed {previous [{node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305}], current []}, reason: becoming candidate: onLeaderFailure
11:07:27   1> [2018-12-06T19:07:27,662][INFO ][o.e.c.s.ClusterApplierService] [node_sc1] master node changed {previous [{node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305}], current []}, reason: becoming candidate: onLeaderFailure
11:07:27   1> [2018-12-06T19:07:27,665][INFO ][o.e.n.Node               ] [suite] stopped
11:07:27   1> [2018-12-06T19:07:27,668][INFO ][o.e.n.Node               ] [suite] closing ...
11:07:27   1> [2018-12-06T19:07:27,675][WARN ][o.e.c.NodeConnectionsService] [node_sm2] failed to connect to node {node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305} (tried [1] times)
11:07:27   1> org.elasticsearch.transport.ConnectTransportException: [node_s0][127.0.0.1:44305] connect_exception
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1628) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:98) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:778) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2140) ~[?:1.8.0_192]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.addListener(CompletableContext.java:45) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.addConnectListener(SocketChannelContext.java:82) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSocketChannel.addConnectListener(NioSocketChannel.java:77) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.MockNioTransport$MockSocketChannel.addConnectListener(MockNioTransport.java:302) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.initiateConnection(TcpTransport.java:394) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:361) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:223) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:109) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:372) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:359) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:109) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
11:07:27   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
11:07:27   1> Caused by: java.net.ConnectException: Connection refused
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:116) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.EventHandler.handleConnect(EventHandler.java:97) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.TestingSocketEventHandler.handleConnect(TestingSocketEventHandler.java:43) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.attemptConnect(NioSelector.java:363) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:380) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:230) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:133) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:109) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	... 1 more
11:07:27   1> [2018-12-06T19:07:27,676][WARN ][o.e.c.NodeConnectionsService] [node_sc1] failed to connect to node {node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305} (tried [1] times)
11:07:27   1> org.elasticsearch.transport.ConnectTransportException: [node_s0][127.0.0.1:44305] connect_exception
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1628) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:98) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:778) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2140) ~[?:1.8.0_192]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.addListener(CompletableContext.java:45) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.addConnectListener(SocketChannelContext.java:82) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSocketChannel.addConnectListener(NioSocketChannel.java:77) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.MockNioTransport$MockSocketChannel.addConnectListener(MockNioTransport.java:302) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.initiateConnection(TcpTransport.java:394) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:361) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:223) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:109) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:372) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:359) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:109) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
11:07:27   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
11:07:27   1> Caused by: java.net.ConnectException: Connection refused
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:116) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.EventHandler.handleConnect(EventHandler.java:97) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.TestingSocketEventHandler.handleConnect(TestingSocketEventHandler.java:43) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.attemptConnect(NioSelector.java:363) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:380) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:230) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:133) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:109) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	... 1 more
11:07:27   1> [2018-12-06T19:07:27,678][WARN ][o.e.c.NodeConnectionsService] [node_sd4] failed to connect to node {node_s0}{7y2b9LZYSOqyP_0ClCkBZw}{OdTbtFHyReuXaQwZ6a8Idg}{127.0.0.1}{127.0.0.1:44305} (tried [1] times)
11:07:27   1> org.elasticsearch.transport.ConnectTransportException: [node_s0][127.0.0.1:44305] connect_exception
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener.onFailure(TcpTransport.java:1628) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:98) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:42) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:778) ~[?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2140) ~[?:1.8.0_192]
11:07:27   1> 	at org.elasticsearch.common.concurrent.CompletableContext.addListener(CompletableContext.java:45) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.addConnectListener(SocketChannelContext.java:82) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSocketChannel.addConnectListener(NioSocketChannel.java:77) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.MockNioTransport$MockSocketChannel.addConnectListener(MockNioTransport.java:302) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.initiateConnection(TcpTransport.java:394) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:361) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.internalOpenConnection(ConnectionManager.java:223) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.ConnectionManager.connectToNode(ConnectionManager.java:109) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:372) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:359) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.cluster.NodeConnectionsService$1.doRun(NodeConnectionsService.java:109) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
11:07:27   1> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
11:07:27   1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]
11:07:27   1> Caused by: java.net.ConnectException: Connection refused
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) ~[?:?]
11:07:27   1> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717) ~[?:?]
11:07:27   1> 	at org.elasticsearch.nio.SocketChannelContext.connect(SocketChannelContext.java:116) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.EventHandler.handleConnect(EventHandler.java:97) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.transport.nio.TestingSocketEventHandler.handleConnect(TestingSocketEventHandler.java:43) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.attemptConnect(NioSelector.java:363) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.setUpNewChannels(NioSelector.java:380) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.preSelect(NioSelector.java:230) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:133) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:109) ~[elasticsearch-nio-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
11:07:27   1> 	... 1 more```
@elasticmachine

This comment has been minimized.

elasticmachine commented Dec 6, 2018

@tbrooks8

This comment has been minimized.

Contributor

tbrooks8 commented Dec 6, 2018

This seems like it might be similar to #36276?

At least same test suite and started failing recently.

@jdconrad

This comment has been minimized.

Contributor

jdconrad commented Dec 6, 2018

@tbrooks8 I looked at the other issue as well, and maybe it's the same problem, but definitely different error messages I believe.

@tbrooks8

This comment has been minimized.

Contributor

tbrooks8 commented Dec 7, 2018

After some investigation I think the issue is the same as #36276. There are stack traces that come before the timeout waiting for green:

[2018-12-06T23:29:17,024][WARN ][o.e.i.r.PeerRecoveryTargetService] [node_s0] error while listing local files, recovering as if there are none
org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /private/var/folders/vj/_qsxqj6n6z14dpr0crpw2rbr0000gn/T/org.elasticsearch.snapshots.SourceOnlySnapshotIT_3F6F96908F439A68-058/tempDir-002/custom/XVxgoOZJNO/0/zxnpJEqtTCa3lzR8Ud7Z0Q/0/index/write.lock
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.MockDirectoryWrapper.obtainLock(MockDirectoryWrapper.java:1046) ~[lucene-test-framework-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.elasticsearch.index.store.Store.getMetadata(Store.java:291) ~[main/:?]
	at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1143) ~[main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStoreMetadataSnapshot(PeerRecoveryTargetService.java:300) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartRecoveryRequest(PeerRecoveryTargetService.java:321) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:182) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$500(PeerRecoveryTargetService.java:82) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:633) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [main/:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Additionally, I think I have isolated what is the underlying issue. Although I do not yet know that is causing it.

What I see is that we start a recovery from snapshot.

java.lang.Throwable: start snapshot recovery
	at org.elasticsearch.index.shard.IndexShard.startRecovery(IndexShard.java:2138)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:611)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:155)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:560)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:537)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:234)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:483)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:480)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

This obtains a filesystem lock in the ReadOnlyEngine

java.lang.Throwable: Obtain
	at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:105)
	at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.lambda$getEngineFactory$1(SourceOnlySnapshotRepository.java:149)
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2251)
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2233)
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1389)
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1346)
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:474)
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:279)
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302)
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:277)
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1631)
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$7(IndexShard.java:2142)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

This lock will be held until ReadOnlyEngine#closeNoLock is called. That will happen when the engine is closed or failed.

In the test we update the number of replicas.

[2018-12-06T23:41:23,940][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node_s0] updating number_of_replicas to [1] for indices [test-idx]

At this point I see another recovery start. However, this is a PEER recovery. The ReadOnlyEngine is still open and holding the filesystem lock.

java.lang.Throwable: start Recovery
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.startRecovery(PeerRecoveryTargetService.java:145)
	at org.elasticsearch.index.shard.IndexShard.startRecovery(IndexShard.java:2128)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:611)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:155)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:560)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:537)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:234)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:483)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:480)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

The peer recovery throws an exception when it goes to get the Store.MetadataSnapshot for the index.

org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /private/var/folders/vj/_qsxqj6n6z14dpr0crpw2rbr0000gn/T/org.elasticsearch.snapshots.SourceOnlySnapshotIT_3F6F96908F439A68-061/tempDir-002/custom/XVxgoOZJNO/0/T29u63acR4erreBcU8dteA/0/index/write.lock
	at org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:139) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:41) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:45) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.MockDirectoryWrapper.obtainLock(MockDirectoryWrapper.java:1046) ~[lucene-test-framework-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.apache.lucene.store.FilterDirectory.obtainLock(FilterDirectory.java:105) ~[lucene-core-8.0.0-snapshot-c78429a554.jar:8.0.0-snapshot-c78429a554 c78429a554d28611dacd90c388e6c34039b228d1 - romseygeek - 2018-12-04 10:17:44]
	at org.elasticsearch.index.store.Store.getMetadata(Store.java:291) ~[main/:?]
	at org.elasticsearch.index.shard.IndexShard.snapshotStoreMetadata(IndexShard.java:1143) ~[main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStoreMetadataSnapshot(PeerRecoveryTargetService.java:300) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.getStartRecoveryRequest(PeerRecoveryTargetService.java:321) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.doRecovery(PeerRecoveryTargetService.java:182) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.access$500(PeerRecoveryTargetService.java:82) [main/:?]
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService$RecoveryRunner.doRun(PeerRecoveryTargetService.java:633) [main/:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:759) [main/:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]

If I remove one case of randomBoolean() call in the test (and set the value directly to what it is when random is called), the test passes reliably.

I see the recovery from snapshot that opens a ReadOnlyEngine.

java.lang.Throwable: start snapshot recovery
	at org.elasticsearch.index.shard.IndexShard.startRecovery(IndexShard.java:2138)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:611)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:155)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:560)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:537)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:234)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:483)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:480)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

java.lang.Throwable: Obtain
	at org.elasticsearch.index.engine.ReadOnlyEngine.<init>(ReadOnlyEngine.java:105)
	at org.elasticsearch.snapshots.SourceOnlySnapshotRepository.lambda$getEngineFactory$1(SourceOnlySnapshotRepository.java:149)
	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:2251)
	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:2233)
	at org.elasticsearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1389)
	at org.elasticsearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1346)
	at org.elasticsearch.index.shard.StoreRecovery.restore(StoreRecovery.java:474)
	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromRepository$5(StoreRecovery.java:279)
	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:302)
	at org.elasticsearch.index.shard.StoreRecovery.recoverFromRepository(StoreRecovery.java:277)
	at org.elasticsearch.index.shard.IndexShard.restoreFromRepository(IndexShard.java:1631)
	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$7(IndexShard.java:2142)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

And the recovery from PEER after the settings update.

[2018-12-06T23:51:45,427][INFO ][o.e.c.m.MetaDataUpdateSettingsService] [node_s0] updating number_of_replicas to [1] for indices [test-idx]

java.lang.Throwable: start Recovery
	at org.elasticsearch.indices.recovery.PeerRecoveryTargetService.startRecovery(PeerRecoveryTargetService.java:145)
	at org.elasticsearch.index.shard.IndexShard.startRecovery(IndexShard.java:2128)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:611)
	at org.elasticsearch.indices.IndicesService.createShard(IndicesService.java:155)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createShard(IndicesClusterStateService.java:560)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.createOrUpdateShards(IndicesClusterStateService.java:537)
	at org.elasticsearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:234)
	at org.elasticsearch.cluster.service.ClusterApplierService.lambda$callClusterStateAppliers$6(ClusterApplierService.java:483)
	at java.base/java.lang.Iterable.forEach(Iterable.java:75)
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:480)
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467)
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:418)
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:162)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:660)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:244)
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:207)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:834)

The only different that I see is that the filesystem locks are obtained for different temp directories. In the failing case both are for directory 002. And in the passing case one is on directory 002 and the other is on directory 003.

.SourceOnlySnapshotIT_3F6F96908F439A68-064/tempDir-002/data/nodes/0/indices/9jsnmkG5SDGgHj_zHeVFiw/0/index
.SourceOnlySnapshotIT_3F6F96908F439A68-064/tempDir-002/data/nodes/0/indices/9jsnmkG5SDGgHj_zHeVFiw/0/index

vs

.SourceOnlySnapshotIT_3F6F96908F439A68-064/tempDir-002/data/nodes/0/indices/9jsnmkG5SDGgHj_zHeVFiw/0/index
.SourceOnlySnapshotIT_3F6F96908F439A68-064/tempDir-003/data/nodes/0/indices/9jsnmkG5SDGgHj_zHeVFiw/0/index

My understanding of ESTestCase is that these different directories are for different nodes. So it seems like in the passing case, the two different recoveries for the same index are happening on different nodes. And in the failing case, the recoveries are happening on the same node, leading to a filesystem lock collision.

@tbrooks8

This comment has been minimized.

Contributor

tbrooks8 commented Dec 7, 2018

I actually was looking at the code wrong and the a single cluster shares one tempDir. So the different nodes is not the explanation.

@tbrooks8

This comment has been minimized.

Contributor

tbrooks8 commented Dec 7, 2018

Actually 🙃 it looks like this specific test starts a data node with a different path.home directory (003). So it does look like the test fails depending on which node the peer recovery is initiated on.

Where we create the new node:

        logger.info("--> start a new data node");
        final Settings dataSettings = Settings.builder()
            .put(Node.NODE_NAME_SETTING.getKey(), randomAlphaOfLength(5))
            .put(Environment.PATH_HOME_SETTING.getKey(), createTempDir()) // to get a new node id
            .build();
@droberts195

This comment has been minimized.

Contributor

droberts195 commented Dec 7, 2018

I muted this test in master in 60289e7 as it is making PR builds fail

droberts195 added a commit that referenced this issue Dec 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment