Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AutoFollowIT#testCleanFollowedLeaderIndexUUIDs failures #41071

Closed
cbuescher opened this issue Apr 10, 2019 · 10 comments
Closed

AutoFollowIT#testCleanFollowedLeaderIndexUUIDs failures #41071

cbuescher opened this issue Apr 10, 2019 · 10 comments
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@cbuescher
Copy link
Member

On 7.x: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+multijob-darwin-compatibility/99/console

Could not reproduce locally on 7.x:

./gradlew :x-pack:plugin:ccr:internalClusterTest --tests "org.elasticsearch.xpack.ccr.AutoFollowIT.testCleanFollowedLeaderIndexUUIDs" \
  -Dtests.seed=CB4054056BD16251 \
  -Dtests.security.manager=true \
  -Dtests.locale=el-CY \
  -Dtests.timezone=America/Santo_Domingo \
  -Dcompiler.java=12 \
  -Druntime.java=8

Stacktraces:

java.lang.RuntimeException: failed to start nodes
	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1672)
	at org.elasticsearch.test.InternalTestCluster.reset(InternalTestCluster.java:1214)
	at org.elasticsearch.test.InternalTestCluster.beforeTest(InternalTestCluster.java:1110)
	at org.elasticsearch.xpack.CcrIntegTestCase.startClusters(CcrIntegTestCase.java:171)
	[...]
Caused by: java.util.concurrent.ExecutionException: java.lang.IllegalStateException: failed to connect to remote clusters
	at java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.util.concurrent.FutureTask.get(FutureTask.java:192)
	at org.elasticsearch.test.InternalTestCluster.startAndPublishNodesAndClients(InternalTestCluster.java:1667)
	... 39 more
Caused by: java.lang.IllegalStateException: failed to connect to remote clusters
	at org.elasticsearch.transport.RemoteClusterService.initializeRemoteClusters(RemoteClusterService.java:431)
	at org.elasticsearch.transport.TransportService.doStart(TransportService.java:241)
	at org.elasticsearch.common.component.AbstractLifecycleComponent.start(AbstractLifecycleComponent.java:61)
	at org.elasticsearch.node.Node.start(Node.java:662)
	at org.elasticsearch.test.InternalTestCluster$NodeAndClient.startNode(InternalTestCluster.java:961)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:677)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
Caused by: java.util.concurrent.ExecutionException: ConnectTransportException[[leader1][127.0.0.1:53965] general node connection failure]; nested: IllegalStateException[handshake failed with {leader1}{C0InN5g5Q1ywlJOdPNNOwg}{nCYQ3OCCQc23dtHZNr8p9w}{127.0.0.1}{127.0.0.1:53965}{xpack.installed=true}]; nested: ReceiveTimeoutTransportException[[leader1][127.0.0.1:53965][internal:transport/handshake] request_id [4] timed out after [30001ms]];	
@cbuescher cbuescher added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Apr 10, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch
Copy link
Contributor

ywelsch commented May 7, 2019

This looks to be a transport issue. Can you have a look @tbrooks8 ?

Recent failure: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.1+multijob-darwin-compatibility/10/consoleFull

Relevant log line:

1> [2019-05-06T23:06:23,228][ERROR][o.e.t.n.MockNioTransport ] [follower1] exception from server channel caught on transport layer [sun.nio.ch.ServerSocketChannelImpl[localhost/127.0.0.1:57988]]
  1> java.net.SocketException: Invalid argument
  1> 	at sun.nio.ch.Net.setIntOption0(Native Method) ~[?:?]
  1> 	at sun.nio.ch.Net.setSocketOption(Net.java:334) ~[?:?]
  1> 	at sun.nio.ch.SocketChannelImpl.setOption(SocketChannelImpl.java:190) ~[?:?]
  1> 	at sun.nio.ch.SocketAdaptor.setBooleanOption(SocketAdaptor.java:271) ~[?:?]
  1> 	at sun.nio.ch.SocketAdaptor.setTcpNoDelay(SocketAdaptor.java:306) ~[?:?]
  1> 	at org.elasticsearch.nio.ChannelFactory$RawChannelFactory.configureSocketChannel(ChannelFactory.java:210) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ChannelFactory$RawChannelFactory.acceptNioChannel(ChannelFactory.java:185) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ChannelFactory.acceptNioChannel(ChannelFactory.java:55) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.ServerChannelContext.acceptChannels(ServerChannelContext.java:47) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.EventHandler.acceptChannel(EventHandler.java:45) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.processKey(NioSelector.java:227) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.singleLoop(NioSelector.java:172) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at org.elasticsearch.nio.NioSelector.runLoop(NioSelector.java:129) ~[elasticsearch-nio-7.1.0-SNAPSHOT.jar:7.1.0-SNAPSHOT]
  1> 	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_212]

@dnhatn
Copy link
Member

dnhatn commented May 26, 2019

Similar issue #41071.

@dnhatn
Copy link
Member

dnhatn commented Jun 11, 2019

Another instance https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.1+multijob-darwin-compatibility/15/consoleText.

 1> [2019-06-11T10:53:49,883][ERROR][o.e.t.n.MockNioTransport ] [followerm1] exception from server channel caught on transport layer [sun.nio.ch.ServerSocketChannelImpl[localhost/127.0.0.1:52943]]
  1> java.net.SocketException: Invalid argument
  1> 	at sun.nio.ch.Net.setIntOption0(Native Method) ~[?:?]
  1> 	at sun.nio.ch.Net.setSocketOption(Net.java:334) ~[?:?]
  1> 	at sun.nio.ch.SocketChannelImpl.setOption(SocketChannelImpl.java:190) ~[?:?]
  1> 	at sun.nio.ch.SocketAdaptor.setBooleanOption(SocketAdaptor.java:271) ~[?:?]
  1> 	at sun.nio.ch.SocketAdaptor.setTcpNoDelay(SocketAdaptor.java:306) ~[?:?]
  1> 	at org.elasticsearch.nio.ChannelFactory$RawChannelFactory.configureSocketChannel(ChannelFactory.java:210) ~[elasticsearch-nio-7.1.2-SNAPSHOT.jar:7.1.2-SNAPSHOT]
org.elasticsearch.xpack.ccr.AutoFollowIT > testAutoFollowSoftDeletesDisabled FAILED
    java.lang.RuntimeException: failed to start nodes

        Caused by:
        java.util.concurrent.ExecutionException: java.lang.IllegalStateException: failed to connect to remote clusters

            Caused by:
            java.lang.IllegalStateException: failed to connect to remote clusters

                Caused by:
                java.util.concurrent.ExecutionException: ConnectTransportException[[][127.0.0.1:52858] connect_timeout[30s]]

                    Caused by:
                    ConnectTransportException[[][127.0.0.1:52858] connect_timeout[30s]]

@ywelsch
Copy link
Contributor

ywelsch commented Jun 12, 2019

@tbrooks8 can you have a look?

@Tim-Brooks
Copy link
Contributor

My best guess is that this is similar to this: envoyproxy/envoy#1446

@ywelsch
Copy link
Contributor

ywelsch commented Jun 17, 2019

@tbrooks8 interesting find. How do you think we should fix this? Should we just log a warning on OS X and move on? Could we possibly delay setting the option until we're fully connected?

@Tim-Brooks
Copy link
Contributor

We could delay setting the socket options until connection is complete. That would require a little reworking how how things are now. Possibly just add a connection complete future to set them.

I'll think about it submit a PR sometime this week.

ywelsch added a commit that referenced this issue Jul 17, 2019
Brings some temporary relief for test failures until #41071 is addressed.
ywelsch added a commit that referenced this issue Jul 17, 2019
Brings some temporary relief for test failures until #41071 is addressed.
@ebadyano
Copy link
Contributor

@jdconrad
Copy link
Contributor

jkakavas pushed a commit that referenced this issue Jul 31, 2019
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
Tim-Brooks added a commit to Tim-Brooks/elasticsearch that referenced this issue Aug 2, 2019
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
elastic#41071.
Tim-Brooks added a commit that referenced this issue Aug 2, 2019
Currently in the transport-nio work we connect and bind channels on the
a thread before the channel is registered with a selector. Additionally,
it is at this point that we set all the socket options. This commit
moves these operations onto the event-loop after the channel has been
registered with a selector. It attempts to set the socket options for a
non-server channel at registration time. If that fails, it will attempt
to set the options after the channel is connected. This should fix
#41071.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants