Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AutoFollowIT#testConflictingPatterns failing #35480

Closed
talevy opened this issue Nov 13, 2018 · 5 comments
Closed

[CI] AutoFollowIT#testConflictingPatterns failing #35480

talevy opened this issue Nov 13, 2018 · 5 comments
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@talevy
Copy link
Contributor

talevy commented Nov 13, 2018

Hey all, not sure if this is real or not, but popped up on my PR test. May be worth investigating?

CI: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request/2600/

./gradlew :x-pack:plugin:ccr:internalClusterTest -Dtests.seed=4205CBCCCD58E520 -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT -Dtests.method="testConflictingPatterns" -Dtests.security.manager=true -Dtests.locale=es-PY -Dtests.timezone=Pacific/Majuro -Dcompiler.java=11 -Druntime.java=8

stacks

  1> Caused by: org.elasticsearch.transport.RemoteTransportException: [leader0][127.0.0.1:53885][indices:data/read/xpack/ccr/shard_changes[s]]
  1> Caused by: org.elasticsearch.index.shard.IndexShardClosedException: CurrentState[CLOSED] Closed
  1> 	at org.elasticsearch.index.shard.GlobalCheckpointListeners.close(GlobalCheckpointListeners.java:158) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:103) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.core.internal.io.IOUtils.close(IOUtils.java:61) ~[elasticsearch-core-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.shard.IndexShard.close(IndexShard.java:1187) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.IndexService.closeShard(IndexService.java:436) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.IndexService.removeShard(IndexService.java:419) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.index.IndexService.close(IndexService.java:277) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.indices.IndicesService.removeIndex(IndicesService.java:638) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1> 	at org.elasticsearch.indices.cluster.IndicesClusterStateService.deleteIndices(IndicesClusterStateService.java:286) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
@talevy talevy added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Nov 13, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@javanna
Copy link
Member

javanna commented Nov 21, 2018

Same test but a different failure here: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-darwin-compatibility/72 .

16:58:09 FAILURE 30.1s J0 | AutoFollowIT.testConflictingPatterns <<< FAILURES!
16:58:09    > Throwable #1: java.lang.AssertionError: 
16:58:09    > Expected: <0L>
16:58:09    >      but: was <1L>
16:58:09    > 	at __randomizedtesting.SeedInfo.seed([D307FB40E8829BD7:C074EF21CD3507C2]:0)
16:58:09    > 	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
16:58:09    > 	at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testConflictingPatterns$6(AutoFollowIT.java:217)
16:58:09    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
16:58:09    > 	at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
16:58:09    > 	at org.elasticsearch.xpack.ccr.AutoFollowIT.testConflictingPatterns(AutoFollowIT.java:214)
16:58:09    > 	at java.lang.Thread.run(Thread.java:748)
16:58:09    > 	Suppressed: java.lang.AssertionError: 
16:58:09    > Expected: <1L>
16:58:09    >      but: was <0L>
16:58:09    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
16:58:09    > 		at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testConflictingPatterns$6(AutoFollowIT.java:216)
16:58:09    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
16:58:09    > 		... 39 more
16:58:09    > 	Suppressed: java.lang.AssertionError: 
16:58:09    > Expected: <1L>
16:58:09    >      but: was <0L>
16:58:09    > 		at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
16:58:09    > 		at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testConflictingPatterns$6(AutoFollowIT.java:216)
16:58:09    > 		at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
16:58:09    > 		... 39 more
16:58:09    > 	Suppressed: java.lang.AssertionError: .....................

@martijnvg
Copy link
Member

So the last test failure is caused by the fact that the auto follow coordinator failed fetching the history uuids when following an index. The second time when the auto follow coordinator attempts it, the history uuids are available and auto following succeeds. The test just does not expect that auto following fails the first time. I will think about how to address this.

Initial failure:

java.lang.IllegalArgumentException: leader index's commit stats are missing
  1>    at org.elasticsearch.xpack.ccr.CcrLicenseChecker.lambda$fetchLeaderHistoryUUIDs$6(CcrLicenseChecker.java:258) ~[main/:?]
  1>    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:60) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:48) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1119) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport$2.doRun(TcpTransport.java:1297) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:140) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.handleResponse(TcpTransport.java:1289) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1244) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:1030) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.TcpTransport.consumeNetworkReads(TcpTransport.java:1055) ~[elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.MockTcpTransport.readMessage(MockTcpTransport.java:166) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.MockTcpTransport.access$800(MockTcpTransport.java:74) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.MockTcpTransport$MockChannel$2.lambda$doRun$0(MockTcpTransport.java:343) ~[framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:106) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.transport.MockTcpTransport$MockChannel$2.doRun(MockTcpTransport.java:343) [framework-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.0.0-SNAPSHOT.jar:7.0.0-SNAPSHOT]
  1>    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_192]
  1>    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_192]
  1>    at java.lang.Thread.run(Thread.java:748) [?:1.8.0_192]

@martijnvg
Copy link
Member

The leader index gets created and at the same time the auto follow coordinator fetch the remote cluster state. Not all primary shards have been allocated, because the create index request for leader index has not returned. This causes index following to fail, because not all primary shards have history uuid in its index shard stats.

I will work on a change that let the auto follow coordinator wait with auto following an index until all primary have been started. This should avoid these situations.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 22, 2018
This change adds an extra check that verifies that all primary shards
have been started of an index that is about to be auto followed.

If not all primary shards have been started for an index
then the next auto follow run will try to follow to auto follow
this index again.

Closes elastic#35480
martijnvg added a commit that referenced this issue Nov 27, 2018
…nator,

it has nothing to do with the hlrc support for put auto follow pattern api,
this test was added for.

Relates to #35480
martijnvg added a commit that referenced this issue Nov 27, 2018
…nator,

it has nothing to do with the hlrc support for put auto follow pattern api,
this test was added for.

Relates to #35480
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Nov 27, 2018
… may be removed

AutoFollowCoordinator should take into account that after auto following
an index and while updating that a leader index has been followed, that
the auto follow pattern may have been removed via delete auto follow patters
api.

Closes elastic#35480
@martijnvg
Copy link
Member

martijnvg added a commit that referenced this issue Nov 29, 2018
…35814)

This change adds an extra check that verifies that all primary shards
have been started of an index that is about to be auto followed.

If not all primary shards have been started for an index
then the next auto follow run will try to follow to auto follow
this index again.

Closes #35480
martijnvg added a commit that referenced this issue Nov 29, 2018
…35814)

This change adds an extra check that verifies that all primary shards
have been started of an index that is about to be auto followed.

If not all primary shards have been started for an index
then the next auto follow run will try to follow to auto follow
this index again.

Closes #35480
martijnvg added a commit that referenced this issue Nov 29, 2018
…35814)

This change adds an extra check that verifies that all primary shards
have been started of an index that is about to be auto followed.

If not all primary shards have been started for an index
then the next auto follow run will try to follow to auto follow
this index again.

Closes #35480
martijnvg added a commit that referenced this issue Dec 4, 2018
… may be removed (#35945)

AutoFollowCoordinator should take into account that after auto following
an index and while updating that a leader index has been followed, that
the auto follow pattern may have been removed via delete auto follow patterns
api.

Also fixed a bug that when a remote cluster connection has been removed,
the auto follow coordinator does not die when it tries get a remote client for 
that cluster.

Closes #35480
martijnvg added a commit that referenced this issue Dec 4, 2018
… may be removed (#35945)

AutoFollowCoordinator should take into account that after auto following
an index and while updating that a leader index has been followed, that
the auto follow pattern may have been removed via delete auto follow patterns
api.

Also fixed a bug that when a remote cluster connection has been removed,
the auto follow coordinator does not die when it tries get a remote client for 
that cluster.

Closes #35480
martijnvg added a commit that referenced this issue Dec 4, 2018
… may be removed (#35945)

AutoFollowCoordinator should take into account that after auto following
an index and while updating that a leader index has been followed, that
the auto follow pattern may have been removed via delete auto follow patterns
api.

Also fixed a bug that when a remote cluster connection has been removed,
the auto follow coordinator does not die when it tries get a remote client for
that cluster.

Closes #35480
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants