Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

Closed
spinscale opened this issue Dec 18, 2018 · 15 comments
Closed

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

spinscale opened this issue Dec 18, 2018 · 15 comments
Assignees
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI

Comments

@spinscale
Copy link
Contributor

Could not reproduce this under Linux or osx. Link https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/console

log snippet

  1> [2018-12-18T07:47:41,534][INFO ][o.e.x.c.AutoFollowIT     ] [testAutoFollowManyIndices] after test
FAILURE 15.9s J0 | AutoFollowIT.testAutoFollowManyIndices <<< FAILURES!
   > Throwable #1: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <38>
   >    at __randomizedtesting.SeedInfo.seed([F02ACEB2EC9BD622:D818E1CF850A7355]:0)
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
   >    at org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices(AutoFollowIT.java:151)
   >    at java.lang.Thread.run(Thread.java:748)
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:

reproduction

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=F02ACEB2EC9BD622 \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=id \
  -Dtests.timezone=Africa/Monrovia \
  -Dcompiler.java=11 \
  -Druntime.java=8

Link to the plain text logfile is at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/consoleText

@spinscale spinscale added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Dec 18, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

martijnvg added a commit that referenced this issue Dec 18, 2018
martijnvg added a commit that referenced this issue Dec 18, 2018
martijnvg added a commit that referenced this issue Dec 18, 2018
@martijnvg
Copy link
Member

I've pushed this test fix: e4391af
I will close this issue if this test doen't fail in the coming days.

@gwbrown
Copy link
Contributor

gwbrown commented Jan 11, 2019

@martijnvg
Copy link
Member

After taking a better look at the test and how the auto follow coordinator can work, I think that the underlying issue may be caused by a real issue in the auto follow coordinator.

The logs-does-not-count index should not get auto followed, but it does get auto followed. In the test an auto follow pattern is added, a number of leader indices get created that should get auto followed, the auto follow pattern gets removed, the logs-does-not-count gets created (and should not get auto followed), then the auto follow patterns get re-added and finally a number of leader indices get created that should get auto followed.

When the auto follow coordinator removes the auto follower, the auto follower itself doen't know it is deleted yet and may do another auto follow round before it really stops. I will open a PR to fix this.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Jan 13, 2019
Currently when there are no more auto follow patterns for a remote cluster then
the AutoFollower instance for this remote cluster will be removed. If
a new auto follow pattern for this remote cluster gets added quickly enough
after the last delete then there may be two AutoFollower instance running
for this remote cluster instead of one.

Each AutoFollower instance stops automatically after it sees in the
start() method that there are no more auto follow patterns for the
remote cluster it is tracking. However when an auto follow pattern
gets removed and then added back quickly enough then old AutoFollower
may never detect that at some point there were no auto follow patterns
for the remote cluster it is monitoring. The creation and removal of
an AutoFollower instance happens independently in the `updateAutoFollowers()`
as part of a cluster state update.

By adding the `removed` field, an AutoFollower instance will not miss the
fact there were no auto follow patterns at some point in time. The
`updateAutoFollowers()` method now marks an AutoFollower instance as
removed when it sees that there are no more patterns for a remote cluster.
The updateAutoFollowers() method can then safely start a new AutoFollower
instance.

Relates to elastic#36761
martijnvg added a commit that referenced this issue Jan 15, 2019
Currently when there are no more auto follow patterns for a remote cluster then
the AutoFollower instance for this remote cluster will be removed. If
a new auto follow pattern for this remote cluster gets added quickly enough
after the last delete then there may be two AutoFollower instance running
for this remote cluster instead of one.

Each AutoFollower instance stops automatically after it sees in the
start() method that there are no more auto follow patterns for the
remote cluster it is tracking. However when an auto follow pattern
gets removed and then added back quickly enough then old AutoFollower
may never detect that at some point there were no auto follow patterns
for the remote cluster it is monitoring. The creation and removal of
an AutoFollower instance happens independently in the `updateAutoFollowers()`
as part of a cluster state update.

By adding the `removed` field, an AutoFollower instance will not miss the
fact there were no auto follow patterns at some point in time. The
`updateAutoFollowers()` method now marks an AutoFollower instance as
removed when it sees that there are no more patterns for a remote cluster.
The updateAutoFollowers() method can then safely start a new AutoFollower
instance.

Relates to #36761
martijnvg added a commit that referenced this issue Jan 15, 2019
Currently when there are no more auto follow patterns for a remote cluster then
the AutoFollower instance for this remote cluster will be removed. If
a new auto follow pattern for this remote cluster gets added quickly enough
after the last delete then there may be two AutoFollower instance running
for this remote cluster instead of one.

Each AutoFollower instance stops automatically after it sees in the
start() method that there are no more auto follow patterns for the
remote cluster it is tracking. However when an auto follow pattern
gets removed and then added back quickly enough then old AutoFollower
may never detect that at some point there were no auto follow patterns
for the remote cluster it is monitoring. The creation and removal of
an AutoFollower instance happens independently in the `updateAutoFollowers()`
as part of a cluster state update.

By adding the `removed` field, an AutoFollower instance will not miss the
fact there were no auto follow patterns at some point in time. The
`updateAutoFollowers()` method now marks an AutoFollower instance as
removed when it sees that there are no more patterns for a remote cluster.
The updateAutoFollowers() method can then safely start a new AutoFollower
instance.

Relates to #36761
martijnvg added a commit that referenced this issue Jan 15, 2019
Currently when there are no more auto follow patterns for a remote cluster then
the AutoFollower instance for this remote cluster will be removed. If
a new auto follow pattern for this remote cluster gets added quickly enough
after the last delete then there may be two AutoFollower instance running
for this remote cluster instead of one.

Each AutoFollower instance stops automatically after it sees in the
start() method that there are no more auto follow patterns for the
remote cluster it is tracking. However when an auto follow pattern
gets removed and then added back quickly enough then old AutoFollower
may never detect that at some point there were no auto follow patterns
for the remote cluster it is monitoring. The creation and removal of
an AutoFollower instance happens independently in the `updateAutoFollowers()`
as part of a cluster state update.

By adding the `removed` field, an AutoFollower instance will not miss the
fact there were no auto follow patterns at some point in time. The
`updateAutoFollowers()` method now marks an AutoFollower instance as
removed when it sees that there are no more patterns for a remote cluster.
The updateAutoFollowers() method can then safely start a new AutoFollower
instance.

Relates to #36761
@martijnvg
Copy link
Member

Fix has been pushed. If it starts failing again then this issues can be re-opened.

@markharwood
Copy link
Contributor

markharwood commented Feb 20, 2019

Two other instances recently on master

@markharwood markharwood reopened this Feb 20, 2019
@benwtrent
Copy link
Member

Test failed builds again:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/30/console

./gradlew :x-pack:plugin:ccr:internalClusterTest -Dtests.seed=FC1F17C78EA1673 -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT -Dtests.method="testAutoFollowManyIndices" -Dtests.security.manager=true -Dtests.locale=ig-NG -Dtests.timezone=Pacific/Guam -Dcompiler.java=12 -Druntime.java=11

@martijnvg
Copy link
Member

I think we should mute this test in master, 7.x, 7.0 and 6.7 branches. I will take a look at the recent failures tomorrow.

@martijnvg
Copy link
Member

I've unmuted this test on master and added more logging, so when it fails then there is more information to debug.

martijnvg added a commit that referenced this issue Feb 27, 2019
so that when it fails there is more information to debug.

Relates to #36761
@martijnvg martijnvg changed the title [CI] AutoFollowIT.testAutoFollowManyIndices failed on 6.x [CI] AutoFollowIT.testAutoFollowManyIndices test failure Feb 27, 2019
martijnvg added a commit that referenced this issue Mar 8, 2019
martijnvg added a commit that referenced this issue Mar 8, 2019
reduce the number of indices to be auto followed

Relates to #36761
martijnvg added a commit that referenced this issue Mar 8, 2019
@martijnvg
Copy link
Member

This test finally failed again. Looks related to the fact an assertbusy(...) takes too long to complete. I've reduced the amount of leader indices that need to be auto followed and I will re-enable the test on all branches.

martijnvg added a commit that referenced this issue Mar 8, 2019
martijnvg added a commit that referenced this issue Mar 8, 2019
@danielmitterdorfer
Copy link
Member

It has failed again twice:

On 7.0 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/63/console (CI build log):

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=7E8FF37C0456B038 \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-ER \
  -Dtests.timezone=America/Cuiaba \
  -Dcompiler.java=11 \
  -Druntime.java=11

It also failed on 7.x in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/66/console (CI build log):

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=62B0C83BEB2B0B6D \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=sn \
  -Dtests.timezone=America/Marigot \
  -Dcompiler.java=11 \
  -Druntime.java=11

martijnvg added a commit that referenced this issue Mar 11, 2019
* reduce the number of leader indices to be auto followed
* also check the number of follower indices being created
* also check the whether leader indices are marked as auto followed

Relates to #36761
martijnvg added a commit that referenced this issue Mar 11, 2019
* reduce the number of leader indices to be auto followed
* also check the number of follower indices being created
* also check the whether leader indices are marked as auto followed

Relates to #36761
martijnvg added a commit that referenced this issue Mar 11, 2019
* reduce the number of leader indices to be auto followed
* also check the number of follower indices being created
* also check the whether leader indices are marked as auto followed

Relates to #36761
martijnvg added a commit that referenced this issue Mar 11, 2019
* reduce the number of leader indices to be auto followed
* also check the number of follower indices being created
* also check the whether leader indices are marked as auto followed

Relates to #36761
@martijnvg
Copy link
Member

I've made additional changes to the this test to all branches (^). It looks like sometimes there isn't enough time to auto follow many indices, so I've further reduced the leader indices to be auto followed. I've also added additional assertions in the test.

@martijnvg
Copy link
Member

thanks @danielmitterdorfer, that is helpful.

@martijnvg
Copy link
Member

This test hasn't failed in the last 3 days after the above tweaks were pushed. I will close this issue for now, please re-open if this test fails again for similar reasons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/CCR Issues around the Cross Cluster State Replication features >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants