[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

spinscale · 2018-12-18T11:15:07Z

Could not reproduce this under Linux or osx. Link https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/console

log snippet

  1> [2018-12-18T07:47:41,534][INFO ][o.e.x.c.AutoFollowIT     ] [testAutoFollowManyIndices] after test
FAILURE 15.9s J0 | AutoFollowIT.testAutoFollowManyIndices <<< FAILURES!
   > Throwable #1: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <38>
   >    at __randomizedtesting.SeedInfo.seed([F02ACEB2EC9BD622:D818E1CF850A7355]:0)
   >    at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >    at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:848)
   >    at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:822)
   >    at org.elasticsearch.xpack.ccr.AutoFollowIT.testAutoFollowManyIndices(AutoFollowIT.java:151)
   >    at java.lang.Thread.run(Thread.java:748)
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:
   > Expected: <37>
   >      but: was <30>
   >            at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
   >            at org.elasticsearch.xpack.ccr.AutoFollowIT.lambda$testAutoFollowManyIndices$6(AutoFollowIT.java:154)
   >            at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:836)
   >            ... 39 more
   >    Suppressed: java.lang.AssertionError:

reproduction

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=F02ACEB2EC9BD622 \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=id \
  -Dtests.timezone=Africa/Monrovia \
  -Dcompiler.java=11 \
  -Druntime.java=8

Link to the plain text logfile is at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/consoleText

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-12-18T11:15:08Z

Pinging @elastic/es-distributed

Relates to #36761

martijnvg · 2019-01-11T16:38:03Z

I've pushed this test fix: e4391af
I will close this issue if this test doen't fail in the coming days.

Relates to #36761

gwbrown · 2019-01-11T23:42:49Z

This test failed again today in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+multijob-unix-compatibility/os=debian/47/console

martijnvg · 2019-01-12T09:46:50Z

After taking a better look at the test and how the auto follow coordinator can work, I think that the underlying issue may be caused by a real issue in the auto follow coordinator.

The logs-does-not-count index should not get auto followed, but it does get auto followed. In the test an auto follow pattern is added, a number of leader indices get created that should get auto followed, the auto follow pattern gets removed, the logs-does-not-count gets created (and should not get auto followed), then the auto follow patterns get re-added and finally a number of leader indices get created that should get auto followed.

When the auto follow coordinator removes the auto follower, the auto follower itself doen't know it is deleted yet and may do another auto follow round before it really stops. I will open a PR to fix this.

Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to elastic#36761

Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to #36761

martijnvg · 2019-01-16T10:48:47Z

Fix has been pushed. If it starts failing again then this issues can be re-opened.

markharwood · 2019-02-20T15:36:27Z

Two other instances recently on master

benwtrent · 2019-02-21T16:58:06Z

Test failed builds again:

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=openjdk12,ES_RUNTIME_JAVA=java11,nodes=immutable&&linux&&docker/30/console

./gradlew :x-pack:plugin:ccr:internalClusterTest -Dtests.seed=FC1F17C78EA1673 -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT -Dtests.method="testAutoFollowManyIndices" -Dtests.security.manager=true -Dtests.locale=ig-NG -Dtests.timezone=Pacific/Guam -Dcompiler.java=12 -Druntime.java=11

martijnvg · 2019-02-21T19:16:23Z

I think we should mute this test in master, 7.x, 7.0 and 6.7 branches. I will take a look at the recent failures tomorrow.

martijnvg · 2019-02-27T09:28:01Z

I've unmuted this test on master and added more logging, so when it fails then there is more information to debug.

so that when it fails there is more information to debug. Relates to #36761

Relates to #36761

reduce the number of indices to be auto followed Relates to #36761

Relates to #36761

martijnvg · 2019-03-08T11:46:48Z

This test finally failed again. Looks related to the fact an assertbusy(...) takes too long to complete. I've reduced the amount of leader indices that need to be auto followed and I will re-enable the test on all branches.

Relates to #36761

danielmitterdorfer · 2019-03-11T08:08:38Z

It has failed again twice:

On 7.0 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/63/console (CI build log):

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=7E8FF37C0456B038 \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=ar-ER \
  -Dtests.timezone=America/Cuiaba \
  -Dcompiler.java=11 \
  -Druntime.java=11

It also failed on 7.x in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.x+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/66/console (CI build log):

./gradlew :x-pack:plugin:ccr:internalClusterTest \
  -Dtests.seed=62B0C83BEB2B0B6D \
  -Dtests.class=org.elasticsearch.xpack.ccr.AutoFollowIT \
  -Dtests.method="testAutoFollowManyIndices" \
  -Dtests.security.manager=true \
  -Dtests.locale=sn \
  -Dtests.timezone=America/Marigot \
  -Dcompiler.java=11 \
  -Druntime.java=11

* reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761

martijnvg · 2019-03-11T09:08:03Z

I've made additional changes to the this test to all branches (^). It looks like sometimes there isn't enough time to auto follow many indices, so I've further reduced the leader indices to be auto followed. I've also added additional assertions in the test.

danielmitterdorfer · 2019-03-11T15:18:25Z

To help with future analysis here is the build-stats link (only available for authenticated users) to see the details about build failures: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(branch),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'class:%22org.elasticsearch.xpack.ccr.AutoFollowIT%22%20AND%20test:%22testAutoFollowManyIndices%22'),sort:!(time,desc))

martijnvg · 2019-03-11T16:16:05Z

thanks @danielmitterdorfer, that is helpful.

martijnvg · 2019-03-14T10:15:01Z

This test hasn't failed in the last 3 days after the above tweaks were pushed. I will close this issue for now, please re-open if this test fails again for similar reasons.

spinscale added >test-failure Triaged test failures from CI :Distributed/CCR Issues around the Cross Cluster State Replication features labels Dec 18, 2018

ywelsch assigned martijnvg Dec 18, 2018

martijnvg added a commit that referenced this issue Dec 18, 2018

[TEST] Added more logging

819bda5

Relates to #36761

martijnvg added a commit that referenced this issue Dec 18, 2018

[TEST] Added more logging

b5c5954

Relates to #36761

martijnvg added a commit that referenced this issue Dec 18, 2018

[TEST] Added more logging

1afcfc9

Relates to #36761

martijnvg added a commit that referenced this issue Jan 11, 2019

Test fix, wait for auto follower to have stopped in the background

e4391af

Relates to #36761

martijnvg added a commit that referenced this issue Jan 11, 2019

Test fix, wait for auto follower to have stopped in the background

c875ba9

Relates to #36761

martijnvg added a commit that referenced this issue Jan 11, 2019

Test fix, wait for auto follower to have stopped in the background

c5590b2

Relates to #36761

martijnvg mentioned this issue Jan 14, 2019

When removing an AutoFollower also mark it as removed. #37402

Merged

martijnvg closed this as completed Jan 16, 2019

markharwood reopened this Feb 20, 2019

This was referenced Feb 21, 2019

Muting AutoFollowIT.testAutoFollowManyIndices #39264

Merged

Muting AutoFollowIT.testAutoFollowManyIndices #39265

Merged

Muting AutoFollowIT.testAutoFollowManyIndices #39266

Merged

Muting AutoFollowIT.testAutoFollowManyIndices #39267

Merged

martijnvg added a commit that referenced this issue Feb 27, 2019

Unmuted testAutoFollowManyIndices() test and added more logging,

9ccfc01

so that when it fails there is more information to debug. Relates to #36761

martijnvg changed the title ~~[CI] AutoFollowIT.testAutoFollowManyIndices failed on 6.x~~ [CI] AutoFollowIT.testAutoFollowManyIndices test failure Feb 27, 2019

martijnvg added a commit that referenced this issue Mar 8, 2019

reduce the number of indices to be auto followed

1ca9544

Relates to #36761

martijnvg added a commit that referenced this issue Mar 8, 2019

log the existing indices instead of entire Metadata and

09c53cc

reduce the number of indices to be auto followed Relates to #36761

martijnvg added a commit that referenced this issue Mar 8, 2019

unmuted and tweaked test

8666aa1

Relates to #36761

martijnvg added a commit that referenced this issue Mar 8, 2019

unmuted and tweaked test

7bb0afe

Relates to #36761

martijnvg added a commit that referenced this issue Mar 8, 2019

unmuted and tweaked test

098d7dc

Relates to #36761

martijnvg closed this as completed Mar 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

spinscale commented Dec 18, 2018

elasticmachine commented Dec 18, 2018

martijnvg commented Jan 11, 2019

gwbrown commented Jan 11, 2019

martijnvg commented Jan 12, 2019

martijnvg commented Jan 16, 2019

markharwood commented Feb 20, 2019 •

edited

benwtrent commented Feb 21, 2019

martijnvg commented Feb 21, 2019

martijnvg commented Feb 27, 2019

martijnvg commented Mar 8, 2019

danielmitterdorfer commented Mar 11, 2019

martijnvg commented Mar 11, 2019

danielmitterdorfer commented Mar 11, 2019

martijnvg commented Mar 11, 2019

martijnvg commented Mar 14, 2019

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761

Comments

spinscale commented Dec 18, 2018

elasticmachine commented Dec 18, 2018

martijnvg commented Jan 11, 2019

gwbrown commented Jan 11, 2019

martijnvg commented Jan 12, 2019

martijnvg commented Jan 16, 2019

markharwood commented Feb 20, 2019 • edited

benwtrent commented Feb 21, 2019

martijnvg commented Feb 21, 2019

martijnvg commented Feb 27, 2019

martijnvg commented Mar 8, 2019

danielmitterdorfer commented Mar 11, 2019

martijnvg commented Mar 11, 2019

danielmitterdorfer commented Mar 11, 2019

martijnvg commented Mar 11, 2019

martijnvg commented Mar 14, 2019

markharwood commented Feb 20, 2019 •

edited