New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] AutoFollowIT.testAutoFollowManyIndices test failure #36761
Comments
Pinging @elastic/es-distributed |
I've pushed this test fix: e4391af |
This test failed again today in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.6+multijob-unix-compatibility/os=debian/47/console |
After taking a better look at the test and how the auto follow coordinator can work, I think that the underlying issue may be caused by a real issue in the auto follow coordinator. The When the auto follow coordinator removes the auto follower, the auto follower itself doen't know it is deleted yet and may do another auto follow round before it really stops. I will open a PR to fix this. |
Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to elastic#36761
Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to #36761
Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to #36761
Currently when there are no more auto follow patterns for a remote cluster then the AutoFollower instance for this remote cluster will be removed. If a new auto follow pattern for this remote cluster gets added quickly enough after the last delete then there may be two AutoFollower instance running for this remote cluster instead of one. Each AutoFollower instance stops automatically after it sees in the start() method that there are no more auto follow patterns for the remote cluster it is tracking. However when an auto follow pattern gets removed and then added back quickly enough then old AutoFollower may never detect that at some point there were no auto follow patterns for the remote cluster it is monitoring. The creation and removal of an AutoFollower instance happens independently in the `updateAutoFollowers()` as part of a cluster state update. By adding the `removed` field, an AutoFollower instance will not miss the fact there were no auto follow patterns at some point in time. The `updateAutoFollowers()` method now marks an AutoFollower instance as removed when it sees that there are no more patterns for a remote cluster. The updateAutoFollowers() method can then safely start a new AutoFollower instance. Relates to #36761
Fix has been pushed. If it starts failing again then this issues can be re-opened. |
Two other instances recently on master |
Test failed builds again:
|
I think we should mute this test in master, 7.x, 7.0 and 6.7 branches. I will take a look at the recent failures tomorrow. |
I've unmuted this test on master and added more logging, so when it fails then there is more information to debug. |
so that when it fails there is more information to debug. Relates to #36761
reduce the number of indices to be auto followed Relates to #36761
This test finally failed again. Looks related to the fact an assertbusy(...) takes too long to complete. I've reduced the amount of leader indices that need to be auto followed and I will re-enable the test on all branches. |
It has failed again twice: On 7.0 in https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.0+matrix-java-periodic/ES_BUILD_JAVA=java11,ES_RUNTIME_JAVA=zulu11,nodes=immutable&&linux&&docker/63/console (CI build log):
It also failed on
|
* reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761
* reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761
* reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761
* reduce the number of leader indices to be auto followed * also check the number of follower indices being created * also check the whether leader indices are marked as auto followed Relates to #36761
I've made additional changes to the this test to all branches (^). It looks like sometimes there isn't enough time to auto follow many indices, so I've further reduced the leader indices to be auto followed. I've also added additional assertions in the test. |
To help with future analysis here is the build-stats link (only available for authenticated users) to see the details about build failures: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(branch),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'class:%22org.elasticsearch.xpack.ccr.AutoFollowIT%22%20AND%20test:%22testAutoFollowManyIndices%22'),sort:!(time,desc)) |
thanks @danielmitterdorfer, that is helpful. |
This test hasn't failed in the last 3 days after the above tweaks were pushed. I will close this issue for now, please re-open if this test fails again for similar reasons. |
Could not reproduce this under Linux or osx. Link https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/console
log snippet
reproduction
Link to the plain text logfile is at https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.x+multijob-unix-compatibility/os=oraclelinux/125/consoleText
The text was updated successfully, but these errors were encountered: