-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes sporadically fails #22120
Comments
Not sure this is happens sporadically, I saw this a couple of times today out of probably 10 runs. |
…master PR elastic#22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (elastic#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
This test failure is triggered by PR #22049 which simplifies the cluster state update logic to only change the list of disco-nodes in the cluster state when said cluster state is going to be published. This means that a node cannot just locally change its cluster state and state changes always have to go through an active master. The test failure here highlights a few issues that are not only bound to happen with the above change (but which makes them more frequent):
|
…master (#22134) PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
…master (#22134) PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are: 1) Clear 3 round flow - no interleaving of scheduling. 2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test). 3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration. 4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections. 5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120 6) Discovery host providers are only used once per pinging round. Closes #21739 7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370 8) Beefed up testing and sped them up. 9) removed light profile from production code
…master (elastic#22134) PR elastic#22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (elastic#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are: 1) Clear 3 round flow - no interleaving of scheduling. 2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test). 3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration. 4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections. 5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120 6) Discovery host providers are only used once per pinging round. Closes #21739 7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370 8) Beefed up testing and sped them up. 9) removed light profile from production code
…master (#22134) PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
Closed by #22134 |
one more failure of this test: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles/814/ . Should I open a separate issue for it? |
The
testMultipleNodesShutdownNonMasterNodes
test inorg.elasticsearch.cluster.MinimumMasterNodesIT
suite started to fail regularly. It failed 6 times over the weekend and once today. I cannot reproduce it locally.The failure is
The text was updated successfully, but these errors were encountered: