Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes sporadically fails #22120

Closed
imotov opened this issue Dec 12, 2016 · 5 comments
Assignees
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test Issues or PRs that are addressing/adding tests

Comments

@imotov
Copy link
Contributor

imotov commented Dec 12, 2016

The testMultipleNodesShutdownNonMasterNodes test in org.elasticsearch.cluster.MinimumMasterNodesIT suite started to fail regularly. It failed 6 times over the weekend and once today. I cannot reproduce it locally.

The failure is

  1> [2016-12-12T09:11:56,643][INFO ][o.e.c.MinimumMasterNodesIT] [testMultipleNodesShutdownNonMasterNodes]: after test
ERROR   4.89s J1 | MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes <<< FAILURES!
   > Throwable #1: ClusterBlockException[blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];]
   >    at __randomizedtesting.SeedInfo.seed([4489C6FFC6B9860E:F68A1C476DB0C1A]:0)
   >    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(ClusterBlocks.java:161)
   >    at org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(ClusterBlocks.java:147)
   >    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:92)
   >    at org.elasticsearch.action.search.TransportSearchAction.doExecute(TransportSearchAction.java:52)
   >    at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170)
   >    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142)
   >    at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84)
   >    at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:75)
   >    at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:64)
   >    at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:400)
   >    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:77)
   >    at org.elasticsearch.action.ActionRequestBuilder.execute(ActionRequestBuilder.java:51)
   >    at org.elasticsearch.cluster.MinimumMasterNodesIT.testMultipleNodesShutdownNonMasterNodes(MinimumMasterNodesIT.java:264)
   >    at java.lang.Thread.run(Thread.java:745)
@imotov imotov added :Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test Issues or PRs that are addressing/adding tests labels Dec 12, 2016
@javanna
Copy link
Member

javanna commented Dec 12, 2016

Not sure this is happens sporadically, I saw this a couple of times today out of probably 10 runs.

ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Dec 13, 2016
…master

PR elastic#22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing
a master (elastic#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over
conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be
rejected if there is another node with same transport address.
@ywelsch
Copy link
Contributor

ywelsch commented Dec 13, 2016

This test failure is triggered by PR #22049 which simplifies the cluster state update logic to only change the list of disco-nodes in the cluster state when said cluster state is going to be published. This means that a node cannot just locally change its cluster state and state changes always have to go through an active master. The test failure here highlights a few issues that are not only bound to happen with the above change (but which makes them more frequent):

  • Validating full transport connections: Pinging reuses an existing connection when it's available. It does not re-check however whether the node on the other side is who it thinks it is when reconnecting. This is an issue if the cluster state has disco-nodes in it that are dead and when there are new live nodes which have the same transport address but different node id as the dead nodes and the pinging just reconnects to the different node.
  • Master election gets confused by dead nodes: The reason is that master election proceeds in two steps: 1) Pinging to determine a master and collecting joins 2) Publishing a cluster state which sets the new master and which needs "approval" from the master-eligible nodes through the process of committing. These two steps can disagree, however, when we elect a master based on nodes with same transport address (but different node id) as previous nodes which are now dead. I've opened a PR (Prefer joining node with conflicting transport address when becoming master #22134) so that the live nodes take precedence over the dead nodes when completing the master-election process, so that the cluster state electing the master is published and committed by the right nodes.
  • Async shard fetching is brittle in presence of the issues above. The test actually fails because it ends up with unassigned shards. The reason for this is that the update logic for the cache used by the async shard fetching does not properly handle the scenario where target nodes of a shard fetch do not correspond to the nodes that the master (who is issuing the fetch) believes them to be. When the master starts fetching it adds a NodeEntry to the cache keyed by the node id. When a response is then received, it takes the node id that is contained in the response object to resolve the cache entry. In a situation where a node has a different node id than what the master believes it to have but the same transport address, the response will just be ignored and also prevent future fetching on this node id as the master thinks that fetching is still going on for that node. As such shard fetching will never complete and shards cannot be allocated as they're waiting on the fetching to complete.

ywelsch added a commit that referenced this issue Dec 14, 2016
…master (#22134)

PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
ywelsch added a commit that referenced this issue Dec 14, 2016
…master (#22134)

PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
bleskes added a commit that referenced this issue Dec 21, 2016
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are:

1) Clear 3 round flow - no interleaving of scheduling.
2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test).
3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration.
4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections.
5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120
6) Discovery host providers are only used once per pinging round. Closes #21739
7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370
8) Beefed up testing and sped them up.
9) removed light profile from production code
ywelsch added a commit to ywelsch/elasticsearch that referenced this issue Dec 23, 2016
…master (elastic#22134)

PR elastic#22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (elastic#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
bleskes added a commit that referenced this issue Dec 23, 2016
The `UnicastZenPing` shows it's age and is the result of many small changes. The current state of affairs is confusing and is hard to reason about. This PR cleans it up (while following the same original intentions). Highlights of the changes are:

1) Clear 3 round flow - no interleaving of scheduling.
2) The previous implementation did a best effort attempt to wait for ongoing pings to be sent and completed. The pings were guaranteed to complete because each used the total ping duration as a timeout. This did make it hard to reason about the total ping duration and the flow of the code. All of this is removed now and ping should just complete within the given duration or not be counted (note that it was very handy for testing, but I move the needed sync logic to the test).
3) Because of (2) the pinging scheduling changed a bit, to give a chance for the last round to complete. We now ping at the beginning, 1/3 and 2/3 of the duration.
4) To offset for (3) a bit, incoming ping requests are now added to on going ping collections.
5) UnicastZenPing never establishes full blown connections (but does reuse them if there). Relates to #22120
6) Discovery host providers are only used once per pinging round. Closes #21739
7) Usage of the ability to open a connection without connecting to a node ( #22194 ) and shorter connection timeouts helps with connections piling up. Closes #19370
8) Beefed up testing and sped them up.
9) removed light profile from production code
ywelsch added a commit that referenced this issue Dec 23, 2016
…master (#22134)

PR #22049 changed the node update logic to never remove nodes from the cluster state when the cluster state is not published. This led to an issue electing a master (#22120) based on nodes with same transport address (but different node id) as previous nodes. The joining nodes should take precedence over conflicting ones. Note that this only applies to the action of becoming master. If a master is established and a node joins an existing master, it will be rejected if there is another node with same transport address.
@jasontedor
Copy link
Member

Closed by #22134

@javanna
Copy link
Member

javanna commented Apr 10, 2017

one more failure of this test: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=sles/814/ .

Should I open a separate issue for it?

@ywelsch
Copy link
Contributor

ywelsch commented Apr 10, 2017

Thanks for the notification @javanna. I've pushed 12471c4 as a follow-up to a3cceb8 (see #23922), which should fix this. If more failures of this occur after this commit, feel free to open a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Discovery-Plugins Anything related to our integration plugins with EC2, GCP and Azure >test Issues or PRs that are addressing/adding tests
Projects
None yet
Development

No branches or pull requests

5 participants