Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

Closed
mark-vieira opened this issue Nov 10, 2022 · 2 comments · Fixed by #91659
Closed

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

mark-vieira opened this issue Nov 10, 2022 · 2 comments · Fixed by #91659
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

I've seen three builds fail now witht he same error and this started happening on Nov 7 so it's likely we've introduced something here:

https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20%27:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0%23follower%23oneThirdUpgradedTest%27.%0A%3E%20%60cluster%7B:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0-follower%7D%60%20failed%20to%20wait%20for%20cluster%20health%20yellow%20after%2040%20SECONDS%0A%20%20%20%20IO%20error%20while%20waiting%20cluster%0A%20%20%20%20%20%20503%20Service%20Unavailable&search.relativeStartTime=P28D&search.timeZoneId=America/Los_Angeles#

* What went wrong:
Execution failed for task ':x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0#follower#oneThirdUpgradedTest'.
> `cluster{:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0-follower}` failed to wait for cluster health yellow after 40 SECONDS
    IO error while waiting cluster
      503 Service Unavailable

It seems the follower cluster is having issues coming back up after upgrade.

@mark-vieira mark-vieira added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI labels Nov 10, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team label Nov 10, 2022
@idegtiarenko idegtiarenko self-assigned this Nov 15, 2022
@idegtiarenko
Copy link
Contributor

idegtiarenko commented Nov 17, 2022

I was looking into the logs and found following:

[2022-11-16T11:09:01,635][ERROR][o.e.c.c.NodeRemovalClusterStateTaskExecutor] [v8.6.0-follower-1] unexpected failure during [node-left] java.lang.NullPointerException: Cannot invoke "org.elasticsearch.cluster.routing.RoutingNode.nodeId()" because "node" is null
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:62)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.decideCanAllocate(DesiredBalanceReconciler.java:432)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.findRelocationTarget(DesiredBalanceReconciler.java:420)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.balance(DesiredBalanceReconciler.java:385)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.run(DesiredBalanceReconciler.java:88)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:299)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.reconcile(DesiredBalanceShardsAllocator.java:213)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.allocate(DesiredBalanceShardsAllocator.java:162)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.lambda$reroute$6(AllocationService.java:423)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:518)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.executeWithRoutingAllocation(AllocationService.java:444)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:420)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.disassociateDeadNodes(AllocationService.java:275)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor.execute(NodeRemovalClusterStateTaskExecutor.java:74)

and

[2022-11-16T11:09:17,405][INFO ][o.e.c.c.JoinHelper       ] [v8.6.0-follower-0] failed to join {v8.6.0-follower-1}{Rqb3crOxRAeyGIaeVnWDhA}{h2XQ3Xh8SR6FysJuBFEQ9w}{v8.6.0-follower-1}{127.0.0.1}{127.0.0.1:40817}{cdfhilmrstw}{testattr=test, xpack.installed=true} with JoinRequest{sourceNode={v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{CH-DfRU8T42L2tTnA90FOA}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:45999}{cdfhilmrstw}{xpack.installed=true, upgraded=true, testattr=test}, minimumTerm=1, optionalJoin=Optional.empty} org.elasticsearch.transport.RemoteTransportException: [v8.6.0-follower-1][127.0.0.1:40817][internal:cluster/coordination/join]
Caused by: java.lang.IllegalArgumentException: can't add node {v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{CH-DfRU8T42L2tTnA90FOA}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:45999}{cdfhilmrstw}{xpack.installed=true, upgraded=true, testattr=test}, found existing node {v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{iJ7kiW0NT7unyP5E-tvb6A}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:40489}{cdfhilmrstw}{xpack.installed=true, testattr=test} with the same id but is a different node instance
	at org.elasticsearch.cluster.node.DiscoveryNodes$Builder.add(DiscoveryNodes.java:648)
	at org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:126)

I suspect above error derailed the disassociateDeadNodes. As a result the node was not removed from the cluster state when it was stopping. Then it could not join back upon the start as exactly the same node was already present in a cluster state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. Team:Distributed Meta label for distributed team >test-failure Triaged test failures from CI
Projects
None yet
3 participants