[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

mark-vieira · 2022-11-10T17:51:19Z

I've seen three builds fail now witht he same error and this started happening on Nov 7 so it's likely we've introduced something here:

https://gradle-enterprise.elastic.co/scans/failures?failures.failureClassification=all_failures&failures.failureMessage=Execution%20failed%20for%20task%20%27:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0%23follower%23oneThirdUpgradedTest%27.%0A%3E%20%60cluster%7B:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0-follower%7D%60%20failed%20to%20wait%20for%20cluster%20health%20yellow%20after%2040%20SECONDS%0A%20%20%20%20IO%20error%20while%20waiting%20cluster%0A%20%20%20%20%20%20503%20Service%20Unavailable&search.relativeStartTime=P28D&search.timeZoneId=America/Los_Angeles#

* What went wrong:
Execution failed for task ':x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0#follower#oneThirdUpgradedTest'.
> `cluster{:x-pack:qa:rolling-upgrade-multi-cluster:v8.6.0-follower}` failed to wait for cluster health yellow after 40 SECONDS
    IO error while waiting cluster
      503 Service Unavailable

It seems the follower cluster is having issues coming back up after upgrade.

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2022-11-10T17:51:41Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2022-11-17T15:26:44Z

I was looking into the logs and found following:

[2022-11-16T11:09:01,635][ERROR][o.e.c.c.NodeRemovalClusterStateTaskExecutor] [v8.6.0-follower-1] unexpected failure during [node-left] java.lang.NullPointerException: Cannot invoke "org.elasticsearch.cluster.routing.RoutingNode.nodeId()" because "node" is null
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.decider.AllocationDeciders.canAllocate(AllocationDeciders.java:62)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.decideCanAllocate(DesiredBalanceReconciler.java:432)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.findRelocationTarget(DesiredBalanceReconciler.java:420)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.balance(DesiredBalanceReconciler.java:385)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceReconciler.run(DesiredBalanceReconciler.java:88)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.recordTime(DesiredBalanceShardsAllocator.java:299)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.reconcile(DesiredBalanceShardsAllocator.java:213)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.allocator.DesiredBalanceShardsAllocator.allocate(DesiredBalanceShardsAllocator.java:162)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.lambda$reroute$6(AllocationService.java:423)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:518)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.executeWithRoutingAllocation(AllocationService.java:444)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.reroute(AllocationService.java:420)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.routing.allocation.AllocationService.disassociateDeadNodes(AllocationService.java:275)
	at org.elasticsearch.server@8.6.0-SNAPSHOT/org.elasticsearch.cluster.coordination.NodeRemovalClusterStateTaskExecutor.execute(NodeRemovalClusterStateTaskExecutor.java:74)

and

[2022-11-16T11:09:17,405][INFO ][o.e.c.c.JoinHelper       ] [v8.6.0-follower-0] failed to join {v8.6.0-follower-1}{Rqb3crOxRAeyGIaeVnWDhA}{h2XQ3Xh8SR6FysJuBFEQ9w}{v8.6.0-follower-1}{127.0.0.1}{127.0.0.1:40817}{cdfhilmrstw}{testattr=test, xpack.installed=true} with JoinRequest{sourceNode={v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{CH-DfRU8T42L2tTnA90FOA}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:45999}{cdfhilmrstw}{xpack.installed=true, upgraded=true, testattr=test}, minimumTerm=1, optionalJoin=Optional.empty} org.elasticsearch.transport.RemoteTransportException: [v8.6.0-follower-1][127.0.0.1:40817][internal:cluster/coordination/join]
Caused by: java.lang.IllegalArgumentException: can't add node {v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{CH-DfRU8T42L2tTnA90FOA}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:45999}{cdfhilmrstw}{xpack.installed=true, upgraded=true, testattr=test}, found existing node {v8.6.0-follower-0}{Xcn9_bCFRbSZstKrfNCS6A}{iJ7kiW0NT7unyP5E-tvb6A}{v8.6.0-follower-0}{127.0.0.1}{127.0.0.1:40489}{cdfhilmrstw}{xpack.installed=true, testattr=test} with the same id but is a different node instance
	at org.elasticsearch.cluster.node.DiscoveryNodes$Builder.add(DiscoveryNodes.java:648)
	at org.elasticsearch.cluster.coordination.JoinTaskExecutor.execute(JoinTaskExecutor.java:126)

I suspect above error derailed the disassociateDeadNodes. As a result the node was not removed from the cluster state when it was stopping. Then it could not join back upon the start as exactly the same node was already present in a cluster state.

mark-vieira added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. >test-failure Triaged test failures from CI labels Nov 10, 2022

elasticsearchmachine added the Team:Distributed Meta label for distributed team label Nov 10, 2022

idegtiarenko self-assigned this Nov 15, 2022

idegtiarenko mentioned this issue Nov 17, 2022

Avoid NPE when disassociateDeadNodes is executed for a node present in the desired balance #91659

Merged

idegtiarenko closed this as completed in #91659 Nov 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

mark-vieira commented Nov 10, 2022

elasticsearchmachine commented Nov 10, 2022

idegtiarenko commented Nov 17, 2022 •

edited

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

[CI] rolling-upgrade-multi-cluster tests failing to start node #91517

Comments

mark-vieira commented Nov 10, 2022

elasticsearchmachine commented Nov 10, 2022

idegtiarenko commented Nov 17, 2022 • edited

idegtiarenko commented Nov 17, 2022 •

edited