Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rolling-upgrade:v8.12.0#oneThirdUpgradedTest IllegalStateException: failed to obtain node locks #101231

Open
stu-elastic opened this issue Oct 23, 2023 · 16 comments
Labels
:Delivery/Build Build or test infrastructure medium-risk An open issue or test failure that is a medium risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI >upgrade

Comments

@stu-elastic
Copy link
Contributor

stu-elastic commented Oct 23, 2023

CI Link

https://gradle-enterprise.elastic.co/s/augsybdqwff3i

Repro line

:x-pack:plugin:shutdown:qa:rolling-upgrade:v8.12.0-1

Does it reproduce?

Didn't try

Applicable branches

main

Failure history

No response

Failure excerpt

» [2023-10-23T18:50:26,585][ERROR][o.e.b.Elasticsearch      ] [v8.12.0-1] fatal exception while booting Elasticsearch java.lang.IllegalStateException: failed to obtain node locks, tried [/dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/data]; maybe these locations are not writable or multiple nodes were started on the same data path?
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:297)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.construct(NodeConstruction.java:484)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:244)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:181)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:236)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:236)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:73)
»  Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/data/node.lock
»  	at org.apache.lucene.core@9.8.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)
»  	at org.apache.lucene.core@9.8.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)
»  	at org.apache.lucene.core@9.8.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:235)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:209)
»  	at org.elasticsearch.server@8.12.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:289)
»  	... 6 more
»  
»  ERROR: Elasticsearch did not exit normally - check the logs at /dev/shm/elastic+elasticsearch+main+intake+multijob+bwc-snapshots/x-pack/plugin/shutdown/qa/rolling-upgrade/build/testclusters/v8.12.0-1/logs/v8.12.0.log
»  
»  ERROR: Elasticsearch exited unexpectedly, with exit code 1
@stu-elastic stu-elastic added :Delivery/Build Build or test infrastructure >test-failure Triaged test failures from CI >upgrade needs:triage Requires assignment of a team area label labels Oct 23, 2023
@elasticsearchmachine elasticsearchmachine added blocker Team:Delivery Meta label for Delivery team labels Oct 23, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

@elasticsearchmachine elasticsearchmachine removed the needs:triage Requires assignment of a team area label label Oct 23, 2023
@mark-vieira
Copy link
Contributor

mark-vieira commented Oct 23, 2023

@breskeby this sounds like it might be related to #101069. Looking at the cluster logs it looks like we're attempting to start an already started cluster, which would explain the error above. Perhaps the updated logic is losing track of clusters that are used across multiple tasks as is the case for many BWC tests. My guess is some state is getting confused when we upgrade nodes in a cluster.

@breskeby
Copy link
Contributor

breskeby commented Oct 23, 2023

@mark-vieira From a brief look at the logic we changed and the project in question I couldn't see how that change affected this and wasn't able to reproduce. I'll have another fresh look tomorrow. as indeed it seems related that we see this failure after making the change we did in #101069

@mark-vieira mark-vieira added medium-risk An open issue or test failure that is a medium risk to future releases and removed blocker labels Oct 31, 2023
@mark-vieira
Copy link
Contributor

@breskeby looks like this is still happening occasionally: #103839

@slobodanadamovic
Copy link
Contributor

Another failure today: https://gradle-enterprise.elastic.co/s/izhi63q6ustnw

@joegallo
Copy link
Contributor

joegallo commented Feb 7, 2024

And another: https://gradle-enterprise.elastic.co/s/6ey6xm4uylriy

Note that this one was a failure of x-pack:plugin:eql:qa:ccs-rolling-upgrade:v8.13.0#oneThirdUpgraded, though, not the specific test indicated in the issue description. The "failed to obtain node locks" error and stack trace are present, though, so I thought it was fair to attach onto this one.

@martijnvg
Copy link
Member

I ran into this failure in a pr: https://gradle-enterprise.elastic.co/s/lwkluhs5zpwf6/console-log?page=3#L2846
I also noticed that it happened today on the main branch: https://gradle-enterprise.elastic.co/s/e4ca4ihilzigw/console-log?page=2#L1183

@iverase
Copy link
Contributor

iverase commented Mar 18, 2024

Another one today: https://gradle-enterprise.elastic.co/s/coocr6hsiw7ny

@williamrandolph
Copy link
Contributor

We had one in the intake build on 17 March: https://gradle-enterprise.elastic.co/s/a4567iuaplgju/

@benwtrent
Copy link
Member

Here is another intake build failure due to this: https://gradle-enterprise.elastic.co/s/h4gi5trbgx5rk

All three nodes crashed due to failing to obtain locks on their data paths.

@mark-vieira
Copy link
Contributor

I think we want to move this pull request forward. The downside is it'll probably make the test execution a bit slower but I think the improvement in stability is probably worth it. I'll pick this back up.

@DaveCTurner
Copy link
Contributor

https://gradle-enterprise.elastic.co/s/l34azxsevqole looks like another instance of this

@nik9000
Copy link
Member

nik9000 commented Apr 18, 2024

@bpintea
Copy link
Contributor

bpintea commented Apr 22, 2024

@kkrik-es
Copy link
Contributor

kkrik-es commented May 1, 2024

@davidkyle
Copy link
Member

Same error for :qa:ccs-rolling-upgrade-remote-cluster:v8.15.0#twoThirdUpgraded

https://gradle-enterprise.elastic.co/s/nprfknz2niwso

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Delivery/Build Build or test infrastructure medium-risk An open issue or test failure that is a medium risk to future releases Team:Delivery Meta label for Delivery team >test-failure Triaged test failures from CI >upgrade
Projects
None yet
Development

No branches or pull requests