Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a downsample persistent task assignment bug #106247

Conversation

martijnvg
Copy link
Member

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node).

This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage.

The original fix via #98769 didn't get this part right.

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node).

This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage.

The original fix via elastic#98769 didn't get this part right.
@martijnvg martijnvg added >bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data v8.13.1 v8.14.0 labels Mar 12, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-storage-engine (Team:StorageEngine)

@elasticsearchmachine
Copy link
Collaborator

Hi @martijnvg, I've created a changelog YAML for you.

@martijnvg
Copy link
Member Author

martijnvg commented Mar 12, 2024

Failure looks unrelated. Starting the upgraded node failed in rolling upgrade integration tests:

[2024-03-12T16:16:18,173][ERROR][o.e.b.Elasticsearch      ] [v8.13.0-0] fatal exception while booting Elasticsearch java.lang.IllegalStateException: failed to obtain node locks, tried [/dev/shm/bk/bk-agent-prod-gcp-1710258034806935985/elastic/elasticsearch-pull-request/x-pack/qa/rolling-upgrade/build/testclusters/v8.13.0-0/data]; maybe these locations are not writable or multiple nodes were started on the same data path?
--
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:293)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.validateSettings(NodeConstruction.java:504)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:255)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:192)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:240)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:240)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:75)
  | »  Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /dev/shm/bk/bk-agent-prod-gcp-1710258034806935985/elastic/elasticsearch-pull-request/x-pack/qa/rolling-upgrade/build/testclusters/v8.13.0-0/data/node.lock
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:231)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:206)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:285)
  | »       ... 6 more

****

This type of failures are being tracked in: #101231

@martijnvg
Copy link
Member Author

@elasticmachine run elasticsearch-ci

@martijnvg
Copy link
Member Author

@elasticmachine run @elasticmachine run elasticsearch-ci

@martijnvg
Copy link
Member Author

@elasticmachine run elasticsearch-ci/bwc-snapshots

@martijnvg
Copy link
Member Author

@elasticmachine run elasticsearch-ci/8.13.0 / bwc-snapshots

@martijnvg martijnvg added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 12, 2024
@martijnvg martijnvg merged commit d54593f into elastic:main Mar 13, 2024
14 checks passed
martijnvg added a commit to martijnvg/elasticsearch that referenced this pull request Mar 13, 2024
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node).

This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage.

The original fix via elastic#98769 didn't get this part right.
@elasticsearchmachine
Copy link
Collaborator

💚 Backport successful

Status Branch Result
8.13

elasticsearchmachine pushed a commit that referenced this pull request Apr 5, 2024
If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node).

This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage.

The original fix via #98769 didn't get this part right.

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport-and-merge Automatically create backport pull requests and merge when ready >bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data Team:StorageEngine v8.13.1 v8.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants