Fix a downsample persistent task assignment bug #106247

martijnvg · 2024-03-12T15:29:53Z

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node).

This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage.

The original fix via #98769 didn't get this part right.

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.

elasticsearchmachine · 2024-03-12T15:30:17Z

Pinging @elastic/es-storage-engine (Team:StorageEngine)

elasticsearchmachine · 2024-03-12T15:30:17Z

Hi @martijnvg, I've created a changelog YAML for you.

martijnvg · 2024-03-12T16:50:16Z

Failure looks unrelated. Starting the upgraded node failed in rolling upgrade integration tests:

[2024-03-12T16:16:18,173][ERROR][o.e.b.Elasticsearch      ] [v8.13.0-0] fatal exception while booting Elasticsearch java.lang.IllegalStateException: failed to obtain node locks, tried [/dev/shm/bk/bk-agent-prod-gcp-1710258034806935985/elastic/elasticsearch-pull-request/x-pack/qa/rolling-upgrade/build/testclusters/v8.13.0-0/data]; maybe these locations are not writable or multiple nodes were started on the same data path?
--
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:293)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.validateSettings(NodeConstruction.java:504)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.NodeConstruction.prepareConstruction(NodeConstruction.java:255)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.node.Node.<init>(Node.java:192)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch$2.<init>(Elasticsearch.java:240)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.initPhase3(Elasticsearch.java:240)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:75)
  | »  Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by another program: /dev/shm/bk/bk-agent-prod-gcp-1710258034806935985/elastic/elasticsearch-pull-request/x-pack/qa/rolling-upgrade/build/testclusters/v8.13.0-0/data/node.lock
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.NativeFSLockFactory.obtainFSLock(NativeFSLockFactory.java:117)
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.FSLockFactory.obtainLock(FSLockFactory.java:43)
  | »       at org.apache.lucene.core@9.10.0/org.apache.lucene.store.BaseDirectory.obtainLock(BaseDirectory.java:44)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:231)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment$NodeLock.<init>(NodeEnvironment.java:206)
  | »       at org.elasticsearch.server@8.14.0-SNAPSHOT/org.elasticsearch.env.NodeEnvironment.<init>(NodeEnvironment.java:285)
  | »       ... 6 more

****

This type of failures are being tracked in: #101231

martijnvg · 2024-03-12T16:59:01Z

@elasticmachine run elasticsearch-ci

martijnvg · 2024-03-12T16:59:13Z

@elasticmachine run @elasticmachine run elasticsearch-ci

martijnvg · 2024-03-12T16:59:34Z

@elasticmachine run elasticsearch-ci/bwc-snapshots

martijnvg · 2024-03-12T17:00:25Z

@elasticmachine run elasticsearch-ci/8.13.0 / bwc-snapshots

…ssing_index_relocation

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via elastic#98769 didn't get this part right.

elasticsearchmachine · 2024-03-13T08:02:40Z

💚 Backport successful

Status	Branch	Result
✅	8.13

If as part of the persistent task assignment the source downsample index no longer exists, then the persistent task framework will continuously try to find an assignment and fail with IndexNotFoundException (which gets logged as a warning on elected master node). This fixes a bug in resolving the shard routing, so that if the index no longer exists any node is returned and the persistent task can fail gracefully at a later stage. The original fix via #98769 didn't get this part right. Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

martijnvg added >bug :StorageEngine/Downsampling Downsampling (replacement for rollups) - Turn fine-grained time-based data into coarser-grained data v8.13.1 v8.14.0 labels Mar 12, 2024

elasticsearchmachine added the Team:StorageEngine label Mar 12, 2024

martijnvg added 3 commits March 12, 2024 16:30

Update docs/changelog/106247.yaml

c415187

iter

ffc0910

spotless

18c5669

kkrik-es approved these changes Mar 12, 2024

View reviewed changes

Merge remote-tracking branch 'es/main' into tsdb/downsampling_task_mi…

fb1be70

…ssing_index_relocation

martijnvg added the auto-backport-and-merge Automatically create backport pull requests and merge when ready label Mar 12, 2024

martijnvg merged commit d54593f into elastic:main Mar 13, 2024
14 checks passed

martijnvg mentioned this pull request Mar 13, 2024

[8.13] Fix a downsample persistent task assignment bug (#106247) #106280

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix a downsample persistent task assignment bug #106247

Fix a downsample persistent task assignment bug #106247

martijnvg commented Mar 12, 2024

elasticsearchmachine commented Mar 12, 2024

elasticsearchmachine commented Mar 12, 2024

martijnvg commented Mar 12, 2024 •

edited

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

elasticsearchmachine commented Mar 13, 2024

Fix a downsample persistent task assignment bug #106247

Fix a downsample persistent task assignment bug #106247

Conversation

martijnvg commented Mar 12, 2024

elasticsearchmachine commented Mar 12, 2024

elasticsearchmachine commented Mar 12, 2024

martijnvg commented Mar 12, 2024 • edited

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

martijnvg commented Mar 12, 2024

elasticsearchmachine commented Mar 13, 2024

💚 Backport successful

martijnvg commented Mar 12, 2024 •

edited