SLM metadata records enormous string describing last failure #71325

DaveCTurner · 2021-04-06T10:45:55Z

Elasticsearch version (bin/elasticsearch --version): Reported in 7.11.2 but still in master.

Plugins installed: Cloud

JVM version (java -version): N/A

OS version (uname -a if on a Unix-like system): Cloud

Description of the problem including expected versus actual behavior:

A user reported a multi-megabyte response to the Get snapshot lifecycle policy API, which they found surprising. The bulk of the content was reporting in great detail every single shard failure complete with stack trace, which are all included as suppressed exceptions here ...

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

Line 115 in a92a647

    
           snapInfo.shardFailures().forEach(failure -> e.addSuppressed(failure.getCause()));

... and converted to a string here:

elasticsearch/x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

Line 221 in a92a647

    
           newPolicyMetadata.setLastFailure(new SnapshotInvocationRecord(snapshotName, timestamp, exceptionToString()));

I'm not sure all this detail is useful, and it certainly seems bad to keep something so large in the cluster state. Can we trim this down somehow?

Provide logs (if relevant):

"details": "{\"type\":\"snapshot_exception\",\"reason\":\"[found-snapshots:cloud-snapshot-2021.04.01-REDACTED] failed to create snapshot successfully, REDACTED(>800) out of REDACTED(>800) total shards failed\",\"stack_trace\":\"SnapshotException[[found-snapshots:cloud-snapshot-2021.04.01-REDACTED] failed to create snapshot successfully, REDACTED(>800) out of REDACTED(>800) total shards failed]\\n\\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:111)\\n\\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:93)\\n\\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)\\n\\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)\\n\\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77)\\n\\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)\\n\\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:143)\\n\\tat org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:76)\\n\\tat org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:216)\\n\\tat org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2681)\\n\\tat org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$34(SnapshotsService.java:1577)\\n\\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:117)\\n\\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$37(BlobStoreRepository.java:1130)\\n\\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:117)\\n\\tat org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)\\n\\tat org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)\\n\\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)\\n\\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\\n\\tat java.base/java.lang.Thread.run(Thread.java:832)\\n\\tSuppressed: [REDACTED/REDACTED][[REDACTED][0]] IndexShardSnapshotFailedException[NoSuchFileException[Blob object [snapshots/REDACTED/indices/REDACTED/0/index-REDACTED] not found: 404 Not Found\\nGET https://storage.googleapis.com/download/storage/v1/b/REDACTED/o/snapshots%REDACTED%2Findices%REDACTED%2F0%2Findex-REDACTED?alt=media\\nNo such object: REDACTED/snapshots/REDACTED/indices/REDACTED/0/index-REDACTED]]\\n\\t\\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:54)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1524)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:115)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1472)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1469)\\n\\t\\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.doGetRepositoryData(BlobStoreRepository.java:1463)\\n\\t\\t... 6 more\\n\\tSuppressed: [REDACTED/REDACTED][[REDACTED][0]] IndexShardSnapshotFailedException[NoSuchFileException... [many MBs of the same snipped]

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-04-06T10:45:57Z

Pinging @elastic/es-core-features (Team:Core/Features)

Leaf-Lin · 2021-09-20T01:57:21Z

Came across this in 8.0.0-alpha2. Feels like the stack_trace part should be removed?

kingherc · 2023-06-29T15:24:15Z

FYI this has been the root cause for some ES upgrades being blocked because of a huge cluster state due to such large exceptions. We cannot deserialize such large cluster states due to this bug #96976 . However, such large fields should not be stored in the cluster state in the first place.

I see that the relevant #96918 has been fixed already, which should fix the issue with large fields in the cluster state. I am unsure whether this ticket is still relevant? I guess the XContentSerialization limit should also limit the response of the Get snapshot lifecycle policy API?

andreidan · 2023-06-29T16:07:07Z

I believe this would've been fixed by @original-brownbear in #80942

ywangd · 2024-01-11T01:57:25Z

Running into this again today on a 8.11.1 cluster. The details field can still contain a string that is too large for deserialization.

DaveCTurner added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Apr 6, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Apr 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SLM metadata records enormous string describing last failure #71325

SLM metadata records enormous string describing last failure #71325

DaveCTurner commented Apr 6, 2021

elasticmachine commented Apr 6, 2021

Leaf-Lin commented Sep 20, 2021

kingherc commented Jun 29, 2023

andreidan commented Jun 29, 2023

ywangd commented Jan 11, 2024

SLM metadata records enormous string describing last failure #71325

SLM metadata records enormous string describing last failure #71325

Comments

DaveCTurner commented Apr 6, 2021

elasticmachine commented Apr 6, 2021

Leaf-Lin commented Sep 20, 2021

kingherc commented Jun 29, 2023

andreidan commented Jun 29, 2023

ywangd commented Jan 11, 2024