Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SLM metadata records enormous string describing last failure #71325

Open
DaveCTurner opened this issue Apr 6, 2021 · 5 comments
Open

SLM metadata records enormous string describing last failure #71325

DaveCTurner opened this issue Apr 6, 2021 · 5 comments
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team

Comments

@DaveCTurner
Copy link
Contributor

Elasticsearch version (bin/elasticsearch --version): Reported in 7.11.2 but still in master.

Plugins installed: Cloud

JVM version (java -version): N/A

OS version (uname -a if on a Unix-like system): Cloud

Description of the problem including expected versus actual behavior:

A user reported a multi-megabyte response to the Get snapshot lifecycle policy API, which they found surprising. The bulk of the content was reporting in great detail every single shard failure complete with stack trace, which are all included as suppressed exceptions here ...

snapInfo.shardFailures().forEach(failure -> e.addSuppressed(failure.getCause()));

... and converted to a string here:

newPolicyMetadata.setLastFailure(new SnapshotInvocationRecord(snapshotName, timestamp, exceptionToString()));

I'm not sure all this detail is useful, and it certainly seems bad to keep something so large in the cluster state. Can we trim this down somehow?

Provide logs (if relevant):

"details": "{\"type\":\"snapshot_exception\",\"reason\":\"[found-snapshots:cloud-snapshot-2021.04.01-REDACTED] failed to create snapshot successfully, REDACTED(>800) out of REDACTED(>800) total shards failed\",\"stack_trace\":\"SnapshotException[[found-snapshots:cloud-snapshot-2021.04.01-REDACTED] failed to create snapshot successfully, REDACTED(>800) out of REDACTED(>800) total shards failed]\\n\\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:111)\\n\\tat org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:93)\\n\\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)\\n\\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)\\n\\tat org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77)\\n\\tat org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:32)\\n\\tat org.elasticsearch.action.ActionListener$2.onResponse(ActionListener.java:143)\\n\\tat org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:76)\\n\\tat org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:216)\\n\\tat org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2681)\\n\\tat org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$34(SnapshotsService.java:1577)\\n\\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:117)\\n\\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$37(BlobStoreRepository.java:1130)\\n\\tat org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:117)\\n\\tat org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)\\n\\tat org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)\\n\\tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)\\n\\tat org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)\\n\\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)\\n\\tat java.base/java.lang.Thread.run(Thread.java:832)\\n\\tSuppressed: [REDACTED/REDACTED][[REDACTED][0]] IndexShardSnapshotFailedException[NoSuchFileException[Blob object [snapshots/REDACTED/indices/REDACTED/0/index-REDACTED] not found: 404 Not Found\\nGET https://storage.googleapis.com/download/storage/v1/b/REDACTED/o/snapshots%REDACTED%2Findices%REDACTED%2F0%2Findex-REDACTED?alt=media\\nNo such object: REDACTED/snapshots/REDACTED/indices/REDACTED/0/index-REDACTED]]\\n\\t\\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:54)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1524)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:115)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1472)\\n\\t\\tat org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1469)\\n\\t\\tat org.elasticsearch.repositories.blobstore.BlobStoreRepository.doGetRepositoryData(BlobStoreRepository.java:1463)\\n\\t\\t... 6 more\\n\\tSuppressed: [REDACTED/REDACTED][[REDACTED][0]] IndexShardSnapshotFailedException[NoSuchFileException... [many MBs of the same snipped]

@DaveCTurner DaveCTurner added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management labels Apr 6, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Apr 6, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-features (Team:Core/Features)

@Leaf-Lin
Copy link
Contributor

Came across this in 8.0.0-alpha2. Feels like the stack_trace part should be removed?

@kingherc
Copy link
Contributor

FYI this has been the root cause for some ES upgrades being blocked because of a huge cluster state due to such large exceptions. We cannot deserialize such large cluster states due to this bug #96976 . However, such large fields should not be stored in the cluster state in the first place.

I see that the relevant #96918 has been fixed already, which should fix the issue with large fields in the cluster state. I am unsure whether this ticket is still relevant? I guess the XContentSerialization limit should also limit the response of the Get snapshot lifecycle policy API?

@andreidan
Copy link
Contributor

I believe this would've been fixed by @original-brownbear in #80942

@ywangd
Copy link
Member

ywangd commented Jan 11, 2024

Running into this again today on a 8.11.1 cluster. The details field can still contain a string that is too large for deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team
Projects
None yet
Development

No branches or pull requests

6 participants