Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

Conversation

original-brownbear
Copy link
Member

We should not serialize the full exception including cause(s) and stacktraces
here. This can be a string of multiple MBs for a very large cluster that has
a large subset of indices/shards failing to snapshot.
We can get the full details of what failed for each shard in detail from the
repository as well as from logs anyway. If we fail to finalize the snapshot
we still get the rough reason for that failure with this change and can
look at the logs for more details.
Also, moved generating the JSON to the proper utility for that.

somewhat relates #77466 (since this is motivated by a user issue that saw a multi MB string being serialized here).

We should not serialize the full exception including cause(s) and stacktraces
here. This can be a string of multiple MBs for a very large cluster that has
a large subset of indices/shards failing to snapshot.
We can get the full details of what failed for each shard in detail from the
repository as well as from logs anyway. If we fail to finalize the snapshot
we still get the rough reason for that failure with this change and can
look at the logs for more details.
@original-brownbear original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v8.1.0 labels Nov 23, 2021
@elasticmachine elasticmachine added the Team:Data Management Meta label for data/management team label Nov 23, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-data-management (Team:Data Management)

return exception.map(e -> Strings.toString((builder, params) -> {
ElasticsearchException.generateThrowableXContent(builder, EXCEPTION_SERIALIZATION_PARAMS, e);
return builder;
}, EXCEPTION_SERIALIZATION_PARAMS)).orElse(null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid in the linked sdh the exception become huge because of 2089 suppressed exceptions (attached to the root, not to the nested exception):

{"type":"snapshot_exception","reason":"[azure_repository:routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed","stack_trace":"SnapshotException[[azure_repository:
routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed]
  at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:111)
  at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:93)
  at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
  at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)
  at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77)
  at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
  at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:163)
  at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:217)
  at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:184)
  at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101)
  at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291)
  at org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2850)
  at org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$36(SnapshotsService.java:1710)
  at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134)
  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$38(BlobStoreRepository.java:1140)
  at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134)
  at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
  at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
  at java.base/java.lang.Thread.run(Thread.java:831)
  Suppressed: [openapi-iis-live-7.11.2-2021.08.19/h80zl_siTwC5zSzUIsqqlw][[openapi-iis-live-7.11.2-2021.08.19][1]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE
xception: Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: IOException[Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: AbortedSnapshotException[aborted]]
    at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)
    at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:54)
    at org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1614)
    at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:119)
    at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1555)
    at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1552)
    at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291)
    at org.elasticsearch.action.ResultDeduplicator$CompositeListener.onResponse(ResultDeduplicator.java:92)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getRepositoryData$47(BlobStoreRepository.java:1376)
    at org.elasticsearch.action.ResultDeduplicator.executeOnce(ResultDeduplicator.java:40)
    at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1376)
    at org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1552)
    at org.elasticsearch.snapshots.SnapshotsService.access$900(SnapshotsService.java:119)
    at org.elasticsearch.snapshots.SnapshotsService$16.clusterStateProcessed(SnapshotsService.java:3355)
    at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:523)
    at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:410)
    at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
    at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:410)
    at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:270)
    at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:262)
    at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:239)
    at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62)
    at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140)
    at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)
    at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)
    at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)
    ... 3 more
  Suppressed: [openapi-svc-live-7.11.2-2021.09.07/whEHoP6ATW6aOKy8HNOzEA][[openapi-svc-live-7.11.2-2021.09.07][3]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE
:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were using 7.13.1 in case that was changed since then

Copy link
Contributor

@idegtiarenko idegtiarenko Nov 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was looking on the exception json and I think we print each suppressed exception twice:

~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)" details.json -c
4178
~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw" details.json -c
2

First time it is written by ExceptionsHelper.stackTrace(throwable) > e.printStackTrace(printWriter) inside of STACK_TRACE
and second time inside our own SUPPRESSED array

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went a little further here now. As you pointed out to me on another channel, even with this change we are still serialising all the suppressed exceptions.
These exceptions are pointless without the stack-trace it turns out (the message doesn't really contain any actionable information). I don't think we can serialise the stack traces for every shard or even should do so for a sub-set of shards, it's just a massive number of bytes. To me this seems to even be outright wrong for storing history items to the SLM history as there's no telling how large those exceptions could serialize.
-> I just dropped collecting all the shard level exceptions in SLM itself and moved serialisation to what we would serialize in a REST response both for the history store and the cluster state.

This still gives us a reasonably actionable amount of information in the case of an outright failure of the snapshot request via the internal client and for all the shard level details we track how many shards failed and that's it. Further analysing this can be done by inspecting the GET /_snapshot/ API and if even more detail is needed, the logs.
Storing this kind of information outside the logs and including the stack-trace simply doesn't scale to situations like failing all shards in a 100k shard snapshot and the like.

@dakrone could you take a look here and see if you're fine with ^^, we really need to do something here :) Otherwise this is going to bite us (already has IMO, with the fact that we indexed O(10M) docs in a linked issue).

Copy link
Member

@dakrone dakrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing this Armin

@original-brownbear
Copy link
Member Author

@elasticmachine update branch

@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/part-2

@original-brownbear
Copy link
Member Author

Thanks Ievgen and Lee!

@original-brownbear original-brownbear merged commit 1f93339 into elastic:master Nov 29, 2021
@original-brownbear original-brownbear deleted the compact-serialization-slm-exception branch November 29, 2021 20:46
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.0 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 80942

weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (150 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
weizijun added a commit to weizijun/elasticsearch that referenced this pull request Nov 30, 2021
* upstream/master: (55 commits)
  Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864)
  Optimize DLS bitset building for matchAll query (elastic#81030)
  URL option for BaseRunAsSuperuserCommand (elastic#81025)
  Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942)
  Fix shadowed vars pt7 (elastic#80996)
  Fail shards early when we can detect a type missmatch (elastic#79869)
  Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096)
  Clarify `unassigned.reason` docs (elastic#81017)
  Strip blocks from settings for reindex targets (elastic#80887)
  Split off the values supplier for ScriptDocValues (elastic#80635)
  [ML] Switch message and detail for model snapshot deprecations (elastic#81108)
  [DOCS] Update xrefs for snapshot restore docs (elastic#81023)
  [ML] Updates visiblity of validate API (elastic#81061)
  Track histogram of transport handling times (elastic#80581)
  [ML] Fix datafeed preview with remote indices (elastic#81099)
  [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060)
  [ML] Add logging for failing PyTorch test (elastic#81044)
  Extending the timeout waiting for snapshot to be ready (elastic#81018)
  [ML] Fix incorrect logging of unexpected model size error (elastic#81089)
  [ML] Make inference timeout test more reliable (elastic#81094)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Data Management/ILM+SLM Index and Snapshot lifecycle management Team:Data Management Meta label for data/management team v8.0.0 v8.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants