Log reason(s) for reroute if exceptional #58259
Labels
:Distributed/Allocation
All issues relating to the decision making around placing a shard (both master logic & on the nodes)
>enhancement
Team:Distributed
Meta label for distributed team
Sometimes we call
reroute()
because something went wrong, and often that "something" is a problem common to multiple shards. For instance, aCircuitBreakingException
during shard fetching will likely affect all the shard fetches on that node at once. We want to know about such exceptions in the logs, but there's little value in logging the exception for every single shard.Today we discussed this as a team (in relationship to #57804 (comment)) and decided that it seems natural to use the batching facility built into the
BatchedRerouteService
to record examples of these failures in the logs, on a reroute-by-reroute basis, without having to record them all.The text was updated successfully, but these errors were encountered: