Skip to content

APIs like /_cluster/state Break for Large Clusters due to Response Size Limitations #79560

@original-brownbear

Description

@original-brownbear

ES currently is not able to return REST responses larger than 2Gb (max int value) because of the way we serialize the messages into BytesReference instances.
This causes APIs like /_cluster/state to stop working eventually (in this case we're talking about ~15k indices with Auditbeat templates when using ?human&pretty and an almost 1G response without those parameters).

Even before outright breaking due to the 2G size limit, requesting a response of this size can destabilize smaller master nodes. This has already been observed for smaller states when concurrent requests come into the mix.

This is not all that important of an issue in practice for most users because of the limited usefulness of these massive responses in most cases, but:
One implication of this issue is that the support diagnostics tool breaks and/or that running it might destabilize the master/cluster.
Another issue is orchestration tooling that might hit endpoints like the cluster state endpoint and destabilize/break clusters that way (observed in the real-world already).

It is definitely a bug to have endpoints that eventually become unusable or worse yet allow for bringing down a node if called.
A solution to this is likely to not have these endpoints instead of making them work at larger scale and force users/tooling to use more specific endpoints for the problem at hand instead.

relates #77466

Metadata

Metadata

Labels

:Distributed Coordination/Cluster CoordinationCluster formation and cluster state publication, including cluster membership and fault detection.>bugTeam:ClientsMeta label for clients teamTeam:Distributed (Obsolete)Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions