New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tighter circuit-breakers for aggs on master-eligible nodes #76145
Comments
Pinging @elastic/es-analytics-geo (Team:Analytics) |
That's an interesting question! If we do it I think it's important that the error message complain about not being able to reserve memory because its running on a master node. It might be obvious to some folks but I'll say it out loud: there are two sorts of times when aggs can happen on master nodes:
Both of these can use a bunch of memory. The work of the data nodes generally fall to the |
This seems like the heart of it: by the time we're tripping the real memory breaker we're basically useless as a master node, we don't have the memory left to deal with incoming requests from the other nodes in the cluster that ask us to do master-related things. In the linked issue the user suggests ignoring the real memory circuit breaker on certain kinds of master-node requests but really this isn't going to work, there's lots of other master-node requests and updating the cluster state itself may need a bunch of memory too, so if we just start ignoring the real-memory circuit breaker we'll just blow up in a different way. So I guess the question is whether we can track the memory usage of aggs on the coordinating node more accurately with the request breaker. That sounds like a big ask. |
I want to do that as part of a few other ideas I've had. But its a big project. I wonder if we could hack something together which gives us a lower real memory breaker on the masters? |
I'm sure we could but we'd also have to make master-node requests ignore the newly-lowered breaker. I'd rather not do that, I think the master should push back on the rest of the cluster if it's approaching OOM territory. |
Aggs already get a special breaker. It's not simple, but its not as crazy as it sounds. |
@DaveCTurner @nik9000 , I was reviewing this issue, and wanted to get your thoughts - are there any actions we can take on this issue specifically as next steps? Nik opened an issue to follow up, should we close this one? |
I don't think there is much we can do here are this point. |
In #76073 a user reported problems with hitting the real-memory circuit breaker on a mixed master/data node that was running heavy aggregations. We recommend dedicated masters to protect against this kind of problem, but it's hard for users to pin down exactly when it makes sense to introduce dedicated masters into their cluster.
My question is whether we could push back on heavy aggregations earlier on master-eligible nodes in order to maintain enough headroom for doing their master-related activities.
The text was updated successfully, but these errors were encountered: