Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tighter circuit-breakers for aggs on master-eligible nodes #76145

Open
DaveCTurner opened this issue Aug 5, 2021 · 9 comments
Open

Tighter circuit-breakers for aggs on master-eligible nodes #76145

DaveCTurner opened this issue Aug 5, 2021 · 9 comments
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@DaveCTurner
Copy link
Contributor

In #76073 a user reported problems with hitting the real-memory circuit breaker on a mixed master/data node that was running heavy aggregations. We recommend dedicated masters to protect against this kind of problem, but it's hard for users to pin down exactly when it makes sense to introduce dedicated masters into their cluster.

My question is whether we could push back on heavy aggregations earlier on master-eligible nodes in order to maintain enough headroom for doing their master-related activities.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Aug 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2021

My question is whether we could push back on heavy aggregations earlier on master-eligible nodes in order to maintain enough headroom for doing their master-related activities.

That's an interesting question! If we do it I think it's important that the error message complain about not being able to reserve memory because its running on a master node.

It might be obvious to some folks but I'll say it out loud: there are two sorts of times when aggs can happen on master nodes:

  1. When master nodes are also data nodes
  2. When master nodes are also coordinating nodes

Both of these can use a bunch of memory. The work of the data nodes generally fall to the request breaker. The work of the coordinating nodes generally falls to the real memory breaker. Its easier to reason about the request breaker and I imagine it'd do a better job helping in this case. I'm not sure how to think about the real memory breaker here.

@DaveCTurner
Copy link
Contributor Author

The work of the coordinating nodes generally falls to the real memory breaker.

This seems like the heart of it: by the time we're tripping the real memory breaker we're basically useless as a master node, we don't have the memory left to deal with incoming requests from the other nodes in the cluster that ask us to do master-related things. In the linked issue the user suggests ignoring the real memory circuit breaker on certain kinds of master-node requests but really this isn't going to work, there's lots of other master-node requests and updating the cluster state itself may need a bunch of memory too, so if we just start ignoring the real-memory circuit breaker we'll just blow up in a different way.

So I guess the question is whether we can track the memory usage of aggs on the coordinating node more accurately with the request breaker. That sounds like a big ask.

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2021

So I guess the question is whether we can track the memory usage of aggs on the coordinating node more accurately with the request breaker. That sounds like a big ask.

I want to do that as part of a few other ideas I've had. But its a big project. I wonder if we could hack something together which gives us a lower real memory breaker on the masters?

@DaveCTurner
Copy link
Contributor Author

I wonder if we could hack something together which gives us a lower real memory breaker on the masters?

I'm sure we could but we'd also have to make master-node requests ignore the newly-lowered breaker. I'd rather not do that, I think the master should push back on the rest of the cluster if it's approaching OOM territory.

@nik9000
Copy link
Member

nik9000 commented Aug 5, 2021

I'm sure we could but we'd also have to make master-node requests ignore the newly-lowered breaker. I'd rather not do that, I think the master should push back on the rest of the cluster if it's approaching OOM territory.

Aggs already get a special breaker. It's not simple, but its not as crazy as it sounds.

@nik9000
Copy link
Member

nik9000 commented Sep 8, 2021

I opened #77449 to talk about using a dense representation for aggs for reduction. I think if we're going to do this, then it'll be more worth doing if we had #77449. I'm still not sure its worth doing though. I'm going to drop team-discuss for now because it isn't going to be a near term thing.

@wchaparro
Copy link
Contributor

@DaveCTurner @nik9000 , I was reviewing this issue, and wanted to get your thoughts - are there any actions we can take on this issue specifically as next steps? Nik opened an issue to follow up, should we close this one?

@nik9000
Copy link
Member

nik9000 commented Feb 22, 2022

@DaveCTurner @nik9000 , I was reviewing this issue, and wanted to get your thoughts - are there any actions we can take on this issue specifically as next steps? Nik opened an issue to follow up, should we close this one?

I don't think there is much we can do here are this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

4 participants