New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory management: do not enforce the BigArrays limit on the network layer and the tranlog. #6374
Conversation
…layer and the tranlog. This commit disables memory circuit breaking in BigArrays on FsTranslog, NettyHttpServerTransport and NettyTransport so that these components are not impacted by heavy search requests. Close elastic#6332
* used but has a different limit. This is typically useful for having component-specific limits that would be different | ||
* from the default one. | ||
*/ | ||
public BigArrays limit(long newMaxSizeInBytes) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the method name is a bit confusing as it doesn't set the limit to a new value but rather returns a new BigArray instance. I think we need a better name or maybe have new constructor BigArrays(BigArrays arrays, long limit)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather go with another name, since a method allows me to override it in MockBigArrays and keep the sanity checks. What about clone(long newMaxSizeInBytes)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limitedTo(long bytes)
? Clone is kinda non-specific.
The commit looks good but I worry about the scenario that transmitting a large cluster state (for example), still counts towards all other requests because of the shared counter. This will cause all other requests to fail until the bytes are released. I wonder if we should just introduce a dedicated buffer for the cluster state/translog/critical memory users that shouldn't be blocked. |
I think this is a good property? Otherwise the risk is that the node would go out-of-memory which is worse? On a related note, the case that you just described was my motivation to open #6375 and try to make sure we do not set a limit that is too low as @kevinkluge mentioned on #6332. Maybe the right fix for now would be to just completely back out this feature until we have a proper way to share memory across breakers. |
what worries me is that if you have a constant load of request that just works, and suddenly there comes a big cluster state update that pushes the buffer above a limit, you only see it in the requests and will not have a way of knowing what happened. This will be hard to debug because cluster state publishing is internal to ES. One alternative is log a warning in the log for that case? (letting X through, but limit has been breached). Another is to say the CS and translog is using their own unlimitted BigArrays as the PR doesn't limit it anyway. |
I think the safest thing to do is to leave the breaker as-is, but increase it to a very high amount until we're able to share memory across breakers. I think OOME and rejecting CS/translog are both bad, so I think I'd rather wait until we had it using a breaker that had the potential to shrink (since BigArray pages don't currently ever shrink), since otherwise we could end up overloading a cluster that could potentially recover while remaining under the memory limit. |
This commit disables memory circuit breaking in BigArrays on FsTranslog,
NettyHttpServerTransport and NettyTransport so that these components are not
impacted by heavy search requests.
Close #6332