-
Notifications
You must be signed in to change notification settings - Fork 24.6k
-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CircuitBreakingException on extremely small dataset #18144
Comments
"fun"
@danielmitterdorfer this might be you though it is hard to tell. @tylersmalley is there any chance you can take a thread dump when this happens? Maybe just the hot_threads API (though it might not work because of the breaker)? |
@nik9000, I will get that once it returns to a failed state again. |
Thanks! |
The _msearch requests will begin failing before the entire ES cluster. Nothing ever appeared in {
"error" : {
"root_cause" : [ {
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
"bytes_wanted" : 726582240,
"bytes_limit" : 726571417
} ],
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
"bytes_wanted" : 726582240,
"bytes_limit" : 726571417
},
"status" : 503
} Here is the thread dump: https://gist.githubusercontent.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/jstack_7647_2.log I will keep the cluster in a failed state should you need any additional information from it. |
It says "I'm not doing anything". Any chance you can get a task list? The breaker you are hitting is trying to prevent requests from overwhelming memory. If you had in flight requests I should have seen them doing something in the thread dump. Lots of stuff in Elasticsearch is async so I wouldn't see everything but I expected something. The task list goes the other way - it registers something whenever a request starts and removes it when it stops. If we see something in the task list, especially if it is a lot of somethings, then we have our smoking gun. If we see nothing, well, we go look other places. The next place might be getting a heap dump. But I'm not going to put you through that. I should be able to reproduce this on my side. I believe @danielmitterdorfer, who I pinged, will not be around tomorrow so I might just keep this issue. |
curl 'http://localhost:9200/_tasks?pretty=true'
{
"error" : {
"root_cause" : [ {
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
"bytes_wanted" : 726582240,
"bytes_limit" : 726571417
} ],
"type" : "circuit_breaking_exception",
"reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
"bytes_wanted" : 726582240,
"bytes_limit" : 726571417
},
"status" : 503
} I will restart the cluster and monitor the Here is a heap dump in its current failed state: https://gist.github.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/jmap_7647.bin |
@nik9000 I tried to reproduce the scenario locally by running topbeat and running the query above periodically but so far the circuit breaker did not trip. I am not surprised that the thread dump does not reveal much because the circuit breaker essentially prevents further work from coming into the system. Based on an analysis of the heap dump I guess that the system is not really busy but the bytes are not freed properly and add up over time. I had a closer look at how the bytes are freed in inFlightRequestsBreaker(circuitBreakerService).addWithoutBreaking(-request().content().length()); Considering that the content is represented by a |
I have also installed Kibana 5.0.0-alpha2, imported the dashboard from topbeat, opened it and set it to auto-refresh every 5 seconds. I could just see that the request breaker (which is used by So I followed the respective |
@danielmitterdorfer in my testing the request circuit breaker (backing BigArrays) has always reset to 0 if there are no requests You should be able to turn on TRACE logging for the |
@dakrone Thanks for the hint. I'll check that. |
@danielmitterdorfer I believe to have found what was causing this on my end, but unsure if it should have triggered the the CircuitBreaker. While doing other testing I still had a script running which hit the |
@tylersmalley Even that should not trip any circuit breaker so we definitely need to investigate. If you can shed more light on how we can reproduce it, that's great. |
I was able to reproduce this also on 5.0.0-alpha2 with x-pack installed and kibana hitting the node. Just like @danielmitterdorfer said, the request breaker is increasing very slowly, it looks like there is a leak. I also tried setting |
@danielmitterdorfer here is the node script I have to preform the health requests on ES. In it I added a second check to run in parallel for speed up the fault. https://gist.githubusercontent.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/test.js |
This reproduces pretty easily now, building from master (or 5.0.0-alpha2), simple turn on logging and then run Kibana, the periodic health check that kibana does causing it to increase over time. |
@dakrone I can reproduce the increase now too but the problem is not the |
I have investigated and now have a minimal reproduction scenario: The problem is that a |
I have also checked 2.x. It is not affected as the code is structured differently there. |
With this commit we free all bytes reserved on the request circuit breaker. Closes elastic#18144
@tylersmalley The problem is fixed now and the fix will be included in the next release of the 5.0 series. Thanks for reporting and helping on the reproduction. Much appreciated! |
Great, thanks @danielmitterdorfer! |
@dakrone I also checked why this happens:
It's caused by the implementation of `ChildMemoryCircuitBreaker#limit(). As far as I can see the overhead is only taken into account for logging statements but never for actual limiting. To me this does not sound that it's intended that way. |
@danielmitterdorfer the overhead is taken into account also when comparing against the limit: if (memoryBytesLimit > 0 && newUsedWithOverhead > memoryBytesLimit) {
....
} I remember it correctly now (I was misinterpreting what a feature I added did, doh!), the overhead is only for tweaking the estimation of an addition, not to factor into the total at all. This is because the fielddata circuit breaker estimates the amount of memory used but ultimately adjusts with the exact value used, so it should not add the overhead-modified usage, but the actual usage. Only the overhead is used for the per-addition check. Hopefully that clarifies, I was slightly confusing myself there too assuming it was taken into account with the added total amount for the breaker, but the current behavior is correct. |
@dakrone Ah, right. I missed this line... . Thanks for the explanation. Maybe we should add a comment in the code so the next time it comes up we don't have to dig to find this in the ticket again. :) With that explanation I am not sure whether any circuit breaker except the field data circuit breaker should have a user-defined overhead at all. Wdyt? |
Elasticsearch version: alpha2
JVM version: build 1.8.0_74-b02
OS version: OS X El Capitan 10.11.3
Description of the problem including expected versus actual behavior:
Steps to reproduce:
Install and run elasticsearch-alpha2, topbeat-alpha2, and kibana-alpha2 (Topbeat is only monitoring the node process on a 20 second interval.)
I am using Kibana to monitor a node process. Here is the query Kibana is using to generate the visualization:
Run for about 30 minutes.
I have 1027 documents, and the total size is 1.6MB.
Provide logs (if relevant):
Eventually, all requests to ES will fail with this exception including _stats.
The text was updated successfully, but these errors were encountered: