CircuitBreakingException on extremely small dataset #18144

tylersmalley · 2016-05-04T19:45:22Z

Elasticsearch version: alpha2

JVM version: build 1.8.0_74-b02

OS version: OS X El Capitan 10.11.3

Description of the problem including expected versus actual behavior:

Steps to reproduce:

Install and run elasticsearch-alpha2, topbeat-alpha2, and kibana-alpha2 (Topbeat is only monitoring the node process on a 20 second interval.)

I am using Kibana to monitor a node process. Here is the query Kibana is using to generate the visualization:

{
   "size":0,
   "aggs":{
      "2":{
         "date_histogram":{
            "field":"@timestamp",
            "interval":"30s",
            "time_zone":"America/Los_Angeles",
            "min_doc_count":1
         },
         "aggs":{
            "1":{
               "max":{
                  "field":"proc.mem.rss"
               }
            }
         }
      }
   },
   "highlight":{
      "pre_tags":[
         "@kibana-highlighted-field@"
      ],
      "post_tags":[
         "@/kibana-highlighted-field@"
      ],
      "fields":{
         "*":{

         }
      },
      "require_field_match":false,
      "fragment_size":2147483647
   },
   "query":{
      "bool":{
         "must":[
            {
               "query_string":{
                  "query":"*",
                  "analyze_wildcard":true
               }
            },
            {
               "match":{
                  "proc.cmdline":{
                     "query":"/Users/tyler/.nvm/versions/node/v4.4.3/bin/node ./bin/../src/cli",
                     "type":"phrase"
                  }
               }
            },
            {
               "range":{
                  "@timestamp":{
                     "gte":1462389453519,
                     "lte":1462390353519,
                     "format":"epoch_millis"
                  }
               }
            }
         ],
         "must_not":[

         ]
      }
   }
}

Run for about 30 minutes.

I have 1027 documents, and the total size is 1.6MB.

Provide logs (if relevant):

CircuitBreakingException[[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]]
    at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:211)
    at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:128)
    at org.elasticsearch.http.HttpServer.dispatchRequest(HttpServer.java:109)
    at org.elasticsearch.http.netty.NettyHttpServerTransport.dispatchRequest(NettyHttpServerTransport.java:489)
    at org.elasticsearch.http.netty.HttpRequestHandler.messageReceived(HttpRequestHandler.java:65)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.http.netty.pipelining.HttpPipeliningHandler.messageReceived(HttpPipeliningHandler.java:85)
    at org.jboss.netty.channel.SimpleChannelHandler.handleUpstream(SimpleChannelHandler.java:88)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.handler.codec.http.HttpChunkAggregator.messageReceived(HttpChunkAggregator.java:145)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.handler.codec.http.HttpContentDecoder.messageReceived(HttpContentDecoder.java:108)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.jboss.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:459)
    at org.jboss.netty.handler.codec.replay.ReplayingDecoder.callDecode(ReplayingDecoder.java:536)
    at org.jboss.netty.handler.codec.replay.ReplayingDecoder.messageReceived(ReplayingDecoder.java:435)
    at org.jboss.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.OpenChannelsHandler.handleUpstream(OpenChannelsHandler.java:83)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.jboss.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.jboss.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.jboss.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
    at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.jboss.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.jboss.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Eventually, all requests to ES will fail with this exception including _stats.

nik9000 · 2016-05-04T19:54:45Z

"fun"

data for [<http_request>] would be larger than limit of [726571417/692.9mb]

@danielmitterdorfer this might be you though it is hard to tell.

@tylersmalley is there any chance you can take a thread dump when this happens? Maybe just the hot_threads API (though it might not work because of the breaker)?

tylersmalley · 2016-05-04T20:00:59Z

@nik9000, I will get that once it returns to a failed state again.

nik9000 · 2016-05-04T20:03:16Z

@nik9000, I will get that once it returns to a failed state again.

Thanks!

tylersmalley · 2016-05-04T20:56:15Z

The _msearch requests will begin failing before the entire ES cluster. Nothing ever appeared in /_nodes/hot_threads and eventually it would fail with the same exception:

{
  "error" : {
    "root_cause" : [ {
      "type" : "circuit_breaking_exception",
      "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
      "bytes_wanted" : 726582240,
      "bytes_limit" : 726571417
    } ],
    "type" : "circuit_breaking_exception",
    "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
    "bytes_wanted" : 726582240,
    "bytes_limit" : 726571417
  },
  "status" : 503
}

Here is the thread dump: https://gist.githubusercontent.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/jstack_7647_2.log

I will keep the cluster in a failed state should you need any additional information from it.

nik9000 · 2016-05-04T21:19:20Z

Here is the thread dump

It says "I'm not doing anything". Any chance you can get a task list? curl localhost:9200/_tasks should do it. That is another (new and shiny!) way for me to figure out what is going on.

The breaker you are hitting is trying to prevent requests from overwhelming memory. If you had in flight requests I should have seen them doing something in the thread dump. Lots of stuff in Elasticsearch is async so I wouldn't see everything but I expected something. The task list goes the other way - it registers something whenever a request starts and removes it when it stops. If we see something in the task list, especially if it is a lot of somethings, then we have our smoking gun. If we see nothing, well, we go look other places.

The next place might be getting a heap dump. But I'm not going to put you through that. I should be able to reproduce this on my side.

I believe @danielmitterdorfer, who I pinged, will not be around tomorrow so I might just keep this issue.

tylersmalley · 2016-05-04T21:31:53Z

Any chance you can get a task list?

curl 'http://localhost:9200/_tasks?pretty=true'
{
  "error" : {
    "root_cause" : [ {
      "type" : "circuit_breaking_exception",
      "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
      "bytes_wanted" : 726582240,
      "bytes_limit" : 726571417
    } ],
    "type" : "circuit_breaking_exception",
    "reason" : "[parent] Data too large, data for [<http_request>] would be larger than limit of [726571417/692.9mb]",
    "bytes_wanted" : 726582240,
    "bytes_limit" : 726571417
  },
  "status" : 503
}

I will restart the cluster and monitor the _tasks endpoint until it begins throwing exceptions and report back.

Here is a heap dump in its current failed state: https://gist.github.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/jmap_7647.bin

danielmitterdorfer · 2016-05-05T12:09:59Z

@nik9000 I tried to reproduce the scenario locally by running topbeat and running the query above periodically but so far the circuit breaker did not trip. I am not surprised that the thread dump does not reveal much because the circuit breaker essentially prevents further work from coming into the system. Based on an analysis of the heap dump I guess that the system is not really busy but the bytes are not freed properly and add up over time. I had a closer look at how the bytes are freed in HttpServer.ResourceHandlingHttpChannel:

inFlightRequestsBreaker(circuitBreakerService).addWithoutBreaking(-request().content().length());

Considering that the content is represented by a ChannelBufferBytesReference which returns the readable bytes in the channel buffer we could end up in a situation where we reserve more bytes than we free (as the readable bytes could change over time). But this is only a theory and I was not able to observe this behavior. If this were the case, the fix is to simply provide the number of reserved bytes to HttpServer.ResourceHandlingHttpChannel but I am somewhat reluctant to do a fix without being able to reproduce it.

danielmitterdorfer · 2016-05-05T13:29:15Z

I have also installed Kibana 5.0.0-alpha2, imported the dashboard from topbeat, opened it and set it to auto-refresh every 5 seconds. I could just see that the request breaker (which is used by BigArrays) is slowly increasing (a few MB after a few minutes). The inflight requests breaker always resets to zero. I think it just happens to be the victim as it's trying very early during request processing to reserve an amount of bytes (we can also see from the stack trace that actually the parent circuit breaker is tripping, not the inflight requests breaker).

So I followed the respective close() calls that are supposed to free the reserved bytes but there are a lot of places to follow. @dakrone: Is it intended that the number of reserved bytes in the request circuit breaker grow over time as BigArrays seems to be intended as some kind of pool or should the reserved number of bytes in BigArrays be zero after a request has been processed?

dakrone · 2016-05-05T14:28:30Z

@danielmitterdorfer in my testing the request circuit breaker (backing BigArrays) has always reset to 0 if there are no requests

You should be able to turn on TRACE logging for the org.elasticsearch.indices.breaker package and see all increments and decrements to the breakers (note this is very verbose)

danielmitterdorfer · 2016-05-05T14:38:44Z

@dakrone Thanks for the hint. I'll check that.

tylersmalley · 2016-05-05T16:08:42Z

@danielmitterdorfer I believe to have found what was causing this on my end, but unsure if it should have triggered the the CircuitBreaker. While doing other testing I still had a script running which hit the cluster/_health endpoint, paused for 100ms, then repeated. I am fairly certain this was not an issue in 2.3, but I can double check.

danielmitterdorfer · 2016-05-05T16:12:27Z

@tylersmalley Even that should not trip any circuit breaker so we definitely need to investigate. If you can shed more light on how we can reproduce it, that's great.

dakrone · 2016-05-05T17:59:42Z

I was able to reproduce this also on 5.0.0-alpha2 with x-pack installed and kibana hitting the node. Just like @danielmitterdorfer said, the request breaker is increasing very slowly, it looks like there is a leak.

I also tried setting network.breaker.inflight_requests.overhead: 0.0 and it looks like it is not being taken into account in at least one place (still having bytes added over time instead of all in_flight_request additions being 0)

tylersmalley · 2016-05-05T20:40:41Z

@danielmitterdorfer here is the node script I have to preform the health requests on ES. In it I added a second check to run in parallel for speed up the fault.

https://gist.githubusercontent.com/tylersmalley/00105a27a0dd7b86016d78dc65e1bfb1/raw/test.js

dakrone · 2016-05-05T20:55:56Z

This reproduces pretty easily now, building from master (or 5.0.0-alpha2), simple turn on logging and then run Kibana, the periodic health check that kibana does causing it to increase over time.

danielmitterdorfer · 2016-05-09T04:54:03Z

@dakrone I can reproduce the increase now too but the problem is not the in_flights_request breaker but the request breaker that keeps increasing. Nevertheless, I'll investigate what's going on.

danielmitterdorfer · 2016-05-09T08:24:55Z

I have investigated and now have a minimal reproduction scenario: curl -XHEAD http://localhost:9200/

The problem is that a XContentBuilder is created which is backed by a BigArrays instance but then we use a constructor of BytesRestResponse without a builder. After that we lose track of the BigArrays instance and don't free it. This affects at least RestMainAction and probably other actions too. I am now working on a fix.

danielmitterdorfer · 2016-05-09T09:28:29Z

I have also checked 2.x. It is not affected as the code is structured differently there.

With this commit we free all bytes reserved on the request circuit breaker. Closes elastic#18144

danielmitterdorfer · 2016-05-09T13:13:37Z

@tylersmalley The problem is fixed now and the fix will be included in the next release of the 5.0 series. Thanks for reporting and helping on the reproduction. Much appreciated!

tylersmalley · 2016-05-09T18:03:13Z

Great, thanks @danielmitterdorfer!

danielmitterdorfer · 2016-05-10T12:26:23Z

@dakrone I also checked why this happens:

I also tried setting network.breaker.inflight_requests.overhead: 0.0 and it looks like it is not being taken into account in at least one place (still having bytes added over time instead of all in_flight_request additions being 0)

It's caused by the implementation of `ChildMemoryCircuitBreaker#limit(). As far as I can see the overhead is only taken into account for logging statements but never for actual limiting. To me this does not sound that it's intended that way.

dakrone · 2016-05-10T15:19:34Z

@danielmitterdorfer the overhead is taken into account also when comparing against the limit:

if (memoryBytesLimit > 0 && newUsedWithOverhead > memoryBytesLimit) {
    ....
}

I remember it correctly now (I was misinterpreting what a feature I added did, doh!), the overhead is only for tweaking the estimation of an addition, not to factor into the total at all. This is because the fielddata circuit breaker estimates the amount of memory used but ultimately adjusts with the exact value used, so it should not add the overhead-modified usage, but the actual usage. Only the overhead is used for the per-addition check.

Hopefully that clarifies, I was slightly confusing myself there too assuming it was taken into account with the added total amount for the breaker, but the current behavior is correct.

danielmitterdorfer · 2016-05-11T07:19:13Z

@dakrone Ah, right. I missed this line... . Thanks for the explanation. Maybe we should add a comment in the code so the next time it comes up we don't have to dig to find this in the ticket again. :) With that explanation I am not sure whether any circuit breaker except the field data circuit breaker should have a user-defined overhead at all. Wdyt?

nik9000 mentioned this issue May 4, 2016

Maybe we should be able to opt arbitrary requests out of tripping the http request breaker? #18145

Closed

clintongormley added Pioneer Program discuss :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload labels May 5, 2016

clintongormley added >bug and removed discuss Pioneer Program labels May 6, 2016

danielmitterdorfer self-assigned this May 9, 2016

clintongormley added the blocker label May 9, 2016

danielmitterdorfer mentioned this issue May 9, 2016

Free bytes reserved on request breaker #18204

Merged

danielmitterdorfer added the v5.0.0-alpha3 label May 9, 2016

danielmitterdorfer added a commit to danielmitterdorfer/elasticsearch that referenced this issue May 9, 2016

Free bytes reserved on request breaker

3a46812

With this commit we free all bytes reserved on the request circuit breaker. Closes elastic#18144

danielmitterdorfer closed this as completed in #18204 May 9, 2016

danielmitterdorfer removed their assignment May 9, 2016

jpountz added Pioneer Program and removed Pioneer Program labels May 9, 2016

dakrone mentioned this issue May 17, 2016

delete-by-query plugin triggers CircuitBreakingException / SearchContextMissingException on large queries #18411

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CircuitBreakingException on extremely small dataset #18144

CircuitBreakingException on extremely small dataset #18144

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

danielmitterdorfer commented May 5, 2016 •

edited

Loading

danielmitterdorfer commented May 5, 2016 •

edited

Loading

dakrone commented May 5, 2016

danielmitterdorfer commented May 5, 2016

tylersmalley commented May 5, 2016

danielmitterdorfer commented May 5, 2016

dakrone commented May 5, 2016

tylersmalley commented May 5, 2016

dakrone commented May 5, 2016

danielmitterdorfer commented May 9, 2016

danielmitterdorfer commented May 9, 2016 •

edited

Loading

danielmitterdorfer commented May 9, 2016

danielmitterdorfer commented May 9, 2016

tylersmalley commented May 9, 2016

danielmitterdorfer commented May 10, 2016

dakrone commented May 10, 2016 •

edited

Loading

danielmitterdorfer commented May 11, 2016

CircuitBreakingException on extremely small dataset #18144

CircuitBreakingException on extremely small dataset #18144

Comments

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

nik9000 commented May 4, 2016

tylersmalley commented May 4, 2016

danielmitterdorfer commented May 5, 2016 • edited Loading

danielmitterdorfer commented May 5, 2016 • edited Loading

dakrone commented May 5, 2016

danielmitterdorfer commented May 5, 2016

tylersmalley commented May 5, 2016

danielmitterdorfer commented May 5, 2016

dakrone commented May 5, 2016

tylersmalley commented May 5, 2016

dakrone commented May 5, 2016

danielmitterdorfer commented May 9, 2016

danielmitterdorfer commented May 9, 2016 • edited Loading

danielmitterdorfer commented May 9, 2016

danielmitterdorfer commented May 9, 2016

tylersmalley commented May 9, 2016

danielmitterdorfer commented May 10, 2016

dakrone commented May 10, 2016 • edited Loading

danielmitterdorfer commented May 11, 2016

danielmitterdorfer commented May 5, 2016 •

edited

Loading

danielmitterdorfer commented May 5, 2016 •

edited

Loading

danielmitterdorfer commented May 9, 2016 •

edited

Loading

dakrone commented May 10, 2016 •

edited

Loading