Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

hackerwin7 · 2020-06-29T07:01:59Z

In the method of ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(), the limit() method would do while loop atomically compare and set the used value.

when we test the high qps query to ES-node, the major cost is ChildMemoryCircuitBreaker.limit() which cause to AggregationPhase.preProcess()'s cost is extremely high

If data node receive high qps complicated search request body, and data-node will be busy to create aggregator, and it will lead to SumAggregator.<init>, lead to BigArrays.newDoubleArray() and lead to BigArrays.adjustBreaker(), finally call breaker.addEstimateBytesAndMaybeBreak and ChildMemoryCircuitBreaker.limit()

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-06-29T08:47:49Z

Pinging @elastic/es-perf (:Performance)

elasticmachine · 2020-06-29T08:47:51Z

Pinging @elastic/es-search (:Search/Search)

hackerwin7 · 2020-07-07T02:59:05Z

a flame graph, here it is:

the most cost is ChildMemoryCircuitBreaker.limit().

elasticmachine · 2020-07-07T17:36:31Z

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

elasticmachine · 2020-07-07T17:36:33Z

Pinging @elastic/es-core-infra (:Core/Infra/Circuit Breakers)

nik9000 · 2020-07-07T18:04:02Z

I wonder if we should consolidate how we grab bytes from the breaker. Or if we should grab a big chunk whenever we grab any and then only come back when we need more. Sort of like a TLAB. For the most part BigArrays is designed to oversize its allocations so for big requests they amortize to happening infrequently. But for many small requests we grab breaker byte very frequently.

nik9000 · 2020-07-07T18:13:25Z

@hackerwin7 could you try running this against master or 7.x? Could you share the search and reproduction?

It looks like the search is a terms which contains another terms which contains a sum. I've cut down on the number of aggregator allocations in 7.x and master over at #56487. It may not help if the cardinality of the sum agg has is ultimately pretty low. The flame graph shows that the initial allocation is what is getting wedged which I haven't changed. But there should be fewer allocations.

hackerwin7 · 2020-07-08T03:16:20Z

@nik9000 Thanks for your reply.

@hackerwin7 could you try running this against master or 7.x? Could you share the search and reproduction?

we test it at 7.6.2, it seems that a high QPS of complicated of aggregations search (which maybe build many aggregators), the main cost is create Aggregator and the BigArrays.adjustBreaker(), in the end call ChildMemoryCircuitBreaker.limit().

I've cut down on the number of aggregator allocations in 7.x and master over at #56487

reduce aggregator allocations maybe a way of reduce the race condition of limit() (loop set atomic value), I will test this in master branch and check the effects

I think in high QPS and many much more aggregators search , according to this flame graph, can we reduce the frequency of call to limit() to reduce race condition. maybe we can batched the limit() or addEstimateBytesAndMaybeBreak(), when added bytes reach to a buff size, we do a real loop set atomic value logic.

danielmitterdorfer · 2020-07-08T07:02:34Z

I wonder if we should consolidate how we grab bytes from the breaker

I agree that the right approach is to reduce the number of calls to the circuit breaker instead of optimizing the circuit breaker's #limit() method which appears to be the bottleneck because of the high number of calls to it (2.6 million calls in the flame graph that you present).

I also think that the issue title mentioning a "race condition" is misleading because - in my understanding - this issue is about inefficient behavior.

lqbilbo · 2020-11-16T22:54:38Z

I wonder if we should consolidate how we grab bytes from the breaker

I agree that the right approach is to reduce the number of calls to the circuit breaker instead of optimizing the circuit breaker's #limit() method which appears to be the bottleneck because of the high number of calls to it (2.6 million calls in the flame graph that you present).

I also think that the issue title mentioning a "race condition" is misleading because - in my understanding - this issue is about inefficient behavior.

@danielmitterdorfer Hi Daniel, if we add a setting indices.breaker.memory_usage_estimates.enable in ClusterSettings which can control if or not estimating memory usage during aggregation phase dynamically, we can reduce the number of calls of AggregatorBase#addRequestCircuitBreakerBytes and BigArrays#adjustBreaker when the switch is closed. Do you think it's a good way?

For example:

protected long addRequestCircuitBreakerBytes(long bytes) {
        // Only use the potential to circuit break if bytes are being incremented
        if (isMemoryUsageEstimatesEnabled()) {   // we add a switch
            if (bytes > 0) {
                this.breakerService
                    .getBreaker(CircuitBreaker.REQUEST)
                    .addEstimateBytesAndMaybeBreak(bytes, "<agg [" + name + "]>");
            } else {
                this.breakerService
                    .getBreaker(CircuitBreaker.REQUEST)
                    .addWithoutBreaking(bytes);
            }
        }
        this.requestBytesUsed += bytes;
        return requestBytesUsed;
    }

and

void adjustBreaker(final long delta, final boolean isDataAlreadyCreated) {
        if (isMemoryUsageEstimatesEnabled()) {         // we add a switch
            if (this.breakerService != null) {
                CircuitBreaker breaker = this.breakerService.getBreaker(breakerName);
                if (this.checkBreaker) {
                    // checking breaker means potentially tripping, but it doesn't
                    // have to if the delta is negative
                    if (delta > 0) {
                        try {
                            breaker.addEstimateBytesAndMaybeBreak(delta, "<reused_arrays>");
                        } catch (CircuitBreakingException e) {
                            if (isDataAlreadyCreated) {
                                // since we've already created the data, we need to
                                // add it so closing the stream re-adjusts properly
                                breaker.addWithoutBreaking(delta);
                            }
                            // re-throw the original exception
                            throw e;
                        }
                    } else {
                        breaker.addWithoutBreaking(delta);
                    }
                } else {
                    // even if we are not checking the breaker, we need to adjust
                    // its' totals, so add without breaking
                    breaker.addWithoutBreaking(delta);
                }
            }
        }
    }

Because we only remain an abstract function in SearchContext and a volatile variable in BigArrays, and we assign the switch value onto BigArrays's variable only while AggregatorBase initial build up. We have controlled the scale of the influence. Just like:

protected final SearchContext context;

protected AggregatorBase(String name, AggregatorFactories factories, SearchContext context, Aggregator parent,
            List<PipelineAggregator> pipelineAggregators, Map<String, Object> metaData) throws IOException {
        this.name = name;
        this.pipelineAggregators = pipelineAggregators;
        this.metaData = metaData;
        this.parent = parent;
        this.context = context;
        this.breakerService = context.bigArrays().breakerService();
        // if we disable memory usage estimates, bigArrays doesn't need to adjust memory size.
        context.bigArrays().enableMemoryUsageEstimates(context.enableMemoryUsageEstimates());
        ...
    }

And we have tried a lot of test cases. They show that bucket and metrics aggregation in our query phase consume few memory usage but the number of calls is large.

After I reviewed the pull request #46751 and #57042, they still can't fix this problem. So if needed, I'll show you the code.

Wish your reply!

nik9000 · 2020-11-17T14:35:52Z

I'm curious if this is still a big problem in 7.9.

I'm a bit worried that mutating BigArrays would be dangerous because its shared by the system. We could certainly fork a fresh one for each request but at that point it might be cleaner to have the forking process reserve some chunk up front for the request.

I dunno. As much fun as it is to attack bottlenecks it isn't clear to me that this is both still a big bottleneck.

lqbilbo · 2020-11-23T10:45:15Z

I'm curious if this is still a big problem in 7.9.

I'm a bit worried that mutating BigArrays would be dangerous because its shared by the system. We could certainly fork a fresh one for each request but at that point it might be cleaner to have the forking process reserve some chunk up front for the request.

@nik9000 Thanks for your reply. After I saw your reply, I ran some test cases in 7.9, but as a regret, the new version still doesn't solve the problem. I agree with you that we shouldn't mutate BigArrays, so I put the Settings into BreakerSettings and I put the condition phase: if (this.breakerService != null && this.breakerService.isMemoryUsageEstimatesEnabled()) into BigArrays#adjustBreaker. And isMemoryUsageEstimatesEnabled() is only False when AggregateBase set it. So it will not affect other part of the whole system.

I dunno. As much fun as it is to attack bottlenecks it isn't clear to me that this is both still a big bottleneck.

We really face the problem just as high QPS, the CPU is almost full and it will affect the stability of our Elasticsearch cluster. So we want to attack the bottleneck while facing with such a business scene.

nik9000 · 2020-11-24T19:19:37Z

@lqbilbo. OK. I'm going to add this to my list of things to think about some more too then. I'm not really sold on the idea of not checking the breaker on allocation but its certainly something. I want to play with pre-allocating the a bunch of byes up front and see what that looks like for us. It might not be as big a change as it sounds.... But it'll take me a week or two before I can get to it what with Thanksgiving here in the US and me spending most of my time on the initial push for runtime fields.

Regardless of how we fix this I'd like to know more about your tests. Ultimately I'm going to have to reproduce this with rally and/or with our jmh benchmarks. Can you tell me what you were doing to get this to show up? I have a suspicion I'll be able to see this really easy with JMH but with rally I'd need to set up a bunch of machines to run a realistic cluster and generate load. Like, what kinds of searches are you sending? To how many shards? On how many nodes? More shards on single nodes'll all step on each other.....

nik9000 · 2020-12-02T20:20:32Z

I had a look this morning. I made a JMH benchmark that builds six sum aggs as fast as its silicon heart can manage. The measurements I get out of it are kind of funny, especially when I run them on a GCP node with 60 cores. There, I see that building these SumAggregators on all the threads clocks in at about 9 microseconds. Not super duper fast, but, like, fine.

Now, with the circuit breaker on I see lots of contention and building the SumAggregators clocks in a 280 microseconds or 620 microseconds. I get both measurements and I'm not 100% sure why, but they are fairly consistent. Measurement mysteries aside, the circuit breaker costs about 600 microseconds in 60 threads. When I run the benchmark with a single thread I see that the circuit breakers adding 1.2 microseconds to building the aggregators. So contention is certainly killing us here. I mean, the flame graph said it, but I wanted to confirm.

I hacked together a prototype that only dips into the circuit breaker once for the whole request. Presumably we could make a slight overestimate of the memory we'll be needing for a "nearly empty" aggregation and allocate all of that up front. That dropped the construction to 170 microseconds or 85 microseconds. Same deal with the two consistent and different measurements. Something must be wrong with my test, but it shows us the order of magnitude. Anyway, 170 microseconds is a lot more than 9 microseconds, but its still a fair savings percentage wise.

What kind of numbers were you seeing, @lqbilbo? What kind of aggs are you running and how many cores do you have? What kind of CPU are you running? I haven't tried ARM. When I ran this on my desktop which has 12 cores I didn't really see much contention. So I imagine you are running with far more cores.

How fast are the requests you are measuring? 600 microseconds is, like, not all the much time. If you end up going to disk for any of your requests the 600 microseconds is a drop in the bucket. Still, lower CPU load is lower CPU load. And we're wasting a bunch of time here if you are running with dozens of CPUs. So I still think this is worth me looking more into.

lqbilbo · 2020-12-05T14:22:56Z

I had a look this morning. I made a JMH benchmark that builds six sum aggs as fast as its silicon heart can manage. The measurements I get out of it are kind of funny, especially when I run them on a GCP node with 60 cores. There, I see that building these SumAggregators on all the threads clocks in at about 9 microseconds. Not super duper fast, but, like, fine.

How fast are the requests you are measuring? 600 microseconds is, like, not all the much time. If you end up going to disk for any of your requests the 600 microseconds is a drop in the bucket. Still, lower CPU load is lower CPU load. And we're wasting a bunch of time here if you are running with dozens of CPUs. So I still think this is worth me looking more into.

We ran tests on a test cluster with 48core and 256g memory size on each node. And the aggs is a query phase with one Date histogram filter, one terms filter and about sixty sum aggs. Our cluster is the product environment cluster which all boxes of it has a high hdd space, or you can say they are for logging scenario, and we build up raid6 for a higher I/O read speed. The cluster configuration is three client nodes, 24 data&master nodes. And we ran nearly three thousand requests per second and the cpu load at highest point is about 60 percent, not too high as we know the nearly 100% CPU load would lead to a full search queue. and this is meaningless. Also, the pct99 is just 600 microseconds as your test show.

I hacked together a prototype that only dips into the circuit breaker once for the whole request. Presumably we could make a slight overestimate of the memory we'll be needing for a "nearly empty" aggregation and allocate all of that up front. That dropped the construction to 170 microseconds or 85 microseconds. Same deal with the two consistent and different measurements. Something must be wrong with my test, but it shows us the order of magnitude. Anyway, 170 microseconds is a lot more than 9 microseconds, but its still a fair savings percentage wise.

So if I can follow your measurements and run the tests with dipping circuit breaker for the whole request into once estimate memory or allocate all aggregation consume before all this happened?

nik9000 · 2020-12-07T15:50:30Z

We ran tests on a test cluster with 48core and 256g memory size on each node. And the aggs is a query phase with one Date histogram filter, one terms filter and about sixty sum aggs. Our cluster is the product environment cluster which all boxes of it has a high hdd space, or you can say they are for logging scenario, and we build up raid6 for a higher I/O read speed. The cluster configuration is three client nodes, 24 data&master nodes. And we ran nearly three thousand requests per second and the cpu load at highest point is about 60 percent, not too high as we know the nearly 100% CPU load would lead to a full search queue. and this is meaningless. Also, the pct99 is just 600 microseconds as your test show.

Perfect.

Is the index being written to? If it isn't, you might be able to get caching by separating your aggs from your top_hits searches. But if you queries that vary then you won't get any cache hits anyway. One thing that is useful - if two requests come in with the same cache key and we're caching the request then we merge those request together.

So if I can follow your measurements and run the tests with dipping circuit breaker for the whole request into once estimate memory or allocate all aggregation consume before all this happened?

I really did just hack together the benchmark to see if I could reproduce your issue and to see if my hunch that we could pre-allocate would help. It isn't really useful outside of theorizing. Be that as it may, I did a sixty sums version on my 12 core machine and got:

Benchmark                                            (breaker)  Mode  Cnt    Score    Error  Units
AggConstructionContentionBenchmark.termsSixtySums         noop  avgt   10   81.853 ±  2.188  us/op
AggConstructionContentionBenchmark.termsSixtySums         real  avgt   10  297.595 ± 13.411  us/op
AggConstructionContentionBenchmark.termsSixtySums  preallocate  avgt   10  110.816 ± 10.993  us/op

Which looks to me like pre-allocating could do the trick.

This lowers the contention on the `REQUEST` circuit breaker when building many aggregations on many threads by preallocating a chunk of breaker up front. This cuts down on the number of times we enter the busy loop in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building aggregations. We still hit the busy loop if we collect many buckets. We let the `AggregationBuilder` pick size of the "chunk" that we preallocate but it doesn't have much to go on - not even the field types. But it is available in a convenient spot and the estimates don't have to be particularly accurate. The benchmarks on my 12 core desktop are interesting: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 1.672 ± 0.042 us/op sum real avgt 10 4.100 ± 0.027 us/op sum preallocate avgt 10 4.230 ± 0.034 us/op termsSixtySums noop avgt 10 92.658 ± 0.939 us/op termsSixtySums real avgt 10 278.764 ± 39.751 us/op termsSixtySums preallocate avgt 10 120.896 ± 16.097 us/op termsSum noop avgt 10 4.573 ± 0.095 us/op termsSum real avgt 10 9.932 ± 0.211 us/op termsSum preallocate avgt 10 7.695 ± 0.313 us/op ``` They show pretty clearly that not using the circuit breaker at all is faster. But we can't do that because we don't want to bring the node down on bad aggs. When there are many aggs (termsSixtySums) the preallocation claws back much of the performance. It even helps marginally when there are two aggs (termsSum). For a single agg (sum) we see a 130 nanosecond hit. Fine. But these values are all pretty small. At best we're seeing a 160 microsecond savings. Not so on a 160 vCPU machine: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 44.956 ± 8.851 us/op sum real avgt 10 118.008 ± 19.505 us/op sum preallocate avgt 10 241.234 ± 305.998 us/op termsSixtySums noop avgt 10 1339.802 ± 51.410 us/op termsSixtySums real avgt 10 12077.671 ± 12110.993 us/op termsSixtySums preallocate avgt 10 3804.515 ± 1458.702 us/op termsSum noop avgt 10 59.478 ± 2.261 us/op termsSum real avgt 10 293.756 ± 253.854 us/op termsSum preallocate avgt 10 197.963 ± 41.578 us/op ``` All of these numbers are larger because we're running all the CPUs flat out and we're seeing more contention everywhere. Even the "noop" breaker sees some contention, but I think it is mostly around memory allocation. Anyway, with many many (termsSixtySums) aggs we're looking at 8 milliseconds of savings by preallocating. Just by dodging the busy loop as much as possible. The error in the measurements there are substantial. Here are the runs: ``` real: Iteration 1: 8679.417 ±(99.9%) 273.220 us/op Iteration 2: 5849.538 ±(99.9%) 179.258 us/op Iteration 3: 5953.935 ±(99.9%) 152.829 us/op Iteration 4: 5763.465 ±(99.9%) 150.759 us/op Iteration 5: 14157.592 ±(99.9%) 395.224 us/op Iteration 1: 24857.020 ±(99.9%) 1133.847 us/op Iteration 2: 24730.903 ±(99.9%) 1107.718 us/op Iteration 3: 18894.383 ±(99.9%) 738.706 us/op Iteration 4: 5493.965 ±(99.9%) 120.529 us/op Iteration 5: 6396.493 ±(99.9%) 143.630 us/op preallocate: Iteration 1: 5512.590 ±(99.9%) 110.222 us/op Iteration 2: 3087.771 ±(99.9%) 120.084 us/op Iteration 3: 3544.282 ±(99.9%) 110.373 us/op Iteration 4: 3477.228 ±(99.9%) 107.270 us/op Iteration 5: 4351.820 ±(99.9%) 82.946 us/op Iteration 1: 3185.250 ±(99.9%) 154.102 us/op Iteration 2: 3058.000 ±(99.9%) 143.758 us/op Iteration 3: 3199.920 ±(99.9%) 61.589 us/op Iteration 4: 3163.735 ±(99.9%) 71.291 us/op Iteration 5: 5464.556 ±(99.9%) 59.034 us/op ``` That variability from 5.5ms to 25ms is terrible. It makes me not particularly trust the 8ms savings from the report. But still, the preallocating method has much less variability between runs and almost all the runs are faster than all of the non-preallocated runs. Maybe the savings is more like 2 or 3 milliseconds, but still. Or maybe we should think of hte savings as worst vs worst? If so its 19 milliseconds. Anyway, its hard to measure how much this helps. But, certainly some. Closes #58647

nik9000 · 2021-01-04T16:36:28Z

For those following along at home the contention gets much, much worse when you have a ton of cpus all trying. The preallocation helps quite a lot in that case so its ultimately what we merged in #66895 to close this.

This lowers the contention on the `REQUEST` circuit breaker when building many aggregations on many threads by preallocating a chunk of breaker up front. This cuts down on the number of times we enter the busy loop in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building aggregations. We still hit the busy loop if we collect many buckets. We let the `AggregationBuilder` pick size of the "chunk" that we preallocate but it doesn't have much to go on - not even the field types. But it is available in a convenient spot and the estimates don't have to be particularly accurate. The benchmarks on my 12 core desktop are interesting: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 1.672 ± 0.042 us/op sum real avgt 10 4.100 ± 0.027 us/op sum preallocate avgt 10 4.230 ± 0.034 us/op termsSixtySums noop avgt 10 92.658 ± 0.939 us/op termsSixtySums real avgt 10 278.764 ± 39.751 us/op termsSixtySums preallocate avgt 10 120.896 ± 16.097 us/op termsSum noop avgt 10 4.573 ± 0.095 us/op termsSum real avgt 10 9.932 ± 0.211 us/op termsSum preallocate avgt 10 7.695 ± 0.313 us/op ``` They show pretty clearly that not using the circuit breaker at all is faster. But we can't do that because we don't want to bring the node down on bad aggs. When there are many aggs (termsSixtySums) the preallocation claws back much of the performance. It even helps marginally when there are two aggs (termsSum). For a single agg (sum) we see a 130 nanosecond hit. Fine. But these values are all pretty small. At best we're seeing a 160 microsecond savings. Not so on a 160 vCPU machine: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 44.956 ± 8.851 us/op sum real avgt 10 118.008 ± 19.505 us/op sum preallocate avgt 10 241.234 ± 305.998 us/op termsSixtySums noop avgt 10 1339.802 ± 51.410 us/op termsSixtySums real avgt 10 12077.671 ± 12110.993 us/op termsSixtySums preallocate avgt 10 3804.515 ± 1458.702 us/op termsSum noop avgt 10 59.478 ± 2.261 us/op termsSum real avgt 10 293.756 ± 253.854 us/op termsSum preallocate avgt 10 197.963 ± 41.578 us/op ``` All of these numbers are larger because we're running all the CPUs flat out and we're seeing more contention everywhere. Even the "noop" breaker sees some contention, but I think it is mostly around memory allocation. Anyway, with many many (termsSixtySums) aggs we're looking at 8 milliseconds of savings by preallocating. Just by dodging the busy loop as much as possible. The error in the measurements there are substantial. Here are the runs: ``` real: Iteration 1: 8679.417 ±(99.9%) 273.220 us/op Iteration 2: 5849.538 ±(99.9%) 179.258 us/op Iteration 3: 5953.935 ±(99.9%) 152.829 us/op Iteration 4: 5763.465 ±(99.9%) 150.759 us/op Iteration 5: 14157.592 ±(99.9%) 395.224 us/op Iteration 1: 24857.020 ±(99.9%) 1133.847 us/op Iteration 2: 24730.903 ±(99.9%) 1107.718 us/op Iteration 3: 18894.383 ±(99.9%) 738.706 us/op Iteration 4: 5493.965 ±(99.9%) 120.529 us/op Iteration 5: 6396.493 ±(99.9%) 143.630 us/op preallocate: Iteration 1: 5512.590 ±(99.9%) 110.222 us/op Iteration 2: 3087.771 ±(99.9%) 120.084 us/op Iteration 3: 3544.282 ±(99.9%) 110.373 us/op Iteration 4: 3477.228 ±(99.9%) 107.270 us/op Iteration 5: 4351.820 ±(99.9%) 82.946 us/op Iteration 1: 3185.250 ±(99.9%) 154.102 us/op Iteration 2: 3058.000 ±(99.9%) 143.758 us/op Iteration 3: 3199.920 ±(99.9%) 61.589 us/op Iteration 4: 3163.735 ±(99.9%) 71.291 us/op Iteration 5: 5464.556 ±(99.9%) 59.034 us/op ``` That variability from 5.5ms to 25ms is terrible. It makes me not particularly trust the 8ms savings from the report. But still, the preallocating method has much less variability between runs and almost all the runs are faster than all of the non-preallocated runs. Maybe the savings is more like 2 or 3 milliseconds, but still. Or maybe we should think of hte savings as worst vs worst? If so its 19 milliseconds. Anyway, its hard to measure how much this helps. But, certainly some. Closes elastic#58647

) This lowers the contention on the `REQUEST` circuit breaker when building many aggregations on many threads by preallocating a chunk of breaker up front. This cuts down on the number of times we enter the busy loop in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building aggregations. We still hit the busy loop if we collect many buckets. We let the `AggregationBuilder` pick size of the "chunk" that we preallocate but it doesn't have much to go on - not even the field types. But it is available in a convenient spot and the estimates don't have to be particularly accurate. The benchmarks on my 12 core desktop are interesting: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 1.672 ± 0.042 us/op sum real avgt 10 4.100 ± 0.027 us/op sum preallocate avgt 10 4.230 ± 0.034 us/op termsSixtySums noop avgt 10 92.658 ± 0.939 us/op termsSixtySums real avgt 10 278.764 ± 39.751 us/op termsSixtySums preallocate avgt 10 120.896 ± 16.097 us/op termsSum noop avgt 10 4.573 ± 0.095 us/op termsSum real avgt 10 9.932 ± 0.211 us/op termsSum preallocate avgt 10 7.695 ± 0.313 us/op ``` They show pretty clearly that not using the circuit breaker at all is faster. But we can't do that because we don't want to bring the node down on bad aggs. When there are many aggs (termsSixtySums) the preallocation claws back much of the performance. It even helps marginally when there are two aggs (termsSum). For a single agg (sum) we see a 130 nanosecond hit. Fine. But these values are all pretty small. At best we're seeing a 160 microsecond savings. Not so on a 160 vCPU machine: ``` Benchmark (breaker) Mode Cnt Score Error Units sum noop avgt 10 44.956 ± 8.851 us/op sum real avgt 10 118.008 ± 19.505 us/op sum preallocate avgt 10 241.234 ± 305.998 us/op termsSixtySums noop avgt 10 1339.802 ± 51.410 us/op termsSixtySums real avgt 10 12077.671 ± 12110.993 us/op termsSixtySums preallocate avgt 10 3804.515 ± 1458.702 us/op termsSum noop avgt 10 59.478 ± 2.261 us/op termsSum real avgt 10 293.756 ± 253.854 us/op termsSum preallocate avgt 10 197.963 ± 41.578 us/op ``` All of these numbers are larger because we're running all the CPUs flat out and we're seeing more contention everywhere. Even the "noop" breaker sees some contention, but I think it is mostly around memory allocation. Anyway, with many many (termsSixtySums) aggs we're looking at 8 milliseconds of savings by preallocating. Just by dodging the busy loop as much as possible. The error in the measurements there are substantial. Here are the runs: ``` real: Iteration 1: 8679.417 ±(99.9%) 273.220 us/op Iteration 2: 5849.538 ±(99.9%) 179.258 us/op Iteration 3: 5953.935 ±(99.9%) 152.829 us/op Iteration 4: 5763.465 ±(99.9%) 150.759 us/op Iteration 5: 14157.592 ±(99.9%) 395.224 us/op Iteration 1: 24857.020 ±(99.9%) 1133.847 us/op Iteration 2: 24730.903 ±(99.9%) 1107.718 us/op Iteration 3: 18894.383 ±(99.9%) 738.706 us/op Iteration 4: 5493.965 ±(99.9%) 120.529 us/op Iteration 5: 6396.493 ±(99.9%) 143.630 us/op preallocate: Iteration 1: 5512.590 ±(99.9%) 110.222 us/op Iteration 2: 3087.771 ±(99.9%) 120.084 us/op Iteration 3: 3544.282 ±(99.9%) 110.373 us/op Iteration 4: 3477.228 ±(99.9%) 107.270 us/op Iteration 5: 4351.820 ±(99.9%) 82.946 us/op Iteration 1: 3185.250 ±(99.9%) 154.102 us/op Iteration 2: 3058.000 ±(99.9%) 143.758 us/op Iteration 3: 3199.920 ±(99.9%) 61.589 us/op Iteration 4: 3163.735 ±(99.9%) 71.291 us/op Iteration 5: 5464.556 ±(99.9%) 59.034 us/op ``` That variability from 5.5ms to 25ms is terrible. It makes me not particularly trust the 8ms savings from the report. But still, the preallocating method has much less variability between runs and almost all the runs are faster than all of the non-preallocated runs. Maybe the savings is more like 2 or 3 milliseconds, but still. Or maybe we should think of hte savings as worst vs worst? If so its 19 milliseconds. Anyway, its hard to measure how much this helps. But, certainly some. Closes #58647

markharwood added :Performance All issues related to Elasticsearch performance including regressions and investigations :Search/Search Search-related issues that do not fall into other categories labels Jun 29, 2020

elasticmachine added the Team:Performance Meta label for performance team label Jun 29, 2020

elasticmachine added the Team:Search Meta label for search team label Jun 29, 2020

jtibshirani added :Analytics/Aggregations Aggregations :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 7, 2020

elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:Search Meta label for search team labels Jul 7, 2020

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 7, 2020

danielmitterdorfer changed the title ~~Race condition in ChildMemoryCircuitBreaker.limit()~~ Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() Jul 8, 2020

nik9000 self-assigned this Nov 24, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

rjernst removed the :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload label Dec 17, 2020

rjernst removed the needs:triage Requires assignment of a team area label label Dec 17, 2020

elasticmachine removed the Team:Core/Infra Meta label for core/infra team label Dec 17, 2020

nik9000 mentioned this issue Dec 30, 2020

Lower contention on requests with many aggs #66895

Merged

nik9000 closed this as completed in #66895 Jan 4, 2021

nik9000 mentioned this issue Jan 4, 2021

Lower contention on requests with many aggs (backport of #66895) #66941

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

hackerwin7 commented Jun 29, 2020 •

edited

Loading

elasticmachine commented Jun 29, 2020

elasticmachine commented Jun 29, 2020

hackerwin7 commented Jul 7, 2020

elasticmachine commented Jul 7, 2020

elasticmachine commented Jul 7, 2020

nik9000 commented Jul 7, 2020

nik9000 commented Jul 7, 2020

hackerwin7 commented Jul 8, 2020 •

edited

Loading

danielmitterdorfer commented Jul 8, 2020

lqbilbo commented Nov 16, 2020 •

edited

Loading

nik9000 commented Nov 17, 2020

lqbilbo commented Nov 23, 2020

nik9000 commented Nov 24, 2020

nik9000 commented Dec 2, 2020

lqbilbo commented Dec 5, 2020 •

edited

Loading

nik9000 commented Dec 7, 2020

nik9000 commented Jan 4, 2021

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

Comments

hackerwin7 commented Jun 29, 2020 • edited Loading

elasticmachine commented Jun 29, 2020

elasticmachine commented Jun 29, 2020

hackerwin7 commented Jul 7, 2020

elasticmachine commented Jul 7, 2020

elasticmachine commented Jul 7, 2020

nik9000 commented Jul 7, 2020

nik9000 commented Jul 7, 2020

hackerwin7 commented Jul 8, 2020 • edited Loading

danielmitterdorfer commented Jul 8, 2020

lqbilbo commented Nov 16, 2020 • edited Loading

nik9000 commented Nov 17, 2020

lqbilbo commented Nov 23, 2020

nik9000 commented Nov 24, 2020

nik9000 commented Dec 2, 2020

lqbilbo commented Dec 5, 2020 • edited Loading

nik9000 commented Dec 7, 2020

nik9000 commented Jan 4, 2021

hackerwin7 commented Jun 29, 2020 •

edited

Loading

hackerwin7 commented Jul 8, 2020 •

edited

Loading

lqbilbo commented Nov 16, 2020 •

edited

Loading

lqbilbo commented Dec 5, 2020 •

edited

Loading