Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

Closed
hackerwin7 opened this issue Jun 29, 2020 · 17 comments · Fixed by #66895
Closed

Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() #58647

hackerwin7 opened this issue Jun 29, 2020 · 17 comments · Fixed by #66895
Assignees
Labels
:Analytics/Aggregations Aggregations :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Performance Meta label for performance team

Comments

@hackerwin7
Copy link

hackerwin7 commented Jun 29, 2020

In the method of ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(), the limit() method would do while loop atomically compare and set the used value.

when we test the high qps query to ES-node, the major cost is ChildMemoryCircuitBreaker.limit() which cause to AggregationPhase.preProcess()'s cost is extremely high

If data node receive high qps complicated search request body, and data-node will be busy to create aggregator, and it will lead to SumAggregator.<init>, lead to BigArrays.newDoubleArray() and lead to BigArrays.adjustBreaker(), finally call breaker.addEstimateBytesAndMaybeBreak and ChildMemoryCircuitBreaker.limit()

@markharwood markharwood added :Performance All issues related to Elasticsearch performance including regressions and investigations :Search/Search Search-related issues that do not fall into other categories labels Jun 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-perf (:Performance)

@elasticmachine elasticmachine added the Team:Performance Meta label for performance team label Jun 29, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jun 29, 2020
@hackerwin7
Copy link
Author

a flame graph, here it is:
image

the most cost is ChildMemoryCircuitBreaker.limit().

@jtibshirani jtibshirani added :Analytics/Aggregations Aggregations :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload and removed :Search/Search Search-related issues that do not fall into other categories labels Jul 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) and removed Team:Search Meta label for search team labels Jul 7, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Circuit Breakers)

@elasticmachine elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 7, 2020
@nik9000
Copy link
Member

nik9000 commented Jul 7, 2020

I wonder if we should consolidate how we grab bytes from the breaker. Or if we should grab a big chunk whenever we grab any and then only come back when we need more. Sort of like a TLAB. For the most part BigArrays is designed to oversize its allocations so for big requests they amortize to happening infrequently. But for many small requests we grab breaker byte very frequently.

@nik9000
Copy link
Member

nik9000 commented Jul 7, 2020

@hackerwin7 could you try running this against master or 7.x? Could you share the search and reproduction?

It looks like the search is a terms which contains another terms which contains a sum. I've cut down on the number of aggregator allocations in 7.x and master over at #56487. It may not help if the cardinality of the sum agg has is ultimately pretty low. The flame graph shows that the initial allocation is what is getting wedged which I haven't changed. But there should be fewer allocations.

@hackerwin7
Copy link
Author

hackerwin7 commented Jul 8, 2020

@nik9000 Thanks for your reply.

@hackerwin7 could you try running this against master or 7.x? Could you share the search and reproduction?

we test it at 7.6.2, it seems that a high QPS of complicated of aggregations search (which maybe build many aggregators), the main cost is create Aggregator and the BigArrays.adjustBreaker(), in the end call ChildMemoryCircuitBreaker.limit().

I've cut down on the number of aggregator allocations in 7.x and master over at #56487

reduce aggregator allocations maybe a way of reduce the race condition of limit() (loop set atomic value), I will test this in master branch and check the effects

I think in high QPS and many much more aggregators search , according to this flame graph, can we reduce the frequency of call to limit() to reduce race condition. maybe we can batched the limit() or addEstimateBytesAndMaybeBreak(), when added bytes reach to a buff size, we do a real loop set atomic value logic.

@danielmitterdorfer
Copy link
Member

I wonder if we should consolidate how we grab bytes from the breaker

I agree that the right approach is to reduce the number of calls to the circuit breaker instead of optimizing the circuit breaker's #limit() method which appears to be the bottleneck because of the high number of calls to it (2.6 million calls in the flame graph that you present).

I also think that the issue title mentioning a "race condition" is misleading because - in my understanding - this issue is about inefficient behavior.

@danielmitterdorfer danielmitterdorfer changed the title Race condition in ChildMemoryCircuitBreaker.limit() Aggregations can be bottlenecked on ChildMemoryCircuitBreaker.limit() Jul 8, 2020
@lqbilbo
Copy link

lqbilbo commented Nov 16, 2020

I wonder if we should consolidate how we grab bytes from the breaker

I agree that the right approach is to reduce the number of calls to the circuit breaker instead of optimizing the circuit breaker's #limit() method which appears to be the bottleneck because of the high number of calls to it (2.6 million calls in the flame graph that you present).

I also think that the issue title mentioning a "race condition" is misleading because - in my understanding - this issue is about inefficient behavior.

@danielmitterdorfer Hi Daniel, if we add a setting indices.breaker.memory_usage_estimates.enable in ClusterSettings which can control if or not estimating memory usage during aggregation phase dynamically, we can reduce the number of calls of AggregatorBase#addRequestCircuitBreakerBytes and BigArrays#adjustBreaker when the switch is closed. Do you think it's a good way?

For example:

protected long addRequestCircuitBreakerBytes(long bytes) {
        // Only use the potential to circuit break if bytes are being incremented
        if (isMemoryUsageEstimatesEnabled()) {   // we add a switch
            if (bytes > 0) {
                this.breakerService
                    .getBreaker(CircuitBreaker.REQUEST)
                    .addEstimateBytesAndMaybeBreak(bytes, "<agg [" + name + "]>");
            } else {
                this.breakerService
                    .getBreaker(CircuitBreaker.REQUEST)
                    .addWithoutBreaking(bytes);
            }
        }
        this.requestBytesUsed += bytes;
        return requestBytesUsed;
    }

and

void adjustBreaker(final long delta, final boolean isDataAlreadyCreated) {
        if (isMemoryUsageEstimatesEnabled()) {         // we add a switch
            if (this.breakerService != null) {
                CircuitBreaker breaker = this.breakerService.getBreaker(breakerName);
                if (this.checkBreaker) {
                    // checking breaker means potentially tripping, but it doesn't
                    // have to if the delta is negative
                    if (delta > 0) {
                        try {
                            breaker.addEstimateBytesAndMaybeBreak(delta, "<reused_arrays>");
                        } catch (CircuitBreakingException e) {
                            if (isDataAlreadyCreated) {
                                // since we've already created the data, we need to
                                // add it so closing the stream re-adjusts properly
                                breaker.addWithoutBreaking(delta);
                            }
                            // re-throw the original exception
                            throw e;
                        }
                    } else {
                        breaker.addWithoutBreaking(delta);
                    }
                } else {
                    // even if we are not checking the breaker, we need to adjust
                    // its' totals, so add without breaking
                    breaker.addWithoutBreaking(delta);
                }
            }
        }
    }

Because we only remain an abstract function in SearchContext and a volatile variable in BigArrays, and we assign the switch value onto BigArrays's variable only while AggregatorBase initial build up. We have controlled the scale of the influence. Just like:

protected final SearchContext context;

protected AggregatorBase(String name, AggregatorFactories factories, SearchContext context, Aggregator parent,
            List<PipelineAggregator> pipelineAggregators, Map<String, Object> metaData) throws IOException {
        this.name = name;
        this.pipelineAggregators = pipelineAggregators;
        this.metaData = metaData;
        this.parent = parent;
        this.context = context;
        this.breakerService = context.bigArrays().breakerService();
        // if we disable memory usage estimates, bigArrays doesn't need to adjust memory size.
        context.bigArrays().enableMemoryUsageEstimates(context.enableMemoryUsageEstimates());
        ...
    }

And we have tried a lot of test cases. They show that bucket and metrics aggregation in our query phase consume few memory usage but the number of calls is large.

After I reviewed the pull request #46751 and #57042, they still can't fix this problem. So if needed, I'll show you the code.

Wish your reply!

@nik9000
Copy link
Member

nik9000 commented Nov 17, 2020

I'm curious if this is still a big problem in 7.9.

I'm a bit worried that mutating BigArrays would be dangerous because its shared by the system. We could certainly fork a fresh one for each request but at that point it might be cleaner to have the forking process reserve some chunk up front for the request.

I dunno. As much fun as it is to attack bottlenecks it isn't clear to me that this is both still a big bottleneck.

@lqbilbo
Copy link

lqbilbo commented Nov 23, 2020

I'm curious if this is still a big problem in 7.9.

I'm a bit worried that mutating BigArrays would be dangerous because its shared by the system. We could certainly fork a fresh one for each request but at that point it might be cleaner to have the forking process reserve some chunk up front for the request.

@nik9000 Thanks for your reply. After I saw your reply, I ran some test cases in 7.9, but as a regret, the new version still doesn't solve the problem. I agree with you that we shouldn't mutate BigArrays, so I put the Settings into BreakerSettings and I put the condition phase: if (this.breakerService != null && this.breakerService.isMemoryUsageEstimatesEnabled()) into BigArrays#adjustBreaker. And isMemoryUsageEstimatesEnabled() is only False when AggregateBase set it. So it will not affect other part of the whole system.

I dunno. As much fun as it is to attack bottlenecks it isn't clear to me that this is both still a big bottleneck.

We really face the problem just as high QPS, the CPU is almost full and it will affect the stability of our Elasticsearch cluster. So we want to attack the bottleneck while facing with such a business scene.

@nik9000 nik9000 self-assigned this Nov 24, 2020
@nik9000
Copy link
Member

nik9000 commented Nov 24, 2020

@lqbilbo. OK. I'm going to add this to my list of things to think about some more too then. I'm not really sold on the idea of not checking the breaker on allocation but its certainly something. I want to play with pre-allocating the a bunch of byes up front and see what that looks like for us. It might not be as big a change as it sounds.... But it'll take me a week or two before I can get to it what with Thanksgiving here in the US and me spending most of my time on the initial push for runtime fields.

Regardless of how we fix this I'd like to know more about your tests. Ultimately I'm going to have to reproduce this with rally and/or with our jmh benchmarks. Can you tell me what you were doing to get this to show up? I have a suspicion I'll be able to see this really easy with JMH but with rally I'd need to set up a bunch of machines to run a realistic cluster and generate load. Like, what kinds of searches are you sending? To how many shards? On how many nodes? More shards on single nodes'll all step on each other.....

@nik9000
Copy link
Member

nik9000 commented Dec 2, 2020

I had a look this morning. I made a JMH benchmark that builds six sum aggs as fast as its silicon heart can manage. The measurements I get out of it are kind of funny, especially when I run them on a GCP node with 60 cores. There, I see that building these SumAggregators on all the threads clocks in at about 9 microseconds. Not super duper fast, but, like, fine.

Now, with the circuit breaker on I see lots of contention and building the SumAggregators clocks in a 280 microseconds or 620 microseconds. I get both measurements and I'm not 100% sure why, but they are fairly consistent. Measurement mysteries aside, the circuit breaker costs about 600 microseconds in 60 threads. When I run the benchmark with a single thread I see that the circuit breakers adding 1.2 microseconds to building the aggregators. So contention is certainly killing us here. I mean, the flame graph said it, but I wanted to confirm.

I hacked together a prototype that only dips into the circuit breaker once for the whole request. Presumably we could make a slight overestimate of the memory we'll be needing for a "nearly empty" aggregation and allocate all of that up front. That dropped the construction to 170 microseconds or 85 microseconds. Same deal with the two consistent and different measurements. Something must be wrong with my test, but it shows us the order of magnitude. Anyway, 170 microseconds is a lot more than 9 microseconds, but its still a fair savings percentage wise.

What kind of numbers were you seeing, @lqbilbo? What kind of aggs are you running and how many cores do you have? What kind of CPU are you running? I haven't tried ARM. When I ran this on my desktop which has 12 cores I didn't really see much contention. So I imagine you are running with far more cores.

How fast are the requests you are measuring? 600 microseconds is, like, not all the much time. If you end up going to disk for any of your requests the 600 microseconds is a drop in the bucket. Still, lower CPU load is lower CPU load. And we're wasting a bunch of time here if you are running with dozens of CPUs. So I still think this is worth me looking more into.

@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@lqbilbo
Copy link

lqbilbo commented Dec 5, 2020

I had a look this morning. I made a JMH benchmark that builds six sum aggs as fast as its silicon heart can manage. The measurements I get out of it are kind of funny, especially when I run them on a GCP node with 60 cores. There, I see that building these SumAggregators on all the threads clocks in at about 9 microseconds. Not super duper fast, but, like, fine.

How fast are the requests you are measuring? 600 microseconds is, like, not all the much time. If you end up going to disk for any of your requests the 600 microseconds is a drop in the bucket. Still, lower CPU load is lower CPU load. And we're wasting a bunch of time here if you are running with dozens of CPUs. So I still think this is worth me looking more into.

We ran tests on a test cluster with 48core and 256g memory size on each node. And the aggs is a query phase with one Date histogram filter, one terms filter and about sixty sum aggs. Our cluster is the product environment cluster which all boxes of it has a high hdd space, or you can say they are for logging scenario, and we build up raid6 for a higher I/O read speed. The cluster configuration is three client nodes, 24 data&master nodes. And we ran nearly three thousand requests per second and the cpu load at highest point is about 60 percent, not too high as we know the nearly 100% CPU load would lead to a full search queue. and this is meaningless. Also, the pct99 is just 600 microseconds as your test show.

I hacked together a prototype that only dips into the circuit breaker once for the whole request. Presumably we could make a slight overestimate of the memory we'll be needing for a "nearly empty" aggregation and allocate all of that up front. That dropped the construction to 170 microseconds or 85 microseconds. Same deal with the two consistent and different measurements. Something must be wrong with my test, but it shows us the order of magnitude. Anyway, 170 microseconds is a lot more than 9 microseconds, but its still a fair savings percentage wise.

So if I can follow your measurements and run the tests with dipping circuit breaker for the whole request into once estimate memory or allocate all aggregation consume before all this happened?

@nik9000
Copy link
Member

nik9000 commented Dec 7, 2020

We ran tests on a test cluster with 48core and 256g memory size on each node. And the aggs is a query phase with one Date histogram filter, one terms filter and about sixty sum aggs. Our cluster is the product environment cluster which all boxes of it has a high hdd space, or you can say they are for logging scenario, and we build up raid6 for a higher I/O read speed. The cluster configuration is three client nodes, 24 data&master nodes. And we ran nearly three thousand requests per second and the cpu load at highest point is about 60 percent, not too high as we know the nearly 100% CPU load would lead to a full search queue. and this is meaningless. Also, the pct99 is just 600 microseconds as your test show.

Perfect.

Is the index being written to? If it isn't, you might be able to get caching by separating your aggs from your top_hits searches. But if you queries that vary then you won't get any cache hits anyway. One thing that is useful - if two requests come in with the same cache key and we're caching the request then we merge those request together.

So if I can follow your measurements and run the tests with dipping circuit breaker for the whole request into once estimate memory or allocate all aggregation consume before all this happened?

I really did just hack together the benchmark to see if I could reproduce your issue and to see if my hunch that we could pre-allocate would help. It isn't really useful outside of theorizing. Be that as it may, I did a sixty sums version on my 12 core machine and got:

Benchmark                                            (breaker)  Mode  Cnt    Score    Error  Units
AggConstructionContentionBenchmark.termsSixtySums         noop  avgt   10   81.853 ±  2.188  us/op
AggConstructionContentionBenchmark.termsSixtySums         real  avgt   10  297.595 ± 13.411  us/op
AggConstructionContentionBenchmark.termsSixtySums  preallocate  avgt   10  110.816 ± 10.993  us/op

Which looks to me like pre-allocating could do the trick.

@rjernst rjernst removed the :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload label Dec 17, 2020
@rjernst rjernst removed the needs:triage Requires assignment of a team area label label Dec 17, 2020
@elasticmachine elasticmachine removed the Team:Core/Infra Meta label for core/infra team label Dec 17, 2020
nik9000 added a commit that referenced this issue Jan 4, 2021
This lowers the contention on the `REQUEST` circuit breaker when building
many aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building
aggregations. We still hit the busy loop if we collect many buckets.

We let the `AggregationBuilder` pick size of the "chunk" that we
preallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.

The benchmarks on my 12 core desktop are interesting:
```
Benchmark         (breaker)  Mode  Cnt    Score    Error  Units
sum                    noop  avgt   10    1.672 ±  0.042  us/op
sum                    real  avgt   10    4.100 ±  0.027  us/op
sum             preallocate  avgt   10    4.230 ±  0.034  us/op
termsSixtySums         noop  avgt   10   92.658 ±  0.939  us/op
termsSixtySums         real  avgt   10  278.764 ± 39.751  us/op
termsSixtySums  preallocate  avgt   10  120.896 ± 16.097  us/op
termsSum               noop  avgt   10    4.573 ±  0.095  us/op
termsSum               real  avgt   10    9.932 ±  0.211  us/op
termsSum        preallocate  avgt   10    7.695 ±  0.313  us/op
```

They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.

But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:

```
Benchmark         (breaker)  Mode  Cnt      Score       Error  Units
sum                    noop  avgt   10     44.956 ±     8.851  us/op
sum                    real  avgt   10    118.008 ±    19.505  us/op
sum             preallocate  avgt   10    241.234 ±   305.998  us/op
termsSixtySums         noop  avgt   10   1339.802 ±    51.410  us/op
termsSixtySums         real  avgt   10  12077.671 ± 12110.993  us/op
termsSixtySums  preallocate  avgt   10   3804.515 ±  1458.702  us/op
termsSum               noop  avgt   10     59.478 ±     2.261  us/op
termsSum               real  avgt   10    293.756 ±   253.854  us/op
termsSum        preallocate  avgt   10    197.963 ±    41.578  us/op
```

All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:
```
real:
Iteration   1: 8679.417 ±(99.9%) 273.220 us/op
Iteration   2: 5849.538 ±(99.9%) 179.258 us/op
Iteration   3: 5953.935 ±(99.9%) 152.829 us/op
Iteration   4: 5763.465 ±(99.9%) 150.759 us/op
Iteration   5: 14157.592 ±(99.9%) 395.224 us/op
Iteration   1: 24857.020 ±(99.9%) 1133.847 us/op
Iteration   2: 24730.903 ±(99.9%) 1107.718 us/op
Iteration   3: 18894.383 ±(99.9%) 738.706 us/op
Iteration   4: 5493.965 ±(99.9%) 120.529 us/op
Iteration   5: 6396.493 ±(99.9%) 143.630 us/op
preallocate:
Iteration   1: 5512.590 ±(99.9%) 110.222 us/op
Iteration   2: 3087.771 ±(99.9%) 120.084 us/op
Iteration   3: 3544.282 ±(99.9%) 110.373 us/op
Iteration   4: 3477.228 ±(99.9%) 107.270 us/op
Iteration   5: 4351.820 ±(99.9%) 82.946 us/op
Iteration   1: 3185.250 ±(99.9%) 154.102 us/op
Iteration   2: 3058.000 ±(99.9%) 143.758 us/op
Iteration   3: 3199.920 ±(99.9%) 61.589 us/op
Iteration   4: 3163.735 ±(99.9%) 71.291 us/op
Iteration   5: 5464.556 ±(99.9%) 59.034 us/op
```

That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.

Anyway, its hard to measure how much this helps. But, certainly some.

Closes #58647
@nik9000
Copy link
Member

nik9000 commented Jan 4, 2021

For those following along at home the contention gets much, much worse when you have a ton of cpus all trying. The preallocation helps quite a lot in that case so its ultimately what we merged in #66895 to close this.

nik9000 added a commit to nik9000/elasticsearch that referenced this issue Jan 4, 2021
This lowers the contention on the `REQUEST` circuit breaker when building
many aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building
aggregations. We still hit the busy loop if we collect many buckets.

We let the `AggregationBuilder` pick size of the "chunk" that we
preallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.

The benchmarks on my 12 core desktop are interesting:
```
Benchmark         (breaker)  Mode  Cnt    Score    Error  Units
sum                    noop  avgt   10    1.672 ±  0.042  us/op
sum                    real  avgt   10    4.100 ±  0.027  us/op
sum             preallocate  avgt   10    4.230 ±  0.034  us/op
termsSixtySums         noop  avgt   10   92.658 ±  0.939  us/op
termsSixtySums         real  avgt   10  278.764 ± 39.751  us/op
termsSixtySums  preallocate  avgt   10  120.896 ± 16.097  us/op
termsSum               noop  avgt   10    4.573 ±  0.095  us/op
termsSum               real  avgt   10    9.932 ±  0.211  us/op
termsSum        preallocate  avgt   10    7.695 ±  0.313  us/op
```

They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.

But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:

```
Benchmark         (breaker)  Mode  Cnt      Score       Error  Units
sum                    noop  avgt   10     44.956 ±     8.851  us/op
sum                    real  avgt   10    118.008 ±    19.505  us/op
sum             preallocate  avgt   10    241.234 ±   305.998  us/op
termsSixtySums         noop  avgt   10   1339.802 ±    51.410  us/op
termsSixtySums         real  avgt   10  12077.671 ± 12110.993  us/op
termsSixtySums  preallocate  avgt   10   3804.515 ±  1458.702  us/op
termsSum               noop  avgt   10     59.478 ±     2.261  us/op
termsSum               real  avgt   10    293.756 ±   253.854  us/op
termsSum        preallocate  avgt   10    197.963 ±    41.578  us/op
```

All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:
```
real:
Iteration   1: 8679.417 ±(99.9%) 273.220 us/op
Iteration   2: 5849.538 ±(99.9%) 179.258 us/op
Iteration   3: 5953.935 ±(99.9%) 152.829 us/op
Iteration   4: 5763.465 ±(99.9%) 150.759 us/op
Iteration   5: 14157.592 ±(99.9%) 395.224 us/op
Iteration   1: 24857.020 ±(99.9%) 1133.847 us/op
Iteration   2: 24730.903 ±(99.9%) 1107.718 us/op
Iteration   3: 18894.383 ±(99.9%) 738.706 us/op
Iteration   4: 5493.965 ±(99.9%) 120.529 us/op
Iteration   5: 6396.493 ±(99.9%) 143.630 us/op
preallocate:
Iteration   1: 5512.590 ±(99.9%) 110.222 us/op
Iteration   2: 3087.771 ±(99.9%) 120.084 us/op
Iteration   3: 3544.282 ±(99.9%) 110.373 us/op
Iteration   4: 3477.228 ±(99.9%) 107.270 us/op
Iteration   5: 4351.820 ±(99.9%) 82.946 us/op
Iteration   1: 3185.250 ±(99.9%) 154.102 us/op
Iteration   2: 3058.000 ±(99.9%) 143.758 us/op
Iteration   3: 3199.920 ±(99.9%) 61.589 us/op
Iteration   4: 3163.735 ±(99.9%) 71.291 us/op
Iteration   5: 5464.556 ±(99.9%) 59.034 us/op
```

That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.

Anyway, its hard to measure how much this helps. But, certainly some.

Closes elastic#58647
nik9000 added a commit that referenced this issue Jan 4, 2021
)

This lowers the contention on the `REQUEST` circuit breaker when building
many aggregations on many threads by preallocating a chunk of breaker
up front. This cuts down on the number of times we enter the busy loop
in `ChildMemoryCircuitBreaker.limit`. Now we hit it one time when building
aggregations. We still hit the busy loop if we collect many buckets.

We let the `AggregationBuilder` pick size of the "chunk" that we
preallocate but it doesn't have much to go on - not even the field types.
But it is available in a convenient spot and the estimates don't have to
be particularly accurate.

The benchmarks on my 12 core desktop are interesting:
```
Benchmark         (breaker)  Mode  Cnt    Score    Error  Units
sum                    noop  avgt   10    1.672 ±  0.042  us/op
sum                    real  avgt   10    4.100 ±  0.027  us/op
sum             preallocate  avgt   10    4.230 ±  0.034  us/op
termsSixtySums         noop  avgt   10   92.658 ±  0.939  us/op
termsSixtySums         real  avgt   10  278.764 ± 39.751  us/op
termsSixtySums  preallocate  avgt   10  120.896 ± 16.097  us/op
termsSum               noop  avgt   10    4.573 ±  0.095  us/op
termsSum               real  avgt   10    9.932 ±  0.211  us/op
termsSum        preallocate  avgt   10    7.695 ±  0.313  us/op
```

They show pretty clearly that not using the circuit breaker at all is
faster. But we can't do that because we don't want to bring the node
down on bad aggs. When there are many aggs (termsSixtySums) the
preallocation claws back much of the performance. It even helps
marginally when there are two aggs (termsSum). For a single agg (sum)
we see a 130 nanosecond hit. Fine.

But these values are all pretty small. At best we're seeing a 160
microsecond savings. Not so on a 160 vCPU machine:

```
Benchmark         (breaker)  Mode  Cnt      Score       Error  Units
sum                    noop  avgt   10     44.956 ±     8.851  us/op
sum                    real  avgt   10    118.008 ±    19.505  us/op
sum             preallocate  avgt   10    241.234 ±   305.998  us/op
termsSixtySums         noop  avgt   10   1339.802 ±    51.410  us/op
termsSixtySums         real  avgt   10  12077.671 ± 12110.993  us/op
termsSixtySums  preallocate  avgt   10   3804.515 ±  1458.702  us/op
termsSum               noop  avgt   10     59.478 ±     2.261  us/op
termsSum               real  avgt   10    293.756 ±   253.854  us/op
termsSum        preallocate  avgt   10    197.963 ±    41.578  us/op
```

All of these numbers are larger because we're running all the CPUs
flat out and we're seeing more contention everywhere. Even the "noop"
breaker sees some contention, but I think it is mostly around memory
allocation. Anyway, with many many (termsSixtySums) aggs we're looking
at 8 milliseconds of savings by preallocating. Just by dodging the busy
loop as much as possible. The error in the measurements there are
substantial. Here are the runs:
```
real:
Iteration   1: 8679.417 ±(99.9%) 273.220 us/op
Iteration   2: 5849.538 ±(99.9%) 179.258 us/op
Iteration   3: 5953.935 ±(99.9%) 152.829 us/op
Iteration   4: 5763.465 ±(99.9%) 150.759 us/op
Iteration   5: 14157.592 ±(99.9%) 395.224 us/op
Iteration   1: 24857.020 ±(99.9%) 1133.847 us/op
Iteration   2: 24730.903 ±(99.9%) 1107.718 us/op
Iteration   3: 18894.383 ±(99.9%) 738.706 us/op
Iteration   4: 5493.965 ±(99.9%) 120.529 us/op
Iteration   5: 6396.493 ±(99.9%) 143.630 us/op
preallocate:
Iteration   1: 5512.590 ±(99.9%) 110.222 us/op
Iteration   2: 3087.771 ±(99.9%) 120.084 us/op
Iteration   3: 3544.282 ±(99.9%) 110.373 us/op
Iteration   4: 3477.228 ±(99.9%) 107.270 us/op
Iteration   5: 4351.820 ±(99.9%) 82.946 us/op
Iteration   1: 3185.250 ±(99.9%) 154.102 us/op
Iteration   2: 3058.000 ±(99.9%) 143.758 us/op
Iteration   3: 3199.920 ±(99.9%) 61.589 us/op
Iteration   4: 3163.735 ±(99.9%) 71.291 us/op
Iteration   5: 5464.556 ±(99.9%) 59.034 us/op
```

That variability from 5.5ms to 25ms is terrible. It makes me not
particularly trust the 8ms savings from the report. But still,
the preallocating method has much less variability between runs
and almost all the runs are faster than all of the non-preallocated
runs. Maybe the savings is more like 2 or 3 milliseconds, but still.
Or maybe we should think of hte savings as worst vs worst? If so its
19 milliseconds.

Anyway, its hard to measure how much this helps. But, certainly some.

Closes #58647
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Performance All issues related to Elasticsearch performance including regressions and investigations Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Performance Meta label for performance team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants