Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request-level circuit breaker support on coordinating nodes #62223

Merged
merged 16 commits into from
Sep 24, 2020

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Sep 10, 2020

This commit allows coordinating node to account the memory used to perform partial and final reduce of
aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save
and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final
reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException is thrown
if exceeds the maximum memory allowed in this breaker.
This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced.
This estimation can be completely off for some aggregations but it is corrected with the real size after
the reduce completes.
If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations
and replace the estimation with the serialized size of the newly reduced result.

As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead
of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is
to reduce the default batch reduce size of blocking
search request to a more sane number.

Closes #37182

This commit allows coordinating node to account the memory used to perform partial and final reduce of
aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save
and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final
reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown
if exceeds the maximum memory allowed in this breaker.
This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced.
This estimation can be completely off for some aggregations but it is corrected with the real size after
the reduce completes.
If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations
and replace the estimation with the serialized size of the newly reduced result.

As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead
of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is
to [reduce the default batch reduce size](elastic#51857) of blocking
search request to a more sane number.

Closes elastic#37182
@jimczi jimczi added >enhancement release highlight :Analytics/Geo Indexing, search aggregations of geo points and shapes :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload v8.0.0 v7.10.0 labels Sep 10, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Circuit Breakers)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Geo)

@elasticmachine elasticmachine added Team:Core/Infra Meta label for core/infra team Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) labels Sep 10, 2020
@@ -1,177 +0,0 @@
/*
* Licensed to Elasticsearch under one or more contributor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit of a shame that these go from single node tests to full blown IT tests, what is the reasoning behind this choice?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I regrouped the search action tests in a single IT class. I agree that these tests may not require the full IT but they are grouped with other tests that require it so I thought that it makes sense to move them here.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks right to me. I'll have to go over it more closely before 👍 it and I think I'd like to wait a day just to have fresh eyes on it.

This commit removes the serialization of partial reduce in order to speed up the merges when
the batch reduce size is smaller than the number of shards in the request.
The estimation of the size of partial reduce is still based on the binary size (serialized form) but we
keep the full java object and estimate the size with a counting stream output.
Finally this change adds a benchmark for the reduce of nested terms aggs. This benchmark was used to
optimize the code in this PR.
@jimczi
Copy link
Contributor Author

jimczi commented Sep 15, 2020

We discussed offline with @nik9000 and I pushed some changes to speed up the partial merge. First of all I removed the serialization of the partial reduce and replace it with an estimation of the size based on a noop-serialization (just counting the bytes). That resulted in much better performance for large cardinality aggregations and allows to estimate more precisely the memory used by the final reduce. I added the benchmark that I used in the PR to be able to replay the numbers.
For instance a terms/terms agg with a topN size of 100, 512 shards and an overall cardinality of 10,000 gives the following numbers on my machine:

Benchmark                        (bufferSize)  (cardinalityFactor)  (numShards)  (topNSize)  Mode  Cnt      Score      Error  Units
TermsReduceBenchmark.reduceAggs           512                  100         512         100  avgt    7   1756,207 ±   15,345  ms/op
TermsReduceBenchmark.reduceAggs            32                  100          512         100  avgt    7  27112,587 ± 13811,555  ms/op
TermsReduceBenchmark.reduceAggs            32                  100          512         100  avgt    7  10336,829 ± 2762,412  ms/op

The first result is the time it takes to reduce with a batch reduce size of 512, the second result is when we serialize the results of partial aggs with a batch reduce size of 32 and the last one is when we don't serialize partial results. As you can see the speedups are significant.

Copy link
Member

@nik9000 nik9000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some small things but LGTM.

It's a shame about reserializing being slow.

Copy link
Member

@javanna javanna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple of small questions

assertThat(response.get().getFailure().getCause(), instanceOf(IllegalArgumentException.class));
assertEquals("Unknown NamedWriteable category [" + InternalAggregation.class.getName() + "]",
response.get().getFailure().getCause().getMessage());
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this test no longer relevant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it cannot work anymore since we don't need to serialize the aggs. I think it's ok since we have other tests that check that exception thrown during a partial/final reduce are handled correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you mean the condition that the test had to trigger the failure, which was around serialization?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@@ -1,177 +0,0 @@
/*
* Licensed to Elasticsearch under one or more contributor
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is a bit of a shame that these go from single node tests to full blown IT tests, what is the reasoning behind this choice?

@nik9000
Copy link
Member

nik9000 commented Sep 23, 2020

Sorry not to comment publicly about the serialization change! It makes me sad not to serialize but I see the reasoning.

@jimczi
Copy link
Contributor Author

jimczi commented Sep 24, 2020

@elasticmachine run elasticsearch-ci/2

@jimczi jimczi merged commit fbed2a1 into elastic:master Sep 24, 2020
@jimczi jimczi deleted the enhancements/reduce_aggs_circuit_breaker branch September 24, 2020 12:02
jimczi added a commit that referenced this pull request Sep 24, 2020
Ensures that the test always run with a memory circuit breaker.

Relates #62223
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Geo Indexing, search aggregations of geo points and shapes :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload >enhancement release highlight Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Core/Infra Meta label for core/infra team v7.10.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Request-level circuit breaker support on coordinating nodes
5 participants