Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request-level circuit breaker support on coordinating nodes #37182

Closed
markharwood opened this issue Jan 7, 2019 · 11 comments · Fixed by #62223
Closed

Request-level circuit breaker support on coordinating nodes #37182

markharwood opened this issue Jan 7, 2019 · 11 comments · Fixed by #62223
Labels
:Analytics/Aggregations Aggregations :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Core/Infra Meta label for core/infra team

Comments

@markharwood
Copy link
Contributor

Currently we do not have circuit breaker support for search requests executed on the coordinating node. We have multi-phase reduction which should help avoid OOMs but it is still possible to have abusive queries taking a node down.
A recent example OOM was caused by date histograms with 5 minute intervals executed across many time-based indices. Each of the data nodes failed to trip a circuit breaker because they were only seeing a small part of the final result. The multi-phase reduction did nothing to reduce the final number of buckets required and the final OOM occurred while rendering results in toXContent. This scenario was exacerbated by the fact there was a top-level terms agg for hostname under which there were the date histograms.

@markharwood markharwood added >enhancement :Analytics/Aggregations Aggregations :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload team-discuss labels Jan 7, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra

@markharwood
Copy link
Contributor Author

@jimczi Since opening this ticket I've been made aware of #27581
Was that the final solution or is there more work on circuit breakers to be done on coordinating nodes?

@jimczi
Copy link
Contributor

jimczi commented Jan 7, 2019

Was that the final solution or is there more work on circuit breakers to be done on coordinating nodes?

I think so yes, the issue you described here should not happen on master where the maximum number of bucket is set to 10,000 by default.

@markharwood
Copy link
Contributor Author

While #27581 should address issues with aggregations like DateHistogram with many small buckets we don't currently account for internal memory consumed by more expensive aggregations like cardinality.
For this reason I'm keeping this ticket open to track the problem of accounting for these more complex aggregations.

@howardhuanghua
Copy link
Contributor

@jimczi The maximum number of bucket only limit reduction phase bucket count. So how about if coordinate node received lots of first-phase aggregation bucket result from other data node before going to reduction phase and exploded the coordinate node memory? How do we limit the received temporary data memory?

@howardhuanghua
Copy link
Contributor

Follow "results.consumeResult(result);" may consume lots of memory and how to avoid OOM in huge number of nodes case aggregation request?

    @Override
    public final void onShardSuccess(Result result) {
        successfulOps.incrementAndGet();
        results.consumeResult(result);
        if (logger.isTraceEnabled()) {
            logger.trace("got first-phase result from {}", result != null ? result.getSearchShardTarget() : null);
        }
        // clean a previous error on this shard group (note, this code will be serialized on the same shardIndex value level
        // so its ok concurrency wise to miss potentially the shard failures being created because of another failure
        // in the #addShardFailure, because by definition, it will happen on *another* shardIndex
        AtomicArray<ShardSearchFailure> shardFailures = this.shardFailures.get();
        if (shardFailures != null) {
            shardFailures.set(result.getShardIndex(), null);
        }
    }

@ppf2
Copy link
Member

ppf2 commented Jan 23, 2019

@markharwood @jimczi

we don't currently account for internal memory consumed by more expensive aggregations like cardinality.

Will the new real memory breaker on 7.0 resolve this since it should break based on real heap usage threshold even on coordinating nodes correct? Or do we still plan on implementing a coordinating node specific request breaker in addition to the real memory breaker?

@danielmitterdorfer
Copy link
Member

@ppf2 I don't think the real memory circuit breaker is able to reliably prevent this. Circuit breakers are not actively observing system state but rather need to be invoked explicitly at certain points during request handling. Only then they will check current resource usage and potentially reject a request. If a request is past that check and then allocates a lot of memory, also the real memory circuit breaker cannot prevent this situation (although it might detect too high memory usage and rejects other requests that are sent concurrently). Therefore, a circuit breaker that checks (expected) memory usage of aggregations on the coordinating node can make sense.

@nik9000
Copy link
Member

nik9000 commented Jul 27, 2020

This is sort of linked in with a few things that @jimczi and I are working on slowly, in between a bunch of other things. In particular, we now have an accurate accounting of the memory usage of buffered aggregation results which we expect to be the bulk of the "big" stuff in a request on the coordinating node. We're working on refactoring the partial reductions (in #58461). From there we hope to trigger these partial reductions based on memory usage. That isn't quite the same thing as this issue, but it is fairly close.

@jimczi
Copy link
Contributor

jimczi commented Jul 27, 2020

From there we hope to trigger these partial reductions based on memory usage. That isn't quite the same thing as this issue, but it is fairly close.

In my mind the follow up of #58461 was to account the memory used by buffered aggs in the circuit breaker so that's closer to what this issue is about.

jimczi added a commit to jimczi/elasticsearch that referenced this issue Sep 10, 2020
This commit allows coordinating node to account the memory used to perform partial and final reduce of
aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save
and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final
reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown
if exceeds the maximum memory allowed in this breaker.
This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced.
This estimation can be completely off for some aggregations but it is corrected with the real size after
the reduce completes.
If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations
and replace the estimation with the serialized size of the newly reduced result.

As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead
of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is
to [reduce the default batch reduce size](elastic#51857) of blocking
search request to a more sane number.

Closes elastic#37182
jimczi added a commit that referenced this issue Sep 24, 2020
This commit allows coordinating node to account the memory used to perform partial and final reduce of
aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save
and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final
reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown
if exceeds the maximum memory allowed in this breaker.
This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced.
This estimation can be completely off for some aggregations but it is corrected with the real size after
the reduce completes.
If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations
and replace the estimation with the serialized size of the newly reduced result.

As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead
of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is
to [reduce the default batch reduce size](#51857) of blocking
search request to a more sane number.

Closes #37182
jimczi added a commit that referenced this issue Sep 24, 2020
This commit allows coordinating node to account the memory used to perform partial and final reduce of
aggregations in the request circuit breaker. The search coordinator adds the memory that it used to save
and reduce the results of shard aggregations in the request circuit breaker. Before any partial or final
reduce, the memory needed to reduce the aggregations is estimated and a CircuitBreakingException} is thrown
if exceeds the maximum memory allowed in this breaker.
This size is estimated as roughly 1.5 times the size of the serialized aggregations that need to be reduced.
This estimation can be completely off for some aggregations but it is corrected with the real size after
the reduce completes.
If the reduce is successful, we update the circuit breaker to remove the size of the source aggregations
and replace the estimation with the serialized size of the newly reduced result.

As a follow up we could trigger partial reduces based on the memory accounted in the circuit breaker instead
of relying on a static number of shard responses. A simpler follow up that could be done in the mean time is
to [reduce the default batch reduce size](#51857) of blocking
search request to a more sane number.

Closes #37182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations :Core/Infra/Circuit Breakers Track estimates of memory consumption to prevent overload >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants