New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parent Circuit Breaker should cause/allow memory to free before failing #88517
Comments
Pinging @elastic/es-core-infra (Team:Core/Infra) |
Here are my thoughts on the current theory behind what's happening - in short it's a limitation/bug in our circuit breaker ability at the moment:
There are few ways we can tackle this problem, all have drawbacks:
|
We're running into this problem in a production setting, with a 32GB heap. Did you figure out a workaround? Any GC settings you tuned to increase the minimum amount of free heap space? |
@taliastocks Which version of Elasticsearch are you using? There may be a better github issue for your question, given some recent changes in the JDK. |
We're on ES 7, sadly. |
Background
Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the
aggs
challenge in thenyc_taxis
track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.
Problem
At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:
With a 1G heap we can pretty reliably get an error like
Summarized Findings
int[]
s used as backing store for a LuceneBKDPointTree
. We see method invocations ofDocIdSetBuilder::grow
andDocIdSetBuilder::addBuffer
in the allocation stacktraces for theseint[]
s.Memory (img)
Allocations (img)
indices.breaker.total.limit
to100%
allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.Items to investigate
Circuit breaker changes
The text was updated successfully, but these errors were encountered: