Parent Circuit Breaker should cause/allow memory to free before failing #88517

DJRickyB · 2022-07-13T19:50:47Z

Background

Recently @salvatore-campagna and others added multiple aggregations-centric workloads to our nightly benchmarks at https://elasticsearch-benchmarks.elastic.co/. One in particular has been failing frequently and was recently removed; the aggs challenge in the nyc_taxis track, defined here: https://github.com/elastic/rally-tracks/blob/master/nyc_taxis/challenges/default.json#L506. We were running this workload with two single-node configurations, one with an 8GB heap and one with a 1GB heap.

The purpose of the 1GB heap was to track performance in a memory-constrained environment, in case changes to the JVM, GC settings, or object sizes over time lead to regressions. However, this configuration ended up being too unstable in its current form to run on a nightly basis, as errors during the benchmark fail the entire run, and we publish gaps in the charts.

Problem

At the point where the benchmark breaks, we spam the following search repeatedly, without the query cache:

{
  "size": 0,
  "query": {
    "range": {
      "dropoff_datetime": {
        "gte": "2015-01-01 00:00:00",
        "lt": "2015-03-01 00:00:00"
      }
    }
  },
  "aggs": {
    "dropoffs_over_time": {
      "date_histogram": {
        "field": "dropoff_datetime",
        "calendar_interval": "week",
        "time_zone": "America/New_York"
      }
    }
  }
}

With a 1G heap we can pretty reliably get an error like

{'error': {'root_cause': [], 'type': 'search_phase_execution_exception', 'reason': '', 'phase': 'fetch', 'grouped': True, 'failed_shards': [], 'caused_by': {'type': 'circuit_breaking_exception', 'reason': '[parent] Data too large, data for [<reduce_aggs>] would be [1035903210/987.9mb], which is larger than the limit of [1020054732/972.7mb],
real usage: [1035902976/987.9mb], new bytes reserved: [234/234b], usages [eql_sequence=0/0b, model_inference=0/0b, inflight_requests=484/484b, request=234/234b, fielddata=0/0b]', 'bytes_wanted': 1035903210, 'bytes_limit': 1020054732, 'durability': 'TRANSIENT'}}, 'status': 429}

Summarized Findings

We see 800 mb of humongous allocations at the peak during the benchmark, prior to failing with the circuit_breaking_exception. These are evidently large int[]s used as backing store for a Lucene BKDPointTree. We see method invocations of DocIdSetBuilder::grow and DocIdSetBuilder::addBuffer in the allocation stacktraces for these int[]s.
Memory (img)
Allocations (img)
Setting the indices.breaker.total.limit to 100% allowed the benchmark to succeed. A 60 ms major GC cleaned up the humongous objects left over by prior searches and there were no circuit_breaking_exceptions.

Items to investigate

Circuit breaker changes

According to the description in Enhance real memory circuit breaker with G1 GC #58674, we should be seeing G1GC getting invoked prior to tripping the parent circuit breaker. Is this interpretation correct? Can/should this overlimit strategy be made more aggressive, ensuring we attempt to execute a concurrent collection prior to failing a request? Note the problem statement is similar to the problem we're seeing here

The circuit breaking itself hinders that from occurring in a timely manner since it breaks all request before real work is done.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-07-13T19:50:50Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

grcevski · 2022-09-21T19:10:52Z

Here are my thoughts on the current theory behind what's happening - in short it's a limitation/bug in our circuit breaker ability at the moment:

G1 reserves a portion of the heap for humongous objects, however for a high enough rate of humongous object allocations, it will have no choice but to move these objects to the 'old space' or the tenured area. The 'old space' is what we effectively consider as occupied (or non-free) heap. This 'old space' requires a 'mixed' or 'global/full' collect for it to be truly collected, so unless one of those happens, the accounting of the 'free' memory by G1GC will be wrong (inaccurate). As the GC tends to be defensive on amount of time it spends collecting, it will not do this 'old space' collection work unless provoked.
As new transactions happen, they check how much free memory is there and since we give them a wrong estimate, they hit the circuit breakers. If these transactions were to proceed, they would trigger a G1GC collection that will attempt to clean-up the tenured regions and they'll be able to free the memory occupied by the now dead humongous allocations. However, there's no way to know if the memory is occupied by live or dead objects until a 'mixed' or 'global' GC is invoked. If we let a transaction proceed with mostly live objects in the 'old space' we'll hit an OOM.
This scenario is only a problem with small heaps (< 4GB). With larger heap sizes, the amount of 'old' memory the influx of humongous allocations will take will be smaller percentage of the whole heap, leaving enough headroom for the transactions to proceed, causing 'old space' collections as needed. The situation may still arise, but it's orders of magnitude less likely.

There are few ways we can tackle this problem, all have drawbacks:

We can try using various G1GC options for small heaps that will make mixed collections more likely. We only need some to happen to entice the GC to free memory for us. This has possible side-effects on what happens with other workloads, will using these options negatively impact the performance of Elasticsearch.
If we monitor the allocation rate, respective to the growth of the occupied heap, we can detect the pollution of the 'old space' with these humongous objects. Essentially, if the 'old space' grew quickly to high occupancy levels because of high allocation rate, we can estimate with high certainty that some of those objects might be dead now. If we have encountered an event like that, we can dynamically increase the indices.breaker.total.limit to a higher ceiling allowing for transactions to happen. Once we detect a global collect we can reduce the limit back to normal, or similarly if mixed collections have significantly reduced the 'old space' occupancy percentage. Getting the thresholds right will not be easy, if we are too aggressive we might experience OOMs where we wouldn't before, if we are not aggressive enough we might not improve the situation a bit.

taliastocks · 2023-11-16T20:02:17Z

We're running into this problem in a production setting, with a 32GB heap. Did you figure out a workaround? Any GC settings you tuned to increase the minimum amount of free heap space?

williamrandolph · 2023-11-28T19:49:37Z

@taliastocks Which version of Elasticsearch are you using? There may be a better github issue for your question, given some recent changes in the JDK.

taliastocks · 2023-11-29T14:42:26Z

@taliastocks Which version of Elasticsearch are you using? There may be a better github issue for your question, given some recent changes in the JDK.

We're on ES 7, sadly.

DJRickyB added the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Jul 13, 2022

elasticmachine added the Team:Core/Infra Meta label for core/infra team label Jul 13, 2022

DJRickyB mentioned this issue Jul 13, 2022

Address memory pressure for intensive aggregations and small heaps #86531

Open

5 tasks

grcevski added the team-discuss label Jul 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parent Circuit Breaker should cause/allow memory to free before failing #88517

Parent Circuit Breaker should cause/allow memory to free before failing #88517

DJRickyB commented Jul 13, 2022

elasticmachine commented Jul 13, 2022

grcevski commented Sep 21, 2022

taliastocks commented Nov 16, 2023

williamrandolph commented Nov 28, 2023

taliastocks commented Nov 29, 2023

Parent Circuit Breaker should cause/allow memory to free before failing #88517

Parent Circuit Breaker should cause/allow memory to free before failing #88517

Comments

DJRickyB commented Jul 13, 2022

Background

Problem

Summarized Findings

Items to investigate

elasticmachine commented Jul 13, 2022

grcevski commented Sep 21, 2022

taliastocks commented Nov 16, 2023

williamrandolph commented Nov 28, 2023

taliastocks commented Nov 29, 2023