Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many empty buckets still cause OOM #35896

Closed
synhershko opened this issue Nov 26, 2018 · 7 comments
Closed

Too many empty buckets still cause OOM #35896

synhershko opened this issue Nov 26, 2018 · 7 comments
Labels

Comments

@synhershko
Copy link
Contributor

I'm aware of the request circuit breaker to protect against OOMs in agg requests (namely, #19394), but we are still able to crash it with OOM.

Consider the following:

PUT /sports/
{
   "mappings": {
      "doc": {
         "properties": {
            "birthdate": {
               "type": "date",
               "format": "dateOptionalTime"
            },
            "location": {
               "type": "geo_point"
            },
            "name": {
               "type": "keyword"
            },
            "rating": {
               "type": "integer"
            },
            "sport": {
               "type": "keyword"
            }
         }
      }
   }
}

POST _bulk
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Michael", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["5", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Bob", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["3", "4"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Jim", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["3", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Joe", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Tim", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["3", "3"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Alfred", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Jeff Cohen", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["2", "3"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Will", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["4", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Mick", "birthdate":"1989-10-1", "sport":"Baseball", "rating": ["3", "4"],  "location":"46.22,-68.45"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Pong", "birthdate":"1989-11-2", "sport":"Baseball", "rating": ["1", "3"],  "location":"45.21,-68.35"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Ray Ban", "birthdate":"1988-10-3", "sport":"Baseball", "rating": ["2", "2"],  "location":"45.16,-63.58" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Ping", "birthdate":"1992-5-20", "sport":"Baseball", "rating": ["4", "3"],  "location":"45.22,-68.53"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Duke", "birthdate":"1992-2-28", "sport":"Baseball", "rating": ["5", "2"],  "location":"46.22,-68.85"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Hal", "birthdate":"1990-9-9", "sport":"Baseball", "rating": ["4", "2"],  "location":"45.12,-68.35"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Charge", "birthdate":"1990-4-1", "sport":"Baseball", "rating": ["3", "2"], "location":"46.12,-68.55"}
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Barry", "birthdate":"1988-3-1", "sport":"Baseball", "rating": ["5", "2"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Bank", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["6", "4"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Bingo", "birthdate":"1988-3-1", "sport":"Golf", "rating": ["10", "7"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"James", "birthdate":"1988-3-1", "sport":"Basketball", "rating": ["10", "8"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Wayne", "birthdate":"1988-3-1", "sport":"Hockey", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Brady", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }
{"index":{"_index":"sports","_type":"doc"}}
{"name":"Lewis Kay", "birthdate":"1988-3-1", "sport":"Football", "rating": ["10", "10"], "location":"46.25,-68.55" }

Histogram (not Date Histogram) with interval 50 on birthdate (yeah, I know) is enough to crash the node (esp if it has small heap).

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo

@jimczi
Copy link
Contributor

jimczi commented Nov 26, 2018

@synhershko we introduced a new limit in 6x through a cluster setting named search.max_buckets:
#27581
However we couldn't activate it by default in a minor release so in 6x a deprecation warning is printed when the number of buckets to render is bigger than 10,000. You can force a value in 6x (it's a dynamic cluster setting) to throw an error instead. The setting is activated by default in 7 (with a default limit set to 10,000) so I hope you don't mind if I close this issue. I tried your recreation in 6x where I updated search.max_buckets to 10,000, and in 7 and I get the expected error:

{
    "error": {
        "root_cause": [],
        "type": "search_phase_execution_exception",
        "reason": "",
        "phase": "fetch",
        "grouped": true,
        "failed_shards": [],
        "caused_by": {
            "type": "too_many_buckets_exception",
            "reason": "Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting.",
            "max_buckets": 10000
        }
    },
    "status": 503
}

@jimczi jimczi closed this as completed Nov 26, 2018
@synhershko
Copy link
Contributor Author

The problem is this error / warning is never printed, because the ES node reaches OOM and crashes (on the createEmptyBuckets method, full stacktrace not handy atm). Specifically that happened on a 1g heap so I assume the low memory used is the reason and a lower max_buckets config would solve it. Thanks!

@jimczi
Copy link
Contributor

jimczi commented Nov 26, 2018

The problem is this error / warning is never printed, because the ES node reaches OOM and crashes (on the createEmptyBuckets method, full stacktrace not handy atm)

It should be printed in the deprecation logs as soon as the createEmptyBuckets reaches 10,000 buckets. If it's not then there is a bug.

@synhershko
Copy link
Contributor Author

Deprecation log contains:

[2018-11-26T10:46:32,835][WARN ][o.e.d.s.a.MultiBucketConsumerService] [ip-172-31-28-55] This aggregation creates too many buckets (10001) and will 
throw an error in future versions. You should update the [search.max_buckets] cluster setting or use the [composite] aggregation to paginate all buc
kets in multiple requests.  

But the node still crashes with the following (cropped in the end in the original file - probably process quit before finishing the flush to the file):

[2018-11-26T10:46:33,992][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-31-28-55] [gc][8471] overhead, spent [901ms] collecting in the last [1s]      
[2018-11-26T10:46:36,606][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-31-28-55] [gc][8472] overhead, spent [2.3s] collecting in the last [2.6s]     
[2018-11-26T10:46:38,199][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-31-28-55] [gc][8473] overhead, spent [1.5s] collecting in the last [1.5s]     
[2018-11-26T10:46:41,889][WARN ][o.e.m.j.JvmGcMonitorService] [ip-172-31-28-55] [gc][8474] overhead, spent [3.6s] collecting in the last [3.6s]     
[2018-11-26T10:46:49,522][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [ip-172-31-28-55] fatal error in thread [elasticsearch[ip-172-31-28-55
][search][T#1]], exiting                                                                                                                            
java.lang.OutOfMemoryError: Java heap space                                                                                                         
        at java.util.Arrays.copyOf(Arrays.java:3181) ~[?:1.8.0_191]                                                                                 
        at java.util.ArrayList.grow(ArrayList.java:265) ~[?:1.8.0_191]                                                                              
        at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:239) ~[?:1.8.0_191]                                                            
        at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:231) ~[?:1.8.0_191]                                                            
        at java.util.ArrayList.add(ArrayList.java:479) ~[?:1.8.0_191]                                                                               
        at java.util.ArrayList$ListItr.add(ArrayList.java:964) ~[?:1.8.0_191]                                                                       
        at org.elasticsearch.search.aggregations.bucket.histogram.InternalHistogram.addEmptyBuckets(InternalHistogram.java:405) ~[elasticsearch-6.5.
0.jar:6.5.0]                                                                                                                                        
        at org.elasticsearch.search.aggregations.bucket.histogram.InternalHist

@jimczi
Copy link
Contributor

jimczi commented Nov 26, 2018

Thanks @synhershko . The issue is that we don't have a circuit breaker in the coordinating node, we only use it on each shard that participates in the search so the search.max_buckets setting is different. It can be seen as a limit to the number of buckets that we can transport between each shard and finally to the user. We didn't want to break every use cases that returns more than 10,000 buckets in a minor release so we choose the deprecation path but as you noticed this doesn't prevent the node to crash if the heap cannot handle this amount of memory. However there is a workaround
in 6x (where you can force search.max_buckets to a reasonable value) and this shouldn't be an issue in 7 anymore where the default value should protect against the explosion.

@synhershko
Copy link
Contributor Author

For the record, this happens on a single node instance (master + data + ingest roles) with 1g heap - so I did expect the circuit breaker to pop.

Understood re the setting - obviously this is not a production deployment or anything and the query is faulty, but again - I did expect Elastic to protect itself. I'll check this with 7.x later...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants