possible memory leak index query cache #18161

wfelipe · 2016-05-05T18:20:56Z

Elasticsearch version: 2.2.2 and 2.3.2

JVM version: 1.8_65 and 1.8_92

OS version: centos 7 (kernel 3.10.0-327.13.1.el7.x86_64)

Description of the problem including expected versus actual behavior:
this is a testing system, writing has been disabled, and only search is working, here is the environment:

10 indices, with the total of 156 shards (75 being primaries), and the total size of 6.4TB
the cluster has 4 servers, with 64GB of ram each, and 30G allocated to elasticsearch (one instance per server)
one index is 2.1tb (4.4tb total, with one replica), and created with 30 shards
using the default gc tuning from elasticsearch
number of segments: 4.7k (about 1.2k on each node)

configuration:

cluster.name: es_testing
#
# ------------------------------------ Node ------------------------------------
#
node.name: hostname1
node.max_local_storage_nodes: 1
#
# ----------------------------------- Paths ------------------------------------
#
path.conf: /etc/elasticsearch
path.data: /u1/elasticsearch,/u2/elasticsearch,/u3/elasticsearch,/u4/elasticsearch,/u5/elasticsearch
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
bootstrap.mlockall: true
#
# ---------------------------------- Network -----------------------------------
#
network.host: 0.0.0.0
http.port: 9200
#
# ---------------------------------- Gateway -----------------------------------
#
#
# --------------------------------- Discovery ----------------------------------
#
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: hostname1,hostname2,hostname3,hostname4
discovery.zen.minimum_master_nodes: 2
#
# ---------------------------------- Various -----------------------------------
#
action.auto_create_index: true
action.destructive_requires_name: true
#
# -------------------------- Custom Chef Configuration --------------------------
#
action.disable_delete_all_indices: true
gateway.expected_nodes: 1
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.trace: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.mapper.dynamic: false
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.query.warn: 10s
indices.breaker.fielddata.limit: 20%
indices.breaker.request.limit: 20%
indices.breaker.total.limit: 20%
indices.fielddata.cache.size: 10%
monitor.jvm.gc.old.debug: 2s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.warn: 1000ms
network.publish_host: _site_
script.engine.groovy.inline.aggs: true
script.engine.groovy.inline.mapping: false
script.engine.groovy.inline.plugin: false
script.engine.groovy.inline.search: true
script.engine.groovy.inline.update: false
script.groovy.sandbox.receiver_whitelist: "java.lang.String,java.lang.Object,java.lang.Math"
threadpool.search.size: 1000

Steps to reproduce:
the cluster stays fine without any queries (heap around 7gb on 2.3.2, and 4gb on 2.2.2). Once we start sending queries (200-300 reqs/s), the cluster eats up the heap, and after the oldgen gc starts to run, it never frees enough memory.

after a couple hours, the cluster becomes unresponsive and a restart is required.

Provide logs (if relevant):
two memory dumps were taken, and both reported the same suspects, here is one taken from one of the dump:

more than 5000k instances of org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader, using a total of 15gb
990 instances of org.apache.lucene.index.SegmentCoreReaders, using a total of 6gb
1mil instaces of HashMap, using 6gb as well

attached memory reports

The text was updated successfully, but these errors were encountered:

wfelipe · 2016-05-05T18:22:44Z

forgot to attach the images, here it goes:

clintongormley · 2016-05-06T08:36:14Z

What happens when you remove the ridiculously high search thread pool?

threadpool.search.size: 1000

wfelipe · 2016-05-06T17:55:50Z

we keep track of the thread pool usage, and raising it to 1000 was an attempt to see the behavior. The number of threads being used goes around 5-15. It only reaches to 100 when the heap is taken, so that's a side effect.

clintongormley · 2016-05-07T10:50:07Z

@wfelipe yes, but what happens when you use the default setting for the search threadpool size, which is (number of processors * 3)/2+1. You don't mention how many processors you have, but just unsetting this setting will give you the default. With a high size, if search is struggling for whatever reason, then it'll just use one of the many threads that you have allowed it to use which will bring a system to its knees. Instead, with a reasonable thread pool size, search requests will be queued or rejected, keeping the system healthy.

That's why I want to see what happens to memory usage when the threads setting is the default.

jpountz · 2016-05-07T15:20:39Z

Everything you are describing is a side effect of having too many threads * segments per node. Lucene keeps state in a thread local per segment, which is why you are seeing so many instances of SegmentCodeReaders and CompressingStoredFieldsReader. You should try to reduce the size of the search/get pools and have fewer (larger) segments per node.

clintongormley added feedback_needed :Core/Infra/Core Core issues without another label :Cache labels May 6, 2016

jpountz closed this as completed May 7, 2016

makeyang mentioned this issue Mar 13, 2017

IndicesQueryCache consume a lot of memory #23564

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible memory leak index query cache #18161

possible memory leak index query cache #18161

wfelipe commented May 5, 2016

wfelipe commented May 5, 2016

clintongormley commented May 6, 2016

wfelipe commented May 6, 2016

clintongormley commented May 7, 2016

jpountz commented May 7, 2016

possible memory leak index query cache #18161

possible memory leak index query cache #18161

Comments

wfelipe commented May 5, 2016

wfelipe commented May 5, 2016

clintongormley commented May 6, 2016

wfelipe commented May 6, 2016

clintongormley commented May 7, 2016

jpountz commented May 7, 2016