Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible memory leak index query cache #18161

Closed
wfelipe opened this issue May 5, 2016 · 5 comments
Closed

possible memory leak index query cache #18161

wfelipe opened this issue May 5, 2016 · 5 comments

Comments

@wfelipe
Copy link

@wfelipe wfelipe commented May 5, 2016

Elasticsearch version: 2.2.2 and 2.3.2

JVM version: 1.8_65 and 1.8_92

OS version: centos 7 (kernel 3.10.0-327.13.1.el7.x86_64)

Description of the problem including expected versus actual behavior:
this is a testing system, writing has been disabled, and only search is working, here is the environment:

  • 10 indices, with the total of 156 shards (75 being primaries), and the total size of 6.4TB
  • the cluster has 4 servers, with 64GB of ram each, and 30G allocated to elasticsearch (one instance per server)
  • one index is 2.1tb (4.4tb total, with one replica), and created with 30 shards
  • using the default gc tuning from elasticsearch
  • number of segments: 4.7k (about 1.2k on each node)

configuration:

cluster.name: es_testing
#
# ------------------------------------ Node ------------------------------------
#
node.name: hostname1
node.max_local_storage_nodes: 1
#
# ----------------------------------- Paths ------------------------------------
#
path.conf: /etc/elasticsearch
path.data: /u1/elasticsearch,/u2/elasticsearch,/u3/elasticsearch,/u4/elasticsearch,/u5/elasticsearch
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
bootstrap.mlockall: true
#
# ---------------------------------- Network -----------------------------------
#
network.host: 0.0.0.0
http.port: 9200
#
# ---------------------------------- Gateway -----------------------------------
#
#
# --------------------------------- Discovery ----------------------------------
#
discovery.zen.ping.multicast.enabled: false
discovery.zen.ping.unicast.hosts: hostname1,hostname2,hostname3,hostname4
discovery.zen.minimum_master_nodes: 2
#
# ---------------------------------- Various -----------------------------------
#
action.auto_create_index: true
action.destructive_requires_name: true
#
# -------------------------- Custom Chef Configuration --------------------------
#
action.disable_delete_all_indices: true
gateway.expected_nodes: 1
index.indexing.slowlog.threshold.index.debug: 2s
index.indexing.slowlog.threshold.index.info: 5s
index.indexing.slowlog.threshold.index.trace: 500ms
index.indexing.slowlog.threshold.index.warn: 10s
index.mapper.dynamic: false
index.search.slowlog.threshold.fetch.debug: 500ms
index.search.slowlog.threshold.fetch.info: 800ms
index.search.slowlog.threshold.fetch.trace: 200ms
index.search.slowlog.threshold.fetch.warn: 1s
index.search.slowlog.threshold.query.debug: 2s
index.search.slowlog.threshold.query.info: 5s
index.search.slowlog.threshold.query.trace: 500ms
index.search.slowlog.threshold.query.warn: 10s
indices.breaker.fielddata.limit: 20%
indices.breaker.request.limit: 20%
indices.breaker.total.limit: 20%
indices.fielddata.cache.size: 10%
monitor.jvm.gc.old.debug: 2s
monitor.jvm.gc.old.info: 5s
monitor.jvm.gc.old.warn: 10s
monitor.jvm.gc.young.debug: 400ms
monitor.jvm.gc.young.info: 700ms
monitor.jvm.gc.young.warn: 1000ms
network.publish_host: _site_
script.engine.groovy.inline.aggs: true
script.engine.groovy.inline.mapping: false
script.engine.groovy.inline.plugin: false
script.engine.groovy.inline.search: true
script.engine.groovy.inline.update: false
script.groovy.sandbox.receiver_whitelist: "java.lang.String,java.lang.Object,java.lang.Math"
threadpool.search.size: 1000

Steps to reproduce:
the cluster stays fine without any queries (heap around 7gb on 2.3.2, and 4gb on 2.2.2). Once we start sending queries (200-300 reqs/s), the cluster eats up the heap, and after the oldgen gc starts to run, it never frees enough memory.

after a couple hours, the cluster becomes unresponsive and a restart is required.

Provide logs (if relevant):
two memory dumps were taken, and both reported the same suspects, here is one taken from one of the dump:

  • more than 5000k instances of org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader, using a total of 15gb
  • 990 instances of org.apache.lucene.index.SegmentCoreReaders, using a total of 6gb
  • 1mil instaces of HashMap, using 6gb as well

attached memory reports

@wfelipe

This comment has been minimized.

Copy link
Author

@wfelipe wfelipe commented May 5, 2016

forgot to attach the images, here it goes:

screen shot 2016-05-05 at 10 41 14 am

screen shot 2016-05-05 at 9 53 04 am

screen shot 2016-05-05 at 9 53 09 am

screen shot 2016-05-05 at 9 54 15 am

@clintongormley

This comment has been minimized.

Copy link
Member

@clintongormley clintongormley commented May 6, 2016

What happens when you remove the ridiculously high search thread pool?

threadpool.search.size: 1000
@wfelipe

This comment has been minimized.

Copy link
Author

@wfelipe wfelipe commented May 6, 2016

we keep track of the thread pool usage, and raising it to 1000 was an attempt to see the behavior. The number of threads being used goes around 5-15. It only reaches to 100 when the heap is taken, so that's a side effect.

@clintongormley

This comment has been minimized.

Copy link
Member

@clintongormley clintongormley commented May 7, 2016

@wfelipe yes, but what happens when you use the default setting for the search threadpool size, which is (number of processors * 3)/2+1. You don't mention how many processors you have, but just unsetting this setting will give you the default. With a high size, if search is struggling for whatever reason, then it'll just use one of the many threads that you have allowed it to use which will bring a system to its knees. Instead, with a reasonable thread pool size, search requests will be queued or rejected, keeping the system healthy.

That's why I want to see what happens to memory usage when the threads setting is the default.

@jpountz

This comment has been minimized.

Copy link
Contributor

@jpountz jpountz commented May 7, 2016

Everything you are describing is a side effect of having too many threads * segments per node. Lucene keeps state in a thread local per segment, which is why you are seeing so many instances of SegmentCodeReaders and CompressingStoredFieldsReader. You should try to reduce the size of the search/get pools and have fewer (larger) segments per node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.