Too many translog files open, and reach ulimit setting(65535) #49970

seeIT-52 · 2019-12-08T11:08:38Z

Describe the feature:
Hi,
I found 'too many files open' problem in my ES enviroment. And I use the commad ls -al /proc/ES_PID/fd , found that ES opened particularly large number of translog files.
After some search on this problem , I tried to restart a runing normally ES (6.6.1). I found that ，every time ES is restarted, there will be one more translog file on some shards. And all the translogs have no operation data, like below( The server time is incorrect，but I don't think it's concern ):

Elasticsearch version (bin/elasticsearch --version):
Version: 6.6.1, Build: default/tar/1fd8f69/2019-02-13T17:10:04.160291Z, JVM: 1.8.0_60
Plugins installed: []
No Plugins
JVM version (java -version):
java version "1.8.0_60"
Java(TM) SE Runtime Environment (build 1.8.0_60-b27)
OS version (uname -a if on a Unix-like system):
Centos 7.6
Linux 3.10.0-957.5.1.el7.x86_64 #1 SMP Fri Feb 1 14:54:57 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Empty translog files(no operation data) keep increasing.
Old translog exist beyond 12 hours(index.translog.retention.age = 12h).
I think empty translog files should be deleted or don't creat new translog files.
Steps to reproduce:
restart ES node
Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

defautl index setting as follow:
"settings":{ "index.number_of_replicas":"0", "index.refresh_interval":"10s", "index.number_of_shards":"1", "index.routing.allocation.require.tag":"hot", "index.indexing.slowlog.level": "info", "index.indexing.slowlog.threshold.index.trace": "500ms", "index.indexing.slowlog.threshold.index.debug": "2s", "index.indexing.slowlog.threshold.index.info": "5s", "index.indexing.slowlog.threshold.index.warn": "10s", "index.search.slowlog.level": "info", "index.search.slowlog.threshold.fetch.trace": "200ms", "index.search.slowlog.threshold.fetch.debug": "500ms", "index.search.slowlog.threshold.fetch.info": "800ms", "index.search.slowlog.threshold.fetch.warn": "1s", "index.search.slowlog.threshold.query.trace": "500ms", "index.search.slowlog.threshold.query.debug": "2s", "index.search.slowlog.threshold.query.info": "5s", "index.search.slowlog.threshold.query.warn": "10s", "index.translog.durability": "async", "index.translog.flush_threshold_size": "5000mb", "index.translog.sync_interval": "120s", "index.mapping.ignore_malformed": true }
Auto create index use the dynamic mapping, and index name is created by date, only the index created today is active. Indices created before today have no data to index.
Restart ES, the translog of indices created before today (no data to index) keep increasing

If I missed some configuration， please let me know.
I'd be happy to provide additional details, whatever is needed.

Thanks!

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-09T10:34:39Z

Pinging @elastic/es-distributed (:Distributed/Engine)

DaveCTurner · 2019-12-09T11:12:37Z

I figured this would have been addressed by #47414, and maybe also no longer an issue in 7.x thanks to #45473, but in fact that's not true. After restarting a one-node 7.5.0 cluster repeatedly it had over 100 translog generations, despite the limit imposed in #47414. We do a kind of partial flush in InternalEngine#recoverFromTranslogInternal, calling commitIndexWriter, but only if there were ops to recover and we don't trim the translog there either way. Maybe we should?

In the meantime a POST /index/_flush?force will, I think, clean things up.

seeIT-52 · 2019-12-10T07:36:03Z

Hi @DaveCTurner ,
Thanks a lot for your answer.
And is there any command like elasticsearch-traslog trim for translog merge? Because before merging all the translog ,my ES node recovey very slowly even can't start (too many open files.)

ywelsch · 2019-12-10T08:52:10Z

No, you'll have to increase the number of file descriptors before starting up the node next time.

dnhatn · 2019-12-12T03:12:32Z

We create a new translog generation whenever we open a new Translog instance. We need to do that to make sure each generation has at most one primary term. The actual problem is that the translog deletion policy uses the translog generation tag from the safe commit as the baseline. Hence, as David said, we won't be able to clean up translog unless we (force) flush. We need to handle with sync_id carefully if we force flush a recovering shard.

I am prototyping an option where the translog deletion policy uses the local checkpoint from the safe commit instead. It would allow us to clean up the extra translog without having a new commit. However, it's a quite substantial change as we've relied on the translog generation tag in many places.

Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes #49970

…tic#51905) Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes elastic#49970

Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes #49970

DaveCTurner added the :Distributed/Engine Anything around managing Lucene and the Translog in an open shard. label Dec 9, 2019

DaveCTurner added the >bug label Dec 9, 2019

ywelsch assigned dnhatn Dec 9, 2019

dnhatn mentioned this issue Dec 15, 2019

Account trimAboveSeqNo in committed translog generation #50205

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

dnhatn mentioned this issue Feb 5, 2020

Use local checkpoint to calculate min translog gen for recovery #51905

Merged

dnhatn closed this as completed in #51905 Feb 10, 2020

dnhatn mentioned this issue Feb 26, 2020

Use local checkpoint to calculate min translog gen for recovery #52841

Closed

codebrain mentioned this issue Apr 1, 2020

7.7.0 meta ticket (Part 2) elastic/elasticsearch-net#4533

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many translog files open, and reach ulimit setting(65535) #49970

Too many translog files open, and reach ulimit setting(65535) #49970

seeIT-52 commented Dec 8, 2019 •

edited

Loading

elasticmachine commented Dec 9, 2019

DaveCTurner commented Dec 9, 2019

seeIT-52 commented Dec 10, 2019

ywelsch commented Dec 10, 2019

dnhatn commented Dec 12, 2019

Too many translog files open, and reach ulimit setting(65535) #49970

Too many translog files open, and reach ulimit setting(65535) #49970

Comments

seeIT-52 commented Dec 8, 2019 • edited Loading

elasticmachine commented Dec 9, 2019

DaveCTurner commented Dec 9, 2019

seeIT-52 commented Dec 10, 2019

ywelsch commented Dec 10, 2019

dnhatn commented Dec 12, 2019

seeIT-52 commented Dec 8, 2019 •

edited

Loading