-
Notifications
You must be signed in to change notification settings - Fork 24.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bulk task queue grows infinitely after upgrading to 6.2.1 (from 5.6.4) #28714
Comments
Hi @bra-fsn, we reserve Github for bug reports and feature requests only. Please ask questions like these in the Elasticsearch forum instead. Thank you! Some hints to get you started:
|
There's a deadlock in The stack trace below should be pretty self-explanatory of what's going on:
|
@danielmitterdorfer: this is a bug report, not a question... |
@ywelsch thanks for double-checking. I did not spot the deadlock in the thread dump originally. |
I am looking into it |
Pruning tombstones is best effort and should not block if a key is currently locked. This can cause a deadlock in rare situations if we switch of append only optimization while heavily updating the same key in the engine while the LiveVersionMap is locked. This is very rare since this code patch only executed every 15 seconds by default since that is the interval we try to prune the deletes in the version map. Closes elastic#28714
@bra-fsn thanks for reporting this. @danielmitterdorfer @ywelsch I opened a pr for this. |
Pruning tombstones is best effort and should not block if a key is currently locked. This can cause a deadlock in rare situations if we switch of append only optimization while heavily updating the same key in the engine while the LiveVersionMap is locked. This is very rare since this code patch only executed every 15 seconds by default since that is the interval we try to prune the deletes in the version map. Closes #28714
Pruning tombstones is best effort and should not block if a key is currently locked. This can cause a deadlock in rare situations if we switch of append only optimization while heavily updating the same key in the engine while the LiveVersionMap is locked. This is very rare since this code patch only executed every 15 seconds by default since that is the interval we try to prune the deletes in the version map. Closes #28714
Pruning tombstones is best effort and should not block if a key is currently locked. This can cause a deadlock in rare situations if we switch of append only optimization while heavily updating the same key in the engine while the LiveVersionMap is locked. This is very rare since this code patch only executed every 15 seconds by default since that is the interval we try to prune the deletes in the version map. Closes #28714
Pruning tombstones is best effort and should not block if a key is currently locked. This can cause a deadlock in rare situations if we switch of append only optimization while heavily updating the same key in the engine while the LiveVersionMap is locked. This is very rare since this code patch only executed every 15 seconds by default since that is the interval we try to prune the deletes in the version map. Closes elastic#28714
Elasticsearch version (bin/elasticsearch --version):
Version: 6.2.1, Build: 7299dc3/2018-02-07T19:34:26.990113Z, JVM: 1.8.0_131
Plugins installed: []
analysis-icu
JVM version (java -version):
openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version (uname -a if on a Unix-like system):
FreeBSD 11.1
Description of the problem including expected versus actual behavior:
After upgrading to 6.2.1, one of the data (and also client) nodes reject all bulk operations. In the reject response queued tasks grow infinitely while completed tasks doesn't change (see attached logs).
Steps to reproduce:
We have a lot of different data and indices spread over 40 nodes. So far I could observe this error only on one node. When I try to restart the node with kill, it doesn't stop. Below is the stacktrace.
Provide logs (if relevant):
This is just about a minute (logged by an application, uses python elasticsearch client). queued tasks grow, while completed tasks doesn't.
with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@3684362a on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14926, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@11961ec on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14939, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@573952db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14951, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@573952db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14951, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4127dfd2 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14960, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4c6147ea on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14961, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4958c58e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14960, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@7acb9dba on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14967, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@75af7186 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14967, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@7830996e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14972, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@513cdd87 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14979, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@1166d220 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14983, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@498b99a2 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 14998, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@64e61a2f on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15005, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@372ab4a0 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15006, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@4381386b on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15006, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@2c661b6d on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15025, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@5fcb36db on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15025, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@21834e56 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15038, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@58444a25 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15038, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@25da450 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15047, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@6f3c0966 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15047, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@27485a8e on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15050, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'} with: {u'reason': u'rejected execution of org.elasticsearch.transport.TransportService$7@170e6b30 on EsThreadPoolExecutor[name = fmfe16/bulk, queue capacity = 3000, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@12006f74[Running, pool size = 24, active threads = 24, queued tasks = 15056, completed tasks = 253961]]', u'type': u'es_rejected_execution_exception'}
And this is the stacktrace after I tried to kill the node and it didn't stop:
https://pastebin.com/rnzESu5B
The text was updated successfully, but these errors were encountered: