Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

russellday · 2017-10-31T19:33:43Z

I am running a large cluster on public AWS EC2 that ingests significant volume (several TB per day). The cluster will periodically loose a data node, typically due to Heap OOM. When this happens the pending tasks queue fills with several hundred thousand "Shard-Failed" tasks. This renders the cluster useless I recycle the cluster several times. I feel it's likely something with either my implementation or configuration because ES should be able to lose data nodes in a more graceful manor. Can I please get some guidance on how I might mitigate this issue?

I am using the new ROLLOVER API is that is of consequence.

Elasticsearch version (bin/elasticsearch --version):
5.1.1

Plugins installed: [amazon-cloudwatch, discovery-ec2, repository-s3]

JVM version (java -version):
"versions" : [
{
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 38
},
{
"version" : "1.8.0_151",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.151-b12",
"vm_vendor" : "Oracle Corporation",
"count" : 11
}

OS version (uname -a if on a Unix-like system):
Amazon Linux

Description of the problem including expected versus actual behavior:
(Above)

Steps to reproduce:
Intermittent, cluster will run fine for a few hours then I will lose a node due to ingestion rate.

Provide logs (if relevant):

{
"cluster_name" : "removed",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 49,
"number_of_data_nodes" : 35,
"active_primary_shards" : 2112,
"active_shards" : 2492,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 72,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 423050,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 8814583,
"active_shards_percent_as_number" : 95.75388767550702
}

The text was updated successfully, but these errors were encountered:

talevy · 2017-10-31T20:46:17Z

Do you have any logs you can share from the node that dies due to OOM?

russellday · 2017-10-31T23:35:29Z

I can try to get one but I think it's happening because my cluster is a little under sized in relation to the ingest rate. I am more concerned about why the failure renders the cluster unusable. I expect to lose data nodes.

talevy · 2017-11-01T00:19:16Z

@russellday may be so, but logs will definitely help lead to the cause

russellday · 2017-11-02T14:00:54Z

I have hit the issue again. After bout 10 hours of ingestion on the server /_cat/nodes reports 1 of the data nodes is no longer present. The node itself is actually running but cannot communicate with the master.
{
"cluster_name" : "removed",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 45,
"number_of_data_nodes" : 34,
"active_primary_shards" : 2319,
"active_shards" : 3070,
"relocating_shards" : 0,
"initializing_shards" : 25,
"unassigned_shards" : 74,
"delayed_unassigned_shards" : 74,
"number_of_pending_tasks" : 142848,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 1849884,
"active_shards_percent_as_number" : 96.87598611549384
}

Log from the master node (grepped on instance ID = i-0cb4dc7bc907acb08 the data node that is out of the cluster)

[2017-11-02T13:25:01,689][INFO ][o.e.c.m.MetaDataMappingService] [ip_r3.2xlarge_i-099746d14bfafaceb] [a-apps-000049/hgTFPuiJR96laS2JaIR1vg] update_mapping [Event]
[2017-11-02T13:25:25,645][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip_r3.2xlarge_i-099746d14bfafaceb] failed to execute on node [4la7TcDWRVqhMwEDQ9GucQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip_2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][cluster:monitor/nodes/stats[n]] request_id [11224061] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:869) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [54.147.210.141_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][cluster:monitor/nodes/stats[n]] request_id [11224061] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:869) ~[elasticsearch-5.1.1.jar:5.1.1]
... 4 more
[2017-11-02T13:25:27,223][WARN ][o.e.t.TransportService ] [ip_r3.2xlarge_i-099746d14bfafaceb] Received response for a request that has timed out, sent [16578ms] ago, timed out [1578ms] ago, action [cluster:monitor/nodes/stats[n]], node [{ip_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{ip}{ip:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=54.147.210.141, tags=Data}], id [11224061]
[2017-11-02T13:25:28,920][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] low disk watermark [80%] exceeded on [9j66xKfnRR6sXViYoDuL2Q][ip_c3.8xlarge_i-0b17df3c4edfe229c][/mnt/raid/data/nodes/0] free: 86.3gb[13.7%], replicas will not be assigned to this node
[2017-11-02T13:25:28,920][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] low disk watermark [80%] exceeded on [GiG8oFwbRiK6f27rH84cyQ][ip_t2.2xlarge_i-0ee3daefb64d7c7c8][/mnt/raid/data/nodes/0] free: 140.9gb[17.9%], replicas will not be assigned to this node
[2017-11-02T13:25:28,920][WARN ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] high disk watermark [90%] exceeded on [wCWDNAvmQ-W2SUk8LnE6mg][ip_t2.2xlarge_i-002c1b345ee73a497][/mnt/raid/data/nodes/0] free: 0b[0%], shards will be relocated away from this node

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

[2017-11-02T13:27:21,408][WARN ][o.e.d.z.PublishClusterStateAction] [ip_r3.2xlarge_i-099746d14bfafaceb] timed out waiting for all nodes to process published state [10133] (timeout [30s], pending nodes: [{ip_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{ip}{ip:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=ip, tags=Data}])

jasontedor · 2017-11-02T14:16:48Z

Would you please share the output of /_cat/pending_tasks (in a gist, for example).

russellday · 2017-11-02T14:50:14Z

Shard-Failed (pending tasks had already cleared but at the time that is what was in the queue)

jasontedor · 2017-11-02T15:02:46Z

@russellday I want to see the actual tasks, I am expecting that it must be filled with duplicates for the same shard but I want to see this with my own eyes on your pending tasks list.

russellday · 2017-11-02T16:09:30Z

I actually had the output in a terminal window (gist below). It's our suspicion as well that they are duplicates. We simply do not have enough shards to account for that many tasks.

https://gist.github.com/russellday/093e0f7084f24eb4a29d9da229695d7c

russellday · 2017-11-02T16:11:37Z

@jasontedor does _cat/pending_tasks provide enough detail to confirm they are duplicates? It only gives you the insertOrder, timeInQueue, and priority source columns?

jasontedor · 2017-11-02T16:19:53Z

No, it doesn't give enough, it doesn't give the shard that failed, now I have to look up when we added that.

jasontedor · 2017-11-02T16:24:02Z

Sadly it was not until 5.6 that we added more information here for shard started and we have not yet added more information for shard failed. 😢

How about your logs, do they have lots of messages of the form marking and sending shard failed due to for the same index/shard copy?

russellday · 2017-11-02T16:25:09Z

@jasontedor If were to go with the assumption that they are duplicates (which is likely correct). What would that indicate and how could I clear that queue of tasks? In our cluster, we are using the rollover API and taking snapshots on indexes once they are rolled over. At that point they are no longer ingesting and backed up so we set the replica count to zero. We are not overly concerned with the data, since we can restore or potentially re-ingest for a given time period. However, the huge pending task list results in an outage.

russellday · 2017-11-02T16:32:11Z

@jasontedor
[2017-11-02T13:29:24,379][WARN ][o.e.c.a.s.ShardStateAction] [ip_r3.2xlarge_i-099746d14bfafaceb] [largeapp-removed-000014][0] received shard failed for shard id [[largeapp-removed-000014][0]], allocation id [LcBdeCsfTPWZETj7Kk7ypg], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [largeapp-removed-000014][0], node[4la7TcDWRVqhMwEDQ9GucQ], [R], s[STARTED], a[id=LcBdeCsfTPWZETj7Kk7ypg]], failure [NodeDisconnectedException[[ip_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [ip_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][indices:data/write/bulk[s][r]] disconnected

ywelsch · 2017-11-02T16:49:13Z

I wonder if the high number of tasks are an effect, not the cause for the troubles here.

From the logs that you showed above, we see:

[2017-11-02T13:27:21,408][WARN ][o.e.d.z.PublishClusterStateAction] [54.157.56.12_r3.2xlarge_i-099746d14bfafaceb] timed out waiting for all nodes to process published state [10133] (timeout [30s], pending nodes: [{54.147.210.141_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{172.31.15.252}{172.31.15.252:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=54.147.210.141, tags=Data}])

means that cluster state publishing was halted for 30 seconds. Many tasks can accumulate during that time. I wonder how fast cluster state publishing in general is in your cluster, i.e., whether the master has trouble publishing the cluster state, so has issues to keep up with the number of incoming tasks.
Can you enable debug logging for "cluster.service" on the master node?

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.cluster.service": "debug"
  }
}

This should tell us in the master logs how fast cluster state publishing is.

russellday · 2017-11-02T16:59:58Z

@ywelsch - Yes, I have made the setting change and I will re-enable ingestion. I think what you are suggesting makes a lot of sense. This cluster receives a very high rate of ingestion, it might be struggling to keep up. We do see a good bit of put-mappings in the pending_tasks since we are ingesting some dynamic data.

russellday · 2017-11-02T18:27:13Z

@ywelsch - Turned on ingest again and lost another data node. I have the entire log file but I tried to cherry pick examples of the items that seemed relevant.

https://gist.github.com/russellday/ed2dda7be149845671098da29f7c117e

ywelsch · 2017-11-03T09:46:03Z

Can you provide the full log file? Excerpts are not really going to help here. If you don't want to share it publicly, you can send it to my_first_name@elastic.co
Also can you share some information about the size of your cluster state? Using dynamic mappings in a cluster with high ingestion rates is a bad idea.

russellday · 2017-11-03T17:39:00Z

@ywelsch - you have mail

ywelsch · 2017-11-06T18:34:25Z

I've had a look at the logs, which confirms my theory above that tasks are an effect, not the cause for the troubles. What's interesting is that once a node fails, it takes a long time for that node to be booted from the cluster. Every cluster state happening after that node dropped takes 30 seconds then (i.e. timeout), because the master waits for all nodes to apply the cluster state update (which the faulty node never does, but this node does not seem to be dropped or booted from the cluster either).

There are cases where the cluster state update takes even longer than 30 seconds, see

[2017-11-02T14:11:20,553][WARN ][o.e.c.s.ClusterService   ] [34.224.89.99_r3.2xlarge_i-0ab5787c61924c821] cluster state update task [zen-disco-receive(from master [master {54.157.56.12_r3.2xlarge_i-099746d14bfafaceb}{d4_cBqcYT2Gy4hoqTp2cSQ}{tOeAuZhoQwSiyX5vf74cdw}{172.31.20.173}{172.31.20.173:8193}{availability_zone=us-east-1b, instance_type=r3.2xlarge, external_ip=54.157.56.12, tags=Master} committed version [10165]])] took [3.7m] above the warn threshold of 30s

where it took more than 3 minutes, the reason being issues to connect to nodes, see for example:

[2017-11-02T14:11:17,535][WARN ][o.e.c.NodeConnectionsService] [34.224.89.99_r3.2xlarge_i-0ab5787c61924c821] failed to connect to node {54.82.189.132_m4.4xlarge_i-09cf7a79c4bb3b28a}{mj-8PwXnSs6HTpSYbsGsqg}{fw807l-hSkKz0vt6ckwWwg}{172.31.22.244}{172.31.22.244:8193}{availability_zone=us-east-1b, instance_type=m4.4xlarge, external_ip=54.82.189.132, tags=Client} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [54.82.189.132_m4.4xlarge_i-09cf7a79c4bb3b28a][172.31.22.244:8193] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:379) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:413) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:387) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:291) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:113) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:86) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:700) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.1.1.jar:5.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: 172.31.22.244/172.31.22.244:8193
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267) ~[?:?]
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[?:?]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:418) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:454) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) ~[?:?]
	... 1 more

This means that the master was blocked for nearly 4 minutes to do anything meaningful, causing the large list of tasks.

I'm not sure why the fault detection has trouble to kick the node out. Are you perhaps taking heap dumps on OOM of the data/client nodes? It looks a bit like the connection to the failed node is in some kind of limbo state? Is there anything special around your networking setup?

Also, can you upgrade to the latest ES version (we have made some changes around connection handling) and check if you still hit the same problem there?

russellday · 2017-11-06T22:59:27Z

@ywelsch - Thanks for the confirmation. We are not doing anything out of the norm on any of the nodes. We will upgrade today and see if we still encounter the same behavior.

russellday · 2017-11-09T03:26:16Z

@ywelsch - Same results on 5.6.2.

jasontedor · 2017-11-09T03:40:47Z

@russellday That’s not the latest version. The latest is 5.6.4, and the fixes that are referenced appear in 5.6.3 and 5.6.4, I wouldn’t expect any benefit from 5.6.2 for this issue.

russellday · 2017-11-09T03:59:02Z

@jasontedor - Thanks! I will try to get 5.6.4 pushed tomorrow and let you know.

russellday · 2017-11-11T16:25:38Z

@jasontedor - same results on 5.6.4

ywelsch · 2017-11-14T10:24:09Z

Can you reproduce the same behavior if you simply kill a node (with kill -9)? Are you taking heap dumps on OOMs? How long do those take? Have you made changes to the fault detection settings?

russellday · 2017-11-14T17:15:29Z

@ywelsch - killing a node does not exhibit the same behavior. The recovery in that scenario is normal, a reasonable amount of shard-failed tasks and quick recovery. Not taking heap dumps on the OOM's currently and no changes to the fault detection. I have not investigated the OOM heaps since this test is using instances that I expect will fail on a regular basis. The hope was we can handle replacing failed nodes quickly and maintain cluster availability.

russellday · 2017-11-14T18:07:44Z

@ywelsch - Sent you a log an update via email.

ywelsch · 2017-12-18T09:26:51Z

I've gotten the logs via e-mail, which were unfortunately not from the active master node, so I asked for those as well. No response to that, so I'm closing this.

jgq2008303393 · 2018-02-07T08:47:59Z

@russellday @ywelsch Any result about this issue? We have encounted very similar problem:
https://discuss.elastic.co/t/half-dead-node-lead-to-cluster-hang/113658

azsolinsky · 2018-02-12T22:01:32Z

in the instance @russellday reported, logs on the master are spewing:

[2018-02-12T22:00:17,463][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-08-shared][3] received shard failed for shard id [[v5-2018-02-09-08-shared][3]], allocation id [7chbXHqMTOSHxmkUikYsDA], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,464][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-19-12-shared][1] received shard failed for shard id [[shrink-v5-2018-01-19-12-shared][1]], allocation id [is_1YwAPSLSQs159Eo-l_A], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,466][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-06-10-shared][5] received shard failed for shard id [[v5-2018-02-06-10-shared][5]], allocation id [v2PGlZ94RiaUlJMmZONKig], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,470][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-03-shared][9] received shard failed for shard id [[v5-2018-02-09-03-shared][9]], allocation id [JGkx1bXfTTqNBWMybB7B_A], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,472][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-03-09-shared][0] received shard failed for shard id [[shrink-v5-2017-12-03-09-shared][0]], allocation id [PB8nrR4qQFOY7sxt6AH5GQ], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,474][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-17-07-shared][0] received shard failed for shard id [[shrink-v5-2017-12-17-07-shared][0]], allocation id [mhcHnOkiT_-xA7HaNhng4w], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,476][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-05-05-shared][0] received shard failed for shard id [[shrink-v5-2018-01-05-05-shared][0]], allocation id [ju_4pa67T7yBbFv8JE1bEg], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,477][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-08-17-shared][0] received shard failed for shard id [[shrink-v5-2017-12-08-17-shared][0]], allocation id [O6OIe2TpQpSqIhceuGiiVA], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,479][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-01-03-shared][5] received shard failed for shard id [[v5-2018-02-01-03-shared][5]], allocation id [fs70Ik4mTk6Ch9asOrP-Yw], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,482][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-05-13-shared][1] received shard failed for shard id [[shrink-v5-2018-01-05-13-shared][1]], allocation id [PEK-HPElSoeT6ZiTNsXVkw], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,484][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-06-shared][6] received shard failed for shard id [[v5-2018-02-09-06-shared][6]], allocation id [Ve5FGUFTTeCMONTDCZWXrg], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]

excerpt from pending tasks:

"number_of_pending_tasks": 126510,

7642 1h HIGH shard-failed
7643 1h HIGH shard-failed
7420 1h HIGH shard-failed
7645 1h HIGH shard-failed
7646 1h HIGH shard-failed
7647 1h HIGH shard-failed
7648 1h HIGH shard-failed
7649 1h HIGH shard-failed
7913 1h HIGH shard-failed
7651 1h HIGH shard-failed
7916 1h HIGH shard-failed
7653 1h HIGH shard-failed
7654 1h HIGH shard-failed
7655 1h HIGH shard-failed
7924 1h HIGH shard-failed
7657 1h HIGH shard-failed
7658 1h HIGH shard-failed
7659 1h HIGH shard-failed
7932 1h HIGH shard-failed
7661 1h HIGH shard-failed
7662 1h HIGH shard-failed
7663 1h HIGH shard-failed
7664 1h HIGH shard-failed
7665 1h HIGH shard-failed

azsolinsky · 2018-02-12T22:27:31Z

It appears what happened is that upon starting up a bunch of the hosts were over the low water mark, but had replicas on them; it failed to start the shards in that case; those failures kept being repeated and piling up in pending tasks. The low water mark was increased allowing some more of the replicas to become active, but the shard-failed pending task growth just continued; seemingly with the additional shard-failed entries still being for the same shards that failed initially.

bleskes · 2018-02-13T17:24:36Z

@azsolinsky do you mind opening a topic on the forums and post the specific exception the shards fail with there? when we figure it we can open an issue.

talevy added the v5.5.1 label Oct 31, 2017

talevy self-assigned this Nov 1, 2017

colings86 added the feedback_needed label Nov 1, 2017

colings86 added :Allocation and removed feedback_needed labels Dec 18, 2017

ywelsch closed this as completed Dec 18, 2017

lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018

clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

russellday commented Oct 31, 2017

talevy commented Oct 31, 2017

russellday commented Oct 31, 2017

talevy commented Nov 1, 2017

russellday commented Nov 2, 2017 •

edited

Loading

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017 •

edited

Loading

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017

jasontedor commented Nov 2, 2017

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017 •

edited

Loading

ywelsch commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017

ywelsch commented Nov 3, 2017

russellday commented Nov 3, 2017

ywelsch commented Nov 6, 2017

russellday commented Nov 6, 2017

russellday commented Nov 9, 2017

jasontedor commented Nov 9, 2017

russellday commented Nov 9, 2017

russellday commented Nov 11, 2017

ywelsch commented Nov 14, 2017

russellday commented Nov 14, 2017

russellday commented Nov 14, 2017

ywelsch commented Dec 18, 2017

jgq2008303393 commented Feb 7, 2018 •

edited

Loading

azsolinsky commented Feb 12, 2018 •

edited

Loading

azsolinsky commented Feb 12, 2018

bleskes commented Feb 13, 2018

Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

Comments

russellday commented Oct 31, 2017

talevy commented Oct 31, 2017

russellday commented Oct 31, 2017

talevy commented Nov 1, 2017

russellday commented Nov 2, 2017 • edited Loading

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017 • edited Loading

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017

jasontedor commented Nov 2, 2017

jasontedor commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017 • edited Loading

ywelsch commented Nov 2, 2017

russellday commented Nov 2, 2017

russellday commented Nov 2, 2017

ywelsch commented Nov 3, 2017

russellday commented Nov 3, 2017

ywelsch commented Nov 6, 2017

russellday commented Nov 6, 2017

russellday commented Nov 9, 2017

jasontedor commented Nov 9, 2017

russellday commented Nov 9, 2017

russellday commented Nov 11, 2017

ywelsch commented Nov 14, 2017

russellday commented Nov 14, 2017

russellday commented Nov 14, 2017

ywelsch commented Dec 18, 2017

jgq2008303393 commented Feb 7, 2018 • edited Loading

azsolinsky commented Feb 12, 2018 • edited Loading

azsolinsky commented Feb 12, 2018

bleskes commented Feb 13, 2018

russellday commented Nov 2, 2017 •

edited

Loading

russellday commented Nov 2, 2017 •

edited

Loading

russellday commented Nov 2, 2017 •

edited

Loading

jgq2008303393 commented Feb 7, 2018 •

edited

Loading

azsolinsky commented Feb 12, 2018 •

edited

Loading