Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194

Closed
russellday opened this issue Oct 31, 2017 · 32 comments
Closed
Assignees
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v5.5.1

Comments

@russellday
Copy link

I am running a large cluster on public AWS EC2 that ingests significant volume (several TB per day). The cluster will periodically loose a data node, typically due to Heap OOM. When this happens the pending tasks queue fills with several hundred thousand "Shard-Failed" tasks. This renders the cluster useless I recycle the cluster several times. I feel it's likely something with either my implementation or configuration because ES should be able to lose data nodes in a more graceful manor. Can I please get some guidance on how I might mitigate this issue?

I am using the new ROLLOVER API is that is of consequence.

Elasticsearch version (bin/elasticsearch --version):
5.1.1

Plugins installed: [amazon-cloudwatch, discovery-ec2, repository-s3]

JVM version (java -version):
"versions" : [
{
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 38
},
{
"version" : "1.8.0_151",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.151-b12",
"vm_vendor" : "Oracle Corporation",
"count" : 11
}

OS version (uname -a if on a Unix-like system):
Amazon Linux

Description of the problem including expected versus actual behavior:
(Above)

Steps to reproduce:
Intermittent, cluster will run fine for a few hours then I will lose a node due to ingestion rate.

Provide logs (if relevant):

{
"cluster_name" : "removed",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 49,
"number_of_data_nodes" : 35,
"active_primary_shards" : 2112,
"active_shards" : 2492,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 72,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 423050,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 8814583,
"active_shards_percent_as_number" : 95.75388767550702
}

@talevy
Copy link
Contributor

talevy commented Oct 31, 2017

Do you have any logs you can share from the node that dies due to OOM?

@talevy talevy added the v5.5.1 label Oct 31, 2017
@russellday
Copy link
Author

I can try to get one but I think it's happening because my cluster is a little under sized in relation to the ingest rate. I am more concerned about why the failure renders the cluster unusable. I expect to lose data nodes.

@talevy talevy self-assigned this Nov 1, 2017
@talevy
Copy link
Contributor

talevy commented Nov 1, 2017

@russellday may be so, but logs will definitely help lead to the cause

@russellday
Copy link
Author

russellday commented Nov 2, 2017

I have hit the issue again. After bout 10 hours of ingestion on the server /_cat/nodes reports 1 of the data nodes is no longer present. The node itself is actually running but cannot communicate with the master.
{
"cluster_name" : "removed",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 45,
"number_of_data_nodes" : 34,
"active_primary_shards" : 2319,
"active_shards" : 3070,
"relocating_shards" : 0,
"initializing_shards" : 25,
"unassigned_shards" : 74,
"delayed_unassigned_shards" : 74,
"number_of_pending_tasks" : 142848,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 1849884,
"active_shards_percent_as_number" : 96.87598611549384
}

Log from the master node (grepped on instance ID = i-0cb4dc7bc907acb08 the data node that is out of the cluster)

[2017-11-02T13:25:01,689][INFO ][o.e.c.m.MetaDataMappingService] [ip_r3.2xlarge_i-099746d14bfafaceb] [a-apps-000049/hgTFPuiJR96laS2JaIR1vg] update_mapping [Event]
[2017-11-02T13:25:25,645][DEBUG][o.e.a.a.c.n.s.TransportNodesStatsAction] [ip_r3.2xlarge_i-099746d14bfafaceb] failed to execute on node [4la7TcDWRVqhMwEDQ9GucQ]
org.elasticsearch.transport.ReceiveTimeoutTransportException: [ip_2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][cluster:monitor/nodes/stats[n]] request_id [11224061] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:869) [elasticsearch-5.1.1.jar:5.1.1]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

Caused by: org.elasticsearch.transport.ReceiveTimeoutTransportException: [54.147.210.141_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][cluster:monitor/nodes/stats[n]] request_id [11224061] timed out after [15000ms]
at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:869) ~[elasticsearch-5.1.1.jar:5.1.1]
... 4 more
[2017-11-02T13:25:27,223][WARN ][o.e.t.TransportService ] [ip_r3.2xlarge_i-099746d14bfafaceb] Received response for a request that has timed out, sent [16578ms] ago, timed out [1578ms] ago, action [cluster:monitor/nodes/stats[n]], node [{ip_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{ip}{ip:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=54.147.210.141, tags=Data}], id [11224061]
[2017-11-02T13:25:28,920][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] low disk watermark [80%] exceeded on [9j66xKfnRR6sXViYoDuL2Q][ip_c3.8xlarge_i-0b17df3c4edfe229c][/mnt/raid/data/nodes/0] free: 86.3gb[13.7%], replicas will not be assigned to this node
[2017-11-02T13:25:28,920][INFO ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] low disk watermark [80%] exceeded on [GiG8oFwbRiK6f27rH84cyQ][ip_t2.2xlarge_i-0ee3daefb64d7c7c8][/mnt/raid/data/nodes/0] free: 140.9gb[17.9%], replicas will not be assigned to this node
[2017-11-02T13:25:28,920][WARN ][o.e.c.r.a.DiskThresholdMonitor] [ip_r3.2xlarge_i-099746d14bfafaceb] high disk watermark [90%] exceeded on [wCWDNAvmQ-W2SUk8LnE6mg][ip_t2.2xlarge_i-002c1b345ee73a497][/mnt/raid/data/nodes/0] free: 0b[0%], shards will be relocated away from this node

at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]

[2017-11-02T13:27:21,408][WARN ][o.e.d.z.PublishClusterStateAction] [ip_r3.2xlarge_i-099746d14bfafaceb] timed out waiting for all nodes to process published state [10133] (timeout [30s], pending nodes: [{ip_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{ip}{ip:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=ip, tags=Data}])

@jasontedor
Copy link
Member

Would you please share the output of /_cat/pending_tasks (in a gist, for example).

@russellday
Copy link
Author

russellday commented Nov 2, 2017

Shard-Failed (pending tasks had already cleared but at the time that is what was in the queue)

@jasontedor
Copy link
Member

@russellday I want to see the actual tasks, I am expecting that it must be filled with duplicates for the same shard but I want to see this with my own eyes on your pending tasks list.

@russellday
Copy link
Author

I actually had the output in a terminal window (gist below). It's our suspicion as well that they are duplicates. We simply do not have enough shards to account for that many tasks.

https://gist.github.com/russellday/093e0f7084f24eb4a29d9da229695d7c

@russellday
Copy link
Author

@jasontedor does _cat/pending_tasks provide enough detail to confirm they are duplicates? It only gives you the insertOrder, timeInQueue, and priority source columns?

@jasontedor
Copy link
Member

No, it doesn't give enough, it doesn't give the shard that failed, now I have to look up when we added that.

@jasontedor
Copy link
Member

Sadly it was not until 5.6 that we added more information here for shard started and we have not yet added more information for shard failed. 😢

How about your logs, do they have lots of messages of the form marking and sending shard failed due to for the same index/shard copy?

@russellday
Copy link
Author

@jasontedor If were to go with the assumption that they are duplicates (which is likely correct). What would that indicate and how could I clear that queue of tasks? In our cluster, we are using the rollover API and taking snapshots on indexes once they are rolled over. At that point they are no longer ingesting and backed up so we set the replica count to zero. We are not overly concerned with the data, since we can restore or potentially re-ingest for a given time period. However, the huge pending task list results in an outage.

@russellday
Copy link
Author

russellday commented Nov 2, 2017

@jasontedor
[2017-11-02T13:29:24,379][WARN ][o.e.c.a.s.ShardStateAction] [ip_r3.2xlarge_i-099746d14bfafaceb] [largeapp-removed-000014][0] received shard failed for shard id [[largeapp-removed-000014][0]], allocation id [LcBdeCsfTPWZETj7Kk7ypg], primary term [1], message [failed to perform indices:data/write/bulk[s] on replica [largeapp-removed-000014][0], node[4la7TcDWRVqhMwEDQ9GucQ], [R], s[STARTED], a[id=LcBdeCsfTPWZETj7Kk7ypg]], failure [NodeDisconnectedException[[ip_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [ip_t2.2xlarge_i-0cb4dc7bc907acb08][ip:8193][indices:data/write/bulk[s][r]] disconnected

@ywelsch
Copy link
Contributor

ywelsch commented Nov 2, 2017

I wonder if the high number of tasks are an effect, not the cause for the troubles here.

From the logs that you showed above, we see:

[2017-11-02T13:27:21,408][WARN ][o.e.d.z.PublishClusterStateAction] [54.157.56.12_r3.2xlarge_i-099746d14bfafaceb] timed out waiting for all nodes to process published state [10133] (timeout [30s], pending nodes: [{54.147.210.141_t2.2xlarge_i-0cb4dc7bc907acb08}{4la7TcDWRVqhMwEDQ9GucQ}{IAm78ZOYTv-JFLM5ijgK4g}{172.31.15.252}{172.31.15.252:8193}{availability_zone=us-east-1a, instance_type=t2.2xlarge, external_ip=54.147.210.141, tags=Data}])

means that cluster state publishing was halted for 30 seconds. Many tasks can accumulate during that time. I wonder how fast cluster state publishing in general is in your cluster, i.e., whether the master has trouble publishing the cluster state, so has issues to keep up with the number of incoming tasks.
Can you enable debug logging for "cluster.service" on the master node?

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.cluster.service": "debug"
  }
}

This should tell us in the master logs how fast cluster state publishing is.

@russellday
Copy link
Author

@ywelsch - Yes, I have made the setting change and I will re-enable ingestion. I think what you are suggesting makes a lot of sense. This cluster receives a very high rate of ingestion, it might be struggling to keep up. We do see a good bit of put-mappings in the pending_tasks since we are ingesting some dynamic data.

@russellday
Copy link
Author

@ywelsch - Turned on ingest again and lost another data node. I have the entire log file but I tried to cherry pick examples of the items that seemed relevant.

https://gist.github.com/russellday/ed2dda7be149845671098da29f7c117e

@ywelsch
Copy link
Contributor

ywelsch commented Nov 3, 2017

Can you provide the full log file? Excerpts are not really going to help here. If you don't want to share it publicly, you can send it to my_first_name@elastic.co
Also can you share some information about the size of your cluster state? Using dynamic mappings in a cluster with high ingestion rates is a bad idea.

@russellday
Copy link
Author

@ywelsch - you have mail

@ywelsch
Copy link
Contributor

ywelsch commented Nov 6, 2017

I've had a look at the logs, which confirms my theory above that tasks are an effect, not the cause for the troubles. What's interesting is that once a node fails, it takes a long time for that node to be booted from the cluster. Every cluster state happening after that node dropped takes 30 seconds then (i.e. timeout), because the master waits for all nodes to apply the cluster state update (which the faulty node never does, but this node does not seem to be dropped or booted from the cluster either).

There are cases where the cluster state update takes even longer than 30 seconds, see

[2017-11-02T14:11:20,553][WARN ][o.e.c.s.ClusterService   ] [34.224.89.99_r3.2xlarge_i-0ab5787c61924c821] cluster state update task [zen-disco-receive(from master [master {54.157.56.12_r3.2xlarge_i-099746d14bfafaceb}{d4_cBqcYT2Gy4hoqTp2cSQ}{tOeAuZhoQwSiyX5vf74cdw}{172.31.20.173}{172.31.20.173:8193}{availability_zone=us-east-1b, instance_type=r3.2xlarge, external_ip=54.157.56.12, tags=Master} committed version [10165]])] took [3.7m] above the warn threshold of 30s

where it took more than 3 minutes, the reason being issues to connect to nodes, see for example:

[2017-11-02T14:11:17,535][WARN ][o.e.c.NodeConnectionsService] [34.224.89.99_r3.2xlarge_i-0ab5787c61924c821] failed to connect to node {54.82.189.132_m4.4xlarge_i-09cf7a79c4bb3b28a}{mj-8PwXnSs6HTpSYbsGsqg}{fw807l-hSkKz0vt6ckwWwg}{172.31.22.244}{172.31.22.244:8193}{availability_zone=us-east-1b, instance_type=m4.4xlarge, external_ip=54.82.189.132, tags=Client} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [54.82.189.132_m4.4xlarge_i-09cf7a79c4bb3b28a][172.31.22.244:8193] connect_timeout[30s]
	at org.elasticsearch.transport.netty4.Netty4Transport.connectToChannels(Netty4Transport.java:379) ~[?:?]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:413) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:387) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:291) ~[elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.NodeConnectionsService.validateNodeConnected(NodeConnectionsService.java:113) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.NodeConnectionsService.connectToNodes(NodeConnectionsService.java:86) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.service.ClusterService.runTasksForExecutor(ClusterService.java:700) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.cluster.service.ClusterService$UpdateTask.run(ClusterService.java:920) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:458) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:238) [elasticsearch-5.1.1.jar:5.1.1]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:201) [elasticsearch-5.1.1.jar:5.1.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_141]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_141]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_141]
Caused by: io.netty.channel.ConnectTimeoutException: connection timed out: 172.31.22.244/172.31.22.244:8193
	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:267) ~[?:?]
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38) ~[?:?]
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:120) ~[?:?]
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:418) ~[?:?]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:454) ~[?:?]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:873) ~[?:?]
	... 1 more

This means that the master was blocked for nearly 4 minutes to do anything meaningful, causing the large list of tasks.

I'm not sure why the fault detection has trouble to kick the node out. Are you perhaps taking heap dumps on OOM of the data/client nodes? It looks a bit like the connection to the failed node is in some kind of limbo state? Is there anything special around your networking setup?

Also, can you upgrade to the latest ES version (we have made some changes around connection handling) and check if you still hit the same problem there?

@russellday
Copy link
Author

@ywelsch - Thanks for the confirmation. We are not doing anything out of the norm on any of the nodes. We will upgrade today and see if we still encounter the same behavior.

@russellday
Copy link
Author

@ywelsch - Same results on 5.6.2.

@jasontedor
Copy link
Member

@russellday That’s not the latest version. The latest is 5.6.4, and the fixes that are referenced appear in 5.6.3 and 5.6.4, I wouldn’t expect any benefit from 5.6.2 for this issue.

@russellday
Copy link
Author

@jasontedor - Thanks! I will try to get 5.6.4 pushed tomorrow and let you know.

@russellday
Copy link
Author

@jasontedor - same results on 5.6.4

@ywelsch
Copy link
Contributor

ywelsch commented Nov 14, 2017

Can you reproduce the same behavior if you simply kill a node (with kill -9)? Are you taking heap dumps on OOMs? How long do those take? Have you made changes to the fault detection settings?

@russellday
Copy link
Author

@ywelsch - killing a node does not exhibit the same behavior. The recovery in that scenario is normal, a reasonable amount of shard-failed tasks and quick recovery. Not taking heap dumps on the OOM's currently and no changes to the fault detection. I have not investigated the OOM heaps since this test is using instances that I expect will fail on a regular basis. The hope was we can handle replacing failed nodes quickly and maintain cluster availability.

@russellday
Copy link
Author

@ywelsch - Sent you a log an update via email.

@ywelsch
Copy link
Contributor

ywelsch commented Dec 18, 2017

I've gotten the logs via e-mail, which were unfortunately not from the active master node, so I asked for those as well. No response to that, so I'm closing this.

@ywelsch ywelsch closed this as completed Dec 18, 2017
@jgq2008303393
Copy link

jgq2008303393 commented Feb 7, 2018

@russellday @ywelsch Any result about this issue? We have encounted very similar problem:
https://discuss.elastic.co/t/half-dead-node-lead-to-cluster-hang/113658

@azsolinsky
Copy link

azsolinsky commented Feb 12, 2018

in the instance @russellday reported, logs on the master are spewing:

[2018-02-12T22:00:17,463][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-08-shared][3] received shard failed for shard id [[v5-2018-02-09-08-shared][3]], allocation id [7chbXHqMTOSHxmkUikYsDA], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,464][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-19-12-shared][1] received shard failed for shard id [[shrink-v5-2018-01-19-12-shared][1]], allocation id [is_1YwAPSLSQs159Eo-l_A], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,466][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-06-10-shared][5] received shard failed for shard id [[v5-2018-02-06-10-shared][5]], allocation id [v2PGlZ94RiaUlJMmZONKig], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,470][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-03-shared][9] received shard failed for shard id [[v5-2018-02-09-03-shared][9]], allocation id [JGkx1bXfTTqNBWMybB7B_A], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,472][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-03-09-shared][0] received shard failed for shard id [[shrink-v5-2017-12-03-09-shared][0]], allocation id [PB8nrR4qQFOY7sxt6AH5GQ], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,474][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-17-07-shared][0] received shard failed for shard id [[shrink-v5-2017-12-17-07-shared][0]], allocation id [mhcHnOkiT_-xA7HaNhng4w], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,476][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-05-05-shared][0] received shard failed for shard id [[shrink-v5-2018-01-05-05-shared][0]], allocation id [ju_4pa67T7yBbFv8JE1bEg], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,477][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2017-12-08-17-shared][0] received shard failed for shard id [[shrink-v5-2017-12-08-17-shared][0]], allocation id [O6OIe2TpQpSqIhceuGiiVA], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,479][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-01-03-shared][5] received shard failed for shard id [[v5-2018-02-01-03-shared][5]], allocation id [fs70Ik4mTk6Ch9asOrP-Yw], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,482][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [shrink-v5-2018-01-05-13-shared][1] received shard failed for shard id [[shrink-v5-2018-01-05-13-shared][1]], allocation id [PEK-HPElSoeT6ZiTNsXVkw], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]
[2018-02-12T22:00:17,484][WARN ][o.e.c.a.s.ShardStateAction] [<redacted>_m4.16xlarge_i-03f84c85e0a871309] [v5-2018-02-09-06-shared][6] received shard failed for shard id [[v5-2018-02-09-06-shared][6]], allocation id [Ve5FGUFTTeCMONTDCZWXrg], primary term [0], message [master {<redacted>_m4.16xlarge_i-03f84c85e0a871309}{eef-RM2vTzWWDLKy4WI0VQ}{Q3WrsXuZQjS6gcY98mKPZQ}{172.31.61.188}{172.31.61.188:8193}{availability_zone=us-east-1b, instance_type=m4.16xlarge, external_ip=<redacted>, tags=Master} has not removed previously failed shard. resending shard failure]

excerpt from pending tasks:

"number_of_pending_tasks": 126510,

7642 1h HIGH shard-failed
7643 1h HIGH shard-failed
7420 1h HIGH shard-failed
7645 1h HIGH shard-failed
7646 1h HIGH shard-failed
7647 1h HIGH shard-failed
7648 1h HIGH shard-failed
7649 1h HIGH shard-failed
7913 1h HIGH shard-failed
7651 1h HIGH shard-failed
7916 1h HIGH shard-failed
7653 1h HIGH shard-failed
7654 1h HIGH shard-failed
7655 1h HIGH shard-failed
7924 1h HIGH shard-failed
7657 1h HIGH shard-failed
7658 1h HIGH shard-failed
7659 1h HIGH shard-failed
7932 1h HIGH shard-failed
7661 1h HIGH shard-failed
7662 1h HIGH shard-failed
7663 1h HIGH shard-failed
7664 1h HIGH shard-failed
7665 1h HIGH shard-failed

@azsolinsky
Copy link

It appears what happened is that upon starting up a bunch of the hosts were over the low water mark, but had replicas on them; it failed to start the shards in that case; those failures kept being repeated and piling up in pending tasks. The low water mark was increased allowing some more of the replicas to become active, but the shard-failed pending task growth just continued; seemingly with the additional shard-failed entries still being for the same shards that failed initially.

@bleskes
Copy link
Contributor

bleskes commented Feb 13, 2018

@azsolinsky do you mind opening a topic on the forums and post the specific exception the shards fail with there? when we figure it we can open an issue.

@lcawl lcawl added :Search/Search Search-related issues that do not fall into other categories and removed :Allocation labels Feb 13, 2018
@clintongormley clintongormley added :Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. and removed :Search/Search Search-related issues that do not fall into other categories labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Distributed A catch all label for anything in the Distributed Area. If you aren't sure, use this one. v5.5.1
Projects
None yet
Development

No branches or pull requests

10 participants