-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pending Tasks Queue flooded with Shard-Failed Tasks after loosing a data node. #27194
Comments
Do you have any logs you can share from the node that dies due to OOM? |
I can try to get one but I think it's happening because my cluster is a little under sized in relation to the ingest rate. I am more concerned about why the failure renders the cluster unusable. I expect to lose data nodes. |
@russellday may be so, but logs will definitely help lead to the cause |
I have hit the issue again. After bout 10 hours of ingestion on the server /_cat/nodes reports 1 of the data nodes is no longer present. The node itself is actually running but cannot communicate with the master. Log from the master node (grepped on instance ID = i-0cb4dc7bc907acb08 the data node that is out of the cluster) [2017-11-02T13:25:01,689][INFO ][o.e.c.m.MetaDataMappingService] [ip_r3.2xlarge_i-099746d14bfafaceb] [a-apps-000049/hgTFPuiJR96laS2JaIR1vg] update_mapping [Event]
|
Would you please share the output of |
Shard-Failed (pending tasks had already cleared but at the time that is what was in the queue) |
@russellday I want to see the actual tasks, I am expecting that it must be filled with duplicates for the same shard but I want to see this with my own eyes on your pending tasks list. |
I actually had the output in a terminal window (gist below). It's our suspicion as well that they are duplicates. We simply do not have enough shards to account for that many tasks. https://gist.github.com/russellday/093e0f7084f24eb4a29d9da229695d7c |
@jasontedor does _cat/pending_tasks provide enough detail to confirm they are duplicates? It only gives you the insertOrder, timeInQueue, and priority source columns? |
No, it doesn't give enough, it doesn't give the shard that failed, now I have to look up when we added that. |
Sadly it was not until 5.6 that we added more information here for shard started and we have not yet added more information for shard failed. 😢 How about your logs, do they have lots of messages of the form |
@jasontedor If were to go with the assumption that they are duplicates (which is likely correct). What would that indicate and how could I clear that queue of tasks? In our cluster, we are using the rollover API and taking snapshots on indexes once they are rolled over. At that point they are no longer ingesting and backed up so we set the replica count to zero. We are not overly concerned with the data, since we can restore or potentially re-ingest for a given time period. However, the huge pending task list results in an outage. |
@jasontedor |
I wonder if the high number of tasks are an effect, not the cause for the troubles here. From the logs that you showed above, we see:
means that cluster state publishing was halted for 30 seconds. Many tasks can accumulate during that time. I wonder how fast cluster state publishing in general is in your cluster, i.e., whether the master has trouble publishing the cluster state, so has issues to keep up with the number of incoming tasks.
This should tell us in the master logs how fast cluster state publishing is. |
@ywelsch - Yes, I have made the setting change and I will re-enable ingestion. I think what you are suggesting makes a lot of sense. This cluster receives a very high rate of ingestion, it might be struggling to keep up. We do see a good bit of put-mappings in the pending_tasks since we are ingesting some dynamic data. |
@ywelsch - Turned on ingest again and lost another data node. I have the entire log file but I tried to cherry pick examples of the items that seemed relevant. https://gist.github.com/russellday/ed2dda7be149845671098da29f7c117e |
Can you provide the full log file? Excerpts are not really going to help here. If you don't want to share it publicly, you can send it to my_first_name@elastic.co |
@ywelsch - you have mail |
I've had a look at the logs, which confirms my theory above that tasks are an effect, not the cause for the troubles. What's interesting is that once a node fails, it takes a long time for that node to be booted from the cluster. Every cluster state happening after that node dropped takes 30 seconds then (i.e. timeout), because the master waits for all nodes to apply the cluster state update (which the faulty node never does, but this node does not seem to be dropped or booted from the cluster either). There are cases where the cluster state update takes even longer than 30 seconds, see
where it took more than 3 minutes, the reason being issues to connect to nodes, see for example:
This means that the master was blocked for nearly 4 minutes to do anything meaningful, causing the large list of tasks. I'm not sure why the fault detection has trouble to kick the node out. Are you perhaps taking heap dumps on OOM of the data/client nodes? It looks a bit like the connection to the failed node is in some kind of limbo state? Is there anything special around your networking setup? Also, can you upgrade to the latest ES version (we have made some changes around connection handling) and check if you still hit the same problem there? |
@ywelsch - Thanks for the confirmation. We are not doing anything out of the norm on any of the nodes. We will upgrade today and see if we still encounter the same behavior. |
@ywelsch - Same results on 5.6.2. |
@russellday That’s not the latest version. The latest is 5.6.4, and the fixes that are referenced appear in 5.6.3 and 5.6.4, I wouldn’t expect any benefit from 5.6.2 for this issue. |
@jasontedor - Thanks! I will try to get 5.6.4 pushed tomorrow and let you know. |
@jasontedor - same results on 5.6.4 |
Can you reproduce the same behavior if you simply kill a node (with |
@ywelsch - killing a node does not exhibit the same behavior. The recovery in that scenario is normal, a reasonable amount of shard-failed tasks and quick recovery. Not taking heap dumps on the OOM's currently and no changes to the fault detection. I have not investigated the OOM heaps since this test is using instances that I expect will fail on a regular basis. The hope was we can handle replacing failed nodes quickly and maintain cluster availability. |
@ywelsch - Sent you a log an update via email. |
I've gotten the logs via e-mail, which were unfortunately not from the active master node, so I asked for those as well. No response to that, so I'm closing this. |
@russellday @ywelsch Any result about this issue? We have encounted very similar problem: |
in the instance @russellday reported, logs on the master are spewing:
excerpt from pending tasks:
7642 1h HIGH shard-failed |
It appears what happened is that upon starting up a bunch of the hosts were over the low water mark, but had replicas on them; it failed to start the shards in that case; those failures kept being repeated and piling up in pending tasks. The low water mark was increased allowing some more of the replicas to become active, but the shard-failed pending task growth just continued; seemingly with the additional shard-failed entries still being for the same shards that failed initially. |
@azsolinsky do you mind opening a topic on the forums and post the specific exception the shards fail with there? when we figure it we can open an issue. |
I am running a large cluster on public AWS EC2 that ingests significant volume (several TB per day). The cluster will periodically loose a data node, typically due to Heap OOM. When this happens the pending tasks queue fills with several hundred thousand "Shard-Failed" tasks. This renders the cluster useless I recycle the cluster several times. I feel it's likely something with either my implementation or configuration because ES should be able to lose data nodes in a more graceful manor. Can I please get some guidance on how I might mitigate this issue?
I am using the new ROLLOVER API is that is of consequence.
Elasticsearch version (
bin/elasticsearch --version
):5.1.1
Plugins installed: [amazon-cloudwatch, discovery-ec2, repository-s3]
JVM version (
java -version
):"versions" : [
{
"version" : "1.8.0_141",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.141-b16",
"vm_vendor" : "Oracle Corporation",
"count" : 38
},
{
"version" : "1.8.0_151",
"vm_name" : "OpenJDK 64-Bit Server VM",
"vm_version" : "25.151-b12",
"vm_vendor" : "Oracle Corporation",
"count" : 11
}
OS version (
uname -a
if on a Unix-like system):Amazon Linux
Description of the problem including expected versus actual behavior:
(Above)
Steps to reproduce:
Intermittent, cluster will run fine for a few hours then I will lose a node due to ingestion rate.
Provide logs (if relevant):
{
"cluster_name" : "removed",
"status" : "red",
"timed_out" : false,
"number_of_nodes" : 49,
"number_of_data_nodes" : 35,
"active_primary_shards" : 2112,
"active_shards" : 2492,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 72,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 423050,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 8814583,
"active_shards_percent_as_number" : 95.75388767550702
}
The text was updated successfully, but these errors were encountered: