New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add timeout for Search Network Action to Improve Cluster Resistance #60037
Comments
The bad consequences of a disk failure are likely much more widespread than searches alone. However I think the problem described here will be improved in 7.9 thanks to #52680 which will remove a node from the cluster fairly promptly if its disk fails. Would you confirm or otherwise that this improves your experience? |
Pinging @elastic/es-search (:Search/Search) |
@DaveCTurner. Thanks for your reply. I think read-only filesystems is not enough. We also need to handle the Overload situation. Although the disk is heavily loaded and its %util may reach 95%, it can still handle write or read operation very slowly. Will a heavy-load disk be treated as disk failure? |
This is interesting I opened #59824. Not sure if that helps with your case @boicehuang |
In my experience, concurrent aggregation requests stuck in coordinate nodes can easily cause a lot of nodes of the cluster to full GC and become unresponsive in the situation of disk heavy-loads or failure. That is why I mentioned search requests as above. |
@DaveCTurner Thanks for your reply. Let me explain more about my experience. The process is as follows.
My question:
|
In theory we could but it doesn't really help much. We'd carry on sending traffic to the bad node, all of which will have to wait for the same timeout, so latency would be terrible and a backlog would still form. Much better to stop sending requests to the node ASAP, i.e. to remove it from the cluster.
This is the bit I am not understanding. Faulty disks fail IO requests quickly IME (assuming they're properly configured and don't retry for ages first) and you can detect that they're failing and remove the node from service yourself much earlier and more reliably than Elasticsearch can (e.g. by looking at SMART metrics). Given that this is not an Elasticsearch-specific problem I've looked at how other similar systems handle this (e.g. MongoDB, Cassandra, CockroachDB). As far as I can tell, none of them have any special handling or health checks of this nature. I've spoken with our syseng folks internally and we don't see our own clusters failing like this either. I haven't been able to find any other research or documentation that suggests this is something that needs addressing at the application level either. Please help us understand what we've missed here. |
@DaveCTurner. Thanks for your reply.
As far as I know, in HBase we limit how long a single RPC call can run by setting |
Neither of those look relevant to the question of detecting and handling disk failures with a timeout. Repeating what I said above: timing out individual (sub-)requests does not prevent future (sub-)requests from going to the same faulty node. Much better to remove broken nodes from the cluster. In terms of the more general question of timing out a search, you can already do this today in the client (#43332) and we're already discussing returning partial results on a timeout (#30897) too. You may also be looking for adaptive replica selection to steer searches away from nodes that are simply busy. I don't think these are useful in the case of a genuinely broken node, the subject of this thread, but maybe they are helpful to you. |
No further feedback received. @boicehuang if you have any more information for us please add it in a comment and we can look at re-opening this issue. |
Issue
Currently, the coordinate node sends Query and Fetch network action to remote data nodes without any timeout options.
It has a very bad impact, when one of the data nodes' machine is in disk failure, it can't handle I/0 operations like reading or writing data from disk but it is still connected with other nodes. This node acts as a black hole in the cluster, it stuck every shard search request from the coordinate node. Cumulative requests are increasing and consuming a lot of memory in the coordinate node, soon it will cause the coordinate node to fullGC.
We have maintained a Production Environment for about 300 nodes, and Disk Failure is very common. We try to set a timeout in search request body, like https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html#global-search-timeout. But it doesn't take effect in my situation since the timeout is mainly used for Lucene, as discussed in #9156.
So I have added the request body timeout for the query, fetch, and write network action. It seems to have a very great impact on improving cluster resistance on the Overload or Disk Failure of a node. I wonder if the solution is good enough and there is a better solution instead?
The text was updated successfully, but these errors were encountered: