Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Closed
boicehuang opened this issue Jul 22, 2020 · 10 comments
Closed

Add timeout for Search Network Action to Improve Cluster Resistance #60037

boicehuang opened this issue Jul 22, 2020 · 10 comments
Labels
>bug feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team

Comments

@boicehuang
Copy link
Contributor

boicehuang commented Jul 22, 2020

Issue

Currently, the coordinate node sends Query and Fetch network action to remote data nodes without any timeout options.

    public <T extends TransportResponse> void sendChildRequest(final Transport.Connection connection, final String action,
                                                               final TransportRequest request, final Task parentTask,
                                                               final TransportResponseHandler<T> handler) {
        sendChildRequest(connection, action, request, parentTask, TransportRequestOptions.EMPTY, handler);
    }

It has a very bad impact, when one of the data nodes' machine is in disk failure, it can't handle I/0 operations like reading or writing data from disk but it is still connected with other nodes. This node acts as a black hole in the cluster, it stuck every shard search request from the coordinate node. Cumulative requests are increasing and consuming a lot of memory in the coordinate node, soon it will cause the coordinate node to fullGC.

We have maintained a Production Environment for about 300 nodes, and Disk Failure is very common. We try to set a timeout in search request body, like https://www.elastic.co/guide/en/elasticsearch/reference/current/search.html#global-search-timeout. But it doesn't take effect in my situation since the timeout is mainly used for Lucene, as discussed in #9156.

So I have added the request body timeout for the query, fetch, and write network action. It seems to have a very great impact on improving cluster resistance on the Overload or Disk Failure of a node. I wonder if the solution is good enough and there is a better solution instead?

@boicehuang boicehuang added >bug needs:triage Requires assignment of a team area label labels Jul 22, 2020
@DaveCTurner
Copy link
Contributor

The bad consequences of a disk failure are likely much more widespread than searches alone. However I think the problem described here will be improved in 7.9 thanks to #52680 which will remove a node from the cluster fairly promptly if its disk fails. Would you confirm or otherwise that this improves your experience?

@DaveCTurner DaveCTurner added :Search/Search Search-related issues that do not fall into other categories feedback_needed labels Jul 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Search)

@elasticmachine elasticmachine added the Team:Search Meta label for search team label Jul 22, 2020
@DaveCTurner DaveCTurner added Team:Search Meta label for search team and removed Team:Search Meta label for search team needs:triage Requires assignment of a team area label labels Jul 22, 2020
@boicehuang
Copy link
Contributor Author

boicehuang commented Jul 22, 2020

@DaveCTurner. Thanks for your reply. I think read-only filesystems is not enough. We also need to handle the Overload situation. Although the disk is heavily loaded and its %util may reach 95%, it can still handle write or read operation very slowly. Will a heavy-load disk be treated as disk failure?

@Bukhtawar
Copy link
Contributor

This is interesting I opened #59824. Not sure if that helps with your case @boicehuang

@boicehuang
Copy link
Contributor Author

boicehuang commented Jul 22, 2020

In my experience, concurrent aggregation requests stuck in coordinate nodes can easily cause a lot of nodes of the cluster to full GC and become unresponsive in the situation of disk heavy-loads or failure. That is why I mentioned search requests as above.

@boicehuang
Copy link
Contributor Author

boicehuang commented Jul 24, 2020

@DaveCTurner Thanks for your reply. Let me explain more about my experience. The process is as follows.

  1. A big aggregation request is sent to the coordinating node with a timeout and split into 5 shard search queries. The Shard search requests were sent to the data nodes without any timeout.
    image

  2. 4 of the data shards returned shard search responses quickly, but one data node machines became heavily loaded that elasticsearch handled shard search requests very slowly and returned its response after 5 minutes. (In my experience, some type of disk failure just slows down i/o speed.)
    image

  3. The coordinate node was still waiting for response of Shard 1. The client waited for 5s but still can't get any partial results. It discarded the request but the aggregation request is still stuck in elasticsearch with 4 shard search response and waiting for the only shard search response left.

image

  1. after 5 minutes the left shard finally returned its response and the aggregation finally finished.
    image

My question:

  1. Can we end up the aggr earlier and responded to the request with partial results, instead of waiting for a response or node health checker?

@DaveCTurner
Copy link
Contributor

Can we end up the aggr earlier and responded to the request with partial results, instead of waiting for a response or node health checker?

In theory we could but it doesn't really help much. We'd carry on sending traffic to the bad node, all of which will have to wait for the same timeout, so latency would be terrible and a backlog would still form. Much better to stop sending requests to the node ASAP, i.e. to remove it from the cluster.

In my experience, some type of disk failure just slows down i/o speed.

This is the bit I am not understanding. Faulty disks fail IO requests quickly IME (assuming they're properly configured and don't retry for ages first) and you can detect that they're failing and remove the node from service yourself much earlier and more reliably than Elasticsearch can (e.g. by looking at SMART metrics).

Given that this is not an Elasticsearch-specific problem I've looked at how other similar systems handle this (e.g. MongoDB, Cassandra, CockroachDB). As far as I can tell, none of them have any special handling or health checks of this nature. I've spoken with our syseng folks internally and we don't see our own clusters failing like this either. I haven't been able to find any other research or documentation that suggests this is something that needs addressing at the application level either. Please help us understand what we've missed here.

@boicehuang
Copy link
Contributor Author

boicehuang commented Jul 27, 2020

@DaveCTurner. Thanks for your reply.

I haven't been able to find any other research or documentation that suggests this is something that needs addressing at the application level either. Please help us understand what we've missed here.

As far as I know, in HBase we limit how long a single RPC call can run by setting hbase.rpc.timeout. In MySQL, we can also use max_exection_time to set a session-wide timeout, long-running query execution would be interrupted after timeout.

@DaveCTurner
Copy link
Contributor

Neither of those look relevant to the question of detecting and handling disk failures with a timeout. Repeating what I said above: timing out individual (sub-)requests does not prevent future (sub-)requests from going to the same faulty node. Much better to remove broken nodes from the cluster.

In terms of the more general question of timing out a search, you can already do this today in the client (#43332) and we're already discussing returning partial results on a timeout (#30897) too. You may also be looking for adaptive replica selection to steer searches away from nodes that are simply busy. I don't think these are useful in the case of a genuinely broken node, the subject of this thread, but maybe they are helpful to you.

@cjcenizal
Copy link
Contributor

No further feedback received. @boicehuang if you have any more information for us please add it in a comment and we can look at re-opening this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug feedback_needed :Search/Search Search-related issues that do not fall into other categories Team:Search Meta label for search team
Projects
None yet
Development

No branches or pull requests

5 participants