Fault detection ping doesn't check for disk health #40326

Bukhtawar · 2019-03-21T17:27:07Z

Problem

The fault detection pings are light-weight and mostly check for network connectivity across nodes to kick the nodes that are not reachable out of the cluster. In some cases one of the disk can become unresponsive in which case some APIs like /_nodes/stats might get stuck. The ongoing writes can be stalled till this state is detected by some other means and disk replaced. I believe the lag detector with 7.x would be the first to figure this out and remove the node from the cluster state but only if there has been a cluster state update. Could this be detected upfront with deeper health checks

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-03-22T08:38:00Z

Pinging @elastic/es-distributed

DaveCTurner · 2019-03-22T08:51:51Z

Some initial thoughts:

Can local disks become unresponsive as you describe, or is this uniquely a failure mode of network-attached storage? We generally recommend against network-attached storage.

How exactly do you propose deciding that the disk has become unresponsive?

I don't think removing such a broken node from the cluster is the right way to react here. If it were still running then it will keep trying to rejoin the cluster and I think this would be rather disruptive. I suspect the right reaction is to shut it down.

I don't think that this needs any kind of distributed check. The node ought to be able to make this decision locally and react accordingly.

Somewhat related to #18417.

Bukhtawar · 2019-03-22T09:03:24Z

@DaveCTurner This has happened on i3.16xlarge (SSD-based instance storage)
Can a node before joining the cluster do checks locally to figure out a problem with disk health and avoid sending out a join request to the master. This would avoid unnecessary disruptions.

Bukhtawar · 2019-03-22T09:26:01Z

An initial proposal
I think it doesn't have to necessarily tie up with FD pings but can run as an observer thread on master to check for FS stats. If those stats over a period of time are indicative of a bad disk the observer thread can submit a task on master to remove the specific node from the cluster state. Each node before making a join request checks for FS stats before making a join request

We can expose an API like /_healthcheck that users can chose to invoke if they want to terminate the bad instances on a bad disk. That way the node can remain isolated from the cluster and be terminated on user action.

DaveCTurner · 2019-03-22T09:36:10Z

This has happened on i3.16xlarge (SSD-based instance storage)

Interesting.

If those stats over a period of time are indicative of a bad disk

This is the crux of the matter. How precisely do you propose to make this decision?

Bukhtawar · 2019-03-25T05:20:17Z

For writes it could be if the write threadpool queue size is consistently high while the disk throughput(based on /proc/diskstats) is consistently very low. The interval and limits could be fine-tuned I believe. I agree this would need some brain-storming to precisely detect a disk health issue

DaveCTurner · 2019-03-26T11:12:42Z

Can you share the stats you collected from /proc/diskstats during the outage you mentioned above so we can see if it's possible to detect a clearly faulty disk in this way?

Also do you have an idea for implementing this portably? Your suggestion of /proc/diskstats is only supported on Linux, and I can't find anything to indicate that it's a stable interface that won't change in a future kernel version.

DaveCTurner · 2019-03-27T14:56:29Z

We discussed this idea as a team and raised two questions:

if this is a problem in practice, we expect there to be other systems in existence that have encountered, and solved, it. @Bukhtawar could you look into this and summarise how this is addressed elsewhere?
as far as we can tell there are kernel-level timeouts that should prevent a disk from being unresponsive. @Bukhtawar can you explain in more detail how these timeouts failed to prevent the situation described in the OP?

ebadyano · 2019-05-15T12:31:53Z

No further feedback received. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue.

Bukhtawar · 2019-08-01T11:36:28Z

Can't we leverage the lag detector (or along similar lines) send out periodic no-op cluster state updates, if there has't been an update(minutely or 5 minutely) yet so as to not overload the cluster. If the node with bad disk fails to apply the cluster state we can kick it out.
@DaveCTurner thoughts?

DaveCTurner · 2019-08-01T12:15:04Z

It's impossible for us to say whether this would help without the further information requested above. "Doesn't trigger the lag detector" is a very weak health check.

Bukhtawar · 2019-08-01T13:23:43Z

Some context: I noticed a problem(read-only) with the volume which continued for over 2hrs, but a cluster update was meanwhile not published, it happened after the first 40mins had elapsed and all this while requests were stalled on the problematic node. Only after the volume recovered did the node apply the cluster state update. The idea here was if there are no updates, master wouldn't be able to detect a disk which had turned read-only and requests could be stalled.
I agree this isn't sufficient but this will prevent cases where there is a definite problem(ones like the read-only volume which is common). While I'm yet to look how other systems react, I had a thought around how the most obvious issues can be mitigated.

While there are other approaches like reading and writing to file and checking on the latency but they suffer from false positives which could be due to long GC pauses.

Another way could be the master could be initiating periodic writes on nodes and if the writes haven't been processed beyond a threshold(60s) master kicks the node out. Node joining back should validate that the writes goes through

DaveCTurner · 2019-08-01T14:04:43Z

You certainly can't start a node on a readonly filesystem, so maybe a node should shut down if its filesystem becomes readonly while it's running. This very question is already on our agenda to discuss at some point in the future.

I don't think we need to do anything as complicated as a cluster state update to check for a readonly filesystem, as I noted above:

I don't think that this needs any kind of distributed check. The node ought to be able to make this decision locally and react accordingly.

The same is true of checking for IO latency: why bother the master with this at all?

DaveCTurner · 2019-08-01T14:09:09Z

... and I'll ask again for answers to the questions we posed earlier.

Bukhtawar · 2019-08-01T14:10:45Z

The same is true of checking for IO latency: why bother the master with this at all?

If most of the nodes were facing an outage (region-wide) then a cluster level decision becomes important. It should be more desirable to kick the node out if the cluster health doesn't go RED or some other health characteristics.

I did try looking into how other systems behave but looks like these systems face a lot of false positives and operator intervention.

Bukhtawar · 2019-08-01T14:15:36Z

The best part about Lag detector is it's more deterministic leaving lesser room for false positives. I am not saying that cluster state is the solution here but anything along the lines should definitely be helpful. Atleast for cases when issues are very obvious and still no remediation is being anticipated.

Bukhtawar · 2019-08-02T13:16:57Z

@DaveCTurner just curious. How would the lag detector respond to read-only filesystems to the joining node. After kicking the node out of the cluster due to lagging state, the joining node would retry the join. The join validations on master and full join validations on node will not validate a read-only disk(maybe won't write anything to disk) as a result responding successfully to join validations causing master to update the cluster state with a node join. But then again the joining node would fail to update this state(it's own join) causing this to go in a loop.
Please let me know if I am missing something here.

DaveCTurner · 2019-08-02T15:52:11Z

But then again the joining node would fail to update this state(it's own join) causing this to go in a loop.

Yes, that's the issue, and the very argument for performing these checks locally.

Bukhtawar · 2019-08-02T16:37:14Z

But isn't that an issue today with lag detector. Shouldn't this need a fix to avoid too many flip-flops once a disk goes read-only? Would it be better if join requests could persist the cluster state passed to it by master as a part of join validation before acking back?

@DaveCTurner I hear you but the only point I am trying to make is taking a cluster wide decision (through master maybe)could help protect overall cluster health from going RED.

Also would node start up operations always involve disk read/writes. I see there is a plan for better consistency checks as a part of #44624. Is that the reason your recommendation on local checks and shutdown won't need additional start-up checks(flip-flops won't happen if there is a clear disk failure eg: read-only)

DaveCTurner · 2019-08-02T18:37:43Z

But isn't that an issue today with lag detector

Yes indeed, that's why it's on our agenda to discuss.

taking a cluster wide decision (through master maybe)could help protect overall cluster health from going RED.

I don't follow. If the cluster cannot accept writes to some shards then RED is surely the correct health to report?

Also would node start up operations always involve disk read/writes.

Yes, you cannot start up a node on a readonly filesystem.

Bukhtawar · 2019-08-02T18:55:20Z

I don't follow. If the cluster cannot accept writes to some shards then RED is surely the correct health to report?

Would shard relocation not work if the disk is read-only. I guess it would be more ideal to relocate shards-off as a best effort before shutting down the node to avoid running into a RED state

DaveCTurner · 2019-08-03T07:49:36Z

A readonly filesystem on a data node is unsupported situation, although I will admit that Elasticsearch's behaviour if the filesystem goes readonly could be better-defined than it is today. We don't really expect anything to work in such a case. Elasticsearch expects to be able to write to disk on the source node (the primary) during a peer recovery, and may fail the shard if it discovers it cannot do so.

This conversation started out discussing local disks becoming readonly, but now you seem to be concerned with outages affecting multiple nodes in a region-wide fashion. Can you explain more clearly how you can have a whole region's worth of local disks go readonly at the same time?

Can you also answer the outstanding question about why your IO subsystem was hanging rather than timing out and returning an error when the local disk became unresponsive?

Bukhtawar · 2019-08-03T10:05:35Z

Thanks @DaveCTurner for the detailed explanation

Can you explain more clearly how you can have a whole region's worth of local disks go readonly at the same time?

Apologies I wasn't clear, what I actually meant was multiple nodes(maybe an AZ not the entire region) can face outages at the same time. In cases where the cluster isn't zone aware/zone balanced, it would be more desirable to remediate one node at a time possibly to not cause a RED cluster for read intensive workloads(I am assuming reads should go through)

Can you also answer the outstanding question about why your IO subsystem was hanging rather than timing out and returning an error when the local disk became unresponsive?

i noticed it was something like /_bulk?timeout=10000ms(couldn't get a chance to investigate further). I am not sure whether the explicit timeout wasn't honored causing it to get stalled for the entire outage.

Bukhtawar · 2019-08-03T14:58:06Z

@DaveCTurner I was thinking having a similar check on node joins should help #16745. Thoughts

DaveCTurner added resiliency :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 22, 2019

DaveCTurner added team-discuss feedback_needed labels Mar 22, 2019

DaveCTurner removed the team-discuss label Mar 27, 2019

ebadyano closed this as completed May 15, 2019

DaveCTurner mentioned this issue Sep 24, 2019

No timeouts in co-ordinator to primary and primary to replica interactions during indexing #46994

Closed

DaveCTurner mentioned this issue May 13, 2020

IO hangs on Leader can cause publications to get stuck indefinitely #56707

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault detection ping doesn't check for disk health #40326

Fault detection ping doesn't check for disk health #40326

Bukhtawar commented Mar 21, 2019

elasticmachine commented Mar 22, 2019

DaveCTurner commented Mar 22, 2019

Bukhtawar commented Mar 22, 2019

Bukhtawar commented Mar 22, 2019

DaveCTurner commented Mar 22, 2019

Bukhtawar commented Mar 25, 2019

DaveCTurner commented Mar 26, 2019

DaveCTurner commented Mar 27, 2019

ebadyano commented May 15, 2019

Bukhtawar commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019 •

edited

Loading

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019

Bukhtawar commented Aug 2, 2019

DaveCTurner commented Aug 2, 2019

Bukhtawar commented Aug 2, 2019 •

edited

Loading

DaveCTurner commented Aug 2, 2019

Bukhtawar commented Aug 2, 2019

DaveCTurner commented Aug 3, 2019

Bukhtawar commented Aug 3, 2019

Bukhtawar commented Aug 3, 2019

Fault detection ping doesn't check for disk health #40326

Fault detection ping doesn't check for disk health #40326

Comments

Bukhtawar commented Mar 21, 2019

Problem

elasticmachine commented Mar 22, 2019

DaveCTurner commented Mar 22, 2019

Bukhtawar commented Mar 22, 2019

Bukhtawar commented Mar 22, 2019

DaveCTurner commented Mar 22, 2019

Bukhtawar commented Mar 25, 2019

DaveCTurner commented Mar 26, 2019

DaveCTurner commented Mar 27, 2019

ebadyano commented May 15, 2019

Bukhtawar commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019 • edited Loading

DaveCTurner commented Aug 1, 2019

DaveCTurner commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019

Bukhtawar commented Aug 1, 2019

Bukhtawar commented Aug 2, 2019

DaveCTurner commented Aug 2, 2019

Bukhtawar commented Aug 2, 2019 • edited Loading

DaveCTurner commented Aug 2, 2019

Bukhtawar commented Aug 2, 2019

DaveCTurner commented Aug 3, 2019

Bukhtawar commented Aug 3, 2019

Bukhtawar commented Aug 3, 2019

Bukhtawar commented Aug 1, 2019 •

edited

Loading

Bukhtawar commented Aug 2, 2019 •

edited

Loading