Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault detection ping doesn't check for disk health #40326

Closed
Bukhtawar opened this issue Mar 21, 2019 · 24 comments
Closed

Fault detection ping doesn't check for disk health #40326

Bukhtawar opened this issue Mar 21, 2019 · 24 comments
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed resiliency

Comments

@Bukhtawar
Copy link
Contributor

Problem

The fault detection pings are light-weight and mostly check for network connectivity across nodes to kick the nodes that are not reachable out of the cluster. In some cases one of the disk can become unresponsive in which case some APIs like /_nodes/stats might get stuck. The ongoing writes can be stalled till this state is detected by some other means and disk replaced. I believe the lag detector with 7.x would be the first to figure this out and remove the node from the cluster state but only if there has been a cluster state update. Could this be detected upfront with deeper health checks

@DaveCTurner DaveCTurner added resiliency :Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Mar 22, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@DaveCTurner
Copy link
Contributor

Some initial thoughts:

Can local disks become unresponsive as you describe, or is this uniquely a failure mode of network-attached storage? We generally recommend against network-attached storage.

How exactly do you propose deciding that the disk has become unresponsive?

I don't think removing such a broken node from the cluster is the right way to react here. If it were still running then it will keep trying to rejoin the cluster and I think this would be rather disruptive. I suspect the right reaction is to shut it down.

I don't think that this needs any kind of distributed check. The node ought to be able to make this decision locally and react accordingly.

Somewhat related to #18417.

@Bukhtawar
Copy link
Contributor Author

@DaveCTurner This has happened on i3.16xlarge (SSD-based instance storage)
Can a node before joining the cluster do checks locally to figure out a problem with disk health and avoid sending out a join request to the master. This would avoid unnecessary disruptions.

@Bukhtawar
Copy link
Contributor Author

An initial proposal
I think it doesn't have to necessarily tie up with FD pings but can run as an observer thread on master to check for FS stats. If those stats over a period of time are indicative of a bad disk the observer thread can submit a task on master to remove the specific node from the cluster state. Each node before making a join request checks for FS stats before making a join request

We can expose an API like /_healthcheck that users can chose to invoke if they want to terminate the bad instances on a bad disk. That way the node can remain isolated from the cluster and be terminated on user action.

@DaveCTurner
Copy link
Contributor

This has happened on i3.16xlarge (SSD-based instance storage)

Interesting.

If those stats over a period of time are indicative of a bad disk

This is the crux of the matter. How precisely do you propose to make this decision?

@Bukhtawar
Copy link
Contributor Author

For writes it could be if the write threadpool queue size is consistently high while the disk throughput(based on /proc/diskstats) is consistently very low. The interval and limits could be fine-tuned I believe. I agree this would need some brain-storming to precisely detect a disk health issue

@DaveCTurner
Copy link
Contributor

Can you share the stats you collected from /proc/diskstats during the outage you mentioned above so we can see if it's possible to detect a clearly faulty disk in this way?

Also do you have an idea for implementing this portably? Your suggestion of /proc/diskstats is only supported on Linux, and I can't find anything to indicate that it's a stable interface that won't change in a future kernel version.

@DaveCTurner
Copy link
Contributor

We discussed this idea as a team and raised two questions:

  • if this is a problem in practice, we expect there to be other systems in existence that have encountered, and solved, it. @Bukhtawar could you look into this and summarise how this is addressed elsewhere?

  • as far as we can tell there are kernel-level timeouts that should prevent a disk from being unresponsive. @Bukhtawar can you explain in more detail how these timeouts failed to prevent the situation described in the OP?

@ebadyano
Copy link
Contributor

No further feedback received. @Bukhtawar if you have the requested information please add it in a comment and we can look at re-opening this issue.

@Bukhtawar
Copy link
Contributor Author

Can't we leverage the lag detector (or along similar lines) send out periodic no-op cluster state updates, if there has't been an update(minutely or 5 minutely) yet so as to not overload the cluster. If the node with bad disk fails to apply the cluster state we can kick it out.
@DaveCTurner thoughts?

@DaveCTurner
Copy link
Contributor

It's impossible for us to say whether this would help without the further information requested above. "Doesn't trigger the lag detector" is a very weak health check.

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Aug 1, 2019

Some context: I noticed a problem(read-only) with the volume which continued for over 2hrs, but a cluster update was meanwhile not published, it happened after the first 40mins had elapsed and all this while requests were stalled on the problematic node. Only after the volume recovered did the node apply the cluster state update. The idea here was if there are no updates, master wouldn't be able to detect a disk which had turned read-only and requests could be stalled.
I agree this isn't sufficient but this will prevent cases where there is a definite problem(ones like the read-only volume which is common). While I'm yet to look how other systems react, I had a thought around how the most obvious issues can be mitigated.

While there are other approaches like reading and writing to file and checking on the latency but they suffer from false positives which could be due to long GC pauses.

Another way could be the master could be initiating periodic writes on nodes and if the writes haven't been processed beyond a threshold(60s) master kicks the node out. Node joining back should validate that the writes goes through

@DaveCTurner
Copy link
Contributor

You certainly can't start a node on a readonly filesystem, so maybe a node should shut down if its filesystem becomes readonly while it's running. This very question is already on our agenda to discuss at some point in the future.

I don't think we need to do anything as complicated as a cluster state update to check for a readonly filesystem, as I noted above:

I don't think that this needs any kind of distributed check. The node ought to be able to make this decision locally and react accordingly.

The same is true of checking for IO latency: why bother the master with this at all?

@DaveCTurner
Copy link
Contributor

... and I'll ask again for answers to the questions we posed earlier.

@Bukhtawar
Copy link
Contributor Author

The same is true of checking for IO latency: why bother the master with this at all?

If most of the nodes were facing an outage (region-wide) then a cluster level decision becomes important. It should be more desirable to kick the node out if the cluster health doesn't go RED or some other health characteristics.

I did try looking into how other systems behave but looks like these systems face a lot of false positives and operator intervention.

@Bukhtawar
Copy link
Contributor Author

The best part about Lag detector is it's more deterministic leaving lesser room for false positives. I am not saying that cluster state is the solution here but anything along the lines should definitely be helpful. Atleast for cases when issues are very obvious and still no remediation is being anticipated.

@Bukhtawar
Copy link
Contributor Author

@DaveCTurner just curious. How would the lag detector respond to read-only filesystems to the joining node. After kicking the node out of the cluster due to lagging state, the joining node would retry the join. The join validations on master and full join validations on node will not validate a read-only disk(maybe won't write anything to disk) as a result responding successfully to join validations causing master to update the cluster state with a node join. But then again the joining node would fail to update this state(it's own join) causing this to go in a loop.
Please let me know if I am missing something here.

@DaveCTurner
Copy link
Contributor

But then again the joining node would fail to update this state(it's own join) causing this to go in a loop.

Yes, that's the issue, and the very argument for performing these checks locally.

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Aug 2, 2019

But isn't that an issue today with lag detector. Shouldn't this need a fix to avoid too many flip-flops once a disk goes read-only? Would it be better if join requests could persist the cluster state passed to it by master as a part of join validation before acking back?

@DaveCTurner I hear you but the only point I am trying to make is taking a cluster wide decision (through master maybe)could help protect overall cluster health from going RED.

Also would node start up operations always involve disk read/writes. I see there is a plan for better consistency checks as a part of #44624. Is that the reason your recommendation on local checks and shutdown won't need additional start-up checks(flip-flops won't happen if there is a clear disk failure eg: read-only)

@DaveCTurner
Copy link
Contributor

But isn't that an issue today with lag detector

Yes indeed, that's why it's on our agenda to discuss.

taking a cluster wide decision (through master maybe)could help protect overall cluster health from going RED.

I don't follow. If the cluster cannot accept writes to some shards then RED is surely the correct health to report?

Also would node start up operations always involve disk read/writes.

Yes, you cannot start up a node on a readonly filesystem.

@Bukhtawar
Copy link
Contributor Author

I don't follow. If the cluster cannot accept writes to some shards then RED is surely the correct health to report?

Would shard relocation not work if the disk is read-only. I guess it would be more ideal to relocate shards-off as a best effort before shutting down the node to avoid running into a RED state

@DaveCTurner
Copy link
Contributor

A readonly filesystem on a data node is unsupported situation, although I will admit that Elasticsearch's behaviour if the filesystem goes readonly could be better-defined than it is today. We don't really expect anything to work in such a case. Elasticsearch expects to be able to write to disk on the source node (the primary) during a peer recovery, and may fail the shard if it discovers it cannot do so.

This conversation started out discussing local disks becoming readonly, but now you seem to be concerned with outages affecting multiple nodes in a region-wide fashion. Can you explain more clearly how you can have a whole region's worth of local disks go readonly at the same time?

Can you also answer the outstanding question about why your IO subsystem was hanging rather than timing out and returning an error when the local disk became unresponsive?

@Bukhtawar
Copy link
Contributor Author

Thanks @DaveCTurner for the detailed explanation

Can you explain more clearly how you can have a whole region's worth of local disks go readonly at the same time?

Apologies I wasn't clear, what I actually meant was multiple nodes(maybe an AZ not the entire region) can face outages at the same time. In cases where the cluster isn't zone aware/zone balanced, it would be more desirable to remediate one node at a time possibly to not cause a RED cluster for read intensive workloads(I am assuming reads should go through)

Can you also answer the outstanding question about why your IO subsystem was hanging rather than timing out and returning an error when the local disk became unresponsive?

i noticed it was something like /_bulk?timeout=10000ms(couldn't get a chance to investigate further). I am not sure whether the explicit timeout wasn't honored causing it to get stalled for the entire outage.

@Bukhtawar
Copy link
Contributor Author

@DaveCTurner I was thinking having a similar check on node joins should help #16745. Thoughts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. feedback_needed resiliency
Projects
None yet
Development

No branches or pull requests

4 participants