Adds resiliency to read-only filesystems #45286 #52680

Bukhtawar · 2020-02-23T11:45:38Z

Fixes read only file system, part of the overall proposal for #45286
Pending

Finalisation on TODOs
API Backward compatibility
Covering missing cases
Unit and Integration tests

Rebase from fork

…ite to all paths and emits a stats is_writable as a part of node stats API. FsReadOnlyMonitor pulls up the stats and tries to remove the node if not all paths are found to be writable. Addresses elastic#45286.

elasticmachine · 2020-02-24T08:37:33Z

Pinging @elastic/es-core-infra (:Core/Infra/Resiliency)

elasticmachine · 2020-02-24T08:37:35Z

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

DaveCTurner

Thanks @Bukhtawar, I left a few initial comments on the general approach. I think this will need some changes to make it testable too, so it won't really make sense to review it in depth until you've added some tests. The CoordinatorTests test suite is a good place to look as this lets you write tests that hit timeouts without actually having to wait.

Please make sure to reformat your code too - there's a few places where the whitespace doesn't fit the style of the surrounding code.

DaveCTurner · 2020-02-24T08:46:40Z

server/src/main/java/org/elasticsearch/monitor/fs/FsReadOnlyMonitor.java

+ * Monitor runs on master and listens for events from #ClusterInfoService on node stats. It checks to see if
+ * a node has all paths writable if not removes the node from the cluster based on the setting monitor.fs.unhealthy.remove_enabled
+ */
+public class FsReadOnlyMonitor {


This (and the extensions to ClusterInfoService) seem unnecessary. It would be preferable for the FollowersChecker to report a node as unhealthy directly.

If I understand correctly we don't want NodeClient pulling up FS stats. Instead the FollowerChecker ping should pull the FS health info too?

I'll change the current implementation and use the Transport request handler. Let me know if thats what is expected

DaveCTurner · 2020-02-24T08:48:52Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+                            // delete any lingering file from a previous failure
+                            Files.deleteIfExists(resolve);
+                            Files.createFile(resolve);
+                            Files.delete(resolve);


This is too weak a check IMO. It doesn't write any data or fsync anything.

Added fsync check

DaveCTurner · 2020-02-24T08:49:45Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+                    }catch(IOException ex){
+                        logger.error("Failed to perform writes on path {} due to {}", path, ex);
+                        pathHealthStats.put(path, Status.UNHEALTHY);
+                    } catch(Exception ex){


I don't understand why we don't count this as UNHEALTHY too. Can you explain?

Removed. Initial thought was any unanticipated bug causing parent Exception

DaveCTurner · 2020-02-24T09:43:52Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+        Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(5), TimeValue.timeValueSeconds(1),
+            Setting.Property.NodeScope, Setting.Property.Dynamic);
+    public static final Setting<TimeValue> HEALTHCHECK_TIMEOUT_SETTING =
+        Setting.timeSetting("monitor.fs.health.unhealthy_timeout", TimeValue.timeValueMinutes(5), TimeValue.timeValueMinutes(2),


5 minutes seems a very long timeout to me. Do we really want to consider a node healthy if it's taking literally minutes to pass this simple check?

I also think we should be stricter about the UNHEALTHY -> HEALTHY transition to try and avoid flapping. What about keeping the node UNHEALTHY until the check passes very quickly (~1 second?)

DaveCTurner · 2020-02-24T09:54:56Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+    @Override
+    protected void doStart() {
+        //TODO check if this needs to be a part of a dedicated threadpool
+        scheduledFuture = threadPool.scheduleWithFixedDelay(new FsHealthMonitor(), refreshInterval, ThreadPool.Names.SAME);


I think this should not be on the SAME threadpool since it's doing IO that's potentially slow. GENERIC would be ok, but then I think we need protection to make sure there's only one check running at once.

Done. Also modified the scheduled checks to be one per data path to honour the 1s HEALTHY SLA

Bukhtawar · 2020-03-25T18:16:16Z

Thanks @DaveCTurner I have made modifications as suggested and wrote some basic tests to validate if the happy cases work fine. I'll continue on tests while I get some feedback on the source code(maybe saves time with revisions). There are few things like backward compatibility/ Integ tests that needs some work.

server/src/test/java/org/elasticsearch/cluster/coordination/PreVoteCollectorTests.java

Bukhtawar · 2020-03-25T18:26:08Z

server/src/main/java/org/elasticsearch/cluster/coordination/Coordinator.java

@@ -1173,6 +1179,12 @@ public void run() {
                            return;
                        }

+                        if(fsService.stats().getTotal().isWritable() == Boolean.FALSE){


I have left out spaces assuming checkStyles would catch. But unfortunate. I'll fix white spacing

Bukhtawar · 2020-04-02T08:51:45Z

Ping @DaveCTurner,
Hope you are doing good !!

Looks like I again ran into conflicts with master, but only tests so far. If you can take a look at the code changes and share your feedbacks I'll re-raise them with the conflicts resolved

Bukhtawar · 2020-04-15T08:09:14Z

Hi @DaveCTurner is there anything I should be doing other than resolving conflicts and fixing/adding tests that would help this PR get some traction. I wanted your thoughts on the source code changes and the final approach before proceeding with tests. Please share your thoughts on taking this forward

Bukhtawar · 2020-04-20T06:18:29Z

I'll re-raise the new revision soon also addressing the merge conflicts, leveraging FollowerChecker as a medium of transport and cutting down on FS stats broadcast

DaveCTurner

I left a few more suggestions for simplifications and better tests.

server/src/test/java/org/elasticsearch/monitor/fs/FsHealthServiceTests.java

DaveCTurner · 2020-04-20T08:53:42Z

server/src/main/java/org/elasticsearch/cluster/coordination/NodeFsHealthChecker.java

+import java.util.Set;
+import java.util.function.Supplier;
+
+public class NodeFsHealthChecker {


This doesn't seem necessary, it's enough for followers to reject the today's health checks.

Agree. That simplifies a great deal

DaveCTurner · 2020-04-20T08:56:46Z

server/src/main/java/org/elasticsearch/monitor/fs/FsInfo.java

@@ -48,16 +48,19 @@
        long total = -1;
        long free = -1;
        long available = -1;
+        @Nullable
+        Boolean isWritable;


I don't think we should add this to the stats -- we aim to remove read-only nodes from the cluster, so this will effectively always be true when collecting stats.

Sure but when we actually have a read-only node removed, it would still stay around unless an operator intervenes by either fixing some of the issues or replacing it. I feel /_nodes/_local/stats might still serve a good purpose and would let the system know it needs an attention.

I think this is already handled by the cluster health API -- the faulty node will report red health when it is removed from the cluster which is a much clearer indication that action is needed, and we can record helpful details in the logs since we always check the logs in this kind of situation.

While I understand we don't want to expose the stats, is the interface FsService#stats().getTotal().isWritable() acceptable at multiple places that I have used to reject requests @ PrevoteCollector/JoinHelper/FollowersChecker or should FsService expose another interface without touching FsInfo at all

Simply having a RED node from health API may not be able to differentiate a N/W /GC pause/FS issue distinctly and remediations actions might differ. Having a metric may help with some automation which would otherwise need a log dive.

Let me know your thoughts anyways.

I'll change this to FsService.FsHealthService#isWritable() elsewhere I don't think we need to carry any other baggage. Do you think FsHealthService can exist independently outside MonitorService ?

Yes, I don't think FsService should be involved here. FsHealthService makes sense on its own.

DaveCTurner · 2020-04-20T08:58:37Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+    @Override
+    protected void doStart() {
+        for (Path path : nodeEnv.nodeDataPaths()) {
+            scheduledFutures.add(threadPool.scheduleWithFixedDelay(new FsPathHealthMonitor(path), refreshInterval,


I think we only need to schedule one task which loops through all paths itself. There's no need to check them in parallel like this.

The idea behind it is there is still a possibility that multiple data paths can be mounted on separate (network) volumes and can fail independently. Since we were publishing stats per data path, it made sense to individually report the health per data path. Let me know if you think otherwise. If you don't think /_node/_local/stats adds any value we can consider alternatives

It's true that the paths can fail independently but this doesn't matter, we will fail the node if any of the paths are broken. I see that independent checks may be useful for stats but as per my previous comment I don't think we need to expose this in stats.

DaveCTurner · 2020-04-20T09:02:05Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+    private volatile TimeValue healthCheckTimeoutInterval;
+    private final NodeEnvironment nodeEnv;
+    private final LongSupplier currentTimeMillisSupplier;
+    private Map<Path, TimeStampedStatus> pathHealthStats;


This seems unnecessarily detailed. I think we only really need to keep track of the time of the last successful check.

Sure based on the above discussion I'll simplify further

DaveCTurner

Plumbing looks much better now, much simpler!

I left a few smaller comments on the implementation of FsHealthService. I haven't been through the tests in great detail yet but they look promising.

DaveCTurner · 2020-05-04T15:55:50Z

server/src/main/java/org/elasticsearch/monitor/NodeHealthService.java

+@FunctionalInterface
+public interface NodeHealthService {
+
+    enum Status { HEALTHY, UNHEALTHY, UNKNOWN }


Suggest collapsing UNKNOWN with HEALTHY, there's no need to distinguish these cases IMO.

Intent was if the healthcheck hasn't yet started(not sure if thats possible) while the consumers have started to poll for health. Changed for now as suggested

DaveCTurner · 2020-05-04T16:01:32Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+        Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(10),
+            Setting.Property.NodeScope);
+    public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING =
+        Setting.timeSetting("monitor.fs.health.healthy_timeout", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(1),


I checked a few example systems in production and it seems like it's not that unusual to see delays of a few 10s of seconds that eventually succeed. This is common enough that I think it would be bad to start failing nodes in those cases by default. I will follow up with some of my systems engineering colleagues to agree on a sensible default here, but 1s is certainly too low.

Also a reminder about having a shorter timeout for the UNHEALTHY -> HEALTHY transition vs the HEALTHY -> UNHEALTHY one, mentioned first here: #52680 (comment)

DaveCTurner · 2020-05-04T16:01:59Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+    @Override
+    public Status getHealth() {
+        if (enabled == false) {
+            return Status.UNKNOWN;


I think Status.HEALTHY is fine here.

DaveCTurner · 2020-05-04T16:05:34Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+                    pathHealthStats.put(path,  Status.UNHEALTHY);
+                }
+            }
+            lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong()));


If this individual check saw no exceptions but took longer than the timeout interval then I don't think we should record it as a successful run as is done here, since that will result in a fatally-slow node still occasionally reporting itself as healthy, joining the cluster, and then failing again.

Good point. My bad I missed this

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

DaveCTurner · 2020-05-04T16:26:10Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+                    pathHealthStats.put(path,  Status.UNHEALTHY);
+                }
+            }
+            lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong()));


Also why Math.max? I think that currentTimeMillisSupplier is ThreadPool#relativeTimeInMillis which is monotonic.

Ahh.. I started lastRunTimeMillis as Long.MIN_VALUE originally

DaveCTurner · 2020-05-04T16:28:44Z

server/src/main/java/org/elasticsearch/monitor/fs/FsHealthService.java

+        }
+
+        private void monitorFSHealth() {
+            if (checkInProgress.compareAndSet(false, true) == false) {


Can that happen? I think threadPool.scheduleWithFixedDelay avoids this?

My bad, I asked for this in #52680 (comment) it seems.

:) I added it just an additional guarantee. I too think its not needed

DaveCTurner · 2020-05-04T16:29:57Z

server/src/main/java/org/elasticsearch/cluster/coordination/FsHealthcheckFailureException.java

+ * and this nodes needs to be removed from the cluster
+ */
+
+public class FsHealthcheckFailureException extends ElasticsearchException {


Nit: "health check" is two words:

Suggested change

public class FsHealthcheckFailureException extends ElasticsearchException {

public class FsHealthCheckFailureException extends ElasticsearchException {

DaveCTurner · 2020-05-04T16:31:25Z

server/src/main/java/org/elasticsearch/ElasticsearchException.java

@@ -1041,7 +1041,9 @@ public String toString() {
                org.elasticsearch.ingest.IngestProcessorException.class,
                org.elasticsearch.ingest.IngestProcessorException::new,
                157,
-                Version.V_7_5_0);
+                Version.V_7_5_0),
+        FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class,


nit: two words

Suggested change

FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class,

FS_HEALTH_CHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthCheckFailureException.class,

Documents the feature and settings introduced in elastic#52680.

Bukhtawar · 2020-07-07T11:43:01Z

Thanks @DaveCTurner I have one concern around BWC during version upgrade. Would NodeHealthCheckFailureException cause serialisation exceptions at master in a mixed mode cluster expl if master is on older version and read only data node on a newer version.

DaveCTurner · 2020-07-07T11:49:59Z

No, we use a NotSerializableExceptionWrapper in that case.

Today we do not allow a node to start if its filesystem is readonly, but it is possible for a filesystem to become readonly while the node is running. We don't currently have any infrastructure in place to make sure that Elasticsearch behaves well if this happens. A node that cannot write to disk may be poisonous to the rest of the cluster. With this commit we periodically verify that nodes' filesystems are writable. If a node fails these writability checks then it is removed from the cluster and prevented from re-joining until the checks start passing again. Closes elastic#45286

Today we do not allow a node to start if its filesystem is readonly, but it is possible for a filesystem to become readonly while the node is running. We don't currently have any infrastructure in place to make sure that Elasticsearch behaves well if this happens. A node that cannot write to disk may be poisonous to the rest of the cluster. With this commit we periodically verify that nodes' filesystems are writable. If a node fails these writability checks then it is removed from the cluster and prevented from re-joining until the checks start passing again. Closes #45286 Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>

Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>

In elastic#52680 we introduced a new health check mechanism. This commit fixes up some sporadic related test failures, and improves the behaviour of the `FollowersChecker` slightly in the case that no retries are configured. Closes elastic#59252 Closes elastic#59172

In #52680 we introduced a new health check mechanism. This commit fixes up some sporadic related test failures, and improves the behaviour of the `FollowersChecker` slightly in the case that no retries are configured. Closes #59252 Closes #59172

In elastic#52680 we introduced a new health check mechanism. This commit fixes up some related test failures on Windows caused by erroneously assuming that all paths begin with `/`. Closes elastic#59380

In #52680 we introduced a new health check mechanism. This commit fixes up some related test failures on Windows caused by erroneously assuming that all paths begin with `/`. Closes #59380

In #52680 we introduced a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. This commit extends this check to consider a failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of unhealthiness. Closes #58373

Bukhtawar and others added 3 commits July 4, 2019 12:15

Merge pull request #2 from elastic/master

569e8cc

Rebase from fork

Merge remote-tracking branch 'upstream/master'

64815f1

[Initial DRAFT] Adds a FsHealthService that periodically tries to wr…

b598944

…ite to all paths and emits a stats is_writable as a part of node stats API. FsReadOnlyMonitor pulls up the stats and tries to remove the node if not all paths are found to be writable. Addresses elastic#45286.

Bukhtawar requested a review from DaveCTurner February 23, 2020 11:45

DaveCTurner added :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Feb 24, 2020

DaveCTurner added the >enhancement label Feb 24, 2020

DaveCTurner previously requested changes Feb 24, 2020

View reviewed changes

DaveCTurner mentioned this pull request Feb 28, 2020

Support for timeout in stats API #52616

Closed

Bukhtawar added 4 commits March 25, 2020 19:05

Test case addition and PR comments

d4fb892

Merge remote-tracking branch 'upstream/master'

38f1a4e

Merge branch 'master' into ro-fs-handling

f3ac906

Changes for FsHealthService and tests

79948f3

Bukhtawar commented Mar 25, 2020

View reviewed changes

server/src/test/java/org/elasticsearch/cluster/coordination/PreVoteCollectorTests.java Outdated Show resolved Hide resolved

Bukhtawar commented Mar 25, 2020

View reviewed changes

Bukhtawar requested a review from DaveCTurner March 25, 2020 19:18

DaveCTurner reviewed Apr 20, 2020

View reviewed changes

Bukhtawar added 4 commits May 3, 2020 16:20

Review comments for simplication and better tests

20d9ba2

Merge remote-tracking branch 'upstream/master'

fa3ed38

Merge branch 'master' into ro-fs-handling

1646319

Fixing tests and check styles

5305ebb

Bukhtawar requested a review from DaveCTurner May 3, 2020 23:30

DaveCTurner reviewed May 4, 2020

View reviewed changes

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

DaveCTurner added backport pending v7.9.0 v8.0.0 labels Jul 7, 2020

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 7, 2020

Add docs for filesystem health checks

749cb2e

Documents the feature and settings introduced in elastic#52680.

DaveCTurner mentioned this pull request Jul 7, 2020

Add docs for filesystem health checks #59134

Merged

DaveCTurner added a commit that referenced this pull request Jul 7, 2020

Add docs for filesystem health checks (#59134)

c661a40

Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>

DaveCTurner added a commit that referenced this pull request Jul 7, 2020

Add docs for filesystem health checks (#59134)

8f4f844

Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>

hub-cap mentioned this pull request Jul 8, 2020

FsHealthServiceTests » testLoggingOnHungIO fails on windows builds #59252

Closed

DaveCTurner mentioned this pull request Jul 9, 2020

Fix node health-check-related test failures #59277

Merged

DaveCTurner mentioned this pull request Jul 13, 2020

Fix FSHealthServiceTests on Windows #59387

Merged

Bukhtawar deleted the ro-fs-handling branch July 15, 2020 06:33

Bukhtawar mentioned this pull request Jul 18, 2020

Improve node health checks to detect slow disk IO #59824

Closed

DaveCTurner mentioned this pull request Jul 22, 2020

Add timeout for Search Network Action to Improve Cluster Resistance #60037

Closed

DaveCTurner mentioned this pull request Sep 22, 2020

Remove node from cluster when node locks broken #61400

Merged

DaveCTurner removed the backport pending label Sep 22, 2020

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

	public class FsHealthcheckFailureException extends ElasticsearchException {
	public class FsHealthCheckFailureException extends ElasticsearchException {

	FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class,
	FS_HEALTH_CHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthCheckFailureException.class,

Adds resiliency to read-only filesystems #45286 #52680

Adds resiliency to read-only filesystems #45286 #52680

Conversation

Bukhtawar commented Feb 23, 2020 • edited Loading

elasticmachine commented Feb 24, 2020

elasticmachine commented Feb 24, 2020

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar commented Mar 25, 2020

Choose a reason for hiding this comment

Bukhtawar commented Apr 2, 2020 • edited Loading

Bukhtawar commented Apr 15, 2020

Bukhtawar commented Apr 20, 2020

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar Apr 21, 2020 • edited Loading

Choose a reason for hiding this comment

Bukhtawar Apr 22, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DaveCTurner left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar May 4, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bukhtawar commented Jul 7, 2020

DaveCTurner commented Jul 7, 2020

Bukhtawar commented Feb 23, 2020 •

edited

Loading

Bukhtawar commented Apr 2, 2020 •

edited

Loading

Bukhtawar Apr 21, 2020 •

edited

Loading

Bukhtawar Apr 22, 2020 •

edited

Loading

Bukhtawar May 4, 2020 •

edited

Loading