-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds resiliency to read-only filesystems #45286 #52680
Conversation
Rebase from fork
…ite to all paths and emits a stats is_writable as a part of node stats API. FsReadOnlyMonitor pulls up the stats and tries to remove the node if not all paths are found to be writable. Addresses elastic#45286.
Pinging @elastic/es-core-infra (:Core/Infra/Resiliency) |
Pinging @elastic/es-distributed (:Distributed/Cluster Coordination) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Bukhtawar, I left a few initial comments on the general approach. I think this will need some changes to make it testable too, so it won't really make sense to review it in depth until you've added some tests. The CoordinatorTests
test suite is a good place to look as this lets you write tests that hit timeouts without actually having to wait.
Please make sure to reformat your code too - there's a few places where the whitespace doesn't fit the style of the surrounding code.
* Monitor runs on master and listens for events from #ClusterInfoService on node stats. It checks to see if | ||
* a node has all paths writable if not removes the node from the cluster based on the setting monitor.fs.unhealthy.remove_enabled | ||
*/ | ||
public class FsReadOnlyMonitor { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This (and the extensions to ClusterInfoService) seem unnecessary. It would be preferable for the FollowersChecker
to report a node as unhealthy directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly we don't want NodeClient
pulling up FS stats. Instead the FollowerChecker ping should pull the FS health info too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the current implementation and use the Transport request handler. Let me know if thats what is expected
// delete any lingering file from a previous failure | ||
Files.deleteIfExists(resolve); | ||
Files.createFile(resolve); | ||
Files.delete(resolve); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is too weak a check IMO. It doesn't write any data or fsync anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added fsync check
}catch(IOException ex){ | ||
logger.error("Failed to perform writes on path {} due to {}", path, ex); | ||
pathHealthStats.put(path, Status.UNHEALTHY); | ||
} catch(Exception ex){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why we don't count this as UNHEALTHY
too. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed. Initial thought was any unanticipated bug causing parent Exception
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(5), TimeValue.timeValueSeconds(1), | ||
Setting.Property.NodeScope, Setting.Property.Dynamic); | ||
public static final Setting<TimeValue> HEALTHCHECK_TIMEOUT_SETTING = | ||
Setting.timeSetting("monitor.fs.health.unhealthy_timeout", TimeValue.timeValueMinutes(5), TimeValue.timeValueMinutes(2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 minutes seems a very long timeout to me. Do we really want to consider a node healthy if it's taking literally minutes to pass this simple check?
I also think we should be stricter about the UNHEALTHY
-> HEALTHY
transition to try and avoid flapping. What about keeping the node UNHEALTHY
until the check passes very quickly (~1 second?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@Override | ||
protected void doStart() { | ||
//TODO check if this needs to be a part of a dedicated threadpool | ||
scheduledFuture = threadPool.scheduleWithFixedDelay(new FsHealthMonitor(), refreshInterval, ThreadPool.Names.SAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should not be on the SAME
threadpool since it's doing IO that's potentially slow. GENERIC
would be ok, but then I think we need protection to make sure there's only one check running at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Also modified the scheduled checks to be one per data path to honour the 1s HEALTHY
SLA
Thanks @DaveCTurner I have made modifications as suggested and wrote some basic tests to validate if the happy cases work fine. I'll continue on tests while I get some feedback on the source code(maybe saves time with revisions). There are few things like backward compatibility/ Integ tests that needs some work. |
server/src/test/java/org/elasticsearch/cluster/coordination/PreVoteCollectorTests.java
Outdated
Show resolved
Hide resolved
@@ -1173,6 +1179,12 @@ public void run() { | |||
return; | |||
} | |||
|
|||
if(fsService.stats().getTotal().isWritable() == Boolean.FALSE){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have left out spaces assuming checkStyles
would catch. But unfortunate. I'll fix white spacing
Ping @DaveCTurner, Looks like I again ran into conflicts with master, but only tests so far. If you can take a look at the code changes and share your feedbacks I'll re-raise them with the conflicts resolved |
Hi @DaveCTurner is there anything I should be doing other than resolving conflicts and fixing/adding tests that would help this PR get some traction. I wanted your thoughts on the source code changes and the final approach before proceeding with tests. Please share your thoughts on taking this forward |
I'll re-raise the new revision soon also addressing the merge conflicts, leveraging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a few more suggestions for simplifications and better tests.
server/src/test/java/org/elasticsearch/monitor/fs/FsHealthServiceTests.java
Show resolved
Hide resolved
import java.util.Set; | ||
import java.util.function.Supplier; | ||
|
||
public class NodeFsHealthChecker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem necessary, it's enough for followers to reject the today's health checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree. That simplifies a great deal
@@ -48,16 +48,19 @@ | |||
long total = -1; | |||
long free = -1; | |||
long available = -1; | |||
@Nullable | |||
Boolean isWritable; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should add this to the stats -- we aim to remove read-only nodes from the cluster, so this will effectively always be true
when collecting stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but when we actually have a read-only node removed, it would still stay around unless an operator intervenes by either fixing some of the issues or replacing it. I feel /_nodes/_local/stats
might still serve a good purpose and would let the system know it needs an attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is already handled by the cluster health API -- the faulty node will report red
health when it is removed from the cluster which is a much clearer indication that action is needed, and we can record helpful details in the logs since we always check the logs in this kind of situation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I understand we don't want to expose the stats, is the interface FsService#stats().getTotal().isWritable()
acceptable at multiple places that I have used to reject requests @ PrevoteCollector/JoinHelper/FollowersChecker
or should FsService expose another interface without touching FsInfo
at all
Simply having a RED node from health
API may not be able to differentiate a N/W /GC pause/FS issue distinctly and remediations actions might differ. Having a metric may help with some automation which would otherwise need a log dive.
Let me know your thoughts anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change this to FsService.FsHealthService#isWritable() elsewhere I don't think we need to carry any other baggage. Do you think FsHealthService
can exist independently outside MonitorService
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I don't think FsService
should be involved here. FsHealthService
makes sense on its own.
@Override | ||
protected void doStart() { | ||
for (Path path : nodeEnv.nodeDataPaths()) { | ||
scheduledFutures.add(threadPool.scheduleWithFixedDelay(new FsPathHealthMonitor(path), refreshInterval, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only need to schedule one task which loops through all paths itself. There's no need to check them in parallel like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind it is there is still a possibility that multiple data paths can be mounted on separate (network) volumes and can fail independently. Since we were publishing stats per data path, it made sense to individually report the health per data path. Let me know if you think otherwise. If you don't think /_node/_local/stats
adds any value we can consider alternatives
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's true that the paths can fail independently but this doesn't matter, we will fail the node if any of the paths are broken. I see that independent checks may be useful for stats but as per my previous comment I don't think we need to expose this in stats.
private volatile TimeValue healthCheckTimeoutInterval; | ||
private final NodeEnvironment nodeEnv; | ||
private final LongSupplier currentTimeMillisSupplier; | ||
private Map<Path, TimeStampedStatus> pathHealthStats; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unnecessarily detailed. I think we only really need to keep track of the time of the last successful check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure based on the above discussion I'll simplify further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Plumbing looks much better now, much simpler!
I left a few smaller comments on the implementation of FsHealthService
. I haven't been through the tests in great detail yet but they look promising.
@FunctionalInterface | ||
public interface NodeHealthService { | ||
|
||
enum Status { HEALTHY, UNHEALTHY, UNKNOWN } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest collapsing UNKNOWN
with HEALTHY
, there's no need to distinguish these cases IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intent was if the healthcheck hasn't yet started(not sure if thats possible) while the consumers have started to poll for health. Changed for now as suggested
Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(10), | ||
Setting.Property.NodeScope); | ||
public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING = | ||
Setting.timeSetting("monitor.fs.health.healthy_timeout", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(1), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked a few example systems in production and it seems like it's not that unusual to see delays of a few 10s of seconds that eventually succeed. This is common enough that I think it would be bad to start failing nodes in those cases by default. I will follow up with some of my systems engineering colleagues to agree on a sensible default here, but 1s
is certainly too low.
Also a reminder about having a shorter timeout for the UNHEALTHY -> HEALTHY
transition vs the HEALTHY -> UNHEALTHY
one, mentioned first here: #52680 (comment)
@Override | ||
public Status getHealth() { | ||
if (enabled == false) { | ||
return Status.UNKNOWN; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think Status.HEALTHY
is fine here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
pathHealthStats.put(path, Status.UNHEALTHY); | ||
} | ||
} | ||
lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this individual check saw no exceptions but took longer than the timeout interval then I don't think we should record it as a successful run as is done here, since that will result in a fatally-slow node still occasionally reporting itself as healthy, joining the cluster, and then failing again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. My bad I missed this
pathHealthStats.put(path, Status.UNHEALTHY); | ||
} | ||
} | ||
lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why Math.max
? I think that currentTimeMillisSupplier
is ThreadPool#relativeTimeInMillis
which is monotonic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh.. I started lastRunTimeMillis
as Long.MIN_VALUE
originally
} | ||
|
||
private void monitorFSHealth() { | ||
if (checkInProgress.compareAndSet(false, true) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can that happen? I think threadPool.scheduleWithFixedDelay
avoids this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I asked for this in #52680 (comment) it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:) I added it just an additional guarantee. I too think its not needed
* and this nodes needs to be removed from the cluster | ||
*/ | ||
|
||
public class FsHealthcheckFailureException extends ElasticsearchException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: "health check" is two words:
public class FsHealthcheckFailureException extends ElasticsearchException { | |
public class FsHealthCheckFailureException extends ElasticsearchException { |
@@ -1041,7 +1041,9 @@ public String toString() { | |||
org.elasticsearch.ingest.IngestProcessorException.class, | |||
org.elasticsearch.ingest.IngestProcessorException::new, | |||
157, | |||
Version.V_7_5_0); | |||
Version.V_7_5_0), | |||
FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: two words
FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class, | |
FS_HEALTH_CHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthCheckFailureException.class, |
Documents the feature and settings introduced in elastic#52680.
Thanks @DaveCTurner I have one concern around BWC during version upgrade. Would |
No, we use a |
Today we do not allow a node to start if its filesystem is readonly, but it is possible for a filesystem to become readonly while the node is running. We don't currently have any infrastructure in place to make sure that Elasticsearch behaves well if this happens. A node that cannot write to disk may be poisonous to the rest of the cluster. With this commit we periodically verify that nodes' filesystems are writable. If a node fails these writability checks then it is removed from the cluster and prevented from re-joining until the checks start passing again. Closes elastic#45286
Today we do not allow a node to start if its filesystem is readonly, but it is possible for a filesystem to become readonly while the node is running. We don't currently have any infrastructure in place to make sure that Elasticsearch behaves well if this happens. A node that cannot write to disk may be poisonous to the rest of the cluster. With this commit we periodically verify that nodes' filesystems are writable. If a node fails these writability checks then it is removed from the cluster and prevented from re-joining until the checks start passing again. Closes #45286 Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>
Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>
Documents the feature and settings introduced in #52680. Co-authored-by: James Rodewig <james.rodewig@elastic.co>
In elastic#52680 we introduced a new health check mechanism. This commit fixes up some sporadic related test failures, and improves the behaviour of the `FollowersChecker` slightly in the case that no retries are configured. Closes elastic#59252 Closes elastic#59172
In elastic#52680 we introduced a new health check mechanism. This commit fixes up some related test failures on Windows caused by erroneously assuming that all paths begin with `/`. Closes elastic#59380
In #52680 we introduced a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. This commit extends this check to consider a failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of unhealthiness. Closes #58373
In #52680 we introduced a mechanism that will allow nodes to remove themselves from the cluster if they locally determine themselves to be unhealthy. The only check today is that their data paths are all empirically writeable. This commit extends this check to consider a failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of unhealthiness. Closes #58373
Fixes read only file system, part of the overall proposal for #45286
Pending