Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds resiliency to read-only filesystems #45286 #52680

Merged
merged 50 commits into from
Jul 7, 2020

Conversation

Bukhtawar
Copy link
Contributor

@Bukhtawar Bukhtawar commented Feb 23, 2020

Fixes read only file system, part of the overall proposal for #45286
Pending

  • Finalisation on TODOs
  • API Backward compatibility
  • Covering missing cases
  • Unit and Integration tests

Bukhtawar and others added 3 commits July 4, 2019 12:15
…ite to all paths and emits a stats is_writable as a part of node stats API.

 FsReadOnlyMonitor pulls up the stats and tries to remove the node if not all paths are found to be writable.
 Addresses elastic#45286.
@DaveCTurner DaveCTurner added :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. labels Feb 24, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Resiliency)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Cluster Coordination)

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Bukhtawar, I left a few initial comments on the general approach. I think this will need some changes to make it testable too, so it won't really make sense to review it in depth until you've added some tests. The CoordinatorTests test suite is a good place to look as this lets you write tests that hit timeouts without actually having to wait.

Please make sure to reformat your code too - there's a few places where the whitespace doesn't fit the style of the surrounding code.

* Monitor runs on master and listens for events from #ClusterInfoService on node stats. It checks to see if
* a node has all paths writable if not removes the node from the cluster based on the setting monitor.fs.unhealthy.remove_enabled
*/
public class FsReadOnlyMonitor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the extensions to ClusterInfoService) seem unnecessary. It would be preferable for the FollowersChecker to report a node as unhealthy directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly we don't want NodeClient pulling up FS stats. Instead the FollowerChecker ping should pull the FS health info too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the current implementation and use the Transport request handler. Let me know if thats what is expected

// delete any lingering file from a previous failure
Files.deleteIfExists(resolve);
Files.createFile(resolve);
Files.delete(resolve);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is too weak a check IMO. It doesn't write any data or fsync anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added fsync check

}catch(IOException ex){
logger.error("Failed to perform writes on path {} due to {}", path, ex);
pathHealthStats.put(path, Status.UNHEALTHY);
} catch(Exception ex){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand why we don't count this as UNHEALTHY too. Can you explain?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed. Initial thought was any unanticipated bug causing parent Exception

Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(5), TimeValue.timeValueSeconds(1),
Setting.Property.NodeScope, Setting.Property.Dynamic);
public static final Setting<TimeValue> HEALTHCHECK_TIMEOUT_SETTING =
Setting.timeSetting("monitor.fs.health.unhealthy_timeout", TimeValue.timeValueMinutes(5), TimeValue.timeValueMinutes(2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 minutes seems a very long timeout to me. Do we really want to consider a node healthy if it's taking literally minutes to pass this simple check?

I also think we should be stricter about the UNHEALTHY -> HEALTHY transition to try and avoid flapping. What about keeping the node UNHEALTHY until the check passes very quickly (~1 second?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Override
protected void doStart() {
//TODO check if this needs to be a part of a dedicated threadpool
scheduledFuture = threadPool.scheduleWithFixedDelay(new FsHealthMonitor(), refreshInterval, ThreadPool.Names.SAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be on the SAME threadpool since it's doing IO that's potentially slow. GENERIC would be ok, but then I think we need protection to make sure there's only one check running at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Also modified the scheduled checks to be one per data path to honour the 1s HEALTHY SLA

@Bukhtawar
Copy link
Contributor Author

Thanks @DaveCTurner I have made modifications as suggested and wrote some basic tests to validate if the happy cases work fine. I'll continue on tests while I get some feedback on the source code(maybe saves time with revisions). There are few things like backward compatibility/ Integ tests that needs some work.

@@ -1173,6 +1179,12 @@ public void run() {
return;
}

if(fsService.stats().getTotal().isWritable() == Boolean.FALSE){
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have left out spaces assuming checkStyles would catch. But unfortunate. I'll fix white spacing

@Bukhtawar
Copy link
Contributor Author

Bukhtawar commented Apr 2, 2020

Ping @DaveCTurner,
Hope you are doing good !!

Looks like I again ran into conflicts with master, but only tests so far. If you can take a look at the code changes and share your feedbacks I'll re-raise them with the conflicts resolved

@Bukhtawar
Copy link
Contributor Author

Hi @DaveCTurner is there anything I should be doing other than resolving conflicts and fixing/adding tests that would help this PR get some traction. I wanted your thoughts on the source code changes and the final approach before proceeding with tests. Please share your thoughts on taking this forward

@Bukhtawar
Copy link
Contributor Author

I'll re-raise the new revision soon also addressing the merge conflicts, leveraging FollowerChecker as a medium of transport and cutting down on FS stats broadcast

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few more suggestions for simplifications and better tests.

import java.util.Set;
import java.util.function.Supplier;

public class NodeFsHealthChecker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem necessary, it's enough for followers to reject the today's health checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree. That simplifies a great deal

@@ -48,16 +48,19 @@
long total = -1;
long free = -1;
long available = -1;
@Nullable
Boolean isWritable;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should add this to the stats -- we aim to remove read-only nodes from the cluster, so this will effectively always be true when collecting stats.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but when we actually have a read-only node removed, it would still stay around unless an operator intervenes by either fixing some of the issues or replacing it. I feel /_nodes/_local/stats might still serve a good purpose and would let the system know it needs an attention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is already handled by the cluster health API -- the faulty node will report red health when it is removed from the cluster which is a much clearer indication that action is needed, and we can record helpful details in the logs since we always check the logs in this kind of situation.

Copy link
Contributor Author

@Bukhtawar Bukhtawar Apr 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I understand we don't want to expose the stats, is the interface FsService#stats().getTotal().isWritable() acceptable at multiple places that I have used to reject requests @ PrevoteCollector/JoinHelper/FollowersChecker or should FsService expose another interface without touching FsInfo at all

Simply having a RED node from health API may not be able to differentiate a N/W /GC pause/FS issue distinctly and remediations actions might differ. Having a metric may help with some automation which would otherwise need a log dive.

Let me know your thoughts anyways.

Copy link
Contributor Author

@Bukhtawar Bukhtawar Apr 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change this to FsService.FsHealthService#isWritable() elsewhere I don't think we need to carry any other baggage. Do you think FsHealthService can exist independently outside MonitorService ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I don't think FsService should be involved here. FsHealthService makes sense on its own.

@Override
protected void doStart() {
for (Path path : nodeEnv.nodeDataPaths()) {
scheduledFutures.add(threadPool.scheduleWithFixedDelay(new FsPathHealthMonitor(path), refreshInterval,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to schedule one task which loops through all paths itself. There's no need to check them in parallel like this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind it is there is still a possibility that multiple data paths can be mounted on separate (network) volumes and can fail independently. Since we were publishing stats per data path, it made sense to individually report the health per data path. Let me know if you think otherwise. If you don't think /_node/_local/stats adds any value we can consider alternatives

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's true that the paths can fail independently but this doesn't matter, we will fail the node if any of the paths are broken. I see that independent checks may be useful for stats but as per my previous comment I don't think we need to expose this in stats.

private volatile TimeValue healthCheckTimeoutInterval;
private final NodeEnvironment nodeEnv;
private final LongSupplier currentTimeMillisSupplier;
private Map<Path, TimeStampedStatus> pathHealthStats;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessarily detailed. I think we only really need to keep track of the time of the last successful check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure based on the above discussion I'll simplify further

@Bukhtawar Bukhtawar requested a review from DaveCTurner May 3, 2020 23:30
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plumbing looks much better now, much simpler!

I left a few smaller comments on the implementation of FsHealthService. I haven't been through the tests in great detail yet but they look promising.

@FunctionalInterface
public interface NodeHealthService {

enum Status { HEALTHY, UNHEALTHY, UNKNOWN }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest collapsing UNKNOWN with HEALTHY, there's no need to distinguish these cases IMO.

Copy link
Contributor Author

@Bukhtawar Bukhtawar May 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intent was if the healthcheck hasn't yet started(not sure if thats possible) while the consumers have started to poll for health. Changed for now as suggested

Setting.timeSetting("monitor.fs.health.refresh_interval", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(10),
Setting.Property.NodeScope);
public static final Setting<TimeValue> HEALTHY_TIMEOUT_SETTING =
Setting.timeSetting("monitor.fs.health.healthy_timeout", TimeValue.timeValueSeconds(1), TimeValue.timeValueMillis(1),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked a few example systems in production and it seems like it's not that unusual to see delays of a few 10s of seconds that eventually succeed. This is common enough that I think it would be bad to start failing nodes in those cases by default. I will follow up with some of my systems engineering colleagues to agree on a sensible default here, but 1s is certainly too low.

Also a reminder about having a shorter timeout for the UNHEALTHY -> HEALTHY transition vs the HEALTHY -> UNHEALTHY one, mentioned first here: #52680 (comment)

@Override
public Status getHealth() {
if (enabled == false) {
return Status.UNKNOWN;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Status.HEALTHY is fine here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

pathHealthStats.put(path, Status.UNHEALTHY);
}
}
lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this individual check saw no exceptions but took longer than the timeout interval then I don't think we should record it as a successful run as is done here, since that will result in a fatally-slow node still occasionally reporting itself as healthy, joining the cluster, and then failing again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. My bad I missed this

pathHealthStats.put(path, Status.UNHEALTHY);
}
}
lastRunTimeMillis.getAndUpdate(l -> Math.max(l, currentTimeMillisSupplier.getAsLong()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why Math.max? I think that currentTimeMillisSupplier is ThreadPool#relativeTimeInMillis which is monotonic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh.. I started lastRunTimeMillis as Long.MIN_VALUE originally

}

private void monitorFSHealth() {
if (checkInProgress.compareAndSet(false, true) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can that happen? I think threadPool.scheduleWithFixedDelay avoids this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I asked for this in #52680 (comment) it seems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) I added it just an additional guarantee. I too think its not needed

* and this nodes needs to be removed from the cluster
*/

public class FsHealthcheckFailureException extends ElasticsearchException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: "health check" is two words:

Suggested change
public class FsHealthcheckFailureException extends ElasticsearchException {
public class FsHealthCheckFailureException extends ElasticsearchException {

@@ -1041,7 +1041,9 @@ public String toString() {
org.elasticsearch.ingest.IngestProcessorException.class,
org.elasticsearch.ingest.IngestProcessorException::new,
157,
Version.V_7_5_0);
Version.V_7_5_0),
FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: two words

Suggested change
FS_HEALTHCHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthcheckFailureException.class,
FS_HEALTH_CHECK_FAILURE_EXCEPTION(org.elasticsearch.cluster.coordination.FsHealthCheckFailureException.class,

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 7, 2020
Documents the feature and settings introduced in elastic#52680.
@Bukhtawar
Copy link
Contributor Author

Thanks @DaveCTurner I have one concern around BWC during version upgrade. Would NodeHealthCheckFailureException cause serialisation exceptions at master in a mixed mode cluster expl if master is on older version and read only data node on a newer version.

@DaveCTurner
Copy link
Contributor

No, we use a NotSerializableExceptionWrapper in that case.

DaveCTurner pushed a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 7, 2020
Today we do not allow a node to start if its filesystem is readonly, but
it is possible for a filesystem to become readonly while the node is
running. We don't currently have any infrastructure in place to make
sure that Elasticsearch behaves well if this happens. A node that cannot
write to disk may be poisonous to the rest of the cluster.

With this commit we periodically verify that nodes' filesystems are
writable. If a node fails these writability checks then it is removed
from the cluster and prevented from re-joining until the checks start
passing again.

Closes elastic#45286
DaveCTurner added a commit that referenced this pull request Jul 7, 2020
Today we do not allow a node to start if its filesystem is readonly, but
it is possible for a filesystem to become readonly while the node is
running. We don't currently have any infrastructure in place to make
sure that Elasticsearch behaves well if this happens. A node that cannot
write to disk may be poisonous to the rest of the cluster.

With this commit we periodically verify that nodes' filesystems are
writable. If a node fails these writability checks then it is removed
from the cluster and prevented from re-joining until the checks start
passing again.

Closes #45286

Co-authored-by: Bukhtawar Khan <bukhtawar7152@gmail.com>
DaveCTurner added a commit that referenced this pull request Jul 7, 2020
Documents the feature and settings introduced in #52680.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
DaveCTurner added a commit that referenced this pull request Jul 7, 2020
Documents the feature and settings introduced in #52680.

Co-authored-by: James Rodewig <james.rodewig@elastic.co>
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 9, 2020
In elastic#52680 we introduced a new health check mechanism. This commit fixes
up some sporadic related test failures, and improves the behaviour of
the `FollowersChecker` slightly in the case that no retries are
configured.

Closes elastic#59252
Closes elastic#59172
DaveCTurner added a commit that referenced this pull request Jul 9, 2020
In #52680 we introduced a new health check mechanism. This commit fixes
up some sporadic related test failures, and improves the behaviour of
the `FollowersChecker` slightly in the case that no retries are
configured.

Closes #59252
Closes #59172
DaveCTurner added a commit that referenced this pull request Jul 9, 2020
In #52680 we introduced a new health check mechanism. This commit fixes
up some sporadic related test failures, and improves the behaviour of
the `FollowersChecker` slightly in the case that no retries are
configured.

Closes #59252
Closes #59172
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Jul 13, 2020
In elastic#52680 we introduced a new health check mechanism. This commit fixes
up some related test failures on Windows caused by erroneously assuming
that all paths begin with `/`.

Closes elastic#59380
DaveCTurner added a commit that referenced this pull request Jul 13, 2020
In #52680 we introduced a new health check mechanism. This commit fixes
up some related test failures on Windows caused by erroneously assuming
that all paths begin with `/`.

Closes #59380
DaveCTurner added a commit that referenced this pull request Jul 13, 2020
In #52680 we introduced a new health check mechanism. This commit fixes
up some related test failures on Windows caused by erroneously assuming
that all paths begin with `/`.

Closes #59380
@Bukhtawar Bukhtawar deleted the ro-fs-handling branch July 15, 2020 06:33
DaveCTurner pushed a commit that referenced this pull request Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
DaveCTurner pushed a commit that referenced this pull request Sep 22, 2020
In #52680 we introduced a mechanism that will allow nodes to remove
themselves from the cluster if they locally determine themselves to be
unhealthy. The only check today is that their data paths are all
empirically writeable. This commit extends this check to consider a
failure of `NodeEnvironment#assertEnvIsLocked()` to be an indication of
unhealthiness.

Closes #58373
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. >enhancement Team:Core/Infra Meta label for core/infra team Team:Distributed Meta label for distributed team (obsolete) v7.9.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants