Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch indices read-only if a node runs out of disk space #25541

Merged
merged 8 commits into from Jul 5, 2017

Conversation

Projects
None yet
6 participants
@s1monw
Copy link
Contributor

commented Jul 4, 2017

Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to 95% utilization by default.

Closes #24299

Switch indices read-only if a node runs out of disk space
Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk
situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to `95%` utilization by default.

Closes #24299
@bleskes

bleskes approved these changes Jul 5, 2017

Copy link
Member

left a comment

LGTM. Left a bunch of small nits.

if (usage.getFreeBytes() < diskThresholdSettings.getFreeBytesThresholdFloodStage().getBytes() ||
usage.getFreeDiskAsPercentage() < diskThresholdSettings.getFreeDiskThresholdFloodStage()) {
RoutingNode routingNode = state.getRoutingNodes().node(node);
if (routingNode != null) { // this might happen if we haven't got the full cluster-state yet?!

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

Sadly this happens when you have a non-data node. It's been on the hit list to remove (it's annoying imo and is a leftover of the node as client days). no one got to it yet.


protected void markIndicesReadOnly(Set<String> indicesToMarkReadOnly) {
// set read-only block but don't block on the response
client.admin().indices().prepareUpdateSettings(indicesToMarkReadOnly.toArray(Strings.EMPTY_ARRAY)).

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

I think we should protect for errors here and log a warning. It doesn't seem like there are protection higher up the call stacks.

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 5, 2017

Author Contributor

we do catch errors in InternalClusterInfoService and log a warning so I think we are good here?

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

You're right - I missed it.

@@ -42,6 +42,10 @@
new Setting<>("cluster.routing.allocation.disk.watermark.high", "90%",
(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

we're missing the watermark in the setting path (i.e. disk.floodstage vs disk.watermark.floodstage). Feels a bit strange to me but I'm not a native speaker. @dakrone wdyt?

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 5, 2017

Author Contributor

yeah so my logic was, we have 2 watermarks (high and low) and floodstage so I left the watermark out of the key?!

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 5, 2017

Author Contributor

I am curious what @rjernst thinks I am ok with whatever makes sense

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jul 5, 2017

Member

I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

@@ -199,6 +199,7 @@ public void apply(Settings value, Settings current, Settings previous) {
ThrottlingAllocationDecider.CLUSTER_ROUTING_ALLOCATION_NODE_CONCURRENT_RECOVERIES_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_LOW_DISK_WATERMARK_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_HIGH_DISK_WATERMARK_SETTING,
DiskThresholdSettings.CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING,

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

can we add DISK to this name (with or without watermark - see other comment). i.e. CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_DISK_WATERMARK or something?

@@ -590,7 +596,7 @@ public Node start() throws NodeValidationException {
injector.getInstance(SnapshotShardsService.class).start();
injector.getInstance(RoutingService.class).start();
injector.getInstance(SearchService.class).start();
injector.getInstance(MonitorService.class).start();
nodeService.getMonitorService().start();

This comment has been minimized.

Copy link
@bleskes
@@ -762,7 +768,7 @@ public synchronized void close() throws IOException {
toClose.add(() -> stopWatch.stop().start("discovery"));
toClose.add(injector.getInstance(Discovery.class));
toClose.add(() -> stopWatch.stop().start("monitor"));
toClose.add(injector.getInstance(MonitorService.class));
toClose.add(nodeService.getMonitorService());

This comment has been minimized.

Copy link
@bleskes

bleskes Jul 5, 2017

Member

this is redundant. It seems to be already closed by nodeService. It's a bit weird that NodeService closes the monitor service but doesn't call own calling the stop / start methods. Not a big deal though.

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 5, 2017

Author Contributor

that's a leftover from a different path I tried...

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2017

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

@dakrone

dakrone approved these changes Jul 5, 2017

Copy link
Member

left a comment

LGTM, I left one super minor nit, thanks for doing this!

@@ -917,8 +923,8 @@ protected Node newTribeClientNode(Settings settings, Collection<Class<? extends

/** Constructs a ClusterInfoService which may be mocked for tests. */
protected ClusterInfoService newClusterInfoService(Settings settings, ClusterService clusterService,
ThreadPool threadPool, NodeClient client) {
return new InternalClusterInfoService(settings, clusterService, threadPool, client);
ThreadPool threadPool, NodeClient client, Consumer<ClusterInfo> listeners) {

This comment has been minimized.

Copy link
@dakrone

dakrone Jul 5, 2017

Member

Minor nit: listeners -> listener

@bleskes

This comment has been minimized.

Copy link
Member

commented Jul 5, 2017

@bleskes today we don't run a reroute if we go across foolstage, the question is if we should? I mean in reality we'd likely have moved shards away if we could already?

I think we can rely on the fact that if flood stage is on thant the high water mark is also on, which has already kicked in the reroute (as you say). I also think that flood stage is about making indices read only (and only that). A follow up reroute shard moving won't change that (as we don't automatically make the indices writable again). So +1 to keeping as is and not adding a reroute.

@dakrone

This comment has been minimized.

Copy link
Member

commented Jul 5, 2017

Oh something else I forgot to mention, I think we need to document this feature and the new flood level also?

@@ -42,6 +42,10 @@
new Setting<>("cluster.routing.allocation.disk.watermark.high", "90%",
(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_DISK_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",
(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.floodstage"),

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jul 5, 2017

Member

It seems that we have no validation that low <= high <= flood but I think that we should?

This comment has been minimized.

Copy link
@s1monw

s1monw Jul 5, 2017

Author Contributor

I am happy to do this but it should be a sep. PR it's also tricky to do since we don't have a fully picture of the new settings when they are updated. essentially I don't think we can easily protect from this today. to make it work I think we need an extra validation round when settings are updated

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jul 5, 2017

Member

For this I open #25560 as a possible approach?

@@ -42,6 +42,10 @@
new Setting<>("cluster.routing.allocation.disk.watermark.high", "90%",
(s) -> validWatermarkSetting(s, "cluster.routing.allocation.disk.watermark.high"),
Setting.Property.Dynamic, Setting.Property.NodeScope);
public static final Setting<String> CLUSTER_ROUTING_ALLOCATION_FLOOD_STAGE_SETTING =
new Setting<>("cluster.routing.allocation.disk.floodstage", "95%",

This comment has been minimized.

Copy link
@jasontedor

jasontedor Jul 5, 2017

Member

I think the key should be cluster.routing.allocation.disk.watermark.flood_stage.

@jasontedor
Copy link
Member

left a comment

I left a question. I think we need docs too.

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2017

I was waiting for initial feedback until I doc this.. will add docs soon

s1monw added some commits Jul 5, 2017

@s1monw

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2017

@jasontedor @dakrone I pushed docs and requested changes

@jasontedor
Copy link
Member

left a comment

LGTM.

@s1monw s1monw merged commit 6e5cc42 into elastic:master Jul 5, 2017

2 checks passed

CLA Commit author is a member of Elasticsearch
Details
elasticsearch-ci Build finished.
Details

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jul 6, 2017

Merge branch 'master' into multiple-settings-validation
* master:
  Refactor PathTrie and RestController to use a single trie for all methods (elastic#25459)
  Switch indices read-only if a node runs out of disk space (elastic#25541)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.