[Feature Request] Make elastic search crash after `Caused by: java.io.IOException: No space left on device`, rather than spamminng logs #24299

nullpixel · 2017-04-24T18:08:51Z

Describe the feature: As per #20354, logs are spammed with Caused by: java.io.IOException: No space left on device when the disk space runs out. Why not make it leave a single message in the logs, then just crash? As the disk is full, the database cannot run anyway, so rather than spamming logs, just adapt a simpler approach.

This is a production database, and I wouldn't expect my logs to look like that after waking up.

The text was updated successfully, but these errors were encountered:

martijnvg · 2017-04-25T09:10:52Z

I don't think that crashing the node is the best approach when there is no space left on device. The node can still serve read requests and if indices are removed/relocated (or in this case old log files are removed) then there should be sufficient space to handle write requests.

In case of log files filling up, maybe ES should try to prevent this by capping the size of the log file, for example like is done for deprecation logging (with log4j's SizeBasedTriggeringPolicy)?

clintongormley · 2017-04-25T09:18:54Z

Also, a disk might fill up because of a big merge. As soon as that fails, disk space could be freed up again allowing the node to continue working.

khionu · 2017-04-25T09:58:18Z

In that case, it might be best to put a hold on the operations that are unable to complete.

A queue, that will retry at a much lower rate, until operations are successful, would appease the concerns of being available for read and waiting for disk to be freed, without causing undue spam. For an exception that requires manual intervention, a retry rate of once a minute would be acceptable, in my opinion.

s1monw · 2017-04-25T11:55:18Z

I agree killing the node might not be the right thing here from a 10k feet view while I can see it as a viable solution for some use - cases. Out of the box I'd like us to rather detect that we are close to full disk and then simply stop write operations like merging and indexing. Yet, this still has issues since if we have a replica on the node we'd reject a write which would bring the replica out of sync in that moment. Disk-threshold allocation deciders should help here moving stuff away from the node but essentially killing it and leave the cluster would be the right thing to do here IMO. If we wanna do that we need to somehow add corresponding handlers in many places or we start with adding it into the engine and allow folks to opt out of it? I generally think we should kill nodes more-often in disaster situations instead of just sitting there and wait for more disaster to happen.

nullpixel · 2017-04-25T15:11:10Z

Yeah, so you cannot write to the disk if it's full, but spamming the logs isn't the way to go.

Could do a "read only" style mode where it just says it went into read only because of no disk space. there's no need to spam the logs; even with a cap

jasontedor · 2017-04-25T15:14:11Z

Yeah, so you cannot write to the disk if it's full, but spamming the logs isn't the way to go.

This is not necessarily true, the logs could (and should!) be on a separate mount, they could have log rotation applied to them, etc.

jasontedor · 2017-04-25T15:16:42Z

I generally think we should kill nodes more-often in disaster situations instead of just sitting there and wait for more disaster to happen.

I agree but I'm unsure if disk-full qualifies as such a disaster situation since it's possible to recover.

jasontedor · 2017-04-25T15:23:29Z

Why not make it leave a single message in the logs, then just crash? As the disk is full, the database cannot run anyway, so rather than spamming logs, just adapt a simpler approach.

In a concurrent server application, there are likely many disk operations in flight, expecting a single message is not realistic. As others have mentioned, this situation is not completely fatal, and can be recovered from so crashing on disk-full should not be a first option.

khionu · 2017-04-25T17:57:27Z

One option is to provide a few behaviors for the host to pick from. Fatal crash, start refusing writes, or something else.

Something else to be considered is maybe preventative? It should be reasonable to check the space remaining daily, and if it goes below, say, 5%, log a severe warning.

s1monw · 2017-04-27T07:23:53Z

@khionu just out of curiosity did you look into this

s1monw · 2017-04-27T07:28:33Z

I agree but I'm unsure if disk-full qualifies as such a disaster situation since it's possible to recover.

the question is what the recovery path is? Most likely we need to relocate shards but shouldn't the disk-threshold decider have taken care of this already? Once we are like at 99% there is not much we can do but failing? it's likely the most healthy option here, it tells the cluster to heal itself by allocating shards on other nodes, it notifies the users since a node died. The log message might be clear and we can refuse to start up until we have at lest 5% disk back? I kind of like this option the more I think about it.

jasontedor · 2017-04-27T14:30:03Z

@s1monw You're starting to convince me. Here's another thought has occurred to me: if we keep the node alive, I don't think there's a lot that we should or can do (without a lot of jumping through hoops) about the disk-full log messages, so those are going to keep pumping out. If those log messages are being sent to a remote monitoring cluster, the disks on nodes on the remote monitoring cluster could be overwhelmed too and now you have two problems (remote denial of service on the monitoring cluster). This is an argument for dying.

s1monw · 2017-04-27T14:43:32Z

I'd like to hear what @rjernst and @nik9000 are thinking about this

jasontedor · 2017-04-27T14:59:22Z

Another thing to consider with respect to dying is that operators are going to have their nodes set to auto-restart (we encourage this because of dying with dignity). If we fail startup when the disk is full, as we should if we proceed with dying when the disk is full, we will end up in an infinite retry loop that will also be spamming logs and we haven't solved anything. I discussed this concern with @s1monw and we discussed the idea of a marker file to track the number of retries and simply not start at all if that marker file is present and the count exceeds some amount. At this point, manual intervention is required but it already is for disk full anyway.

nik9000 · 2017-04-27T15:31:14Z

I'd like to hear what @rjernst and @nik9000 are thinking about this

Lots of things.

What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.
If a primary fails to write to a replica because it ran out of space then it'll fail the replica, freeing up the space that the replica was taking up pretty quick. I could see this being a transient problem if you are running really close to the edge. Shooting the node seems drastic in these situations because it'll have disk to work with soon.
Why don't we roll the log files based on size? If we did that at least we'd limit the amount of disk we consume in this case. We'd still chew through IOPS but still.
I feel like we should do more indoctrination about running separate mount points for the data directory and everything else. It is fairly uncommon these days to separate /var/log and / but I've never run Elasticsearch in production without /var/lib/elasticsearch being on a separate volume.
I feel like if disk is full it'd be nice if the node stuck around so other nodes could recover from it.
I'm fairly concerned that killing the node will make the problem worse for the rest of the cluster which will have to recover the copies of the shard somewhere else which will fill the disk space of other nodes. At least, it'll push them as close to the edge as the disk allocation decider allows them to go.

I'm sure I'll think of more things. I don't really like the idea of shooting the node if it runs out of disk space. I just have a gut feeling about it more than all the stuff I wrote.

s1monw · 2017-04-28T21:55:40Z

What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.

this one is interesting. if you have only one copy it will be unavailable until the node has enough disk to recover. But you won't loose data, it's still there. If you have a copy the cluster will try to allocate it elsewhere and we slowly heal the cluster or the node comes back quickly with more disk space. I think dropping out is a good option IMO?

What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.

the same goes for OOM or any other fatal/non-recoverable error. the question is do we treat disk full as non-recoverable. IMO yes, we won't recover from it, the cluster will be in trouble anyhow.

If a primary fails to write to a replica because it ran out of space then it'll fail the replica, freeing up the space that the replica was taking up pretty quick. I could see this being a transient problem if you are running really close to the edge. Shooting the node seems drastic in these situations because it'll have disk to work with soon.

this is not true - we keep the data on disk until we allocated the replica on another node. this can take very long

Why don't we roll the log files based on size? If we did that at least we'd limit the amount of disk we consume in this case. We'd still chew through IOPS but still.

this is a different problem which I agree we should tackle but it's unrelated to out of disk IMO

I feel like if disk is full it'd be nice if the node stuck around so other nodes could recover from it.

we don't have this option unless we switch all indices allocated on it read-only? that is pretty drastic and very error prone

nik9000 · 2017-05-01T16:01:59Z

this is not true - we keep the data on disk until we allocated the replica on another node. this can take very long

Right. I take this point back.

Personally I don't think nodes should kill themselves but I'm aware that is asking too much. There are unrecoverable things OOMs and other bugs. We work hard to prevent these and they break things in subtle ways when they hit.

If running out of disk is truly as unrecoverable as a Java OOM then we should kill the node but we need to open intensive efforts to make sure that it doesn't happen. Like we did with the circuit breakers. The disk allocation stuff doesn't look like it is enough.

s1monw · 2017-05-02T14:18:44Z

If running out of disk is truly as unrecoverable as a Java OOM then we should kill the node but we need to open intensive efforts to make sure that it doesn't happen. Like we did with the circuit breakers. The disk allocation stuff doesn't look like it is enough.

agreed we should think and try harder to make it less likely to get to this point. One thought I was playing with was to tell the primary that the replica has not enough space left and if so we can reject subsequent writes. Such and information can be transported back to the primary with the replica write responses. Once the primary is in such a state it will stay there until the replica tells it's primary it's in a good shape again so we can continue indexing. I really think we should push back to the user is stuff like this happens and we need to give the nodes some time to move shards away which is sometimes not possible due to allocation deciders or no space on other nodes. In such a case we can only reject writes.

nik9000 · 2017-05-02T14:20:17Z

agreed we should think and try harder to make it less likely to get to this point. One thought I was playing with was to tell the primary that the replica has not enough space left and if so we can reject subsequent writes. Such and information can be transported back to the primary with the replica write responses. Once the primary is in such a state it will stay there until the replica tells it's primary it's in a good shape again so we can continue indexing. I really think we should push back to the user is stuff like this happens and we need to give the nodes some time to move shards away which is sometimes not possible due to allocation deciders or no space on other nodes. In such a case we can only reject writes.

+1

nik9000 · 2017-05-02T14:38:20Z

@s1monw, if there is space between when we start moving shards off of a node and when we reject writes then this can be a backstop. I guess also we'd want the primary to have this behavior too, right?

s1monw · 2017-05-02T14:42:58Z

@nik9000 yes so we should accept all writes on replicas all the time we just need to prevent the primary to send them. So yes, the primary should also have such a flag

nik9000 · 2017-05-02T14:51:58Z

Ah! Now I get it.

For those of you like me that don't get it at first: the replication model in Elasticsearch dictates that if a primary has accepted a write but the replica rejects it then the replica must fail. Once failed the replica has to recover from the primary which is a fairly heavy operation. So replicas must do their best not to fail. In this context that means that the replica should absorb the write even though it is running out of space but it should tell the primary to reject further writes. We can tuck the "help I'm running out of space" flag into the write response.

So we have to set the disk space percent quite a bit before we get full because it is super-asynchronous.

s1monw · 2017-05-02T14:56:33Z

@nik9000 yes that is what I meant... thanks for explaining it in other words again.

jasontedor · 2017-05-03T10:59:35Z

A concern I have: consider a homogenous cluster with well-sharded data (these are not unreasonable assumptions). If one node is running low on disk space, then they are all running low on disk space. Killing the first node to run out of disk space will lead to recoveries on the other nodes in the cluster exacerbating their low disk issues. Shooting a node can lead to a cluster-wide outage.

s1monw · 2017-05-04T09:28:02Z

A concern I have: consider a homogenous cluster with well-sharded data (these are not unreasonable assumptions). If one node is running low on disk space, then they are all running low on disk space. Killing the first node to run out of disk space will lead to recoveries on the other nodes in the cluster exacerbating their low disk issues. Shooting a node can lead to a cluster-wide outage.

we spoke about this yesterday in a meeting but I want to add my response here anyway for completeness. I think in such a situation the watermarks will protect us since if a node is already high on disk usage we will not allocate shards on it. We also have a good notion of how big shards are for relocation so we can make good decisions here. That is not absolutely water tight but I think we are ok along those lines.

We also spoke about a possible solution to the problem of continuing indexing when a node is under pressure disk space wise. The plan to tackle this issue is to introduce a new kind of index level cluster block that will be set automatically on all indices that have at least one shard on a node that is above the flood_stage (which is another setting that is set to 95% disk utilization by default). The master is currently monitoring disk usage of all nodes with a refresh interval of 30 seconds. The new cluster block will prevent indexing / updating but only for operations with Engine.Operation.Origin.PRIMARY such that we never get in the way of replica requests etc. The user will still be able to delete indices to make room on tight nodes with this cluster block but elasticsearch will not make any effort to move away from the block. This has to happen based on user actions ie. the user must remove the block from the indices.

s1monw · 2017-05-04T14:07:34Z

Just giving @bleskes a ping here since he might be interested in this as well...

bleskes · 2017-05-09T13:49:21Z

Thx @s1monw . Indeed interesting.

Reading the ticket I kept going back and forth between killing the node and trying to deal with it while staying alive. I get the argument for killing the node as it is not functioning well. On the other side - killing a node is a very drastic operation from a cluster perspective - it's not likely to come back with a minute (like is the case with OOM) and this means that the cluster will start recovering all shards that used to be on it to other nodes. So if you have 500GB of data on the node and one active shard, causing disk to fill up, or if you had a logging issue (and its not on a different mount), now we start copying all those 500GB around.
I like the idea of throttling indexing when indexing overloads shard moving - i.e., the master is already trying to move shards off the troubled node but it takes time, if someone is indexing at the fills up the disk faster than we're moving data, we should slow them down.
In terms of throttling - did we consider using the current throttling mechanism we already have - i.e., when the memory controller locks indexing to a single thread? will tying that to free disk space be enough of throttling with the advantage that it's fully local?
I would like to understand better where the out of disk came from - if it is a merge, will stopping indexing actually help? Also, it seems that a disk full issue only affect shards that are actively writing. Maybe instead of killing the node, we should kill the shard that can't live on it - i.e., fail it? I looked a bit at the code and it seems we currently treat this as a document failure (please do tell me I'm wrong - Lucene is complicated ;)).

s1monw · 2017-05-11T12:11:31Z

I like the idea of throttling indexing when indexing overloads shard moving - i.e., the master is already trying to move shards off the troubled node but it takes time, if someone is indexing at the fills up the disk faster than we're moving data, we should slow them down.

I am convinced we should reject all writes and not throttle once we cross a certain line. Folks can raise that bar if they feel confident but lets not continue writing.

I would like to understand better where the out of disk came from - if it is a merge, will stopping indexing actually help?

we are trying to not even get to this point. We try to prevent adding more data once we crossed the flood_stage which will then hopefully prevent running out of disk. If, lets say the last doc we index before we cross the line is triggering a merge and that merge causes an out of disk exception we fail that shard immediately. That will just causes this one shard to fail here and we give back the space we used for the merge target immediately so I think we are fine here. We also spoke about disabling merges on read-only indices (ie. when the shard goes inactive, which can be a side-effect of marking them as read-only)

Also, it seems that a disk full issue only affect shards that are actively writing. Maybe instead of killing the node, we should kill the shard that can't live on it - i.e., fail it? I looked a bit at the code and it seems we currently treat this as a document failure (please do tell me I'm wrong - Lucene is complicated ;)).

my summary of our chats steps away from failing the node... we try to not even get to the point and make indices that are allocated on the node that crosses the flodd_stage as read_only so I think it's fine?!

bleskes · 2017-05-12T16:51:12Z

I am convinced we should reject all writes and not throttle once we cross a certain line.

I thought about this more and I agree. It's a simpler solution than slowing things down.

my summary of our chats steps away from failing the node...

Ok. Good. Then there is no need offer alternatives :)

The new cluster block will prevent indexing / updating but only for operations with Engine.Operation.Origin.PRIMARY such that we never get in the way of replica requests etc.

By adding an index level block which exclude writes, we block write operations on the reroute phase, even before they go into the replication phase. Everything that's beyond this point will be processed correctly on both replicas and primaries. I think we're good here.

Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in elastic#24299 this change monitors disk utiliation and adds a floodstage watermark that causes all indices that are allocated on a node hitting the floodstage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The floodstage watermark is set to `95%` utilization by default. Closes elastic#24299

Today when we run out of disk all kinds of crazy things can happen and nodes are becoming hard to maintain once out of disk is hit. While we try to move shards away if we hit watermarks this might not be possible in many situations. Based on the discussion in #24299 this change monitors disk utilization and adds a flood-stage watermark that causes all indices that are allocated on a node hitting the flood-stage mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk situation while subsequent write requests will be rejected. Users can switch individual indices read-write once the situation is sorted out. There is no automatic read-write switch once the node has enough space. This requires user interaction. The flood-stage watermark is set to `95%` utilization by default. Closes #24299

nullpixel · 2017-07-06T06:47:07Z

Woo!

Bukhtawar · 2019-02-20T18:33:51Z

The user will still be able to delete indices to make room on tight nodes with this cluster block but elasticsearch will not make any effort to move away from the block. This has to happen based on user actions ie. the user must remove the block from the indices.

Can we not re-evaluate the blocks when disk frees up rather than letting the end user worry about blocks.

inqueue · 2019-02-20T18:51:45Z

Can we not re-evaluate the blocks when disk frees up rather than letting the end user worry about blocks.

Hi @Bukhtawar this is worth discussion. Will you file a new issue for it?

vaishali-prophecy · 2020-08-14T07:30:24Z

@s1monw I am using v7.0.4 but still getting this error. Can you tell me which version to use so that ES doesn't become unresponsive when the disk is full?

RS146BIJAY · 2023-07-06T14:28:52Z

The issue mentions that when flood stage watermark will be breached, we will make indices read only with allowing delete and disabling merges. Wondering why merges were not disabled on read only index? Is it intentional?

We also spoke about disabling merges on read-only indices.

nullpixel mentioned this issue Apr 24, 2017

Log filled with Caused by: java.io.IOException: No space left on device errors on data drive free space consumption #20354

Closed

clintongormley added :Core/Infra/Core Core issues without another label discuss resiliency labels Apr 25, 2017

clintongormley added the BigIssue label Apr 28, 2017

s1monw mentioned this issue May 15, 2017

Add a cluster block that allows to delete indices that are read-only #24678

Merged

s1monw self-assigned this Jun 30, 2017

s1monw mentioned this issue Jul 4, 2017

Switch indices read-only if a node runs out of disk space #25541

Merged

s1monw closed this as completed in #25541 Jul 5, 2017

Bukhtawar mentioned this issue Feb 23, 2019

Support for automatic release of index block once disk frees up #39334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Make elastic search crash after `Caused by: java.io.IOException: No space left on device`, rather than spamminng logs #24299

[Feature Request] Make elastic search crash after `Caused by: java.io.IOException: No space left on device`, rather than spamminng logs #24299

nullpixel commented Apr 24, 2017

martijnvg commented Apr 25, 2017

clintongormley commented Apr 25, 2017

khionu commented Apr 25, 2017

s1monw commented Apr 25, 2017 •

edited

nullpixel commented Apr 25, 2017

jasontedor commented Apr 25, 2017

jasontedor commented Apr 25, 2017

jasontedor commented Apr 25, 2017 •

edited

khionu commented Apr 25, 2017

s1monw commented Apr 27, 2017

s1monw commented Apr 27, 2017

jasontedor commented Apr 27, 2017

s1monw commented Apr 27, 2017

jasontedor commented Apr 27, 2017

nik9000 commented Apr 27, 2017

s1monw commented Apr 28, 2017

nik9000 commented May 1, 2017

s1monw commented May 2, 2017

nik9000 commented May 2, 2017

nik9000 commented May 2, 2017

s1monw commented May 2, 2017

nik9000 commented May 2, 2017

s1monw commented May 2, 2017

jasontedor commented May 3, 2017

s1monw commented May 4, 2017

s1monw commented May 4, 2017

bleskes commented May 9, 2017

s1monw commented May 11, 2017

bleskes commented May 12, 2017

nullpixel commented Jul 6, 2017

Bukhtawar commented Feb 20, 2019

inqueue commented Feb 20, 2019

vaishali-prophecy commented Aug 14, 2020

RS146BIJAY commented Jul 6, 2023 •

edited

[Feature Request] Make elastic search crash after Caused by: java.io.IOException: No space left on device, rather than spamminng logs #24299

[Feature Request] Make elastic search crash after Caused by: java.io.IOException: No space left on device, rather than spamminng logs #24299

Comments

nullpixel commented Apr 24, 2017

martijnvg commented Apr 25, 2017

clintongormley commented Apr 25, 2017

khionu commented Apr 25, 2017

s1monw commented Apr 25, 2017 • edited

nullpixel commented Apr 25, 2017

jasontedor commented Apr 25, 2017

jasontedor commented Apr 25, 2017

jasontedor commented Apr 25, 2017 • edited

khionu commented Apr 25, 2017

s1monw commented Apr 27, 2017

s1monw commented Apr 27, 2017

jasontedor commented Apr 27, 2017

s1monw commented Apr 27, 2017

jasontedor commented Apr 27, 2017

nik9000 commented Apr 27, 2017

s1monw commented Apr 28, 2017

nik9000 commented May 1, 2017

s1monw commented May 2, 2017

nik9000 commented May 2, 2017

nik9000 commented May 2, 2017

s1monw commented May 2, 2017

nik9000 commented May 2, 2017

s1monw commented May 2, 2017

jasontedor commented May 3, 2017

s1monw commented May 4, 2017

s1monw commented May 4, 2017

bleskes commented May 9, 2017

s1monw commented May 11, 2017

bleskes commented May 12, 2017

nullpixel commented Jul 6, 2017

Bukhtawar commented Feb 20, 2019

inqueue commented Feb 20, 2019

vaishali-prophecy commented Aug 14, 2020

RS146BIJAY commented Jul 6, 2023 • edited

[Feature Request] Make elastic search crash after `Caused by: java.io.IOException: No space left on device`, rather than spamminng logs #24299

[Feature Request] Make elastic search crash after `Caused by: java.io.IOException: No space left on device`, rather than spamminng logs #24299

s1monw commented Apr 25, 2017 •

edited

jasontedor commented Apr 25, 2017 •

edited

RS146BIJAY commented Jul 6, 2023 •

edited