Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Make elastic search crash after Caused by: java.io.IOException: No space left on device, rather than spamminng logs #24299

Closed
nullpixel opened this issue Apr 24, 2017 · 34 comments
Assignees
Labels
:Core/Infra/Core Core issues without another label discuss resiliency

Comments

@nullpixel
Copy link

Describe the feature: As per #20354, logs are spammed with Caused by: java.io.IOException: No space left on device when the disk space runs out. Why not make it leave a single message in the logs, then just crash? As the disk is full, the database cannot run anyway, so rather than spamming logs, just adapt a simpler approach.

This is a production database, and I wouldn't expect my logs to look like that after waking up.

@martijnvg
Copy link
Member

I don't think that crashing the node is the best approach when there is no space left on device. The node can still serve read requests and if indices are removed/relocated (or in this case old log files are removed) then there should be sufficient space to handle write requests.

In case of log files filling up, maybe ES should try to prevent this by capping the size of the log file, for example like is done for deprecation logging (with log4j's SizeBasedTriggeringPolicy)?

@clintongormley clintongormley added :Core/Infra/Core Core issues without another label discuss resiliency labels Apr 25, 2017
@clintongormley
Copy link

Also, a disk might fill up because of a big merge. As soon as that fails, disk space could be freed up again allowing the node to continue working.

@khionu
Copy link

khionu commented Apr 25, 2017

In that case, it might be best to put a hold on the operations that are unable to complete.

A queue, that will retry at a much lower rate, until operations are successful, would appease the concerns of being available for read and waiting for disk to be freed, without causing undue spam. For an exception that requires manual intervention, a retry rate of once a minute would be acceptable, in my opinion.

@s1monw
Copy link
Contributor

s1monw commented Apr 25, 2017

I agree killing the node might not be the right thing here from a 10k feet view while I can see it as a viable solution for some use - cases. Out of the box I'd like us to rather detect that we are close to full disk and then simply stop write operations like merging and indexing. Yet, this still has issues since if we have a replica on the node we'd reject a write which would bring the replica out of sync in that moment. Disk-threshold allocation deciders should help here moving stuff away from the node but essentially killing it and leave the cluster would be the right thing to do here IMO. If we wanna do that we need to somehow add corresponding handlers in many places or we start with adding it into the engine and allow folks to opt out of it? I generally think we should kill nodes more-often in disaster situations instead of just sitting there and wait for more disaster to happen.

@nullpixel
Copy link
Author

Yeah, so you cannot write to the disk if it's full, but spamming the logs isn't the way to go.

Could do a "read only" style mode where it just says it went into read only because of no disk space. there's no need to spam the logs; even with a cap

@jasontedor
Copy link
Member

Yeah, so you cannot write to the disk if it's full, but spamming the logs isn't the way to go.

This is not necessarily true, the logs could (and should!) be on a separate mount, they could have log rotation applied to them, etc.

@jasontedor
Copy link
Member

I generally think we should kill nodes more-often in disaster situations instead of just sitting there and wait for more disaster to happen.

I agree but I'm unsure if disk-full qualifies as such a disaster situation since it's possible to recover.

@jasontedor
Copy link
Member

jasontedor commented Apr 25, 2017

Why not make it leave a single message in the logs, then just crash? As the disk is full, the database cannot run anyway, so rather than spamming logs, just adapt a simpler approach.

In a concurrent server application, there are likely many disk operations in flight, expecting a single message is not realistic. As others have mentioned, this situation is not completely fatal, and can be recovered from so crashing on disk-full should not be a first option.

@khionu
Copy link

khionu commented Apr 25, 2017

One option is to provide a few behaviors for the host to pick from. Fatal crash, start refusing writes, or something else.

Something else to be considered is maybe preventative? It should be reasonable to check the space remaining daily, and if it goes below, say, 5%, log a severe warning.

@s1monw
Copy link
Contributor

s1monw commented Apr 27, 2017

@khionu just out of curiosity did you look into this

@s1monw
Copy link
Contributor

s1monw commented Apr 27, 2017

I agree but I'm unsure if disk-full qualifies as such a disaster situation since it's possible to recover.

the question is what the recovery path is? Most likely we need to relocate shards but shouldn't the disk-threshold decider have taken care of this already? Once we are like at 99% there is not much we can do but failing? it's likely the most healthy option here, it tells the cluster to heal itself by allocating shards on other nodes, it notifies the users since a node died. The log message might be clear and we can refuse to start up until we have at lest 5% disk back? I kind of like this option the more I think about it.

@jasontedor
Copy link
Member

@s1monw You're starting to convince me. Here's another thought has occurred to me: if we keep the node alive, I don't think there's a lot that we should or can do (without a lot of jumping through hoops) about the disk-full log messages, so those are going to keep pumping out. If those log messages are being sent to a remote monitoring cluster, the disks on nodes on the remote monitoring cluster could be overwhelmed too and now you have two problems (remote denial of service on the monitoring cluster). This is an argument for dying.

@s1monw
Copy link
Contributor

s1monw commented Apr 27, 2017

I'd like to hear what @rjernst and @nik9000 are thinking about this

@jasontedor
Copy link
Member

Another thing to consider with respect to dying is that operators are going to have their nodes set to auto-restart (we encourage this because of dying with dignity). If we fail startup when the disk is full, as we should if we proceed with dying when the disk is full, we will end up in an infinite retry loop that will also be spamming logs and we haven't solved anything. I discussed this concern with @s1monw and we discussed the idea of a marker file to track the number of retries and simply not start at all if that marker file is present and the count exceeds some amount. At this point, manual intervention is required but it already is for disk full anyway.

@nik9000
Copy link
Member

nik9000 commented Apr 27, 2017

I'd like to hear what @rjernst and @nik9000 are thinking about this

Lots of things.

  1. What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.
  2. If a primary fails to write to a replica because it ran out of space then it'll fail the replica, freeing up the space that the replica was taking up pretty quick. I could see this being a transient problem if you are running really close to the edge. Shooting the node seems drastic in these situations because it'll have disk to work with soon.
  3. Why don't we roll the log files based on size? If we did that at least we'd limit the amount of disk we consume in this case. We'd still chew through IOPS but still.
  4. I feel like we should do more indoctrination about running separate mount points for the data directory and everything else. It is fairly uncommon these days to separate /var/log and / but I've never run Elasticsearch in production without /var/lib/elasticsearch being on a separate volume.
  5. I feel like if disk is full it'd be nice if the node stuck around so other nodes could recover from it.
  6. I'm fairly concerned that killing the node will make the problem worse for the rest of the cluster which will have to recover the copies of the shard somewhere else which will fill the disk space of other nodes. At least, it'll push them as close to the edge as the disk allocation decider allows them to go.

I'm sure I'll think of more things. I don't really like the idea of shooting the node if it runs out of disk space. I just have a gut feeling about it more than all the stuff I wrote.

@s1monw
Copy link
Contributor

s1monw commented Apr 28, 2017

What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.

this one is interesting. if you have only one copy it will be unavailable until the node has enough disk to recover. But you won't loose data, it's still there. If you have a copy the cluster will try to allocate it elsewhere and we slowly heal the cluster or the node comes back quickly with more disk space. I think dropping out is a good option IMO?

What if you only have a single copy of the shard and the node kills itself because it ran out of space? I mean having only a single copy is asking for trouble, but the problem still stands with multiple copies. It is just less common.

the same goes for OOM or any other fatal/non-recoverable error. the question is do we treat disk full as non-recoverable. IMO yes, we won't recover from it, the cluster will be in trouble anyhow.

If a primary fails to write to a replica because it ran out of space then it'll fail the replica, freeing up the space that the replica was taking up pretty quick. I could see this being a transient problem if you are running really close to the edge. Shooting the node seems drastic in these situations because it'll have disk to work with soon.

this is not true - we keep the data on disk until we allocated the replica on another node. this can take very long

Why don't we roll the log files based on size? If we did that at least we'd limit the amount of disk we consume in this case. We'd still chew through IOPS but still.

this is a different problem which I agree we should tackle but it's unrelated to out of disk IMO

I feel like if disk is full it'd be nice if the node stuck around so other nodes could recover from it.

we don't have this option unless we switch all indices allocated on it read-only? that is pretty drastic and very error prone

@nik9000
Copy link
Member

nik9000 commented May 1, 2017

this is not true - we keep the data on disk until we allocated the replica on another node. this can take very long

Right. I take this point back.

Personally I don't think nodes should kill themselves but I'm aware that is asking too much. There are unrecoverable things OOMs and other bugs. We work hard to prevent these and they break things in subtle ways when they hit.

If running out of disk is truly as unrecoverable as a Java OOM then we should kill the node but we need to open intensive efforts to make sure that it doesn't happen. Like we did with the circuit breakers. The disk allocation stuff doesn't look like it is enough.

@s1monw
Copy link
Contributor

s1monw commented May 2, 2017

If running out of disk is truly as unrecoverable as a Java OOM then we should kill the node but we need to open intensive efforts to make sure that it doesn't happen. Like we did with the circuit breakers. The disk allocation stuff doesn't look like it is enough.

agreed we should think and try harder to make it less likely to get to this point. One thought I was playing with was to tell the primary that the replica has not enough space left and if so we can reject subsequent writes. Such and information can be transported back to the primary with the replica write responses. Once the primary is in such a state it will stay there until the replica tells it's primary it's in a good shape again so we can continue indexing. I really think we should push back to the user is stuff like this happens and we need to give the nodes some time to move shards away which is sometimes not possible due to allocation deciders or no space on other nodes. In such a case we can only reject writes.

@nik9000
Copy link
Member

nik9000 commented May 2, 2017

agreed we should think and try harder to make it less likely to get to this point. One thought I was playing with was to tell the primary that the replica has not enough space left and if so we can reject subsequent writes. Such and information can be transported back to the primary with the replica write responses. Once the primary is in such a state it will stay there until the replica tells it's primary it's in a good shape again so we can continue indexing. I really think we should push back to the user is stuff like this happens and we need to give the nodes some time to move shards away which is sometimes not possible due to allocation deciders or no space on other nodes. In such a case we can only reject writes.

+1

@nik9000
Copy link
Member

nik9000 commented May 2, 2017

@s1monw, if there is space between when we start moving shards off of a node and when we reject writes then this can be a backstop. I guess also we'd want the primary to have this behavior too, right?

@s1monw
Copy link
Contributor

s1monw commented May 2, 2017

@nik9000 yes so we should accept all writes on replicas all the time we just need to prevent the primary to send them. So yes, the primary should also have such a flag

@nik9000
Copy link
Member

nik9000 commented May 2, 2017

Ah! Now I get it.

For those of you like me that don't get it at first: the replication model in Elasticsearch dictates that if a primary has accepted a write but the replica rejects it then the replica must fail. Once failed the replica has to recover from the primary which is a fairly heavy operation. So replicas must do their best not to fail. In this context that means that the replica should absorb the write even though it is running out of space but it should tell the primary to reject further writes. We can tuck the "help I'm running out of space" flag into the write response.

So we have to set the disk space percent quite a bit before we get full because it is super-asynchronous.

@s1monw
Copy link
Contributor

s1monw commented May 2, 2017

@nik9000 yes that is what I meant... thanks for explaining it in other words again.

@jasontedor
Copy link
Member

A concern I have: consider a homogenous cluster with well-sharded data (these are not unreasonable assumptions). If one node is running low on disk space, then they are all running low on disk space. Killing the first node to run out of disk space will lead to recoveries on the other nodes in the cluster exacerbating their low disk issues. Shooting a node can lead to a cluster-wide outage.

@s1monw
Copy link
Contributor

s1monw commented May 4, 2017

A concern I have: consider a homogenous cluster with well-sharded data (these are not unreasonable assumptions). If one node is running low on disk space, then they are all running low on disk space. Killing the first node to run out of disk space will lead to recoveries on the other nodes in the cluster exacerbating their low disk issues. Shooting a node can lead to a cluster-wide outage.

we spoke about this yesterday in a meeting but I want to add my response here anyway for completeness. I think in such a situation the watermarks will protect us since if a node is already high on disk usage we will not allocate shards on it. We also have a good notion of how big shards are for relocation so we can make good decisions here. That is not absolutely water tight but I think we are ok along those lines.

We also spoke about a possible solution to the problem of continuing indexing when a node is under pressure disk space wise. The plan to tackle this issue is to introduce a new kind of index level cluster block that will be set automatically on all indices that have at least one shard on a node that is above the flood_stage (which is another setting that is set to 95% disk utilization by default). The master is currently monitoring disk usage of all nodes with a refresh interval of 30 seconds. The new cluster block will prevent indexing / updating but only for operations with Engine.Operation.Origin.PRIMARY such that we never get in the way of replica requests etc. The user will still be able to delete indices to make room on tight nodes with this cluster block but elasticsearch will not make any effort to move away from the block. This has to happen based on user actions ie. the user must remove the block from the indices.

@s1monw
Copy link
Contributor

s1monw commented May 4, 2017

Just giving @bleskes a ping here since he might be interested in this as well...

@bleskes
Copy link
Contributor

bleskes commented May 9, 2017

Thx @s1monw . Indeed interesting.

  1. Reading the ticket I kept going back and forth between killing the node and trying to deal with it while staying alive. I get the argument for killing the node as it is not functioning well. On the other side - killing a node is a very drastic operation from a cluster perspective - it's not likely to come back with a minute (like is the case with OOM) and this means that the cluster will start recovering all shards that used to be on it to other nodes. So if you have 500GB of data on the node and one active shard, causing disk to fill up, or if you had a logging issue (and its not on a different mount), now we start copying all those 500GB around.
  2. I like the idea of throttling indexing when indexing overloads shard moving - i.e., the master is already trying to move shards off the troubled node but it takes time, if someone is indexing at the fills up the disk faster than we're moving data, we should slow them down.
  3. In terms of throttling - did we consider using the current throttling mechanism we already have - i.e., when the memory controller locks indexing to a single thread? will tying that to free disk space be enough of throttling with the advantage that it's fully local?
  4. I would like to understand better where the out of disk came from - if it is a merge, will stopping indexing actually help? Also, it seems that a disk full issue only affect shards that are actively writing. Maybe instead of killing the node, we should kill the shard that can't live on it - i.e., fail it? I looked a bit at the code and it seems we currently treat this as a document failure (please do tell me I'm wrong - Lucene is complicated ;)).

@s1monw
Copy link
Contributor

s1monw commented May 11, 2017

I like the idea of throttling indexing when indexing overloads shard moving - i.e., the master is already trying to move shards off the troubled node but it takes time, if someone is indexing at the fills up the disk faster than we're moving data, we should slow them down.

I am convinced we should reject all writes and not throttle once we cross a certain line. Folks can raise that bar if they feel confident but lets not continue writing.

I would like to understand better where the out of disk came from - if it is a merge, will stopping indexing actually help?

we are trying to not even get to this point. We try to prevent adding more data once we crossed the flood_stage which will then hopefully prevent running out of disk. If, lets say the last doc we index before we cross the line is triggering a merge and that merge causes an out of disk exception we fail that shard immediately. That will just causes this one shard to fail here and we give back the space we used for the merge target immediately so I think we are fine here. We also spoke about disabling merges on read-only indices (ie. when the shard goes inactive, which can be a side-effect of marking them as read-only)

Also, it seems that a disk full issue only affect shards that are actively writing. Maybe instead of killing the node, we should kill the shard that can't live on it - i.e., fail it? I looked a bit at the code and it seems we currently treat this as a document failure (please do tell me I'm wrong - Lucene is complicated ;)).

my summary of our chats steps away from failing the node... we try to not even get to the point and make indices that are allocated on the node that crosses the flodd_stage as read_only so I think it's fine?!

@bleskes
Copy link
Contributor

bleskes commented May 12, 2017

I am convinced we should reject all writes and not throttle once we cross a certain line.

I thought about this more and I agree. It's a simpler solution than slowing things down.

my summary of our chats steps away from failing the node...

Ok. Good. Then there is no need offer alternatives :)

The new cluster block will prevent indexing / updating but only for operations with Engine.Operation.Origin.PRIMARY such that we never get in the way of replica requests etc.

By adding an index level block which exclude writes, we block write operations on the reroute phase, even before they go into the replication phase. Everything that's beyond this point will be processed correctly on both replicas and primaries. I think we're good here.

@s1monw s1monw self-assigned this Jun 30, 2017
s1monw added a commit to s1monw/elasticsearch that referenced this issue Jul 4, 2017
Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in elastic#24299
this change monitors disk utiliation and adds a floodstage watermark
that causes all indices that are allocated on a node hitting the floodstage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk
situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The floodstage watermark is set to `95%` utilization by default.

Closes elastic#24299
s1monw added a commit that referenced this issue Jul 5, 2017
Today when we run out of disk all kinds of crazy things can happen
and nodes are becoming hard to maintain once out of disk is hit.
While we try to move shards away if we hit watermarks this might not
be possible in many situations. Based on the discussion in #24299
this change monitors disk utilization and adds a flood-stage watermark
that causes all indices that are allocated on a node hitting the flood-stage
mark to be switched read-only (with the option to be deleted). This allows users to react on the low disk
situation while subsequent write requests will be rejected. Users can switch
individual indices read-write once the situation is sorted out. There is no
automatic read-write switch once the node has enough space. This requires
user interaction.

The flood-stage watermark is set to `95%` utilization by default.

Closes #24299
@nullpixel
Copy link
Author

Woo!

@Bukhtawar
Copy link
Contributor

The user will still be able to delete indices to make room on tight nodes with this cluster block but elasticsearch will not make any effort to move away from the block. This has to happen based on user actions ie. the user must remove the block from the indices.

Can we not re-evaluate the blocks when disk frees up rather than letting the end user worry about blocks.

@inqueue
Copy link
Member

inqueue commented Feb 20, 2019

Can we not re-evaluate the blocks when disk frees up rather than letting the end user worry about blocks.

Hi @Bukhtawar this is worth discussion. Will you file a new issue for it?

@vaishali-prophecy
Copy link

@s1monw I am using v7.0.4 but still getting this error. Can you tell me which version to use so that ES doesn't become unresponsive when the disk is full?

@RS146BIJAY
Copy link

RS146BIJAY commented Jul 6, 2023

The issue mentions that when flood stage watermark will be breached, we will make indices read only with allowing delete and disabling merges. Wondering why merges were not disabled on read only index? Is it intentional?

We also spoke about disabling merges on read-only indices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Core Core issues without another label discuss resiliency
Projects
None yet
Development

No branches or pull requests