Skip to content

Retry writes to the AOF on ENOSPC #588

Open
wants to merge 1 commit into from

2 participants

@saj
saj commented Jul 18, 2012

Big, busy Redis servers configured with AOF persistence can chew through a lot of disk storage. It's usually not a big deal if capacity estimates come in a little under true: operators can adapt to steady growth by using filesystems that can be grown while online.

What we can't (instantaneously) adapt to, however, are the big, unpredictable surges in demand that come with automatic AOF rewrites. If a BGREWRITEAOF happens to completely fill its filesystem (with the temporary AOF), there is a very real chance that Redis will die because it has no space left to append to the real AOF.

You then find yourself in a somewhat ironic situation: the AOF rewrite process designed to prevent Redis from filling its disk just killed Redis... because it filled the disk. :)

With the existing implementation, Redis will terminate immediately if a write to its AOF has failed.

This commit modifies AOF write behaviour when the fsync policy is set to one of the less durable modes (everysec or no). With this commit, Redis will assume that if an AOF write fails because of ENOSPC and we have a BGREWRITEAOF child running, the write might succeed if we try it again after the child dies. This keeps Redis up and servicing requests even if an automatic BGREWRITEAOF momentarily fills the disk.

Children running BGREWRITEAOF will, of course, keep dying on ENOSPC until an operator comes along and throws more storage at Redis. You can use something like #586 to automatically detect this problem.

Requires #587.

@saj saj Retry writes to the AOF on ENOSPC
Big, busy Redis servers configured with AOF persistence can chew through
a lot of disk storage.  It's usually not a big deal if capacity
estimates come in a little under true:  operators can adapt to steady
growth by using filesystems that can be grown while online.

What we can't (instantaneously) adapt to, however, are the big,
unpredictable surges in demand that come with automatic AOF rewrites.
If a BGREWRITEAOF happens to completely fill its filesystem (with the
temporary AOF), there is a very real chance that Redis will die because
it has no space left to append to the real AOF.

You then find yourself in a somewhat ironic situation:  the AOF rewrite
process designed to prevent Redis from filling its disk just killed
Redis...  because it filled the disk. :)

With the existing implementation, Redis will terminate immediately if a
write to its AOF has failed.

This commit modifies AOF write behaviour when the fsync policy is set to
one of the less durable modes ('everysec' or 'no').  With this commit,
Redis will assume that if an AOF write fails because of ENOSPC and we
have a BGREWRITEAOF child running, the write might succeed if we try it
again after the child dies.  This keeps Redis up and servicing requests
even if an automatic BGREWRITEAOF momentarily fills the disk.
0882378
@antirez
Owner
antirez commented Jul 27, 2012

This seems a good idea in general but I've to review the patch more carefully before merging it. Adding to the 2.6 milestone so I'll review it before 2.6 stable for sure.

@antirez antirez was assigned Jul 27, 2012
@antirez
Owner
antirez commented Sep 27, 2012

2.6 -> 2.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.