Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WARN when filesystem is mounted with flushoncommit — best practices? #68

Open
d235j opened this issue Jul 8, 2018 · 1 comment
Open
Labels

Comments

@d235j
Copy link

@d235j d235j commented Jul 8, 2018

The README for bees indicates that filesystems should be mounted with the flushoncommit option.
This is not the default for btrfs. Additionally, having it enabled causes a warning in __writeback_inodes_sb_nr, which is discussed a bit in this thread:
https://www.spinics.net/lists/linux-btrfs/msg72483.html

What would be the best practice in this situation?

@Zygo
Copy link
Owner

@Zygo Zygo commented Jul 8, 2018

The README lists flushoncommit as a tested configuration, not a requirement. Neither flushoncommit setting conflicts with bees, so you can use either mount option with bees.

I don't test bees on filesystems with noflushoncommit because the data loss that normally occurs on crashes with noflushoncommit creates noise results in testing that make it difficult to determine whether any data loss that is found was expected btrfs behavior, or some bug in either btrfs or bees.

The default (noflushoncommit) makes btrfs performance competitive in filesystem benchmarks, but it's very unforgiving when there is a crash. In addition to the usual empty files that get left behind on a crash on other filesystems like ext4, btrfs noflushoncommit also injects hole extents into the middle of files if delayed allocation causes a reordering of extent flushes around a transaction commit, and a crash occurs between the transaction commit and the delayed flushes. This gets worse on larger/busier systems because the flushes take more time to execute, so the percentage of time when data is at risk rapidly reaches 100%, and the locations of the holes injected into files become less predictable. In other words, data corruption (as opposed to loss) on a crash is certain with noflushoncommit, the only variables are how much data and which parts of which files are damaged.

Because of the problems with noflushoncommit, I don't recommend using btrfs without flushoncommit except in specific circumstances:

  • Dedicated application servers can use flushoncommit safely if all writes are known to be explicitly persisted with fsync, and lost file writes can be detected and tolerated by all applications on the server. Database engines and sqlite can meet this requirement, but tools like 'cp' and 'rsync' do not.
  • Another solution is to delete all files that were written just before a crash. A build server can do this by erasing the workspace of an interrupted build and restarting builds from a fresh checkout after a reboot. A build in progress would likely be corrupted in ways that are difficult to detect, especially if GNU ld is involved.
  • An end-user running a collection of desktop productivity applications without rigorous automated supervision is just rolling the dice on data loss. Such users should always mount with flushoncommit.

bees does call fsync() after creating temporary extents for dedup src, and all of its temporary files are created with O_TMPFILE so they are automatically removed on reboots; however, there is still a race condition when noflushoncommit is used that has no solution:

  • two extents are identified as duplicate by bees
  • some non-bees process writes identical data to the src extent, replacing the src extent with delalloc blocks in memory
  • bees dedups the extent (with dedup_file_range ioctl), replacing a reference to a flushed dst extent on disk with a reference to a non-flushed src extent in memory. In the previous step we required identical data so the dedup is still accepted by the kernel
  • a btrfs commit occurs, deleting the old dst ref, but not flushing the new src ref because noflushoncommit
  • system crashes before the flush occurs

With noflushoncommit, in theory the above sequence could replace the duplicate extent with a hole (note: I haven't tested noflushoncommit with bees, so I have not reproduced this in practice, and there may be some other mechanism in btrfs that prevents this case). flushoncommit does not allow this to happen as the flush and commit occur within the same transaction.

Probably the best approach would be to fix the bug introduced in kernel commit ce8ea7cc6eb3 ("btrfs: don't call btrfs_start_delalloc_roots in flushoncommit"). See also https://marc.info/?l=linux-btrfs&m=151315564008773. This bug does not affect LTS kernels (i.e. 4.14.x) so it can also be avoided by using one of those.

To date I haven't seen evidence of this warning causing any problems in practice; however, I wouldn't expect it to be a problem unless a umount is attempted with a filesystem under heavy write load, and I would expect the effect in that case to be limited to a kernel crash or maybe a deadlock.

@Zygo Zygo added the kernel bug label Jul 8, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.