-
Notifications
You must be signed in to change notification settings - Fork 6.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip directory fsync for filesystem btrfs #8903
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking through the code, there are a lot of places that this would not be covered. The checkpoint and backup code both rename files in the directory, fsync the directory. This change will not cover those cases.
Personally, I think it might be better to have a FSDirectory::RenameFile API that takes the two names, and the other options (IOOptions, IODebugContext*) and does the rename of the file and the Fsync (if an IOOptions suggests it).
I think that the FSDirectory class should have many of the methods off FileSystem (listChildren, exists, etc) that are basic wrappers back to the FileSystem. The difference is the methods are not required to specify the full path as input.
include/rocksdb/file_system.h
Outdated
// Fsync after renaming a file. Depends on the filesystem, it may fsync | ||
// directory or just the renaming file (e.g. btrfs). By default, it just calls | ||
// directory fsync. | ||
virtual IOStatus FsyncForRename(const std::string& filename, | ||
const IOOptions& options, | ||
IODebugContext* dbg) { | ||
return Fsync(options, dbg); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than adding a new API, why not make it an IOOption? That seems like it would be cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, this could just be Fsync(const std::string & filename...). Not sure why this is specific to "rename"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is only for a file rename. For new file creation (or deletion) we still just call Fsync(). We want to be explicit for that. But I agree that putting that information in IOOption would be more flexible.
env/io_posix.cc
Outdated
if (is_btrfs_) { | ||
int fd = open(filename.c_str(), O_RDONLY); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the directory is /home/my_dir and the filename is /home/my_file, what should happen? Should it be a requirement that the filename is in the directory or an error (NotFound?) is returned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, the dir-fsync won't persist the file renaming. At least with this change, on btrfs the renaming could be persisted with file fsync.
env/fs_posix.cc
Outdated
struct statfs buf; | ||
int ret = fstatfs(fd, &buf); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be cleaner if this code was moved inside of the PosixDirectory constructor.
And if it is done there, can this check be skipped altogether if BTRFS_SUPER_MAGIC is not defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, updated.
We cannot use definition of BTRFS_SUPER_MAGIC to check it's on btrfs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If BTRFS_SUPER_MAGIC is not defined, is there any reason to do this fstatfs?
We do want to keep the fsync for checkpoint and backup, as they're renaming a directory instead of a file. They're currently using old Directory which is not change. But I think we do want to provide an API for FSDirectory to force sync, in case they are moving to FSDirectory. I think switch to using IOOptions is a good idea.
Then the
Sorry, can you elaborate? |
The Backup code renames both files and directories (checkpoint appears to only rename directories). Compaction, Options, and Identity files are also examples of files that are created and renamed. I do not know what the rule/expectation for syncs are around those files.
Sorry, I meant that there should be an FSDirectory::RenameFile that does the rename. It can still use the base FileSystem implementation to do the work, but can then do the sync if required. |
Backup and checkpoint do need dir-fsync, as they're using
First, nothing changed for non-btrfs.
Yeah, that's one option. After discussed with Siying, to minimize the change, we will add an API |
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
I still believe it would be better to NOT use a new API but instead use a new IOOptions instead.:
Personally, I would prefer having an IOOptions::sync that, if set to true, would encourage the FileSystem/Directory to do an Fsync after. Then either the RenameFile (or even better an SDirectory::RenameFile) would perform the operation after it is called. The sync option could apply to other commands as well (such as DeleteFile). IMO, this is much cleaner and makes it easier to see the intent of what is going on. It would also be easier to find all of the places that should or should not be doing this sync code and keep track of them. |
To do that, we have to add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I tried it out and found a few places that need FsyncAfterFileRename().
- the IDENTITY file after it's renamed
[pid 3470993] fdatasync(7</mnt/btrfs/dbbench/000000.dbtmp>) = 0
[pid 3470993] rename("/mnt/btrfs/dbbench/000000.dbtmp", "/mnt/btrfs/dbbench/IDENTITY") = 0
- OPTIONS file after it's renamed
[pid 3470993] fsync(10</mnt/btrfs/dbbench/OPTIONS-000006.dbtmp>) = 0
[pid 3470993] rename("/mnt/btrfs/dbbench/OPTIONS-000006.dbtmp", "/mnt/btrfs/dbbench/OPTIONS-000007") = 0
- Files created/copied during backup, and also the metafile
[pid 3512221] fdatasync(23</mnt/btrfs/backup/shared_checksum/.000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst.tmp>) = 0
[pid 3512195] rename("/mnt/btrfs/backup//shared_checksum/.000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst.tmp", "/mnt/btrfs/backup//shared_checksum/000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst") = 0
...
[pid 3512195] fdatasync(23</mnt/btrfs/backup/meta/.1.tmp>) = 0
[pid 3512195] rename("/mnt/btrfs/backup//meta/.1.tmp", "/mnt/btrfs/backup//meta/1") = 0
Also, checkpoint is odd in that it renames the directory at the end. I am not sure how to handle that.
[pid 3542581] fdatasync(19</mnt/btrfs/checkpoint.tmp/CURRENT>) = 0
[pid 3542581] rename("/mnt/btrfs/checkpoint.tmp", "/mnt/btrfs/checkpoint/") = 0
I used following commands btw (creating a filesystem in /mnt/btrfs turned out to be wasted effort since I found out our devserver uses btrfs already...):
$ TEST_TMPDIR=/mnt/btrfs strace -fye rename,fsync,fdatasync ./db_bench -benchmarks=filluniquerandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -num=100000
$ strace -fye rename,fsync,fdatasync ./ldb backup --backup_dir=/mnt/btrfs/backup/ --db=/mnt/btrfs/dbbench/
$ strace -fye rename,fsync,fdatasync ./ldb checkpoint --checkpoint_dir=/mnt/btrfs/checkpoint/ --db=/mnt/btrfs/dbbench/
Thanks @ajkr for the review.
|
But in the case of non-btrfs we probably sync the directory for some reason not too long after any file is created. For example the dir would be synced for the new MANIFEST not too long after the IDENTITY file is created. The MANIFEST's dir sync coincidentally syncs the IDENTITY file pretty quickly. Whereas, skipping dir sync and missing a FsyncAfterFileRename() in btrfs would be more consequential because nothing would ever (?) force that file entry to be synced. |
That said, I am fine with adding the dir sync for non-btrfs in these cases, if you want. |
0867b38
to
c70a2de
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Sounds good.
|
include/rocksdb/file_system.h
Outdated
|
||
IOOptions() : IOOptions(false) {} | ||
|
||
IOOptions(bool force_dir_fsync_) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be explicit
.
IOOptions() : timeout(0), prio(IOPriority::kIOLow), type(IOType::kUnknown) {} | ||
// Force directory fsync, some file systems like btrfs may skip directory | ||
// fsync, set this to force the fsync | ||
bool force_dir_fsync; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, this feels inconsistent with the API doc on IOOptions
: "These are hints and are not necessarily guaranteed". A hint to force something (IMO) means it might happen in which case client still needs to handle the case it didn't happen. But the way we are using Fsync(IOOptions(true) ...
is requiring the dir fsync happened.
IMO add an Fsync()
overload with argument contents_synced
(or better name) whose default implementation ignores it, but whose PosixFileSystem implementation uses it to skip the dir sync on btrfs. It is true by default but false when something like directory rename just happened and the new name has not been synced.
edit: Or just make the IOOption indicate "there's nothing important that needs to be synced" (idk the name for this); then it is fine to be ignored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can sort of see ignoring force_dir_fsync
to mean just do the fsync if I strain to think about it (because after all we're in a function called Fsync()
...). So it's fine to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Are you worried about the size of IOOptions
to suggest having the contents_synced
as argument? If it's not a concern, how about adding contents_synced
to IOOptions
, maybe also having the string renamed_file
in IOOptions
(to get ride of FsyncAfterFileRename()
):
struct IOOptions {
...
// only for dir fsync()
std::string renamed_file; // or vector of renamed_files?
bool contents_synced = false;
}
Then we don't need to change public API. For existing POSIX filesystems, they just ignore these new added fields. And for btrfs, it can take advantage of that (or ignoring them won't cause correctness problem):
Fsync(opts) {
// if it's btrfs, otherwise just do fsync dir:
if (!opts.contents_synced) {
if (opts.renamed_file) {
Fsync(opts.renamed_file); // fsync renamed file
} else {
Fsync(fd_); // fsync dir
}
}
}
contents_synced
is set to false by default, so the default behavior is not changed. Please let me know what you think.
db/compaction/compaction_job.cc
Outdated
@@ -2322,7 +2322,7 @@ Status CompactionServiceCompactionJob::Run() { | |||
constexpr IODebugContext* dbg = nullptr; | |||
|
|||
if (output_directory_) { | |||
io_s = output_directory_->Fsync(IOOptions(), dbg); | |||
io_s = output_directory_->Fsync(IOOptions(true), dbg); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is dir fsync forced for remote compaction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we do multiple Rename()
s for remote compaction result (sst files):
rocksdb/db/compaction/compaction_job.cc
Line 1086 in 1c290c7
s = fs_->RenameFile(src_file, tgt_file, IOOptions(), nullptr); |
we should force the dir fsync. We could also sync all compaction result files, which needs to make
FsyncAfterFileRename()
support list of files. To be simple, we just force the sync here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also checked file ingestion, seems extra safe (it syncs twice!). LGTM because it seems to work.
$ ./ldb write_extern_sst ./tmp.sst --db=/data/users/andrewkr/dbbench/dbbench/ << EOF
a ==> b
EOF
$ strace -fye link,fsync,fdatasync ./ldb ingest_extern_sst ./tmp.sst --db=/data/users/andrewkr/dbbench/dbbench/ --move_files
...
[pid 1808814] link("./tmp.sst", "/data/users/andrewkr/dbbench/dbbench//000152.sst") = 0
[pid 1808814] fdatasync(25</data/users/andrewkr/dbbench/dbbench/000152.sst>) = 0
[pid 1808814] fdatasync(25</data/users/andrewkr/dbbench/dbbench/000152.sst>) = 0
...
c70a2de
to
b44590a
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
1 similar comment
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
Discussed with @ajkr offline that we decided to add a new option |
409276c
to
51ea9c8
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good, though somewhat difficult as fsync always is...
env/io_posix.cc
Outdated
} | ||
return s; | ||
} | ||
// fallback to dir-fsync for kDefault and kDirRenamed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and kFileDeleted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe kFileDeleted should skip the fsync. It's used during DB::Open so wouldn't it defeat this optimization if it falls back to fsync?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep having fsync for file deletion to keep consistent with other file systems (for file deletion, btrfs and other file system have the same behavior).
If we decide the fsync after deletion is not necessary, maybe we should remove the fsync for all. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing I am stuck on though is this PR doesn't seem to optimize any known scenario (please correct me if I'm wrong). Should we have a plan to finish eliminating directory fsyncs in the foreground during DB::Open? For example I wonder if we can use the background purge thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, that make sense.
I'm going to add kFileDeleted and eliminate fsync for btrfs first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This use of background purge thread looks promising.
Lines 1615 to 1620 in f24c39a
// PurgeObsoleteFiles here does not delete files. Instead, it adds the | |
// files to be deleted to a job queue, and deletes it in a separate | |
// background thread. | |
state->db->PurgeObsoleteFiles(job_context, true /* schedule only */); | |
state->mu->Lock(); | |
state->db->SchedulePurge(); |
We'd need to replace Open()'s call to DeleteObsoleteFiles() with something that can pick/schedule the purge and subsequent dir fsync in the background.
rocksdb/db/db_impl/db_impl_open.cc
Lines 1698 to 1699 in f24c39a
impl->DeleteObsoleteFiles(); | |
s = impl->directories_.GetDbDir()->Fsync(IOOptions(), nullptr); |
It looks like we have an option, avoid_unnecessary_blocking_io
, that is typically used to control delegating such I/O to the background. I think we can use it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I changed it to do background purge if avoid_unnecessary_blocking_io = true
. And removed dir sync after file purge.
Added a benchmark for that, which shows 20% performance improvement on btrfs:
with fix:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
DBOpen/iterations:200_mean 23894325 ns 18697120 ns 3
no fix:
-----------------------------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------------------------
DBOpen/iterations:200_mean 30343592 ns 24304870 ns 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! That change is cleaner than I would have guessed. Does the background thread do dir sync already?
51ea9c8
to
ed8553d
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
ed8553d
to
db02ef1
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
2bc47da
to
a270f6d
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
3 similar comments
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
edc27ec
to
96588c8
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 3 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncForRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is skipped which will not persist the deletion. It should be harmless as RocksDB should try to cleanup later. Test Plan: run tests on btrfs and non btrfs
destroy Add write during bench
This reverts commit db287b11e0ad1631fb534e2f77e0ac108e474189.
cad3947
to
63609fe
Compare
@jay-zhuang has updated the pull request. You must reimport the pull request before landing. |
@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
@jay-zhuang merged this pull request in 2910264. |
Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:
new file itself is synced.
synced. So an API
FsyncAfterFileRename(filename, ...)
is providedto sync the file on btrfs. By default, it just calls dir-fsync.
IOOptions.force_dir_fsync = true
forced, the same as above.
Test Plan: run tests on btrfs and non btrfs