Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip directory fsync for filesystem btrfs #8903

Closed
wants to merge 11 commits into from

Conversation

jay-zhuang
Copy link
Contributor

@jay-zhuang jay-zhuang commented Sep 10, 2021

Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 4 directory fsync cases:

  1. creating a new file: dir-fsync is not needed on btrfs, as long as the
    new file itself is synced.
  2. renaming a file: dir-fsync is not needed if the renamed file is
    synced. So an API FsyncAfterFileRename(filename, ...) is provided
    to sync the file on btrfs. By default, it just calls dir-fsync.
  3. deleting files: dir-fsync is forced by set
    IOOptions.force_dir_fsync = true
  4. renaming multiple files (like backup and checkpoint): dir-fsync is
    forced, the same as above.

Test Plan: run tests on btrfs and non btrfs

Copy link
Contributor

@mrambacher mrambacher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking through the code, there are a lot of places that this would not be covered. The checkpoint and backup code both rename files in the directory, fsync the directory. This change will not cover those cases.

Personally, I think it might be better to have a FSDirectory::RenameFile API that takes the two names, and the other options (IOOptions, IODebugContext*) and does the rename of the file and the Fsync (if an IOOptions suggests it).

I think that the FSDirectory class should have many of the methods off FileSystem (listChildren, exists, etc) that are basic wrappers back to the FileSystem. The difference is the methods are not required to specify the full path as input.

Comment on lines 1100 to 1151
// Fsync after renaming a file. Depends on the filesystem, it may fsync
// directory or just the renaming file (e.g. btrfs). By default, it just calls
// directory fsync.
virtual IOStatus FsyncForRename(const std::string& filename,
const IOOptions& options,
IODebugContext* dbg) {
return Fsync(options, dbg);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than adding a new API, why not make it an IOOption? That seems like it would be cleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, this could just be Fsync(const std::string & filename...). Not sure why this is specific to "rename"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is only for a file rename. For new file creation (or deletion) we still just call Fsync(). We want to be explicit for that. But I agree that putting that information in IOOption would be more flexible.

include/rocksdb/file_system.h Outdated Show resolved Hide resolved
env/io_posix.cc Outdated Show resolved Hide resolved
env/io_posix.cc Outdated
Comment on lines 1535 to 1536
if (is_btrfs_) {
int fd = open(filename.c_str(), O_RDONLY);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the directory is /home/my_dir and the filename is /home/my_file, what should happen? Should it be a requirement that the filename is in the directory or an error (NotFound?) is returned

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, the dir-fsync won't persist the file renaming. At least with this change, on btrfs the renaming could be persisted with file fsync.

env/io_posix.cc Outdated Show resolved Hide resolved
env/fs_posix.cc Outdated
Comment on lines 562 to 563
struct statfs buf;
int ret = fstatfs(fd, &buf);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be cleaner if this code was moved inside of the PosixDirectory constructor.

And if it is done there, can this check be skipped altogether if BTRFS_SUPER_MAGIC is not defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, updated.

We cannot use definition of BTRFS_SUPER_MAGIC to check it's on btrfs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If BTRFS_SUPER_MAGIC is not defined, is there any reason to do this fstatfs?

HISTORY.md Outdated Show resolved Hide resolved
@jay-zhuang
Copy link
Contributor Author

Looking through the code, there are a lot of places that this would not be covered. The checkpoint and backup code both rename files in the directory, fsync the directory. This change will not cover those cases.

We do want to keep the fsync for checkpoint and backup, as they're renaming a directory instead of a file. They're currently using old Directory which is not change. But I think we do want to provide an API for FSDirectory to force sync, in case they are moving to FSDirectory. I think switch to using IOOptions is a good idea.

Personally, I think it might be better to have a FSDirectory::RenameFile API that takes the two names, and the other options (IOOptions, IODebugContext*) and does the rename of the file and the Fsync (if an IOOptions suggests it).

Then the FileSystem::RenameFile() caller needs to check if the file is on btrfs and set IOOptions, which we want to do inside of FileSystem or FSDirectory. Or are you suggesting to add new FSDirectory::RenameFile?

I think that the FSDirectory class should have many of the methods off FileSystem (listChildren, exists, etc) that are basic wrappers back to the FileSystem. The difference is the methods are not required to specify the full path as input.

Sorry, can you elaborate?

@mrambacher
Copy link
Contributor

Looking through the code, there are a lot of places that this would not be covered. The checkpoint and backup code both rename files in the directory, fsync the directory. This change will not cover those cases.

We do want to keep the fsync for checkpoint and backup, as they're renaming a directory instead of a file. They're currently using old Directory which is not change. But I think we do want to provide an API for FSDirectory to force sync, in case they are moving to FSDirectory. I think switch to using IOOptions is a good idea.

The Backup code renames both files and directories (checkpoint appears to only rename directories).

Compaction, Options, and Identity files are also examples of files that are created and renamed. I do not know what the rule/expectation for syncs are around those files.

Personally, I think it might be better to have a FSDirectory::RenameFile API that takes the two names, and the other options (IOOptions, IODebugContext*) and does the rename of the file and the Fsync (if an IOOptions suggests it).

Then the FileSystem::RenameFile() caller needs to check if the file is on btrfs and set IOOptions, which we want to do inside of FileSystem or FSDirectory. Or are you suggesting to add new FSDirectory::RenameFile?

I think that the FSDirectory class should have many of the methods off FileSystem (listChildren, exists, etc) that are basic wrappers back to the FileSystem. The difference is the methods are not required to specify the full path as input.

Sorry, can you elaborate?

Sorry, I meant that there should be an FSDirectory::RenameFile that does the rename. It can still use the base FileSystem implementation to do the work, but can then do the sync if required.

env/io_posix.h Outdated Show resolved Hide resolved
@jay-zhuang
Copy link
Contributor Author

The Backup code renames both files and directories (checkpoint appears to only rename directories).

Backup and checkpoint do need dir-fsync, as they're using Directory instead of FSDirectory, for now, we will keep it as it is. If in the future, we want to move them to use FSDirectory, we can add an option to force the dir-fsync.

Compaction, Options, and Identity files are also examples of files that are created and renamed. I do not know what the rule/expectation for syncs are around those files.

First, nothing changed for non-btrfs.
For btrfs, as long as the new file is synced, there's no need to do dir-fsync. (which is not true for other FS). details: T99353462. We checked that for compaction, flush, wal, etc. we do fsync the file, unless there's no strong persisting requirement.

Sorry, I meant that there should be an FSDirectory::RenameFile that does the rename. It can still use the base FileSystem implementation to do the work, but can then do the sync if required.

Yeah, that's one option. After discussed with Siying, to minimize the change, we will add an API FsyncAfterFileRename().

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@mrambacher
Copy link
Contributor

The Backup code renames both files and directories (checkpoint appears to only rename directories).

Backup and checkpoint do need dir-fsync, as they're using Directory instead of FSDirectory, for now, we will keep it as it is. If in the future, we want to move them to use FSDirectory, we can add an option to force the dir-fsync.

Compaction, Options, and Identity files are also examples of files that are created and renamed. I do not know what the rule/expectation for syncs are around those files.

First, nothing changed for non-btrfs.
For btrfs, as long as the new file is synced, there's no need to do dir-fsync. (which is not true for other FS). details: T99353462. We checked that for compaction, flush, wal, etc. we do fsync the file, unless there's no strong persisting requirement.

Sorry, I meant that there should be an FSDirectory::RenameFile that does the rename. It can still use the base FileSystem implementation to do the work, but can then do the sync if required.

Yeah, that's one option. After discussed with Siying, to minimize the change, we will add an API FsyncAfterFileRename().

I still believe it would be better to NOT use a new API but instead use a new IOOptions instead.:

  1. A new API makes it very hard for a wrapped class to provide their own implementation of FSync. I believe the wrapper for FSyncRename calls target->FSyncRename, which calls target->FSync (not Wrapped FSync), defeating the ability to wrap the calls without wrapping both.
  2. This adds a new API for one specific use case. If another file system does something different for create or delete, do we need to add APIs for those modes as well?
  3. The IOOptions says it provides hints to the "file system" that may or may not be honored and this seems like the sort of thing it should be used for.

Personally, I would prefer having an IOOptions::sync that, if set to true, would encourage the FileSystem/Directory to do an Fsync after. Then either the RenameFile (or even better an SDirectory::RenameFile) would perform the operation after it is called. The sync option could apply to other commands as well (such as DeleteFile).

IMO, this is much cleaner and makes it easier to see the intent of what is going on. It would also be easier to find all of the places that should or should not be doing this sync code and keep track of them.

@jay-zhuang
Copy link
Contributor Author

The Backup code renames both files and directories (checkpoint appears to only rename directories).

Backup and checkpoint do need dir-fsync, as they're using Directory instead of FSDirectory, for now, we will keep it as it is. If in the future, we want to move them to use FSDirectory, we can add an option to force the dir-fsync.

Compaction, Options, and Identity files are also examples of files that are created and renamed. I do not know what the rule/expectation for syncs are around those files.

First, nothing changed for non-btrfs.
For btrfs, as long as the new file is synced, there's no need to do dir-fsync. (which is not true for other FS). details: T99353462. We checked that for compaction, flush, wal, etc. we do fsync the file, unless there's no strong persisting requirement.

Sorry, I meant that there should be an FSDirectory::RenameFile that does the rename. It can still use the base FileSystem implementation to do the work, but can then do the sync if required.

Yeah, that's one option. After discussed with Siying, to minimize the change, we will add an API FsyncAfterFileRename().

I still believe it would be better to NOT use a new API but instead use a new IOOptions instead.:

  1. A new API makes it very hard for a wrapped class to provide their own implementation of FSync. I believe the wrapper for FSyncRename calls target->FSyncRename, which calls target->FSync (not Wrapped FSync), defeating the ability to wrap the calls without wrapping both.
  2. This adds a new API for one specific use case. If another file system does something different for create or delete, do we need to add APIs for those modes as well?
  3. The IOOptions says it provides hints to the "file system" that may or may not be honored and this seems like the sort of thing it should be used for.

Personally, I would prefer having an IOOptions::sync that, if set to true, would encourage the FileSystem/Directory to do an Fsync after. Then either the RenameFile (or even better an SDirectory::RenameFile) would perform the operation after it is called. The sync option could apply to other commands as well (such as DeleteFile).

IMO, this is much cleaner and makes it easier to see the intent of what is going on. It would also be easier to find all of the places that should or should not be doing this sync code and keep track of them.

To do that, we have to add FSDirectory::Rename(), which is a new API and requires refactoring existing code (the Rename() within FSDirectory also not fit perfectly base on our usage), so we decide to have FsyncAfterFileRename() for this case.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I tried it out and found a few places that need FsyncAfterFileRename().

  • the IDENTITY file after it's renamed
[pid 3470993] fdatasync(7</mnt/btrfs/dbbench/000000.dbtmp>) = 0
[pid 3470993] rename("/mnt/btrfs/dbbench/000000.dbtmp", "/mnt/btrfs/dbbench/IDENTITY") = 0
  • OPTIONS file after it's renamed
[pid 3470993] fsync(10</mnt/btrfs/dbbench/OPTIONS-000006.dbtmp>) = 0
[pid 3470993] rename("/mnt/btrfs/dbbench/OPTIONS-000006.dbtmp", "/mnt/btrfs/dbbench/OPTIONS-000007") = 0
  • Files created/copied during backup, and also the metafile
[pid 3512221] fdatasync(23</mnt/btrfs/backup/shared_checksum/.000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst.tmp>) = 0
[pid 3512195] rename("/mnt/btrfs/backup//shared_checksum/.000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst.tmp", "/mnt/btrfs/backup//shared_checksum/000041_sPZT1TXAIDLV27X0ZGXTZ_1057300.sst") = 0
...
[pid 3512195] fdatasync(23</mnt/btrfs/backup/meta/.1.tmp>) = 0
[pid 3512195] rename("/mnt/btrfs/backup//meta/.1.tmp", "/mnt/btrfs/backup//meta/1") = 0

Also, checkpoint is odd in that it renames the directory at the end. I am not sure how to handle that.

[pid 3542581] fdatasync(19</mnt/btrfs/checkpoint.tmp/CURRENT>) = 0
[pid 3542581] rename("/mnt/btrfs/checkpoint.tmp", "/mnt/btrfs/checkpoint/") = 0

I used following commands btw (creating a filesystem in /mnt/btrfs turned out to be wasted effort since I found out our devserver uses btrfs already...):

$ TEST_TMPDIR=/mnt/btrfs strace -fye rename,fsync,fdatasync ./db_bench -benchmarks=filluniquerandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -num=100000
$ strace -fye rename,fsync,fdatasync ./ldb backup --backup_dir=/mnt/btrfs/backup/ --db=/mnt/btrfs/dbbench/
$ strace -fye rename,fsync,fdatasync ./ldb checkpoint --checkpoint_dir=/mnt/btrfs/checkpoint/ --db=/mnt/btrfs/dbbench/

@jay-zhuang
Copy link
Contributor Author

Thanks @ajkr for the review.

  1. Identify file
    Currently we don't sync dir after identify file renaming:

    s = env->RenameFile(tmp, IdentityFileName(dbname));

    Should we add the dir sync for that? (as we rarely update identify file, adding dir-sync won't have too much impact I think).

  2. Options file
    We don't sync dir after renaming:

    s = RenameTempFileToOptionsFile(file_name);

    Without the dir-sync, it could be a problem that the latest option file is lost, right?

  3. Backup and checkpoint
    Yes, we should add that. (I mistakenly thought that they use Directory instead of FSDirectory, so they're not impacted, which is not the case, as Directory still calls FSDirectory). Will add an option to force the directory sync.

@ajkr
Copy link
Contributor

ajkr commented Sep 23, 2021

Should we add the dir sync for that? (as we rarely update identify file, adding dir-sync won't have too much impact I think).
Without the dir-sync, it could be a problem that the latest option file is lost, right?

But in the case of non-btrfs we probably sync the directory for some reason not too long after any file is created. For example the dir would be synced for the new MANIFEST not too long after the IDENTITY file is created. The MANIFEST's dir sync coincidentally syncs the IDENTITY file pretty quickly. Whereas, skipping dir sync and missing a FsyncAfterFileRename() in btrfs would be more consequential because nothing would ever (?) force that file entry to be synced.

@ajkr
Copy link
Contributor

ajkr commented Sep 23, 2021

Should we add the dir sync for that? (as we rarely update identify file, adding dir-sync won't have too much impact I think).
Without the dir-sync, it could be a problem that the latest option file is lost, right?

But in the case of non-btrfs we probably sync the directory for some reason not too long after any file is created. For example the dir would be synced for the new MANIFEST not too long after the IDENTITY file is created. The MANIFEST's dir sync coincidentally syncs the IDENTITY file pretty quickly. Whereas, skipping dir sync and missing a FsyncAfterFileRename() in btrfs would be more consequential because nothing would ever (?) force that file entry to be synced.

That said, I am fine with adding the dir sync for non-btrfs in these cases, if you want.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@jay-zhuang
Copy link
Contributor Author

That said, I am fine with adding the dir sync for non-btrfs in these cases, if you want.

Sounds good.

  1. Added FsyncAfterFileRename() for Identify and Option file.
  2. Force dir-fsync for backup, checkpoint and also deletion.


IOOptions() : IOOptions(false) {}

IOOptions(bool force_dir_fsync_)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be explicit.

IOOptions() : timeout(0), prio(IOPriority::kIOLow), type(IOType::kUnknown) {}
// Force directory fsync, some file systems like btrfs may skip directory
// fsync, set this to force the fsync
bool force_dir_fsync;
Copy link
Contributor

@ajkr ajkr Sep 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, this feels inconsistent with the API doc on IOOptions: "These are hints and are not necessarily guaranteed". A hint to force something (IMO) means it might happen in which case client still needs to handle the case it didn't happen. But the way we are using Fsync(IOOptions(true) ... is requiring the dir fsync happened.

IMO add an Fsync() overload with argument contents_synced (or better name) whose default implementation ignores it, but whose PosixFileSystem implementation uses it to skip the dir sync on btrfs. It is true by default but false when something like directory rename just happened and the new name has not been synced.

edit: Or just make the IOOption indicate "there's nothing important that needs to be synced" (idk the name for this); then it is fine to be ignored.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can sort of see ignoring force_dir_fsync to mean just do the fsync if I strain to think about it (because after all we're in a function called Fsync()...). So it's fine to me.

Copy link
Contributor Author

@jay-zhuang jay-zhuang Sep 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Are you worried about the size of IOOptions to suggest having the contents_synced as argument? If it's not a concern, how about adding contents_synced to IOOptions, maybe also having the string renamed_file in IOOptions (to get ride of FsyncAfterFileRename()):

struct IOOptions {
  ...
  // only for dir fsync()
  std::string renamed_file; // or vector of renamed_files?
  bool contents_synced = false;
}

Then we don't need to change public API. For existing POSIX filesystems, they just ignore these new added fields. And for btrfs, it can take advantage of that (or ignoring them won't cause correctness problem):

Fsync(opts) {
  // if it's btrfs, otherwise just do fsync dir:
  if (!opts.contents_synced) {
    if (opts.renamed_file) {
       Fsync(opts.renamed_file); // fsync renamed file
    } else {
       Fsync(fd_); // fsync dir
    }
  }
}

contents_synced is set to false by default, so the default behavior is not changed. Please let me know what you think.

@@ -2322,7 +2322,7 @@ Status CompactionServiceCompactionJob::Run() {
constexpr IODebugContext* dbg = nullptr;

if (output_directory_) {
io_s = output_directory_->Fsync(IOOptions(), dbg);
io_s = output_directory_->Fsync(IOOptions(true), dbg);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is dir fsync forced for remote compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we do multiple Rename()s for remote compaction result (sst files):

s = fs_->RenameFile(src_file, tgt_file, IOOptions(), nullptr);

we should force the dir fsync. We could also sync all compaction result files, which needs to make FsyncAfterFileRename() support list of files. To be simple, we just force the sync here.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also checked file ingestion, seems extra safe (it syncs twice!). LGTM because it seems to work.

$ ./ldb write_extern_sst ./tmp.sst --db=/data/users/andrewkr/dbbench/dbbench/ << EOF
a ==> b
EOF

$ strace -fye link,fsync,fdatasync ./ldb ingest_extern_sst ./tmp.sst --db=/data/users/andrewkr/dbbench/dbbench/ --move_files
...
[pid 1808814] link("./tmp.sst", "/data/users/andrewkr/dbbench/dbbench//000152.sst") = 0
[pid 1808814] fdatasync(25</data/users/andrewkr/dbbench/dbbench/000152.sst>) = 0
[pid 1808814] fdatasync(25</data/users/andrewkr/dbbench/dbbench/000152.sst>) = 0
...

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@jay-zhuang
Copy link
Contributor Author

Discussed with @ajkr offline that we decided to add a new option DirFsyncOptions to Fsync() API. The option is going to include the dir fsync reason and corresponding options. As we cannot override virtual function, the new name is FsyncWithDirOptions(). By default, it's the same Fsync(). If the caller specify DirFsyncOptions, btrfs (or other future FS) can take advantage of that and skip the dir fsync.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good, though somewhat difficult as fsync always is...

HISTORY.md Outdated Show resolved Hide resolved
env/io_posix.cc Outdated
}
return s;
}
// fallback to dir-fsync for kDefault and kDirRenamed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and kFileDeleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe kFileDeleted should skip the fsync. It's used during DB::Open so wouldn't it defeat this optimization if it falls back to fsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep having fsync for file deletion to keep consistent with other file systems (for file deletion, btrfs and other file system have the same behavior).
If we decide the fsync after deletion is not necessary, maybe we should remove the fsync for all. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thing I am stuck on though is this PR doesn't seem to optimize any known scenario (please correct me if I'm wrong). Should we have a plan to finish eliminating directory fsyncs in the foreground during DB::Open? For example I wonder if we can use the background purge thread.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, that make sense.
I'm going to add kFileDeleted and eliminate fsync for btrfs first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use of background purge thread looks promising.

rocksdb/db/db_impl/db_impl.cc

Lines 1615 to 1620 in f24c39a

// PurgeObsoleteFiles here does not delete files. Instead, it adds the
// files to be deleted to a job queue, and deletes it in a separate
// background thread.
state->db->PurgeObsoleteFiles(job_context, true /* schedule only */);
state->mu->Lock();
state->db->SchedulePurge();

We'd need to replace Open()'s call to DeleteObsoleteFiles() with something that can pick/schedule the purge and subsequent dir fsync in the background.

impl->DeleteObsoleteFiles();
s = impl->directories_.GetDbDir()->Fsync(IOOptions(), nullptr);

It looks like we have an option, avoid_unnecessary_blocking_io, that is typically used to control delegating such I/O to the background. I think we can use it here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, I changed it to do background purge if avoid_unnecessary_blocking_io = true. And removed dir sync after file purge.
Added a benchmark for that, which shows 20% performance improvement on btrfs:

with fix:
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
DBOpen/iterations:200_mean     23894325 ns     18697120 ns            3



no fix:
-----------------------------------------------------------------------
Benchmark                             Time             CPU   Iterations
-----------------------------------------------------------------------
DBOpen/iterations:200_mean     30343592 ns     24304870 ns            3

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! That change is cleaner than I would have guessed. Does the background thread do dir sync already?

env/io_posix.cc Outdated Show resolved Hide resolved
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ajkr ajkr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

3 similar comments
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Summary:
Directory fsync might be expensive on btrfs and it may not be needed.
Here are 3 directory fsync cases:
1. creating a new file: dir-fsync is not needed on btrfs, as long as the
   new file itself is synced.
2. renaming a file: dir-fsync is not needed if the renamed file is
   synced. So an API `FsyncForRename(filename, ...)` is provided to sync
   the file on btrfs. By default, it just calls dir-fsync.
3. deleting files: dir-fsync is skipped which will not persist the
   deletion. It should be harmless as RocksDB should try to cleanup
   later.

Test Plan: run tests on btrfs and non btrfs
destroy

Add write during bench
This reverts commit db287b11e0ad1631fb534e2f77e0ac108e474189.
@facebook-github-bot
Copy link
Contributor

@jay-zhuang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jay-zhuang merged this pull request in 2910264.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants