Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSX fcntl(fd, F_NOCACHE, 1) not equivalent to O_DIRECT on Linux #48

Closed
aktau opened this issue Feb 22, 2015 · 4 comments
Closed

OSX fcntl(fd, F_NOCACHE, 1) not equivalent to O_DIRECT on Linux #48

aktau opened this issue Feb 22, 2015 · 4 comments

Comments

@aktau
Copy link

@aktau aktau commented Feb 22, 2015

This is probably already well known, but it bit me while planning to do some comparative benchmarks between Linux and OSX. I started with OSX and direct=1, rw=randread, but noticed that OSX was reading at 1200MB/s from a 1GB file. This was unexpected as I only have one spinning rust HDD in the MBP.

So I looked up the ways to do direct I/O on OSX. On stack overflow and the Apple mailing lists, fcntl(fd, F_NOCACHE, 1) looked to be the canonical solution. This was also implemented in fio(1) in 2011 in commit 7e8ad19. It seems that F_NOCACHE disables the page cache from that point on, but the file in quesition was already in the page cache, it will not be purged and the pages will be used.

I also commented on stack overflow: clear buffer cache on OSX with my observations. I'll copy it here as well:

It's my impression that even when turning off the cache like this (with F_NOCACHE or F_GLOBAL_NOCACHE), if there are pages of the file already in the page cache, those will still be used. I tried to test this by using fio(1) with direct=1. It seems to confirm my suspicions (I get ~1200MB/s throughput on a random read, on a spindle HDD in my MBP, not an SSD). I've confirmed with dtruss that fio(1) actually calls fcntl correctly.

After running "sudo purge" and trying the same fio(1) invocation, it's much slower. So yes, it appears that F_NOCACHE is no direct equivalent of O_DIRECT on Linux.

I'm not advocating running sudo purge as part of fio, but perhaps it could be added to the documentation of direct that the behaviour is quite different from O_DIRECT but they're both fio's way of doing direct IO. Running sudo purge is both slow, has an adverse effect on the rest of the system while (and after) it runs for obvious reasons.

Another idea I had was to forcibly re-write the file each time for reading, while having the file open with F_NOCACHE, which makes the written pages not enter the page cache (UPC). That would also be slow (possibly) but hopefully it wouldn't evict other, unrelated files.

An interesting discussion involving an Apple dev on the Apple mailing list: http://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html. Specifically the part mentioning that files opened with F_NOCACHE will still uses already loaded pages if they're present.

EDIT: Some train of thought rambling: but perhaps mmap + MAP_NOCACHE on OSX is a possible avenue (+ telling uses to use that for direct IO testing). The man page doesn't give me a lot of hope though, as it appears that there's not guarantess and the OS might keep things in memory if it feels like it anyway. A small reference mentioning it: http://www.qtcentre.org/threads/24733-High-performance-large-file-reading-on-OSX

@axboe
Copy link
Owner

@axboe axboe commented Feb 22, 2015

Thanks for doing a thorough investigation! And I agree with you, we should add some mention of having to run purge to evict any previous cache. Or at least be aware that caching effects may exist.

But it really is a horrible interface, seems strange that OSX doesn't give you finer control of what is cached and how.

@aktau
Copy link
Author

@aktau aktau commented Feb 22, 2015

Thanks for doing a thorough investigation!

No problem at all. I'm glad I learned something and found a workaround in short order. At least the difference was large enough (1200MB/s vs 22MB/s i pretty noticeable).

But it really is a horrible interface, seems strange that OSX doesn't give you finer control of what is cached and how.

I agree. I've looked for a bit more but could find nothing at all. It seems purge(8) uses a kernel extension that isn't properly documented. So it doesn't seem productive to find out whether or not it could be used for fio's purposes. A bit of documentation seems like the best option at the moment.

@axboe axboe closed this Jan 15, 2016
sitsofe added a commit to sitsofe/fio that referenced this issue Aug 27, 2017
Create an fio_directio() helper for the explicit directio setting call
and refactor its usage to be entirely within filesetup.c by calling it
from generic_open_file(). Also make initial layout in extend_file() call
it thus keeping up with the change introduced by
6e344dc ("filesetup: keep OS_O_DIRECT
flag when pre-allocating file").

A positive side effect of this change means the following job

rm -f /tmp/fiofile; ./fio --loops=10 --filename /tmp/fiofile --bs=4k \
 --size=1M --direct=1 --name=go

that creates a brand new file will no longer report cached I/O speeds
(such as 2000MiB/s) on macOS. This was happening because macOS fio is
unable to invalidate already cached data (see
axboe#48 ) and data was being cached
during the layout phase.

Signed-off-by: Sitsofe Wheeler <sitsofe@yahoo.com>
@sitsofe
Copy link
Collaborator

@sitsofe sitsofe commented Sep 8, 2017

Just for the record: I checked whether mmap + MADV_DONTNEED/MADV_FREE on at OSX 10.9 would cause pieces of files pre-cached into the UBC (Unified Buffer Cache) to be evicted/invalidated when re-read. Despite the rumour the answer is no, it will not cause instant eviction/invalidation of entries and it seems purge(8) is the only way to instantly evict things from the UBC on OSX.

axboe added a commit that referenced this issue Sep 1, 2020
When the test target device has maximum open zones limit, the zones in
test target region may not be opened up to the limit, because the zones
out of the test target region may have open zones. To ensure that the
test target zones can be opened up to the limit, reset all zones of the
test target device before the test cases with write work load starts.
Introduce the helper function prep_write() to check if all zone reset is
required and do the reset.

Also remove unnecessary reset_zone calls for test case #29 and #48. These
are no longer required by virtue of the improvement in zbd_setup_files()
to set up zones to meet max_open_zones limit.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
rbates0119 pushed a commit to rbates0119/fio that referenced this issue Oct 28, 2020
When the test target device has maximum open zones limit, the zones in
test target region may not be opened up to the limit, because the zones
out of the test target region may have open zones. To ensure that the
test target zones can be opened up to the limit, reset all zones of the
test target device before the test cases with write work load starts.
Introduce the helper function prep_write() to check if all zone reset is
required and do the reset.

Also remove unnecessary reset_zone calls for test case axboe#29 and axboe#48. These
are no longer required by virtue of the improvement in zbd_setup_files()
to set up zones to meet max_open_zones limit.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
axboe added a commit that referenced this issue Oct 30, 2020
Option --size was not specified to the fio command of test case #48. It
resulted in write operations to all available sequential write required
zones and relaxed zone locking test condition. Specify the option to
limit test target to 16 zones so that zone locking is tested with
expected condition.

Fixes: 3bd2078 ("zbd: add test for stressing zone locking")
Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Reviewed-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
janekmi pushed a commit to janekmi/fio that referenced this issue Dec 4, 2020
rpma: fix a typo
axboe added a commit that referenced this issue Jan 29, 2021
Most of ZBD code in fio uses zone_lock() to lock write pointer zones.
This wrapper, besides doing the obvious pthread mutex lock, quiesce
the outstanding i/o when running via asynchronous ioengines. This is
necessary to avoid deadlocks. The function zbd_process_swd(), however,
still uses the naked pthread mutex to lock zones and this leads to a
deadlock when running ZBD test #48 against regular nullb devices.

The fix added in the same patch series that introduced test #48 was to
NOT initialize SWD at all, but this solution is found to create
problems with verify. As the proper fix, modify zbd_process_swd()
to use zone_lock(). This makes the test #48 pass even when SWD counter
is initialized.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
axboe added a commit that referenced this issue Jan 29, 2021
Test cases #3, #4, #28, #29 and #48 require rather large numbers of
sequential zones to run properly and they fail if the test target
device has not enough of such zones in its zone configuration.

Check how many sequential zones are present on the test device and
skip any test cases for which this number is not enough.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
axboe added a commit that referenced this issue Jan 29, 2021
Test #48 runs some i/o to the test device for 30 seconds and then waits
45 seconds for fio to finish. If this wait times out, the test assumes
that fio is hung because of a zone locking issue and fails. It is
observed that 45s may not be enough for some HDDs, especially the ones
running specialized firmware.

Increase the timeout to 180 seconds to avoid any false positives.
There is no change in test duration for the most common devices.
The test will wait for the full 180 seconds only if it fails, otherwise
it will finish very soon after the 30 second i/o period ends.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
dmitry-fomichev added a commit to dmitry-fomichev/fio that referenced this issue Jan 29, 2021
Most of ZBD code in fio uses zone_lock() to lock write pointer zones.
This wrapper, besides doing the obvious pthread mutex lock, quiesce
the outstanding i/o when running via asynchronous ioengines. This is
necessary to avoid deadlocks. The function zbd_process_swd(), however,
still uses the naked pthread mutex to lock zones and this leads to a
deadlock when running ZBD test axboe#48 against regular nullb devices.

The fix added in the same patch series that introduced test axboe#48 was to
NOT initialize SWD at all, but this solution is found to create
problems with verify. As the proper fix, modify zbd_process_swd()
to use zone_lock(). This makes the test axboe#48 pass even when SWD counter
is initialized.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
dmitry-fomichev added a commit to dmitry-fomichev/fio that referenced this issue Jan 29, 2021
Test cases axboe#3, axboe#4, axboe#28, axboe#29 and axboe#48 require rather large numbers of
sequential zones to run properly and they fail if the test target
device has not enough of such zones in its zone configuration.

Check how many sequential zones are present on the test device and
skip any test cases for which this number is not enough.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
dmitry-fomichev added a commit to dmitry-fomichev/fio that referenced this issue Jan 29, 2021
Test axboe#48 runs some i/o to the test device for 30 seconds and then waits
45 seconds for fio to finish. If this wait times out, the test assumes
that fio is hung because of a zone locking issue and fails. It is
observed that 45s may not be enough for some HDDs, especially the ones
running specialized firmware.

Increase the timeout to 180 seconds to avoid any false positives.
There is no change in test duration for the most common devices.
The test will wait for the full 180 seconds only if it fails, otherwise
it will finish very soon after the 30 second i/o period ends.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
dmitry-fomichev added a commit to dmitry-fomichev/fio that referenced this issue Jan 29, 2021
Test cases axboe#3, axboe#4, axboe#28, axboe#29 and axboe#48 require rather large numbers of
sequential zones to run properly and they fail if the test target
device has not enough of such zones in its zone configuration.

Check how many sequential zones are present on the test device and
skip any test cases for which this number is not enough.

Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
dmitry-fomichev added a commit to dmitry-fomichev/fio that referenced this issue Jan 29, 2021
Test axboe#48 runs some i/o to the test device for 30 seconds and then waits
45 seconds for fio to finish. If this wait times out, the test assumes
that fio is hung because of a zone locking issue and fails. It is
observed that 45s may not be enough for some HDDs, especially the ones
running specialized firmware.

Increase the timeout to 180 seconds to avoid any false positives.
There is no change in test duration for the most common devices.
The test will wait for the full 180 seconds only if it fails, otherwise
it will finish very soon after the 30 second i/o period ends.

Signed-off-by: Dmitry Fomichev <dmitry.fomichev@wdc.com>
Reviewed-by: Shin'ichiro Kawasaki <shinichiro.kawasaki@wdc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants