-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: add discard method for ssd's performance #14727
Conversation
We should add a bdev_discard bool config option to enable/disable this, as discard is bad on some devices. Also, teh bluestore piece is more complicated. The existing discard op is synchronous.. we don't want to call it where you have it. We either need a thread doing these, or to use aio. And we probably need to add some ordering infrastructure to make sure that a deallocate+discard doesn't run concurrently with a new allocation and write into that space. I'm not sure exactly what the rules are, but at a minimum we need to make sure the discard aio completes before allowing the space to be reallocated (i.e., discard before release to allocator). Note that the last time I had a conversation with @allensamuels (sandisk) about this we decided it wasn't that important that we worry about discards. I forget what the reason was. |
This needs a rebase. |
Not all the vendors implement discard/trim mechanism, they can take the request and drop/not implement in the firmware level. But it is generally good idea to pass the discards to help the gc process. As Sage mentioned we need to measure the cost of sync calls vs async, some test results if possible on a aged device with gc running. |
@varadakari Thanks |
997a923
to
af4c6e4
Compare
@voidbag: I asked on the ML and was pointed to this PR. Is there anything else you need to add/fix before this can be reviewed and maybe merged? |
@wido Thanks |
So this one is still pending? |
Yeah, I'm still nervous about it. Two things:
1. The special async interface for discard, separate from teh aio one,
rubs me the wrong way, although it's probaby justified since you can't
submit discard requests via libaio
2. I'm not sure this will actually be a good thing to do on a real SSD.
For XFS most seem to recommend using fstrim periodically instead of online
discard because so many firmwares suck and there are often performance
anomalies (or bugs) from lots of small discards. Even if that isn't the
case, if we are discarding small bits as we free them, they may be dropped
by teh SSD such that we don't trim the larger region they eventuallly
form.
I suppose we should probably rebase this, and do some testing, but leave
it off by default (maybe with the option marked DEV or something so that
there is some warning for users?)?
|
I'll rebase this pr and check the issues you all pointed out, until this weekend @liewegas I think fstrim is for just local filesystem unlike BlueFS using raw block device. |
Right about fstrim. I mean that it is a periodic operation that reexamines the freelist and trims it, versus an online approach that trims as we free things. The offline one will submit fewer trims and they should (as well as the device allows) regardless of the size of the individual released extents. |
FWIW I think we will want to have an fstrim-like function either way in order to discard upgraded bluestores. One it's there we can compare results of a periodic approach vs the online one. I think the StupidAllocot in-memory structure should make it reasonably easy to implement since it already orders/batches free extents by size; we can focus on discarding the big extents first (or only). |
@liewegas The problem with mounting EXT4 or XFS with the discard option isn't that discarding is the problem, but it's SATA 3.0 The SATA 3.1 spec brings queued trimming which overcomes these problems. Many controllers out there however only implement SATA 3.0. A few resources:
So yes, off by default and let the user enable it if the need to. Might need to write a log line telling them that in certain scenarios having discard on can degrade performance. |
src/os/bluestore/BlueFS.cc
Outdated
for (auto p = to_release[i].begin(); p != to_release[i].end(); ++p) { | ||
if (cct->_conf->bdev_enable_discard && r == 0) | ||
bdev[i]->discard(p.get_start(), p.get_len()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code is also wrong, because r is undefined in that scope...
The intention of the code is that it discards given range which is failed to discard async, synchronously.
And i don't understand your(@liewegas) comment "This doesn't seem right.. first, if you discard async, you do'nt want ot later discard() sync, or call release(), right?"
Does it mean 'Don't discard synchronously and don't call release(), if discard has been done async'
Am i right??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a comment I deleted after I looked harder and understood the code :)
@liewegas Am I right? Plz correct me if i made a mistake. |
I think 3 isn't needed; just 2! |
Okay, then, |
src/common/config_opts.h
Outdated
@@ -1024,6 +1024,8 @@ OPTION(bdev_debug_aio_suicide_timeout, OPT_FLOAT, 60.0) | |||
// NVMe driver is loaded while osd is running. | |||
OPTION(bdev_nvme_unbind_from_kernel, OPT_BOOL, false) | |||
OPTION(bdev_nvme_retry_count, OPT_INT, -1) // -1 means by default which is 4 | |||
OPTION(bdev_enable_discard, OPT_BOOL, true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't these be set to false?
Do we have an update on this PR? There are more SATA 3.1 and SAS 3.0 SSDs (Samsung PM1633a) out there which async queue TRIM so it can be 'sync' in the code. Would like to see this in BlueStore |
1566d64
to
e070bd4
Compare
I'm properly understand that if |
@k0ste Yes, it does. I think we can skip (or make it configurable) TRIM for db_slow as that's usually on a HDD or non-TRIM device. |
if (!rotational) | ||
r = block_device_discard(fd_direct, (int64_t)offset, (int64_t)len); | ||
return r; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
121f00f
to
fbe2a58
Compare
I see @voidbag ! You are correct |
fbe2a58
to
b5e5501
Compare
@voidbag Can you resolve the conflicts? After that I hope it's ready to be merged |
b5e5501
to
eadc188
Compare
@wido Rebase has been done. Could anyone tell me how the bluestore*.yaml file has to be modified? Periodic discard Thanks |
Thanks! @liewegas might know better, but I think that ./qa/objectstore/bluestore.yaml is sufficient. Although the testing HW should also support discard in order to make this work. Would be very nice if this one gets merged. It has been open for a very long time. |
Just enable the setting in bluestore.yaml.. that should be sufficient. Thanks! |
The smithi testing hardware that we use primarily for rados is all nvme based so this should be exercised. |
74a007a
to
b63ea00
Compare
Discard method is added for ssd's performance. Signed-off-by: Taeksang Kim <voidbag@gmail.com>
discard is added to BlueFS.cc and BlueStore.cc Signed-off-by: Taeksang Kim <voidbag@gmail.com>
Signed-off-by: Taeksang Kim <voidbag@gmail.com>
set bdev_enable_discard and bdev_async_discard true. Signed-off-by: Taeksang Kim <voidbag@gmail.com>
b63ea00
to
b6bd7c9
Compare
Great, hopefully the tests are OK so that this one can be merged. |
the failures in the first run were caused by sepia issues. |
discard method is added to BlockDevice.h for ssd's performance
For ssd's performance, discard should be used, when BlueStore or BlueFS releases block device's area.
Signed-off-by: Taeksang Kim voidbag@gmail.com