Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: add discard method for ssd's performance #14727

Merged
merged 4 commits into from Feb 24, 2018

Conversation

voidbag
Copy link
Contributor

@voidbag voidbag commented Apr 22, 2017

discard method is added to BlockDevice.h for ssd's performance
For ssd's performance, discard should be used, when BlueStore or BlueFS releases block device's area.

Signed-off-by: Taeksang Kim voidbag@gmail.com

@liewegas
Copy link
Member

We should add a bdev_discard bool config option to enable/disable this, as discard is bad on some devices.

Also, teh bluestore piece is more complicated. The existing discard op is synchronous.. we don't want to call it where you have it. We either need a thread doing these, or to use aio.

And we probably need to add some ordering infrastructure to make sure that a deallocate+discard doesn't run concurrently with a new allocation and write into that space. I'm not sure exactly what the rules are, but at a minimum we need to make sure the discard aio completes before allowing the space to be reallocated (i.e., discard before release to allocator).

Note that the last time I had a conversation with @allensamuels (sandisk) about this we decided it wasn't that important that we worry about discards. I forget what the reason was.

@varadakari
Copy link
Contributor

This needs a rebase.

@varadakari
Copy link
Contributor

Not all the vendors implement discard/trim mechanism, they can take the request and drop/not implement in the firmware level. But it is generally good idea to pass the discards to help the gc process. As Sage mentioned we need to measure the cost of sync calls vs async, some test results if possible on a aged device with gc running.

@voidbag
Copy link
Contributor Author

voidbag commented May 19, 2017

@varadakari
I'll rebase it
I had a lot of things to do...
I will update it based on Sage 's comment until next week.
Sorry for the delay.

Thanks

@voidbag voidbag force-pushed the wip-voidbag-bdev-discard branch 5 times, most recently from 997a923 to af4c6e4 Compare May 31, 2017 14:42
@wido
Copy link
Member

wido commented Jun 27, 2017

@voidbag: I asked on the ML and was pointed to this PR.

Is there anything else you need to add/fix before this can be reviewed and maybe merged?

@voidbag
Copy link
Contributor Author

voidbag commented Jun 29, 2017

@wido
I think i have done the implementation.
The remaining thing is to fix the failure of the test, which u and other can point out.
I'm also trying to find the cause of the failure of test.

Thanks

@wido
Copy link
Member

wido commented Sep 11, 2017

So this one is still pending?

@liewegas
Copy link
Member

liewegas commented Sep 11, 2017 via email

@voidbag
Copy link
Contributor Author

voidbag commented Sep 11, 2017

I'll rebase this pr and check the issues you all pointed out, until this weekend

@liewegas I think fstrim is for just local filesystem unlike BlueFS using raw block device.
So i think BlueFS has to issue discard command to raw block device like other filesystem.

@liewegas
Copy link
Member

Right about fstrim. I mean that it is a periodic operation that reexamines the freelist and trims it, versus an online approach that trims as we free things. The offline one will submit fewer trims and they should (as well as the device allows) regardless of the size of the individual released extents.

@liewegas
Copy link
Member

FWIW I think we will want to have an fstrim-like function either way in order to discard upgraded bluestores. One it's there we can compare results of a periodic approach vs the online one. I think the StupidAllocot in-memory structure should make it reasonably easy to implement since it already orders/batches free extents by size; we can focus on discarding the big extents first (or only).

@wido
Copy link
Member

wido commented Sep 12, 2017

@liewegas The problem with mounting EXT4 or XFS with the discard option isn't that discarding is the problem, but it's SATA 3.0

The SATA 3.1 spec brings queued trimming which overcomes these problems. Many controllers out there however only implement SATA 3.0.

A few resources:

So yes, off by default and let the user enable it if the need to. Might need to write a log line telling them that in certain scenarios having discard on can degrade performance.

for (auto p = to_release[i].begin(); p != to_release[i].end(); ++p) {
if (cct->_conf->bdev_enable_discard && r == 0)
bdev[i]->discard(p.get_start(), p.get_len());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code is also wrong, because r is undefined in that scope...
The intention of the code is that it discards given range which is failed to discard async, synchronously.

And i don't understand your(@liewegas) comment "This doesn't seem right.. first, if you discard async, you do'nt want ot later discard() sync, or call release(), right?"
Does it mean 'Don't discard synchronously and don't call release(), if discard has been done async'
Am i right??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a comment I deleted after I looked harder and understood the code :)

@voidbag
Copy link
Contributor Author

voidbag commented Sep 18, 2017

@liewegas
I read all your comments.
I'll briefly describe what I have to do according to your comments.
Step1: Rebase this, fix bugs and add options whose postfix is _dev (This PR)
Step2: Implement periodic discard comment using stupid alloc with only big extents. (new PR)
Step3: Implement fstrim function with every extents like other filesystem (xfs..) (new PR)

Am I right? Plz correct me if i made a mistake.

@liewegas
Copy link
Member

I think 3 isn't needed; just 2!

@voidbag
Copy link
Contributor Author

voidbag commented Sep 18, 2017

Okay, then,
First, i tried to add _dev options in this PR

@@ -1024,6 +1024,8 @@ OPTION(bdev_debug_aio_suicide_timeout, OPT_FLOAT, 60.0)
// NVMe driver is loaded while osd is running.
OPTION(bdev_nvme_unbind_from_kernel, OPT_BOOL, false)
OPTION(bdev_nvme_retry_count, OPT_INT, -1) // -1 means by default which is 4
OPTION(bdev_enable_discard, OPT_BOOL, true)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't these be set to false?

@wido
Copy link
Member

wido commented Nov 3, 2017

Do we have an update on this PR?

There are more SATA 3.1 and SAS 3.0 SSDs (Samsung PM1633a) out there which async queue TRIM so it can be 'sync' in the code.

Would like to see this in BlueStore

@voidbag voidbag force-pushed the wip-voidbag-bdev-discard branch 4 times, most recently from 1566d64 to e070bd4 Compare November 4, 2017 02:55
@k0ste
Copy link
Contributor

k0ste commented Jan 24, 2018

I'm properly understand that if bdev_enable_discard is enabled TRIM command issued for: wal, db and db_slow (backend) device. What if backend device is HDD?

@wido
Copy link
Member

wido commented Jan 24, 2018

@k0ste Yes, it does. I think we can skip (or make it configurable) TRIM for db_slow as that's usually on a HDD or non-TRIM device.

@k0ste
Copy link
Contributor

k0ste commented Jan 24, 2018

@wido yes we should. I think is_rotational function can be used for it.
@voidbag please take a look. TRIM should not be issued if backend device is HDD.

@wido
Copy link
Member

wido commented Jan 24, 2018

@k0ste Good suggestion! @voidbag can you look into that?

if (!rotational)
r = block_device_discard(fd_direct, (int64_t)offset, (int64_t)len);
return r;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wido @k0ste
I had already implemented that part.
You all don't have to worry about issuing discard command to rotational device.

@wido
Copy link
Member

wido commented Jan 25, 2018

I see @voidbag ! You are correct

@wido
Copy link
Member

wido commented Feb 16, 2018

@voidbag Can you resolve the conflicts?

After that I hope it's ready to be merged

@voidbag
Copy link
Contributor Author

voidbag commented Feb 19, 2018

@wido Rebase has been done.

Could anyone tell me how the bluestore*.yaml file has to be modified?
Do I have to modify existing file or create new one.
If I have to create new one, where do I have to do?

Periodic discard
I have analyzed allocator code(Stupid, Bitmap), roughly.
And I realized that bitmap allocator has fixed range of disk unlike stupid allocator that has only free area.
There isn't any good idea with bitmap allocator in my mind.
Current bad idea is a flag for undiscarded block, which requires full search among discarded and undiscarded blocks.
So,
I'll start to write the periodic discard separating undiscarded and discarded blocks with 'stupid allocator', as soon as this pr is merged.

Thanks

@wido
Copy link
Member

wido commented Feb 20, 2018

Thanks! @liewegas might know better, but I think that ./qa/objectstore/bluestore.yaml is sufficient.

Although the testing HW should also support discard in order to make this work.

Would be very nice if this one gets merged. It has been open for a very long time.

@liewegas
Copy link
Member

Just enable the setting in bluestore.yaml.. that should be sufficient. Thanks!

@liewegas
Copy link
Member

The smithi testing hardware that we use primarily for rados is all nvme based so this should be exercised.

@voidbag voidbag force-pushed the wip-voidbag-bdev-discard branch 2 times, most recently from 74a007a to b63ea00 Compare February 20, 2018 12:47
Discard method is added for ssd's performance.

Signed-off-by: Taeksang Kim <voidbag@gmail.com>
discard is added to BlueFS.cc and BlueStore.cc

Signed-off-by: Taeksang Kim <voidbag@gmail.com>
Signed-off-by: Taeksang Kim <voidbag@gmail.com>
set bdev_enable_discard and bdev_async_discard true.

Signed-off-by: Taeksang Kim <voidbag@gmail.com>
@wido
Copy link
Member

wido commented Feb 21, 2018

Great, hopefully the tests are OK so that this one can be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants