-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
blk, os/bluestore: introduce huge page-based read buffers #43849
Conversation
ee0d8bc
to
1d83a5b
Compare
@rzarzynski - could you please add the rationales/goals for this PR in the description? |
And it would be great to have some test coverage in store_test if possible, please |
@ifed01: sure, will add. Thanks for the review! |
1d83a5b
to
3ca5a3d
Compare
This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved |
3ca5a3d
to
10c4b9b
Compare
Added unit testing & rebased. I will add a summary to PR's description basing on the commit messages in the morning. |
jenkins retest this please
|
jenkins retest this please
|
src/blk/kernel/KernelDevice.cc
Outdated
@@ -1041,6 +1041,25 @@ int KernelDevice::discard(uint64_t offset, uint64_t len) | |||
return r; | |||
} | |||
|
|||
|
|||
// FIXME: copied from buffer.cc | |||
#define CEPH_BUFFER_ALLOC_UNIT 4096u |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not starting exposure in buffer.h?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, probably making it public is the way to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched to create_small_page_aligned()
which also avoids duplicating the constant.
src/blk/kernel/KernelDevice.cc
Outdated
MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB, | ||
-1, | ||
0); | ||
ceph_assert(mmaped_region != MAP_FAILED); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
src/blk/kernel/KernelDevice.cc
Outdated
region_q(region_q) { | ||
// the `mmaped_region` has been passed to `raw` as the buffer's `data` | ||
} | ||
~mmaped_buffer_raw() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tag with override?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagged.
-1, | ||
0); | ||
ceph_assert(mmaped_region != MAP_FAILED); | ||
if (mmaped_region == MAP_FAILED) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we make this a bit more user friendly and permit ODS functioning after such a failure, e.g. using regular allocation mechanics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the setup code called once & early. I believe a user who wants to poke with such low-level things would like to see the failure as explicitly as possible; there is simply no business is requesting HP and falling back plain ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well - that's questionable to me given the expected lack of proper documentation. IIUC one has to preconfigure large enough amount of HP in the system beforehand. Something like OSD_count * num_hp_per_osd. If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...
That's not very critical given this is disabled by default but I can easily imagine relevant questions/tickets.... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...
Yes, this is nasty but nasty in the very explicit way. With the implicit fallback there could even worse scenario: somebody adds a new OSD and the fine-n-costly-tuned performance gets hit. Initially nobody notice as a cluster is underloaded but some time later the targeted peak load comes...
Chasing such a problem would be far harder than investigating an assertion failure, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, on the other hand maybe dout(0)
/ warning + extra sentence in the configurable's description would do? If somebody does such a tuning, he should have log monitoring as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the best way would be to raise cluster warning(alert) if something goes wrong here and proceed working. But may be a bit of overkill....
src/common/options/global.yaml.in
Outdated
from a KernelDevice. Applied to minimize size of scatter-gather lists | ||
sent to NICs. Targets really big buffers (>= 2 or 4 MBs). | ||
Keep in mind the system must be configured accordingly (see /proc/sys/vm/nr_hugepages). | ||
fmt_desc: List of key=value pairs delimited by comma, semicolon or tab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to provide a description about what are the key and the value in this list. And some sample(s)...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved comments and added an example.
src/blk/kernel/KernelDevice.cc
Outdated
<< " bdev_read_preallocated_huge_buffer_num=" | ||
<< cct->_conf->bdev_read_preallocated_huge_buffer_num | ||
<< " bdev_read_preallocated_huge_buffers=" | ||
<< cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers") | ||
<< dendl; | ||
return lucky_raw; | ||
} else { | ||
// fallthrough due to empty buffer pool. this can happen also | ||
// when the configurable was explicitly set to 0. | ||
dout(5) << __func__ << " cannot allocate from huge pool" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isn't that too verbose? This will print at level 5 on each allocation if huge pool is disabled. And probably successful allocation logging above is too verbose as well...
IMO one should distinguish failed-to-allocate (important) and huge-pool-disabled (debug-only) cases and log accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very likely is. Will retake a look on those debugs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved everything to 20
.
Keep in mind the system must be configured accordingly (see /proc/sys/vm/nr_hugepages). | ||
fmt_desc: List of key=value pairs delimited by comma, semicolon or tab | ||
see_also: | ||
- bluestore_max_blob_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it really make any sense here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's useful when filling the store with data -- you want to avoid scattering over many small blobs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry it's absolutely unclear from user perspective why this new parameter refers to bluestore_max_blob_size...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it deserves a comment! Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explained in the desc why bluestore_max_blob_size
is needed.
@@ -1188,6 +1189,7 @@ ceph::unique_leakable_ptr<buffer::raw> KernelDevice::create_custom_aligned( | |||
<< " bdev_read_preallocated_huge_buffers=" | |||
<< cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers") | |||
<< dendl; | |||
ioc->flags |= IOContext::FLAG_DONT_CACHE; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't look pretty elegant to me - it's rather an implicit side-effect that single huge page allocation causes no caching for the whole IO context... That's rather not critical but doubtful/questionable.
IMO it should be buffer's DO-NOT-CACHE (or RELEASE-ASAP) property not IO context one
May be there are any ideas how to fix that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. It's a compromise that takes into account the burden on interfaces the per-buffer flag would bring.
I'm sincerely open to a better solution.
src/test/objectstore/store_test.cc
Outdated
return ibp.is_raw_marked<BlockDevice::hugepaged_raw_marker_t>(); | ||
} | ||
|
||
TEST_P(StoreTestDeferredSetup, BluestoreHugeReads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this test case work properly if huge pages are disabled at the host? I presume it wouldn't due to assertion/abort at ExplicitHugePagePool's ctor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it won't. Commented on that in the PR's description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this would rather fail the general QA testing. Not to mention private dev runs... IMO we should have automatic disablement of it then....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I was and still am afraid of that. Though, my dev env. offered HP support out-of-the-box and I was curious even about the nodes serving make check
. It looks they are fine. Therefore I'm having mixed feelings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disabled by default; commented why.
10c4b9b
to
50a37ac
Compare
…tore. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
The idea here is to bring a pool of `mmap`-allocated, constantly-sized buffers which would take precedence over the 2 MB-aligned, THP-based mechanism. On first attempt to acquire a 4 MB buffer, KernelDevice mmaps `bdev_read_preallocated_huge_buffer_num` (default 128) memory regions using the MAP_HUGETLB option. If this fails, the entire process is aborted. Buffers, after their life-times going over, are recycled with lock- free queue shared across entire process. Remember about allocating the appropriate number of huge pages in the system! For instance: ``` echo 256 | sudo tee /proc/sys/vm/nr_hugepages ``` This commit bases on / cherry-picks with changes 897a493. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
When testing remember about `bluestore_max_blob_size` as it's only 64 KB by default while the entire huge page-based pools machinery targets far bigger scenrios (initially 4 MB!). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
We're going to reuse it outside `test/bufferlist.cc`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Its initial user will be a unit test for BlueStore's huge paged-backed reading. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
50a37ac
to
62650c2
Compare
@ifed01: just rebased to include the fix the crimson's unit testing. Ready for re-review! |
Failures unrelated, tracked in: Details: |
@rzarzynski
The first is probably from a missing FreeBSD specific include file, since it is known in Otherwise I'll have I could wiggle it with |
@wjwithagen: thanks for bringing this up!
I will make a fix but will need help with testing as I'm missing a FreeBSD dev env :-(. |
@rzarzynski ATM I replace those 2 with:
And that compiles, and all my test run oke. |
Created #44612 to address the FTBFS while – hopefully (no testing env :-(, unfortunately) – preserving the huge page-backed read buffers. |
Summary
This PR brings two facilities that enable efficient utilization of some NICs during large reads in BlueStore:
KernelDevice
,These features allow to minimize the size of scatter-gather list a network hardware is provided with. For instance, on a system offering 2 MB pages, reading an entire RBD segment may require up to 2 different, non-contiguous regions of physical memory.
By default all these facilities are disabled and there shouldn't be flow difference in comparison to the current
master
.The provided unit test requires a machine with HP available. I'm not sure we can assume this is always the case; if not, we can disable it by default.
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox