Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blk, os/bluestore: introduce huge page-based read buffers #43849

Merged
merged 9 commits into from
Jan 14, 2022

Conversation

rzarzynski
Copy link
Contributor

@rzarzynski rzarzynski commented Nov 8, 2021

Summary

This PR brings two facilities that enable efficient utilization of some NICs during large reads in BlueStore:

  1. configurable alignment for allocating read buffers in KernelDevice,
  2. allocation from a configurable pools of reusable buffers placed on huge pages.

These features allow to minimize the size of scatter-gather list a network hardware is provided with. For instance, on a system offering 2 MB pages, reading an entire RBD segment may require up to 2 different, non-contiguous regions of physical memory.

By default all these facilities are disabled and there shouldn't be flow difference in comparison to the current master.

The provided unit test requires a machine with HP available. I'm not sure we can assume this is always the case; if not, we can disable it by default.

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard cephadm
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

@ifed01
Copy link
Contributor

ifed01 commented Dec 2, 2021

@rzarzynski - could you please add the rationales/goals for this PR in the description?

@ifed01
Copy link
Contributor

ifed01 commented Dec 2, 2021

And it would be great to have some test coverage in store_test if possible, please

@rzarzynski
Copy link
Contributor Author

@ifed01: sure, will add. Thanks for the review!

@github-actions
Copy link

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

@rzarzynski
Copy link
Contributor Author

Added unit testing & rebased. I will add a summary to PR's description basing on the commit messages in the morning.

@rzarzynski
Copy link
Contributor Author

jenkins retest this please

The following tests FAILED:
	236 - unittest-btree-lba-manager (Child aborted)

@rzarzynski
Copy link
Contributor Author

jenkins retest this please

Test failures:

	234 - unittest-seastar-errorator (Child aborted)

@rzarzynski
Copy link
Contributor Author

@ifed01: updated! Could you please take another look?

The aborts in the Seastar-related unit tests are addressed in #44532.

@@ -1041,6 +1041,25 @@ int KernelDevice::discard(uint64_t offset, uint64_t len)
return r;
}


// FIXME: copied from buffer.cc
#define CEPH_BUFFER_ALLOC_UNIT 4096u
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not starting exposure in buffer.h?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, probably making it public is the way to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to create_small_page_aligned() which also avoids duplicating the constant.

MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
-1,
0);
ceph_assert(mmaped_region != MAP_FAILED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO redundant

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved
region_q(region_q) {
// the `mmaped_region` has been passed to `raw` as the buffer's `data`
}
~mmaped_buffer_raw() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tag with override?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagged.

-1,
0);
ceph_assert(mmaped_region != MAP_FAILED);
if (mmaped_region == MAP_FAILED) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we make this a bit more user friendly and permit ODS functioning after such a failure, e.g. using regular allocation mechanics?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the setup code called once & early. I believe a user who wants to poke with such low-level things would like to see the failure as explicitly as possible; there is simply no business is requesting HP and falling back plain ones.

Copy link
Contributor

@ifed01 ifed01 Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well - that's questionable to me given the expected lack of proper documentation. IIUC one has to preconfigure large enough amount of HP in the system beforehand. Something like OSD_count * num_hp_per_osd. If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...
That's not very critical given this is disabled by default but I can easily imagine relevant questions/tickets.... ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...

Yes, this is nasty but nasty in the very explicit way. With the implicit fallback there could even worse scenario: somebody adds a new OSD and the fine-n-costly-tuned performance gets hit. Initially nobody notice as a cluster is underloaded but some time later the targeted peak load comes...
Chasing such a problem would be far harder than investigating an assertion failure, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, on the other hand maybe dout(0) / warning + extra sentence in the configurable's description would do? If somebody does such a tuning, he should have log monitoring as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the best way would be to raise cluster warning(alert) if something goes wrong here and proceed working. But may be a bit of overkill....

from a KernelDevice. Applied to minimize size of scatter-gather lists
sent to NICs. Targets really big buffers (>= 2 or 4 MBs).
Keep in mind the system must be configured accordingly (see /proc/sys/vm/nr_hugepages).
fmt_desc: List of key=value pairs delimited by comma, semicolon or tab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to provide a description about what are the key and the value in this list. And some sample(s)...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved comments and added an example.

<< " bdev_read_preallocated_huge_buffer_num="
<< cct->_conf->bdev_read_preallocated_huge_buffer_num
<< " bdev_read_preallocated_huge_buffers="
<< cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers")
<< dendl;
return lucky_raw;
} else {
// fallthrough due to empty buffer pool. this can happen also
// when the configurable was explicitly set to 0.
dout(5) << __func__ << " cannot allocate from huge pool"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't that too verbose? This will print at level 5 on each allocation if huge pool is disabled. And probably successful allocation logging above is too verbose as well...
IMO one should distinguish failed-to-allocate (important) and huge-pool-disabled (debug-only) cases and log accordingly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very likely is. Will retake a look on those debugs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved everything to 20.

Keep in mind the system must be configured accordingly (see /proc/sys/vm/nr_hugepages).
fmt_desc: List of key=value pairs delimited by comma, semicolon or tab
see_also:
- bluestore_max_blob_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it really make any sense here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's useful when filling the store with data -- you want to avoid scattering over many small blobs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry it's absolutely unclear from user perspective why this new parameter refers to bluestore_max_blob_size...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it deserves a comment! Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explained in the desc why bluestore_max_blob_size is needed.

@@ -1188,6 +1189,7 @@ ceph::unique_leakable_ptr<buffer::raw> KernelDevice::create_custom_aligned(
<< " bdev_read_preallocated_huge_buffers="
<< cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers")
<< dendl;
ioc->flags |= IOContext::FLAG_DONT_CACHE;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't look pretty elegant to me - it's rather an implicit side-effect that single huge page allocation causes no caching for the whole IO context... That's rather not critical but doubtful/questionable.
IMO it should be buffer's DO-NOT-CACHE (or RELEASE-ASAP) property not IO context one
May be there are any ideas how to fix that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. It's a compromise that takes into account the burden on interfaces the per-buffer flag would bring.
I'm sincerely open to a better solution.

return ibp.is_raw_marked<BlockDevice::hugepaged_raw_marker_t>();
}

TEST_P(StoreTestDeferredSetup, BluestoreHugeReads)
Copy link
Contributor

@ifed01 ifed01 Jan 11, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this test case work properly if huge pages are disabled at the host? I presume it wouldn't due to assertion/abort at ExplicitHugePagePool's ctor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it won't. Commented on that in the PR's description.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this would rather fail the general QA testing. Not to mention private dev runs... IMO we should have automatic disablement of it then....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I was and still am afraid of that. Though, my dev env. offered HP support out-of-the-box and I was curious even about the nodes serving make check. It looks they are fine. Therefore I'm having mixed feelings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disabled by default; commented why.

…tore.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
The idea here is to bring a pool of `mmap`-allocated,
constantly-sized buffers which would take precedence
over the 2 MB-aligned, THP-based mechanism. On first
attempt to acquire a 4 MB buffer, KernelDevice mmaps
`bdev_read_preallocated_huge_buffer_num` (default 128)
memory regions using the MAP_HUGETLB option. If this
fails, the entire process is aborted. Buffers, after
their life-times going over, are recycled with lock-
free queue shared across entire process.

Remember about allocating the appropriate number of
huge pages in the system! For instance:

```
echo 256 | sudo tee /proc/sys/vm/nr_hugepages
```

This commit bases on / cherry-picks with changes
897a493.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
When testing remember about `bluestore_max_blob_size` as it's
only 64 KB by default while the entire huge page-based pools
machinery targets far bigger scenrios (initially 4 MB!).

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
We're going to reuse it outside `test/bufferlist.cc`.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Its initial user will be a unit test for BlueStore's huge
paged-backed reading.

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
@rzarzynski
Copy link
Contributor Author

@ifed01: just rebased to include the fix the crimson's unit testing. Ready for re-review!

@amathuria
Copy link
Contributor

http://pulpito.front.sepia.ceph.com/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/

Failures unrelated, tracked in:
https://tracker.ceph.com/issues/44587
https://tracker.ceph.com/issues/53843
https://tracker.ceph.com/issues/43887
https://tracker.ceph.com/issues/53807
https://tracker.ceph.com/issues/53886
https://tracker.ceph.com/issues/53424
https://tracker.ceph.com/issues/50830

Details:
Bug #44587: failed to write 34473 to cgroup.procs
Bug #53843: mgr/dashboard: Error - yargs parser supports a minimum Node.js version of 12
Bug #43887: ceph_test_rados_delete_pools_parallel failure
Bug #53807: Dead jobs in rados/cephadm/smoke-roleless
Bug #53886: ansible: Failed to update apt cache
Bug #53424: CEPHADM_DAEMON_PLACE_FAIL in orch:cephadm/mgr-nfs-upgrade/
Bug #50830: rgw-ingress does not install

@yuriw yuriw merged commit 31ff495 into ceph:master Jan 14, 2022
@wjwithagen
Copy link
Contributor

@rzarzynski
'mmmm
Just got a nice challenge to Fix this for not working on FreeBSD.

/home/jenkins/workspace/ceph-master-compile/src/blk/kernel/KernelDevice.cc:1081:39: error: use of undeclared identifier 'MAP_POPULATE'
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
                                      ^
/home/jenkins/workspace/ceph-master-compile/src/blk/kernel/KernelDevice.cc:1081:54: error: use of undeclared identifier 'MAP_HUGETLB'
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
                                             

The first is probably from a missing FreeBSD specific include file, since it is known in mmap(2).
What would happen if I remove the MAP_HUGETLB from the mmap() call?
Would all the other code still do the right thing?

Otherwise I'll have I could wiggle it with MAP_ALIGNED_SUPER to get huge page allignment.
Last I would like to do is undo all this for FreeBSD.

@rzarzynski
Copy link
Contributor Author

@wjwithagen: thanks for bringing this up!

MAP_POPULATE is Linux-specific too. However, it's absence isn't a big deal as, IIUC, all it does is populating pages with frames immediately. I guess we could mimic that with just a memset to these regions.

MAP_HUGETLB is essential on Linux but, according to a discussion, FreebSD has a different, more-automated mechanism. MAP_ALIGNED_SUPER looks useful for handling the situation where the number of huge pages is not enough.

I will make a fix but will need help with testing as I'm missing a FreeBSD dev env :-(.

@wjwithagen
Copy link
Contributor

@rzarzynski
I agree with your anaylis, and I think this would almost get the same effect.
Thing I'm a bit worried about is wether it will error/softfail in the same way...
When the only thing is that the pages are not preread, but paged in, and causing a perfomance penalty, that is not too bad.
Other/wierder issues might be hard to detect.

ATM I replace those 2 with:

--- a/src/blk/kernel/KernelDevice.cc
+++ b/src/blk/kernel/KernelDevice.cc
@@ -1078,7 +1078,7 @@ struct ExplicitHugePagePool {
         nullptr,
         buffer_size,
         PROT_READ | PROT_WRITE,
-        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
+        MAP_PRIVATE | MAP_ANONYMOUS | MAP_PREFAULT_READ | MAP_ALIGNED_SUPER,
         -1,
         0);
       if (mmaped_region == MAP_FAILED) {

And that compiles, and all my test run oke.
But then I do not excersise the BlueStore too much, since that PR still needs work and be accepted. (my bad)

@rzarzynski
Copy link
Contributor Author

Created #44612 to address the FTBFS while – hopefully (no testing env :-(, unfortunately) – preserving the huge page-backed read buffers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants