blk, os/bluestore: introduce huge page-based read buffers #43849

rzarzynski · 2021-11-08T21:01:02Z

Summary

This PR brings two facilities that enable efficient utilization of some NICs during large reads in BlueStore:

configurable alignment for allocating read buffers in KernelDevice,
allocation from a configurable pools of reusable buffers placed on huge pages.

These features allow to minimize the size of scatter-gather list a network hardware is provided with. For instance, on a system offering 2 MB pages, reading an entire RBD segment may require up to 2 different, non-contiguous regions of physical memory.

By default all these facilities are disabled and there shouldn't be flow difference in comparison to the current master.

The provided unit test requires a machine with HP available. I'm not sure we can assume this is always the case; if not, we can disable it by default.

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard cephadm
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

ifed01 · 2021-12-02T15:47:22Z

@rzarzynski - could you please add the rationales/goals for this PR in the description?

ifed01 · 2021-12-02T15:48:08Z

And it would be great to have some test coverage in store_test if possible, please

rzarzynski · 2021-12-02T16:06:24Z

@ifed01: sure, will add. Thanks for the review!

src/blk/kernel/KernelDevice.cc

github-actions · 2022-01-11T01:16:02Z

This pull request can no longer be automatically merged: a rebase is needed and changes have to be manually resolved

rzarzynski · 2022-01-11T01:41:37Z

Added unit testing & rebased. I will add a summary to PR's description basing on the commit messages in the morning.

rzarzynski · 2022-01-11T09:25:27Z

jenkins retest this please

The following tests FAILED:
	236 - unittest-btree-lba-manager (Child aborted)

rzarzynski · 2022-01-11T12:11:01Z

jenkins retest this please

Test failures:

	234 - unittest-seastar-errorator (Child aborted)

rzarzynski · 2022-01-11T13:26:15Z

@ifed01: updated! Could you please take another look?

The aborts in the Seastar-related unit tests are addressed in #44532.

ifed01 · 2022-01-11T17:44:49Z

src/blk/kernel/KernelDevice.cc

@@ -1041,6 +1041,25 @@ int KernelDevice::discard(uint64_t offset, uint64_t len)
  return r;
 }

+
+// FIXME: copied from buffer.cc
+#define CEPH_BUFFER_ALLOC_UNIT  4096u


why not starting exposure in buffer.h?

Yeah, probably making it public is the way to go.

Switched to create_small_page_aligned() which also avoids duplicating the constant.

ifed01 · 2022-01-11T18:08:28Z

src/blk/kernel/KernelDevice.cc

+        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
+        -1,
+        0);
+      ceph_assert(mmaped_region != MAP_FAILED);


IMO redundant

src/blk/kernel/KernelDevice.cc

ifed01 · 2022-01-11T18:12:07Z

src/blk/kernel/KernelDevice.cc

+	region_q(region_q) {
+      // the `mmaped_region` has been passed to `raw` as the buffer's `data`
+    }
+    ~mmaped_buffer_raw() {


tag with override?

ifed01 · 2022-01-11T18:22:44Z

src/blk/kernel/KernelDevice.cc

+        -1,
+        0);
+      ceph_assert(mmaped_region != MAP_FAILED);
+      if (mmaped_region == MAP_FAILED) {


shouldn't we make this a bit more user friendly and permit ODS functioning after such a failure, e.g. using regular allocation mechanics?

This is the setup code called once & early. I believe a user who wants to poke with such low-level things would like to see the failure as explicitly as possible; there is simply no business is requesting HP and falling back plain ones.

Well - that's questionable to me given the expected lack of proper documentation. IIUC one has to preconfigure large enough amount of HP in the system beforehand. Something like OSD_count * num_hp_per_osd. If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...
That's not very critical given this is disabled by default but I can easily imagine relevant questions/tickets.... ;)

If something goes wrong along the way, e.g. OSD count is increased or some pages are allocated (or even leaked?) by different apps - user would get non-operational OSDs which might be tricky to fix for a one who is unaware of all the implementation details...

Yes, this is nasty but nasty in the very explicit way. With the implicit fallback there could even worse scenario: somebody adds a new OSD and the fine-n-costly-tuned performance gets hit. Initially nobody notice as a cluster is underloaded but some time later the targeted peak load comes...
Chasing such a problem would be far harder than investigating an assertion failure, I think.

Hmm, on the other hand maybe dout(0) / warning + extra sentence in the configurable's description would do? If somebody does such a tuning, he should have log monitoring as well?

IMO the best way would be to raise cluster warning(alert) if something goes wrong here and proceed working. But may be a bit of overkill....

ifed01 · 2022-01-11T18:39:45Z

src/common/options/global.yaml.in

+    from a KernelDevice. Applied to minimize size of scatter-gather lists
+    sent to NICs. Targets really  big buffers (>= 2 or 4 MBs).
+    Keep in mind the system must be configured  accordingly (see /proc/sys/vm/nr_hugepages).
+  fmt_desc: List of key=value pairs delimited by comma, semicolon or tab


It would be nice to provide a description about what are the key and the value in this list. And some sample(s)...

Makes sense!

Improved comments and added an example.

ifed01 · 2022-01-11T18:51:08Z

src/blk/kernel/KernelDevice.cc

-	      << " bdev_read_preallocated_huge_buffer_num="
-	      << cct->_conf->bdev_read_preallocated_huge_buffer_num
+	      << " bdev_read_preallocated_huge_buffers="
+	      << cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers")
 	      << dendl;
      return lucky_raw;
    } else {
      // fallthrough due to empty buffer pool. this can happen also
      // when the configurable was explicitly set to 0.
      dout(5) << __func__ << " cannot allocate from huge pool"


isn't that too verbose? This will print at level 5 on each allocation if huge pool is disabled. And probably successful allocation logging above is too verbose as well...
IMO one should distinguish failed-to-allocate (important) and huge-pool-disabled (debug-only) cases and log accordingly.

Very likely is. Will retake a look on those debugs.

Moved everything to 20.

ifed01 · 2022-01-11T18:55:24Z

src/common/options/global.yaml.in

+    Keep in mind the system must be configured  accordingly (see /proc/sys/vm/nr_hugepages).
+  fmt_desc: List of key=value pairs delimited by comma, semicolon or tab
+  see_also:
+  - bluestore_max_blob_size


does it really make any sense here?

It's useful when filling the store with data -- you want to avoid scattering over many small blobs.

Sorry it's absolutely unclear from user perspective why this new parameter refers to bluestore_max_blob_size...

So it deserves a comment! Will do.

Explained in the desc why bluestore_max_blob_size is needed.

ifed01 · 2022-01-11T19:18:26Z

src/blk/kernel/KernelDevice.cc

@@ -1188,6 +1189,7 @@ ceph::unique_leakable_ptr<buffer::raw> KernelDevice::create_custom_aligned(
 	      << " bdev_read_preallocated_huge_buffers="
 	      << cct->_conf.get_val<std::string>("bdev_read_preallocated_huge_buffers")
 	      << dendl;
+      ioc->flags |= IOContext::FLAG_DONT_CACHE;


This doesn't look pretty elegant to me - it's rather an implicit side-effect that single huge page allocation causes no caching for the whole IO context... That's rather not critical but doubtful/questionable.
IMO it should be buffer's DO-NOT-CACHE (or RELEASE-ASAP) property not IO context one
May be there are any ideas how to fix that?

Agreed. It's a compromise that takes into account the burden on interfaces the per-buffer flag would bring.
I'm sincerely open to a better solution.

ifed01 · 2022-01-11T20:17:06Z

src/test/objectstore/store_test.cc

+  return ibp.is_raw_marked<BlockDevice::hugepaged_raw_marker_t>();
+}
+
+TEST_P(StoreTestDeferredSetup, BluestoreHugeReads)


Would this test case work properly if huge pages are disabled at the host? I presume it wouldn't due to assertion/abort at ExplicitHugePagePool's ctor.

No, it won't. Commented on that in the PR's description.

so this would rather fail the general QA testing. Not to mention private dev runs... IMO we should have automatic disablement of it then....

Well, I was and still am afraid of that. Though, my dev env. offered HP support out-of-the-box and I was curious even about the nodes serving make check. It looks they are fine. Therefore I'm having mixed feelings.

Disabled by default; commented why.

…tore. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

The idea here is to bring a pool of `mmap`-allocated, constantly-sized buffers which would take precedence over the 2 MB-aligned, THP-based mechanism. On first attempt to acquire a 4 MB buffer, KernelDevice mmaps `bdev_read_preallocated_huge_buffer_num` (default 128) memory regions using the MAP_HUGETLB option. If this fails, the entire process is aborted. Buffers, after their life-times going over, are recycled with lock- free queue shared across entire process. Remember about allocating the appropriate number of huge pages in the system! For instance: ``` echo 256 | sudo tee /proc/sys/vm/nr_hugepages ``` This commit bases on / cherry-picks with changes 897a493. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

When testing remember about `bluestore_max_blob_size` as it's only 64 KB by default while the entire huge page-based pools machinery targets far bigger scenrios (initially 4 MB!). Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

We're going to reuse it outside `test/bufferlist.cc`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Its initial user will be a unit test for BlueStore's huge paged-backed reading. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

rzarzynski · 2022-01-12T20:38:02Z

@ifed01: just rebased to include the fix the crimson's unit testing. Ready for re-review!

amathuria · 2022-01-14T15:32:56Z

http://pulpito.front.sepia.ceph.com/yuriw-2022-01-13_18:06:52-rados-wip-yuri3-testing-2022-01-13-0809-distro-default-smithi/

Failures unrelated, tracked in:
https://tracker.ceph.com/issues/44587
https://tracker.ceph.com/issues/53843
https://tracker.ceph.com/issues/43887
https://tracker.ceph.com/issues/53807
https://tracker.ceph.com/issues/53886
https://tracker.ceph.com/issues/53424
https://tracker.ceph.com/issues/50830

Details:
Bug #44587: failed to write 34473 to cgroup.procs
Bug #53843: mgr/dashboard: Error - yargs parser supports a minimum Node.js version of 12
Bug #43887: ceph_test_rados_delete_pools_parallel failure
Bug #53807: Dead jobs in rados/cephadm/smoke-roleless
Bug #53886: ansible: Failed to update apt cache
Bug #53424: CEPHADM_DAEMON_PLACE_FAIL in orch:cephadm/mgr-nfs-upgrade/
Bug #50830: rgw-ingress does not install

wjwithagen · 2022-01-14T22:36:43Z

@rzarzynski
'mmmm
Just got a nice challenge to Fix this for not working on FreeBSD.

/home/jenkins/workspace/ceph-master-compile/src/blk/kernel/KernelDevice.cc:1081:39: error: use of undeclared identifier 'MAP_POPULATE'
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
                                      ^
/home/jenkins/workspace/ceph-master-compile/src/blk/kernel/KernelDevice.cc:1081:54: error: use of undeclared identifier 'MAP_HUGETLB'
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,

The first is probably from a missing FreeBSD specific include file, since it is known in mmap(2).
What would happen if I remove the MAP_HUGETLB from the mmap() call?
Would all the other code still do the right thing?

Otherwise I'll have I could wiggle it with MAP_ALIGNED_SUPER to get huge page allignment.
Last I would like to do is undo all this for FreeBSD.

rzarzynski · 2022-01-15T11:20:53Z

@wjwithagen: thanks for bringing this up!

MAP_POPULATE is Linux-specific too. However, it's absence isn't a big deal as, IIUC, all it does is populating pages with frames immediately. I guess we could mimic that with just a memset to these regions.

MAP_HUGETLB is essential on Linux but, according to a discussion, FreebSD has a different, more-automated mechanism. MAP_ALIGNED_SUPER looks useful for handling the situation where the number of huge pages is not enough.

I will make a fix but will need help with testing as I'm missing a FreeBSD dev env :-(.

wjwithagen · 2022-01-15T13:36:06Z

@rzarzynski
I agree with your anaylis, and I think this would almost get the same effect.
Thing I'm a bit worried about is wether it will error/softfail in the same way...
When the only thing is that the pages are not preread, but paged in, and causing a perfomance penalty, that is not too bad.
Other/wierder issues might be hard to detect.

ATM I replace those 2 with:

--- a/src/blk/kernel/KernelDevice.cc
+++ b/src/blk/kernel/KernelDevice.cc
@@ -1078,7 +1078,7 @@ struct ExplicitHugePagePool {
         nullptr,
         buffer_size,
         PROT_READ | PROT_WRITE,
-        MAP_PRIVATE | MAP_ANONYMOUS | MAP_POPULATE | MAP_HUGETLB,
+        MAP_PRIVATE | MAP_ANONYMOUS | MAP_PREFAULT_READ | MAP_ALIGNED_SUPER,
         -1,
         0);
       if (mmaped_region == MAP_FAILED) {

And that compiles, and all my test run oke.
But then I do not excersise the BlueStore too much, since that PR still needs work and be accepted. (my bad)

rzarzynski · 2022-01-17T15:13:56Z

Created #44612 to address the FTBFS while – hopefully (no testing env :-(, unfortunately) – preserving the huge page-backed read buffers.

rzarzynski added the performance label Nov 8, 2021

github-actions bot added bluestore common core labels Nov 8, 2021

rzarzynski force-pushed the wip-bs-lucky-buffers branch from ee0d8bc to 1d83a5b Compare November 10, 2021 16:07

neha-ojha requested a review from ifed01 December 2, 2021 15:41

rzarzynski commented Jan 11, 2022

View reviewed changes

src/blk/kernel/KernelDevice.cc Outdated Show resolved Hide resolved

rzarzynski force-pushed the wip-bs-lucky-buffers branch from 1d83a5b to 3ca5a3d Compare January 11, 2022 01:15

github-actions bot added the needs-rebase label Jan 11, 2022

github-actions bot added the tests label Jan 11, 2022

rzarzynski force-pushed the wip-bs-lucky-buffers branch from 3ca5a3d to 10c4b9b Compare January 11, 2022 01:39

github-actions bot removed the needs-rebase label Jan 11, 2022

ifed01 reviewed Jan 11, 2022

View reviewed changes

rzarzynski force-pushed the wip-bs-lucky-buffers branch from 10c4b9b to 50a37ac Compare January 12, 2022 14:34

rzarzynski added 8 commits January 12, 2022 20:35

blk, os/bluestore: introduce a cache bypassing to IOContext and BlueS…

9768120

…tore. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

blk: make the buffer alignment configurable in KernelDevice.

67ce52f

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

blk: move the buffer size of ExplicitHugePagePool to run-time.

a3c8090

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

blk: don't cache the huge page-based buffers of KernelDevice.

1362134

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

common, test: move instrumented_bptr to a dedicated header.

a0777ce

We're going to reuse it outside `test/bufferlist.cc`. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

common: introduce instrumented_raw to buffer_instrumentation

936e578

Its initial user will be a unit test for BlueStore's huge paged-backed reading. Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

test/objectstore: verify the huge page-backed reading of BlueStore.

62650c2

Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>

rzarzynski force-pushed the wip-bs-lucky-buffers branch from 50a37ac to 62650c2 Compare January 12, 2022 20:36

ifed01 approved these changes Jan 13, 2022

View reviewed changes

rzarzynski added the needs-qa label Jan 13, 2022

neha-ojha added the wip-yuri3-testing label Jan 13, 2022

yuriw merged commit 31ff495 into ceph:master Jan 14, 2022

ronen-fr mentioned this pull request Jul 8, 2022

common/bl, kv, tests: drop MemDB and simplify buffer::ptr and buffer::raw #36282

Merged

3 tasks

blk, os/bluestore: introduce huge page-based read buffers #43849

blk, os/bluestore: introduce huge page-based read buffers #43849

Conversation

rzarzynski commented Nov 8, 2021 • edited

Summary

Checklist

ifed01 commented Dec 2, 2021

ifed01 commented Dec 2, 2021

rzarzynski commented Dec 2, 2021

github-actions bot commented Jan 11, 2022

rzarzynski commented Jan 11, 2022

rzarzynski commented Jan 11, 2022

rzarzynski commented Jan 11, 2022

rzarzynski commented Jan 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifed01 Jan 11, 2022 • edited

Choose a reason for hiding this comment

rzarzynski Jan 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifed01 Jan 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rzarzynski Jan 11, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rzarzynski commented Jan 12, 2022

amathuria commented Jan 14, 2022

wjwithagen commented Jan 14, 2022

rzarzynski commented Jan 15, 2022

wjwithagen commented Jan 15, 2022

rzarzynski commented Jan 17, 2022

rzarzynski commented Nov 8, 2021 •

edited

ifed01 Jan 11, 2022 •

edited

rzarzynski Jan 11, 2022 •

edited

ifed01 Jan 11, 2022 •

edited

rzarzynski Jan 11, 2022 •

edited