os/bluestore: make deferred writes less aggressive for large writes #42725

ifed01 · 2021-08-09T17:05:08Z

Fixes: https://tracker.ceph.com/issues/52089

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

References tracker ticket
Updates documentation if necessary
Includes tests for new functionality or reproducer for bug

Show available Jenkins commands

jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

…ncrement Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

aclamk

Good stuff.

aclamk · 2021-08-10T07:57:42Z

src/os/bluestore/BlueStore.cc

+    if (!l) {
+      l = max_bsize;
+    }
+    l = std::min(uint64_t(l), length);


How about change:

uint32_t l = p2nphase(offset, max_bsize); if (!l) { l = max_bsize; }

to

uint32_t l = p2roundup(l, max_bsize) - offset;

?

replaced with
l = max_bsize - p2phase(offset, max_bsize);
l = std::min(uint64_t(l), length);

aclamk · 2021-08-10T07:58:30Z

src/os/bluestore/BlueStore.cc

@@ -13734,7 +13738,7 @@ bool BlueStore::BigDeferredWriteContext::can_defer(
    ceph_assert(b_off % chunk_size == 0);
    ceph_assert(blob_aligned_len() % chunk_size == 0);

-    res = blob_aligned_len() <= prefer_deferred_size &&
+    res = blob_aligned_len() < prefer_deferred_size &&


+1 for bringing code to declarative meaning of configuration parameter

aclamk · 2021-08-10T08:10:48Z

src/os/bluestore/BlueStore.cc

    while (left > 0) {
      ceph_assert(prealloc_left > 0);
+      has_chunk2defer |= (prealloc_pos_length < prefer_deferred_size.load());


s/prefer_deferred_size.load()/prefer_deferred_size_snapshot
I did not notice that when I wrote this code.

aclamk · 2021-08-10T08:11:57Z

src/os/bluestore/BlueStore.cc

@@ -14241,14 +14261,19 @@ int BlueStore::_do_alloc_write(

    PExtentVector extents;
    int64_t left = final_length;
+    bool has_chunk2defer = false;


+1 Good name.

aclamk · 2021-08-10T08:14:33Z

src/test/objectstore/store_test.cc

@@ -7483,7 +7630,8 @@ TEST_P(StoreTestSpecificAUSize, DeferredDifferentChunks) {
 	      CEPH_OSD_OP_FLAG_FADVISE_NOCACHE);
      r = queue_transaction(store, ch, std::move(t));
      ++exp_bluestore_write_big;
-      ++exp_bluestore_write_big_deferred;
+      if (expected_write_size != prefer_deferred_size)


if (expected_write_size > prefer_deferred_size)

?

aclamk · 2021-08-10T08:22:26Z

src/os/bluestore/BlueStore.cc

          offset += l;
          length -= l;
-          logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1);
+	  logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1);


line touched, but unchanged.

Fixes: https://tracker.ceph.com/issues/52089 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

write I/O into chunks. Without the fix the following write seq: 0~4M 4096~4M produces tons of deferred writes at the second stage. Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Now deferred write in _do_alloc_write does not depend on blob size, but on size of extent allocated on disk. It is now possible to set bluestore_prefer_deferred_size way larger than bluestore_max_blob_size and still get desired behavior. Example: for deferred=256K, blob=64K : when op write is 128K both blobs will be written as deferred. When op write is 256K then all will go as regular write. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>

…er_deferred_size Signed-off-by: Igor Fedotov <ifedotov@suse.com>

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

neha-ojha · 2021-08-11T00:33:00Z

started a rados run with latest changes included: https://pulpito.ceph.com/nojha-2021-08-11_00:04:11-rados-wip-42725-2021-08-10-distro-basic-smithi/

markhpc · 2021-08-11T14:00:48Z

I tested out a couple of simple code changes in my wip branch here (fix1-5 change different aspects of the deferred IO path):

https://github.com/markhpc/ceph/tree/wip-bs-deferred-fix

Adam created a PR with a different approach at a fix here:

#42721

And Igor's PR here that incorporates Adam's fix and goes much farther changing the deferred IO path.

I tested all of these using the same benchmark configuration I was testing previously, but also in more recent tests reverted a recent PR causing a performance regression in master (Thanks to @mkogan1 for finding this). There appears to be another performance regression in master I haven't tracked down, but I will try to do so this week.

https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing

As a reminder, these tests are looking at deferred IO behavior on NVMe drives rather than HDDs, so the performance numbers are not necessarily relevant (but may be depending on the situation). The more important numbers to look at are the memtable flushing and compaction behavior (which one of our teams at Red Hat did observe as excessively high in pacific vs nautilus based builds on HDD). Generally speaking it looks like #42725 is behaving pretty similarly to FIX1/3/5 in my tests where:

markhpc@b0a0482#diff-6f3b1a5dcf5313dc06f71ba4997ee28ec219ecb5633a728b916465c684f1d87dL14298-R14299

What's strangely surprising is that there seems to be some yet-unknown affect on the 128KB random reads that is causing a bimodal performance distribution. It seems like the deferred IO path may be having an affect, but it's not consistent (for instance deferred=0k in pacific was bad, deferred=0k in master was good, deferred=64k in pacific was good, but deferred=64k master was bad)

So far however, it's debatable if any of these fixes are quite as good as just increasing the blob size in pacific to 256K (though I haven't tested that configuration in master yet). It's possible that Igor's PR here may confer some advantages for small writes (they are much faster with this PR), but I need to verify that we are actually properly deferring those writes in this PR and not just behaving like prefer_deferred=0k (which would be good for NVMe and bad for HDD).

neha-ojha · 2021-08-12T16:14:15Z

started a rados run with latest changes included: https://pulpito.ceph.com/nojha-2021-08-11_00:04:11-rados-wip-42725-2021-08-10-distro-basic-smithi/

no related failures in https://pulpito.ceph.com/?branch=wip-42725-2021-08-10

ifed01 added 3 commits August 6, 2021 15:09

os/bluestore: cosmetic cleanups

080531d

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: fix missing 'l_bluestore_write_deferred' perf counter i…

dc367c4

…ncrement Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: improve logging around deferred writes

89a20d6

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 added bug-fix bluestore labels Aug 9, 2021

ifed01 requested review from markhpc, aclamk and neha-ojha August 9, 2021 17:05

github-actions bot added core tests labels Aug 9, 2021

ifed01 mentioned this pull request Aug 9, 2021

Better handling of deferred write trigger #42721

Closed

3 tasks

neha-ojha added the wip-neha-testing label Aug 9, 2021

aclamk approved these changes Aug 10, 2021

View reviewed changes

ifed01 and others added 6 commits August 10, 2021 14:17

os/bluestore: use non-inclusive comparision against prefer_deferred_size

f1c4448

Fixes: https://tracker.ceph.com/issues/52089 Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: introduce l_bluestore_write_deferred_bytes perf counter.

f38f18c

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: account for alignment with max_blob_size when splitting

b06bcf9

write I/O into chunks. Without the fix the following write seq: 0~4M 4096~4M produces tons of deferred writes at the second stage. Signed-off-by: Igor Fedotov <ifedotov@suse.com>

os/bluestore: enforce one more non-inclusive comparision against pref…

cb30c99

…er_deferred_size Signed-off-by: Igor Fedotov <ifedotov@suse.com>

test/store_test: more testing for deferred writes

324ea6b

Signed-off-by: Igor Fedotov <ifedotov@suse.com>

ifed01 force-pushed the wip-ifed-fix-deferred branch from 85902f6 to 324ea6b Compare August 10, 2021 12:54

neha-ojha added the needs-qa label Aug 11, 2021

markhpc added the performance label Aug 12, 2021

neha-ojha merged commit 6b38ae2 into ceph:master Aug 12, 2021

ifed01 deleted the wip-ifed-fix-deferred branch August 12, 2021 16:36

neha-ojha mentioned this pull request Aug 12, 2021

pacific: os/bluestore: make deferred writes less aggressive for large writes #42773

Merged

neha-ojha removed the wip-neha-testing label Aug 12, 2021

aclamk mentioned this pull request Aug 1, 2022

os/bluestore: fix no deferred writing for legacy OSDs #47241

Closed

14 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: make deferred writes less aggressive for large writes #42725

os/bluestore: make deferred writes less aggressive for large writes #42725

ifed01 commented Aug 9, 2021

aclamk left a comment

aclamk Aug 10, 2021

ifed01 Aug 10, 2021

aclamk Aug 10, 2021

aclamk Aug 10, 2021

aclamk Aug 10, 2021

aclamk Aug 10, 2021

ifed01 Aug 10, 2021

aclamk Aug 10, 2021

ifed01 Aug 10, 2021

neha-ojha commented Aug 11, 2021

markhpc commented Aug 11, 2021 •

edited

neha-ojha commented Aug 12, 2021

os/bluestore: make deferred writes less aggressive for large writes #42725

os/bluestore: make deferred writes less aggressive for large writes #42725

Conversation

ifed01 commented Aug 9, 2021

Checklist

aclamk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neha-ojha commented Aug 11, 2021

markhpc commented Aug 11, 2021 • edited

neha-ojha commented Aug 12, 2021

markhpc commented Aug 11, 2021 •

edited