Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: make deferred writes less aggressive for large writes #42725

Merged
merged 9 commits into from Aug 12, 2021

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Aug 9, 2021

Fixes: https://tracker.ceph.com/issues/52089

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test classic perf
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test api
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
…ncrement

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff.

if (!l) {
l = max_bsize;
}
l = std::min(uint64_t(l), length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about change:


uint32_t l = p2nphase(offset, max_bsize);
if (!l) {
  l = max_bsize;
}

to

uint32_t l = p2roundup(l, max_bsize) - offset;

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced with
l = max_bsize - p2phase(offset, max_bsize);
l = std::min(uint64_t(l), length);

@@ -13734,7 +13738,7 @@ bool BlueStore::BigDeferredWriteContext::can_defer(
ceph_assert(b_off % chunk_size == 0);
ceph_assert(blob_aligned_len() % chunk_size == 0);

res = blob_aligned_len() <= prefer_deferred_size &&
res = blob_aligned_len() < prefer_deferred_size &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for bringing code to declarative meaning of configuration parameter

while (left > 0) {
ceph_assert(prealloc_left > 0);
has_chunk2defer |= (prealloc_pos_length < prefer_deferred_size.load());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/prefer_deferred_size.load()/prefer_deferred_size_snapshot
I did not notice that when I wrote this code.

@@ -14241,14 +14261,19 @@ int BlueStore::_do_alloc_write(

PExtentVector extents;
int64_t left = final_length;
bool has_chunk2defer = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Good name.

@@ -7483,7 +7630,8 @@ TEST_P(StoreTestSpecificAUSize, DeferredDifferentChunks) {
CEPH_OSD_OP_FLAG_FADVISE_NOCACHE);
r = queue_transaction(store, ch, std::move(t));
++exp_bluestore_write_big;
++exp_bluestore_write_big_deferred;
if (expected_write_size != prefer_deferred_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (expected_write_size > prefer_deferred_size)

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

offset += l;
length -= l;
logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1);
logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line touched, but unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

ifed01 and others added 6 commits August 10, 2021 14:17
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
write I/O into chunks.

Without the fix the following write seq:
0~4M
4096~4M

produces tons of deferred writes at the second stage.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Now deferred write in _do_alloc_write does not depend on blob size,
but on size of extent allocated on disk.
It is now possible to set bluestore_prefer_deferred_size way larger than
bluestore_max_blob_size and still get desired behavior.
Example: for deferred=256K, blob=64K : when op write is 128K both blobs will be
written as deferred. When op write is 256K then all will go as regular write.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
…er_deferred_size

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@neha-ojha
Copy link
Member

started a rados run with latest changes included: https://pulpito.ceph.com/nojha-2021-08-11_00:04:11-rados-wip-42725-2021-08-10-distro-basic-smithi/

@markhpc
Copy link
Member

markhpc commented Aug 11, 2021

I tested out a couple of simple code changes in my wip branch here (fix1-5 change different aspects of the deferred IO path):

https://github.com/markhpc/ceph/tree/wip-bs-deferred-fix

Adam created a PR with a different approach at a fix here:

#42721

And Igor's PR here that incorporates Adam's fix and goes much farther changing the deferred IO path.

I tested all of these using the same benchmark configuration I was testing previously, but also in more recent tests reverted a recent PR causing a performance regression in master (Thanks to @mkogan1 for finding this). There appears to be another performance regression in master I haven't tracked down, but I will try to do so this week.

https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing

As a reminder, these tests are looking at deferred IO behavior on NVMe drives rather than HDDs, so the performance numbers are not necessarily relevant (but may be depending on the situation). The more important numbers to look at are the memtable flushing and compaction behavior (which one of our teams at Red Hat did observe as excessively high in pacific vs nautilus based builds on HDD). Generally speaking it looks like #42725 is behaving pretty similarly to FIX1/3/5 in my tests where:

markhpc@b0a0482#diff-6f3b1a5dcf5313dc06f71ba4997ee28ec219ecb5633a728b916465c684f1d87dL14298-R14299

What's strangely surprising is that there seems to be some yet-unknown affect on the 128KB random reads that is causing a bimodal performance distribution. It seems like the deferred IO path may be having an affect, but it's not consistent (for instance deferred=0k in pacific was bad, deferred=0k in master was good, deferred=64k in pacific was good, but deferred=64k master was bad)

So far however, it's debatable if any of these fixes are quite as good as just increasing the blob size in pacific to 256K (though I haven't tested that configuration in master yet). It's possible that Igor's PR here may confer some advantages for small writes (they are much faster with this PR), but I need to verify that we are actually properly deferring those writes in this PR and not just behaving like prefer_deferred=0k (which would be good for NVMe and bad for HDD).

@neha-ojha
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants