New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: make deferred writes less aggressive for large writes #42725
Conversation
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
…ncrement Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good stuff.
src/os/bluestore/BlueStore.cc
Outdated
if (!l) { | ||
l = max_bsize; | ||
} | ||
l = std::min(uint64_t(l), length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about change:
uint32_t l = p2nphase(offset, max_bsize);
if (!l) {
l = max_bsize;
}
to
uint32_t l = p2roundup(l, max_bsize) - offset;
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replaced with
l = max_bsize - p2phase(offset, max_bsize);
l = std::min(uint64_t(l), length);
@@ -13734,7 +13738,7 @@ bool BlueStore::BigDeferredWriteContext::can_defer( | |||
ceph_assert(b_off % chunk_size == 0); | |||
ceph_assert(blob_aligned_len() % chunk_size == 0); | |||
|
|||
res = blob_aligned_len() <= prefer_deferred_size && | |||
res = blob_aligned_len() < prefer_deferred_size && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for bringing code to declarative meaning of configuration parameter
src/os/bluestore/BlueStore.cc
Outdated
while (left > 0) { | ||
ceph_assert(prealloc_left > 0); | ||
has_chunk2defer |= (prealloc_pos_length < prefer_deferred_size.load()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/prefer_deferred_size.load()/prefer_deferred_size_snapshot
I did not notice that when I wrote this code.
@@ -14241,14 +14261,19 @@ int BlueStore::_do_alloc_write( | |||
|
|||
PExtentVector extents; | |||
int64_t left = final_length; | |||
bool has_chunk2defer = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Good name.
src/test/objectstore/store_test.cc
Outdated
@@ -7483,7 +7630,8 @@ TEST_P(StoreTestSpecificAUSize, DeferredDifferentChunks) { | |||
CEPH_OSD_OP_FLAG_FADVISE_NOCACHE); | |||
r = queue_transaction(store, ch, std::move(t)); | |||
++exp_bluestore_write_big; | |||
++exp_bluestore_write_big_deferred; | |||
if (expected_write_size != prefer_deferred_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (expected_write_size > prefer_deferred_size)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
src/os/bluestore/BlueStore.cc
Outdated
offset += l; | ||
length -= l; | ||
logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1); | ||
logger->inc(l_bluestore_write_big_blobs, remaining ? 2 : 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line touched, but unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
Fixes: https://tracker.ceph.com/issues/52089 Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
write I/O into chunks. Without the fix the following write seq: 0~4M 4096~4M produces tons of deferred writes at the second stage. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Now deferred write in _do_alloc_write does not depend on blob size, but on size of extent allocated on disk. It is now possible to set bluestore_prefer_deferred_size way larger than bluestore_max_blob_size and still get desired behavior. Example: for deferred=256K, blob=64K : when op write is 128K both blobs will be written as deferred. When op write is 256K then all will go as regular write. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
…er_deferred_size Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
85902f6
to
324ea6b
Compare
started a rados run with latest changes included: https://pulpito.ceph.com/nojha-2021-08-11_00:04:11-rados-wip-42725-2021-08-10-distro-basic-smithi/ |
I tested out a couple of simple code changes in my wip branch here (fix1-5 change different aspects of the deferred IO path): https://github.com/markhpc/ceph/tree/wip-bs-deferred-fix Adam created a PR with a different approach at a fix here: And Igor's PR here that incorporates Adam's fix and goes much farther changing the deferred IO path. I tested all of these using the same benchmark configuration I was testing previously, but also in more recent tests reverted a recent PR causing a performance regression in master (Thanks to @mkogan1 for finding this). There appears to be another performance regression in master I haven't tracked down, but I will try to do so this week. https://docs.google.com/spreadsheets/d/1mTnKvLh8NIxPukad0o0f8ao-T2-GORXWASTQBdxtDFM/edit?usp=sharing As a reminder, these tests are looking at deferred IO behavior on NVMe drives rather than HDDs, so the performance numbers are not necessarily relevant (but may be depending on the situation). The more important numbers to look at are the memtable flushing and compaction behavior (which one of our teams at Red Hat did observe as excessively high in pacific vs nautilus based builds on HDD). Generally speaking it looks like #42725 is behaving pretty similarly to FIX1/3/5 in my tests where: markhpc@b0a0482#diff-6f3b1a5dcf5313dc06f71ba4997ee28ec219ecb5633a728b916465c684f1d87dL14298-R14299 What's strangely surprising is that there seems to be some yet-unknown affect on the 128KB random reads that is causing a bimodal performance distribution. It seems like the deferred IO path may be having an affect, but it's not consistent (for instance deferred=0k in pacific was bad, deferred=0k in master was good, deferred=64k in pacific was good, but deferred=64k master was bad) So far however, it's debatable if any of these fixes are quite as good as just increasing the blob size in pacific to 256K (though I haven't tested that configuration in master yet). It's possible that Igor's PR here may confer some advantages for small writes (they are much faster with this PR), but I need to verify that we are actually properly deferring those writes in this PR and not just behaving like prefer_deferred=0k (which would be good for NVMe and bad for HDD). |
no related failures in https://pulpito.ceph.com/?branch=wip-42725-2021-08-10 |
Fixes: https://tracker.ceph.com/issues/52089
Signed-off-by: Igor Fedotov ifedotov@suse.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test classic perf
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test api
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox