-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: go deferred for 'big' writes. #33434
Conversation
src/os/bluestore/BlueStore.cc
Outdated
bufferlist::iterator& blp, | ||
WriteContext* wctx) | ||
{ | ||
bluestore_deferred_op_t* op = _get_deferred_op(txc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function, if stripped of "big" namings, could be used as part of _do_write_small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right but in fact its usage from _do_write_small isn't that straightforward so I'd like to postpone this sort of cleanup for later refactoring.
Intended to ensure spinner's read performance for 4K min alloc size to be on par with 64K one. Degradation is caused by additional fragmentation produced by partial overwrites when 4K min alloc size is set./. E.g. consider 4K partial overwrite in the middle of the previously written contiguous 64K blob. For 64K min alloc size deferred write procedure comes in and preserve data continuity. When min alloc size set to 4K overwritten blob is broken into 3 parts whithout the patch. This patch applies deferred writing processing for the case and hence avoids blob continuity breakage. At cost of write performance drop though. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This allow different callback function signatures. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
It makes no sense if affected blob's range is already non-continuous or full overwrite takes place. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
// will read some data to fill out the chunk? | ||
head_read = p2phase<uint64_t>(b_off, chunk_size); | ||
tail_read = p2nphase<uint64_t>(b_off + used, chunk_size); | ||
b_off -= head_read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get head_read != 0 or tail_read != 0 ?
I think that such cases will be caught by do_write_small.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible to have chunk size > alloc unit sometimes. E.g. when expected_write_size is provided, see _choose_write_options(...)
Hence we one can get head/tail at checksum chunk boundaries rather than at allocation unit ones. But do_write_small/do_write_big are distinguished on the latter only.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! More tests are always better :) Will add, thanks!
src/os/bluestore/BlueStore.cc
Outdated
ceph_assert(blob_aligned_len() % chunk_size == 0); | ||
|
||
res = blob_aligned_len() <= prefer_deferred_size && | ||
blob_aligned_len() < ondisk && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If head_read can be !=0 or tail_read can be !=0 than condition should be blob_aligned_len() <= ondisk
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, thanks!
54facd2
to
511c208
Compare
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This applies to bluestore_write_small_deferred and bluestore_write_small_new counters. In fact they apply to both big and small writes. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
'bluestore_debug_omit_block_device_write' wasn't respected for 'big' deferred writes. Plus a minor cleanup around calc_csum calls. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
511c208
to
9b92674
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this way of avoiding fragmentation.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
If this is backported we will require #34754 |
How will we know if/when it gets backported? |
|
Was this backported to nautilus? |
|
Intended to ensure spinner's read performance for 4K min alloc size to be
on par with 64K one. Degradation is caused by additional fragmentation
produced by partial overwrites when 4K min alloc size is set.
E.g. consider 4K partial overwrite in the middle of the previously written contiguous 64K blob.
For 64K min alloc size deferred write procedure comes in and preserve data continuity.
When min alloc size set to 4K overwritten blob is broken into 3 parts
without the patch. This patch applies deferred writing processing for
the case and hence avoids blob continuity breakage. At cost of write performance drop though.
Signed-off-by: Igor Fedotov ifedotov@suse.com
Checklist
Show available Jenkins commands
jenkins retest this please
jenkins test crimson perf
jenkins test signed
jenkins test make check
jenkins test make check arm64
jenkins test submodules
jenkins test dashboard
jenkins test dashboard backend
jenkins test docs
jenkins render docs
jenkins test ceph-volume all
jenkins test ceph-volume tox