Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: go deferred for 'big' writes. #33434

Merged
merged 8 commits into from
Apr 13, 2020

Conversation

ifed01
Copy link
Contributor

@ifed01 ifed01 commented Feb 20, 2020

Intended to ensure spinner's read performance for 4K min alloc size to be
on par with 64K one. Degradation is caused by additional fragmentation
produced by partial overwrites when 4K min alloc size is set.

E.g. consider 4K partial overwrite in the middle of the previously written contiguous 64K blob.
For 64K min alloc size deferred write procedure comes in and preserve data continuity.
When min alloc size set to 4K overwritten blob is broken into 3 parts
without the patch. This patch applies deferred writing processing for
the case and hence avoids blob continuity breakage. At cost of write performance drop though.

Signed-off-by: Igor Fedotov ifedotov@suse.com

Checklist

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

Show available Jenkins commands
  • jenkins retest this please
  • jenkins test crimson perf
  • jenkins test signed
  • jenkins test make check
  • jenkins test make check arm64
  • jenkins test submodules
  • jenkins test dashboard
  • jenkins test dashboard backend
  • jenkins test docs
  • jenkins render docs
  • jenkins test ceph-volume all
  • jenkins test ceph-volume tox

bufferlist::iterator& blp,
WriteContext* wctx)
{
bluestore_deferred_op_t* op = _get_deferred_op(txc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function, if stripped of "big" namings, could be used as part of _do_write_small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right but in fact its usage from _do_write_small isn't that straightforward so I'd like to postpone this sort of cleanup for later refactoring.

Intended to ensure spinner's read performance for 4K min alloc size to be
on par with 64K one. Degradation is caused by additional fragmentation
produced by partial overwrites when 4K min alloc size is set./.

E.g. consider 4K partial overwrite in the middle of the previously written contiguous
64K blob.
For 64K min alloc size deferred write procedure comes in and preserve
data continuity.
When min alloc size set to 4K overwritten blob is broken into 3 parts
whithout the patch. This patch applies deferred writing processing for
the case and hence avoids blob continuity breakage. At cost of
write performance drop though.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This allow different callback function signatures.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
It makes no sense if affected blob's range is already non-continuous or
full overwrite takes place.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
// will read some data to fill out the chunk?
head_read = p2phase<uint64_t>(b_off, chunk_size);
tail_read = p2nphase<uint64_t>(b_off + used, chunk_size);
b_off -= head_read;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get head_read != 0 or tail_read != 0 ?
I think that such cases will be caught by do_write_small.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to have chunk size > alloc unit sometimes. E.g. when expected_write_size is provided, see _choose_write_options(...)
Hence we one can get head/tail at checksum chunk boundaries rather than at allocation unit ones. But do_write_small/do_write_big are distinguished on the latter only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ifed01
I was not sure on how _do_write_small and _do_write_big_deferred interact for different chunk_sizes, so I tested it.
Do you think it is useful to grab it here?
7762a82

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! More tests are always better :) Will add, thanks!

ceph_assert(blob_aligned_len() % chunk_size == 0);

res = blob_aligned_len() <= prefer_deferred_size &&
blob_aligned_len() < ondisk &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If head_read can be !=0 or tail_read can be !=0 than condition should be blob_aligned_len() <= ondisk.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, thanks!

@ifed01 ifed01 force-pushed the wip-ifed-big-writes-deferred branch 2 times, most recently from 54facd2 to 511c208 Compare April 1, 2020 13:01
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
This applies to bluestore_write_small_deferred and
bluestore_write_small_new counters.

In fact they apply to both big and small writes.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
'bluestore_debug_omit_block_device_write' wasn't respected for 'big'
deferred writes.
Plus a minor cleanup around calc_csum calls.

Signed-off-by: Igor Fedotov <ifedotov@suse.com>
@ifed01 ifed01 force-pushed the wip-ifed-big-writes-deferred branch from 511c208 to 9b92674 Compare April 1, 2020 13:53
@aclamk aclamk self-requested a review April 3, 2020 10:14
Copy link
Contributor

@aclamk aclamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this way of avoiding fragmentation.

Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
@tchaikov tchaikov merged commit a79fc42 into ceph:master Apr 13, 2020
@ifed01 ifed01 deleted the wip-ifed-big-writes-deferred branch April 13, 2020 16:26
@badone
Copy link
Contributor

badone commented Apr 27, 2020

If this is backported we will require #34754

@smithfarm
Copy link
Contributor

How will we know if/when it gets backported?

@ifed01
Copy link
Contributor Author

ifed01 commented May 13, 2020

How will we know if/when it gets backported?
We should probably discuss this during perf/core meeting... Will try tomorrow...

@disaster123
Copy link

Was this backported to nautilus?

@tchaikov
Copy link
Contributor

Was this backported to nautilus?

see https://tracker.ceph.com/issues/45195#note-17

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants