New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: optimize blob usage when doing appends/overwrites #13337

Merged
merged 9 commits into from Mar 29, 2017

Conversation

Projects
None yet
2 participants
@ifed01
Contributor

ifed01 commented Feb 9, 2017

We can reuse blob when appending/overwriting some extent instead of allocating another one.
This reduces reference Onode (4Mb total /4K AU) mem usage from 280K to ~32K (max_blob_size=64K and min_alloc_size = 4K)
Read performance is greatly improved too.

@ifed01 ifed01 requested a review from liewegas Feb 9, 2017

Show outdated Hide outdated src/os/bluestore/bluestore_types.h
@@ -456,6 +459,10 @@ struct bluestore_blob_t {
bluestore_blob_t(uint32_t f = 0) : flags(f) {}
const PExtentVector get_extents() const {

This comment has been minimized.

@liewegas

liewegas Feb 10, 2017

Member

PExtentVector& ?

@liewegas

liewegas Feb 10, 2017

Member

PExtentVector& ?

@ifed01 ifed01 changed the title from os/bluestore: [RFC]optimize blob usage when doing appends/overwrites to os/bluestore: optimize blob usage when doing appends/overwrites Mar 9, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 13, 2017

Member

rebase?

Member

liewegas commented Mar 13, 2017

rebase?

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
uint64_t loffs,
uint64_t blob_len)
{
loffs = P2ALIGN(loffs, blob_len);

This comment has been minimized.

@liewegas

liewegas Mar 13, 2017

Member

is blob_len necessarily a power of 2?

@liewegas

liewegas Mar 13, 2017

Member

is blob_len necessarily a power of 2?

This comment has been minimized.

@ifed01

ifed01 Mar 13, 2017

Contributor

This is allocated blob length and hence it has min_alloc_size granularity. That's supposed to be power of 2 IMO.
Moreover currently it's used from do_write_small only that always allocates single AU. Will add an assert though...

@ifed01

ifed01 Mar 13, 2017

Contributor

This is allocated blob length and hence it has min_alloc_size granularity. That's supposed to be power of 2 IMO.
Moreover currently it's used from do_write_small only that always allocates single AU. Will add an assert though...

This comment has been minimized.

@liewegas

liewegas Mar 13, 2017

Member

min_alloc_size is a power of 2, but the blob length needn't be... if min_alloc_size is 64k the blob could be 196k

@liewegas

liewegas Mar 13, 2017

Member

min_alloc_size is a power of 2, but the blob length needn't be... if min_alloc_size is 64k the blob could be 196k

This comment has been minimized.

@ifed01

ifed01 Mar 14, 2017

Contributor

got it, fixed

@ifed01

ifed01 Mar 14, 2017

Contributor

got it, fixed

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
max_bsize,
offset0 - bstart,
&alloc_len)) {
assert(alloc_len = min_alloc_size); // expecting data always

This comment has been minimized.

@liewegas

liewegas Mar 13, 2017

Member

== not =

@liewegas

liewegas Mar 13, 2017

Member

== not =

This comment has been minimized.

@ifed01

ifed01 Mar 13, 2017

Contributor

yep

@ifed01

ifed01 Mar 13, 2017

Contributor

yep

Igor Fedotov added some commits Jan 17, 2017

Igor Fedotov
os/bluestore: store blob logical length rather than recalculate it on…
… pextent vector

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: refactor bluestore_blob_t to encapsulate extents/*lengt…
…h fields.

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: make punch_hole call from set_lextent optional.
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: do garbage collection after main write path to be able …
…to call set_lextent from _do_alloc_write (will be required later for blob alignment).

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
@ifed01

This comment has been minimized.

Show comment
Hide comment
@ifed01

ifed01 Mar 27, 2017

Contributor

retest this please

Contributor

ifed01 commented Mar 27, 2017

retest this please

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 27, 2017

Member

/home/jenkins-build/build/workspace/ceph-pull-requests/src/os/bluestore/bluestore_types.cc: In member function ‘void bluestore_blob_t::allocated(uint32_t, uint32_t, const AllocExtentVector&)’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/os/bluestore/bluestore_types.cc:785:14: error: no match for ‘operator=’ (operand types are ‘__gnu_cxx::__normal_iterator<bluestore_pextent_t*, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)2u, bluestore_pextent_t> > >’ and ‘void’)
start_it = extents.insert(start_it,

insert doesn't return an iterator until c++14 iirc?

Member

liewegas commented Mar 27, 2017

/home/jenkins-build/build/workspace/ceph-pull-requests/src/os/bluestore/bluestore_types.cc: In member function ‘void bluestore_blob_t::allocated(uint32_t, uint32_t, const AllocExtentVector&)’:
/home/jenkins-build/build/workspace/ceph-pull-requests/src/os/bluestore/bluestore_types.cc:785:14: error: no match for ‘operator=’ (operand types are ‘__gnu_cxx::__normal_iterator<bluestore_pextent_t*, std::vector<bluestore_pextent_t, mempool::pool_allocator<(mempool::pool_index_t)2u, bluestore_pextent_t> > >’ and ‘void’)
start_it = extents.insert(start_it,

insert doesn't return an iterator until c++14 iirc?

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
if (new_blen > blen) {
int64_t overflow = new_blen;
overflow -= target_blob_size;
assert(blen <= new_blen - overflow);

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

could this be triggered if the target blob size changes (is reduced) and the existing blob is already larger than that?

@liewegas

liewegas Mar 27, 2017

Member

could this be triggered if the target blob size changes (is reduced) and the existing blob is already larger than that?

This comment has been minimized.

@ifed01

ifed01 Mar 28, 2017

Contributor

yep, will fix

@ifed01

ifed01 Mar 28, 2017

Contributor

yep, will fix

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
@@ -8675,6 +8758,39 @@ void BlueStore::_do_write_small(
logger->inc(l_bluestore_write_small_deferred);
return;
}
uint32_t alloc_len = min_alloc_size;
auto offset0 = P2ALIGN(offset, alloc_len);
auto llen = b->get_blob().get_logical_length();

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

llen can move inside the condition

@liewegas

liewegas Mar 27, 2017

Member

llen can move inside the condition

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
return false;
}
}
if (new_blen > blen) {

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

If i'm understanding this correctly, the new_blen > blen case can only happen if we take the top branch above (b_offset >= blen). If we take the second path, that means we're writing into an unallocated AU inside a larger blob. And our write always falls within one AU (this is _do_write_small), so we can't possibly be extending the blob. Right?

@liewegas

liewegas Mar 27, 2017

Member

If i'm understanding this correctly, the new_blen > blen case can only happen if we take the top branch above (b_offset >= blen). If we take the second path, that means we're writing into an unallocated AU inside a larger blob. And our write always falls within one AU (this is _do_write_small), so we can't possibly be extending the blob. Right?

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

Oh, _do_write_big uses it too.

@liewegas

liewegas Mar 27, 2017

Member

Oh, _do_write_big uses it too.

// from garbage blobs(compressed?) can share logical space within the same
// AU. That's in turn might be caused by unaligned len in clone_range2.
// Hence the second write will fail in an attempt to reuse blob at
// do_alloc_write().

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

clone range offsets are now always aligned... is this still necessary?

@liewegas

liewegas Mar 27, 2017

Member

clone range offsets are now always aligned... is this still necessary?

This comment has been minimized.

@ifed01

ifed01 Mar 28, 2017

Contributor

Saying about aligned clone_range offsets do you mean they are aligned with min_alloc_size? Or just src_offset == dst_offset? The latter isn't enough. At least your latest fix for store_test (commit b2e7f42) assumes no real min_alloc_size alignment. And what's about length alignment during clone_range? Unaligned length can probably cause the same issue.
Anyway IMO such restrictions on clone_range params are pretty implicit at the moment and might tend to change in future. Their impact on the blob reuse stuff is even more implicit and troubleshooting if any will be a challenge....
Hence suggest to leave this check as is.

@ifed01

ifed01 Mar 28, 2017

Contributor

Saying about aligned clone_range offsets do you mean they are aligned with min_alloc_size? Or just src_offset == dst_offset? The latter isn't enough. At least your latest fix for store_test (commit b2e7f42) assumes no real min_alloc_size alignment. And what's about length alignment during clone_range? Unaligned length can probably cause the same issue.
Anyway IMO such restrictions on clone_range params are pretty implicit at the moment and might tend to change in future. Their impact on the blob reuse stuff is even more implicit and troubleshooting if any will be a challenge....
Hence suggest to leave this check as is.

This comment has been minimized.

@liewegas

liewegas Mar 28, 2017

Member

yep, makes sense!

@liewegas

liewegas Mar 28, 2017

Member

yep, makes sense!

Show outdated Hide outdated src/os/bluestore/BlueStore.cc
// its reuse ratio, e.g. in case of reverse write
uint32_t suggested_boff =
(wi.logical_offset - (wi.b_off0 - wi.b_off)) % max_bsize;
if((suggested_boff % (1 << csum_order)) == 0 &&

This comment has been minimized.

@liewegas

liewegas Mar 27, 2017

Member

whitespace

@liewegas

liewegas Mar 27, 2017

Member

whitespace

@ifed01

This comment has been minimized.

Show comment
Hide comment
@ifed01

ifed01 Mar 28, 2017

Contributor

insert doesn't return an iterator until c++14 iirc?

Actually that's a C++11 feature. Looks like some older gcc lacks its support though. Works for my gcc version 5.4.0. Will make a workaround...

Contributor

ifed01 commented Mar 28, 2017

insert doesn't return an iterator until c++14 iirc?

Actually that's a C++11 feature. Looks like some older gcc lacks its support though. Works for my gcc version 5.4.0. Will make a workaround...

Igor Fedotov added some commits Feb 27, 2017

Igor Fedotov
os/bluestore: try to reuse blob rather than create new one on overwrite.
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: add test cases for BlueStore blob reuse.
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: remove excessive 0x in logging
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
test/store_test: adjust OnodeSizeTracking test case to measure full o…
…bject metadata using empty Onode as a basis

Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>
Igor Fedotov
os/bluestore: allow 'single step' blob reuse when overwriting
Signed-off-by: Igor Fedotov <ifedotov@mirantis.com>

@liewegas liewegas merged commit a66d658 into ceph:master Mar 29, 2017

2 of 3 checks passed

default Build finished.
Details
Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details

@ifed01 ifed01 deleted the ifed01:wip-bluestore-minimiza-blobs2 branch Mar 29, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment