New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: _do_remove: dirty shard individually as each blob is unshared #16822

Merged
merged 1 commit into from Aug 7, 2017

Conversation

Projects
None yet
2 participants
@liewegas
Member

liewegas commented Aug 4, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 4, 2017

Member

I'm a bit nervous about this fix because it means we dirty slightly less, and there might be misuses of dirty_range that relied on that sloppiness elsewhere in the code. I'd rather give it lots of testing before backport to luminous...

Member

liewegas commented Aug 4, 2017

I'm a bit nervous about this fix because it means we dirty slightly less, and there might be misuses of dirty_range that relied on that sloppiness elsewhere in the code. I'd rather give it lots of testing before backport to luminous...

@liewegas liewegas requested a review from xiexingguo Aug 4, 2017

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 5, 2017

Member

@liewegas I doubt this is the right fix. fault_range and dirty_range shares the same logic of seeking shards. If dirty_range complains that some shard is missing(not loaded), then perhaps it is because fault_range did a bad thing? (or simply because fault_range loads less shards than we expect it is supposed to do?)

Member

xiexingguo commented Aug 5, 2017

@liewegas I doubt this is the right fix. fault_range and dirty_range shares the same logic of seeking shards. If dirty_range complains that some shard is missing(not loaded), then perhaps it is because fault_range did a bad thing? (or simply because fault_range loads less shards than we expect it is supposed to do?)

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 5, 2017

Member

It's not directly related to fault_range here. We are looking at a different object and checking whether the blob happens to be loaded. If so, we unshare it and include it in the range.. but it only needs to be enough to ensure we write the shard for the blob we unshared.

Hmm, I think there is still a bug here, though. You could have two shards in memory, say 00x80000 and 0x1000000x80000, and unshare a blob in each of those, and end up with a range that covers the gap (0x80000~0x80000) which isn't loaded?

I think we should actually call dirty_range for each blob we unshare for exactly the range that it covers (or a subset of it--just enough to ensure we dirty the shard).

Member

liewegas commented Aug 5, 2017

It's not directly related to fault_range here. We are looking at a different object and checking whether the blob happens to be loaded. If so, we unshare it and include it in the range.. but it only needs to be enough to ensure we write the shard for the blob we unshared.

Hmm, I think there is still a bug here, though. You could have two shards in memory, say 00x80000 and 0x1000000x80000, and unshare a blob in each of those, and end up with a range that covers the gap (0x80000~0x80000) which isn't loaded?

I think we should actually call dirty_range for each blob we unshare for exactly the range that it covers (or a subset of it--just enough to ensure we dirty the shard).

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 5, 2017

Member

It's not directly related to fault_range here. We are looking at a different object and checking whether the blob happens to be loaded. If so, we unshare it and include it in the range.. but it only needs to be enough to ensure we write the shard for the blob we unshared.

Sorry, I didn't read the original tracker carefully(it happens during BlueStore::_do_remove).

Hmm, I think there is still a bug here, though. You could have two shards in memory, say 00x80000 and 0x1000000x80000, and unshare a blob in each of those, and end up with a range that covers the gap (0x80000~0x80000) which isn't loaded?

Exactly.

I think we should actually call dirty_range for each blob we unshare for exactly the range that it covers (or a subset of it--just enough to ensure we dirty the shard).

Sounds reasonable.

Member

xiexingguo commented Aug 5, 2017

It's not directly related to fault_range here. We are looking at a different object and checking whether the blob happens to be loaded. If so, we unshare it and include it in the range.. but it only needs to be enough to ensure we write the shard for the blob we unshared.

Sorry, I didn't read the original tracker carefully(it happens during BlueStore::_do_remove).

Hmm, I think there is still a bug here, though. You could have two shards in memory, say 00x80000 and 0x1000000x80000, and unshare a blob in each of those, and end up with a range that covers the gap (0x80000~0x80000) which isn't loaded?

Exactly.

I think we should actually call dirty_range for each blob we unshare for exactly the range that it covers (or a subset of it--just enough to ensure we dirty the shard).

Sounds reasonable.

os/bluestore: _do_remove: dirty shard individually as each blob is un…
…shared

Two problems with old code:

1- dirty_shard range is inclusive, so we might dirty the shard after b_end
2- we might unshare blobs in two shards with an unloaded shard in between,
which would mean dirtying a shard that isn't loaded.

Fix by ensuring the shard for each unshared blob is dirty individually.

Fixes: http://tracker.ceph.com/issues/20849
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas changed the title from os/bluestore: fix dirty_shard off-by-one to os/bluestore: _do_remove: dirty shard individually as each blob is unshared Aug 6, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 6, 2017

Member

@xiexingguo updated!

Member

liewegas commented Aug 6, 2017

@xiexingguo updated!

@liewegas liewegas added this to the luminous milestone Aug 6, 2017

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 7, 2017

Member

@liewegas The change itself looks fine, but I get another question here:

  ghobject_t nogen = o->oid;
  nogen.generation = ghobject_t::NO_GEN;
  OnodeRef h = c->onode_map.lookup(nogen);

  if (!h || !h->exists) {
    return 0;
  }

  dout(20) << __func__ << " checking for unshareable blobs on " << h
	   << " " << h->oid << dendl;
  map<SharedBlob*,bluestore_extent_ref_map_t> expect;
  for (auto& e : h->extent_map.extent_map) {

I see we are directly looking up the head object from the cache and supposing its extent map is fully loaded here. Is this always true? What if we fail?

Member

xiexingguo commented Aug 7, 2017

@liewegas The change itself looks fine, but I get another question here:

  ghobject_t nogen = o->oid;
  nogen.generation = ghobject_t::NO_GEN;
  OnodeRef h = c->onode_map.lookup(nogen);

  if (!h || !h->exists) {
    return 0;
  }

  dout(20) << __func__ << " checking for unshareable blobs on " << h
	   << " " << h->oid << dendl;
  map<SharedBlob*,bluestore_extent_ref_map_t> expect;
  for (auto& e : h->extent_map.extent_map) {

I see we are directly looking up the head object from the cache and supposing its extent map is fully loaded here. Is this always true? What if we fail?

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 7, 2017

Member

The loaded/unloaded state is a bit constrained at the moment. We never trim extents from the map--we can only throw out the whole onode. So shards are only loaded. Or loaded ones can be split or merged. So if the extent is there its shard is loaded.

The blob might be a spanning a blob. I don't think that matters, though (except we don't actually need to dirty the shard in that case).

Member

liewegas commented Aug 7, 2017

The loaded/unloaded state is a bit constrained at the moment. We never trim extents from the map--we can only throw out the whole onode. So shards are only loaded. Or loaded ones can be split or merged. So if the extent is there its shard is loaded.

The blob might be a spanning a blob. I don't think that matters, though (except we don't actually need to dirty the shard in that case).

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 7, 2017

Member

retest this please

Member

liewegas commented Aug 7, 2017

retest this please

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 7, 2017

Member

http://pulpito.ceph.com/sage-2017-08-06_16:51:13-rados-wip-sage-testing2-20170806a-distro-basic-smithi/

I don't think the ceph_test_objectstore failure is related, but I want to be sure first before merging this.

Member

liewegas commented Aug 7, 2017

http://pulpito.ceph.com/sage-2017-08-06_16:51:13-rados-wip-sage-testing2-20170806a-distro-basic-smithi/

I don't think the ceph_test_objectstore failure is related, but I want to be sure first before merging this.

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 7, 2017

Member

The loaded/unloaded state is a bit constrained at the moment. We never trim extents from the map--we can only throw out the whole onode. So shards are only loaded. Or loaded ones can be split or merged. So if the extent is there its shard is loaded.

@liewegas That's true. Thanks for the detailed explanation.
But lemme make myself a bit more clear here:

By rechecking the original logic of BlueStore::_do_remove:
(1) suppose we are currently deleting a gen object;
(2) we want to check if there is any shared blobs that we might be able to unshare in the corresponding head object now. And if there is, mark it as unshared in the head object too.

Here comes my concern:
We are calling c->onode_map.lookup(nogen); instead of c->get_onode(nogen, false);here,
which means we check whether the head object has already been loaded into memory or not.
But what if it is not? Shouldn't we instead load the head object from disk and continue the
above check?

Member

xiexingguo commented Aug 7, 2017

The loaded/unloaded state is a bit constrained at the moment. We never trim extents from the map--we can only throw out the whole onode. So shards are only loaded. Or loaded ones can be split or merged. So if the extent is there its shard is loaded.

@liewegas That's true. Thanks for the detailed explanation.
But lemme make myself a bit more clear here:

By rechecking the original logic of BlueStore::_do_remove:
(1) suppose we are currently deleting a gen object;
(2) we want to check if there is any shared blobs that we might be able to unshare in the corresponding head object now. And if there is, mark it as unshared in the head object too.

Here comes my concern:
We are calling c->onode_map.lookup(nogen); instead of c->get_onode(nogen, false);here,
which means we check whether the head object has already been loaded into memory or not.
But what if it is not? Shouldn't we instead load the head object from disk and continue the
above check?

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 7, 2017

Member

I don't think the ceph_test_objectstore failure is related,

Me neither :P

Member

xiexingguo commented Aug 7, 2017

I don't think the ceph_test_objectstore failure is related,

Me neither :P

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 7, 2017

Member

I chose to be opportunistic.. if the nogen object is already in cache, unshare the blobs. If not, don't slow things down by loading any new metadata.

I'm not sure that's the best choice but it seems reasonable. Hard to tell whether to trade a metadata read now with paying the cost of loaded and then deleting the shared blob in the future (when the blob is eventually freed). This optimization is really targeting EC pools that are cloning the overwritten range and then deleting it moments later, so for that case at least it will basically always be there.

Member

liewegas commented Aug 7, 2017

I chose to be opportunistic.. if the nogen object is already in cache, unshare the blobs. If not, don't slow things down by loading any new metadata.

I'm not sure that's the best choice but it seems reasonable. Hard to tell whether to trade a metadata read now with paying the cost of loaded and then deleting the shared blob in the future (when the blob is eventually freed). This optimization is really targeting EC pools that are cloning the overwritten range and then deleting it moments later, so for that case at least it will basically always be there.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Aug 7, 2017

Member

Yeah that csum crash trigger on two runs, one with this and one without this. Merging!

Member

liewegas commented Aug 7, 2017

Yeah that csum crash trigger on two runs, one with this and one without this. Merging!

@liewegas liewegas merged commit 311bfca into ceph:master Aug 7, 2017

4 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@liewegas liewegas deleted the liewegas:wip-20849 branch Aug 7, 2017

@xiexingguo

This comment has been minimized.

Show comment
Hide comment
@xiexingguo

xiexingguo Aug 7, 2017

Member

Got it. Fine with me. @liewegas Thanks for the explanation!

Member

xiexingguo commented Aug 7, 2017

Got it. Fine with me. @liewegas Thanks for the explanation!

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Sep 6, 2017

Member

/a/joshd-2017-08-04_06:16:52-rados-wip-20904-distro-basic-smithi/1482581 has the logs that led to diagnosis

Member

liewegas commented Sep 6, 2017

/a/joshd-2017-08-04_06:16:52-rados-wip-20904-distro-basic-smithi/1482581 has the logs that led to diagnosis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment