os/bluestore: shard extent map #10963

liewegas · 2016-09-02T16:52:52Z

Rewrote much of the persistence of onode metadata. The
highlights:

 - extents and blobs stored together (the blob with the
   first referencing extent).
 - extents sharded across multiple k/v keys
 - if a blob if referenced from multiple blobs, it's
   stored in the onode key (called a "spanning blob").
 - when we clone a blob we copy the metadata, but mark
   it shared and put (just) the ref_map on the underlying
   blocks in a shared_blob key.  at this point we also
   assign a globally unique id (sbid = shared blob id)
   so the key has a unique name.
 - we instantiate a SharedBlob in memory regardless of
   whether we need to load the ref_map (which is only
   needed for deallocations!).  the BufferSpace is
   attached to this SharedBlob so we get unified caching
   across clones.

ifed01 · 2016-09-06T11:53:10Z

src/test/objectstore/store_test.cc

+    while (in_flight)
+      cond.Wait(lock);
+    store->umount();
+    store->fsck();


We have fsck_on_(u)mount set to true for this test suite (see main func) hence there is no need to call fsck directly. That's just a waste of time...

liewegas · 2016-09-06T20:02:03Z

This passes tests, except for the bitmap granularity issue in #10999 that is worked around in that PR. I suggest we merge that one too (with an interim fix) until we do something more clever with min_alloc_size

liewegas · 2016-09-06T20:02:12Z

@chhabaramesh

allensamuels · 2016-09-06T20:30:08Z

src/os/bluestore/BlueStore.cc

 {
  const char *p = key.c_str();
-  if (key.length() < 2 + 8 + 4)
+  if (key.length() < 8)


8 --- ugh. Magic number.

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2016-09-06T22:34:24Z

Thanks @allensamuels, fixed that. Also rolled the allocator granularity bug workaround into this PR for now so that the tests pass with all the new fsck's. It's passing the full test suite for me; I think it's ready to merge.

[==========] 61 tests from 1 test case ran. (2841968 ms total)
[ PASSED ] 61 tests.

allensamuels · 2016-09-06T23:02:50Z

src/os/bluestore/BlueStore.cc

    lock("BlueStore::Collection::lock", true, false),
    exists(true),
-    bnode_set(MAX(16, g_conf->bluestore_onode_cache_size / 128)),
-    onode_map(cs)
+    shared_blob_set(MAX(16, g_conf->bluestore_onode_cache_size / 4)),


More magic numbers -- that are different from the old magic numbers :)

Rewrote much of the persistence of onode metadata. The highlights: - extents and blobs stored together (the blob with the first referencing extent). - extents sharded across multiple k/v keys - if a blob if referenced from multiple blobs, it's stored in the onode key (called a "spanning blob"). - when we clone a blob we copy the metadata, but mark it shared and put (just) the ref_map on the underlying blocks in a shared_blob key. at this point we also assign a globally unique id (sbid = shared blob id) so the key has a unique name. - we instantiate a SharedBlob in memory regardless of whether we need to load the ref_map (which is only needed for deallocations!). the BufferSpace is attached to this SharedBlob so we get unified caching across clones. Signed-off-by: Sage Weil <sage@redhat.com>

We could bump the _max value for a TransContext in it's prepare state, have it wait for a long time on IO, and let another txc allocate and commit something with an id higher than the previous max. Fix this first by pushing the max ids into the TransContext where we can deal with them at commit time, and then making _kv_sync_thread bump the committed max in a safe way. Note that this will need to change if/when we do these commits in parallel. Signed-off-by: Sage Weil <sage@redhat.com>

Only examine the range we just wrote to (and to the left and right). Signed-off-by: Sage Weil <sage@redhat.com>

This has to be block_size bits because min_alloc_size can vary over mounts. Signed-off-by: Sage Weil <sage@redhat.com>

We need to handle objects written during previous mounts that may have had a smaller min_alloc_size. Use block_size, which is a safe lower bound. Signed-off-by: Sage Weil <sage@redhat.com>

These were taking min_alloc_size, but this can change across mounts; better to use the logical blob length instead (that's what we want anyway!). Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas added the bluestore label Sep 2, 2016

liewegas force-pushed the wip-bluestore-sharded-extent-map branch 3 times, most recently from 4351945 to 605063e Compare September 2, 2016 20:26

ifed01 reviewed Sep 6, 2016
View reviewed changes

liewegas force-pushed the wip-bluestore-sharded-extent-map branch 3 times, most recently from a9a83dc to 0160c6a Compare September 6, 2016 19:20

allensamuels reviewed Sep 6, 2016
View reviewed changes

liewegas added 3 commits September 6, 2016 17:58

ceph_test_objectstore: add SyntheticMatrixSharding

025bbc6

Signed-off-by: Sage Weil <sage@redhat.com>

ceph_test_objectstore: occasional umount/fsck/mount

933a1da

Signed-off-by: Sage Weil <sage@redhat.com>

ceph_test_objectstore: test shards for longer

58030cc

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-sharded-extent-map branch from 0160c6a to 2a77a90 Compare September 6, 2016 21:58

liewegas mentioned this pull request Sep 6, 2016

os/bluestore: use block_size for bitmap granularity #10999

Closed

allensamuels reviewed Sep 6, 2016
View reviewed changes

markhpc added the performance label Sep 7, 2016

liewegas added 10 commits September 7, 2016 11:26

os/bluestore: optimize compress_extent_map

7f35725

Only examine the range we just wrote to (and to the left and right). Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fix fsck used_block bitmap

6e251cf

This has to be block_size bits because min_alloc_size can vary over mounts. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: use block_size for allocator unit

dcc58c9

We need to handle objects written during previous mounts that may have had a smaller min_alloc_size. Use block_size, which is a safe lower bound. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: make blob_t unused helpers use logical length

2df9aa8

These were taking min_alloc_size, but this can change across mounts; better to use the logical blob length instead (that's what we want anyway!). Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: instrument big/small writes

e152e97

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: instrument transaction count

3fb6c5c

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: instrument onode reshard events

f69af0b

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: dump some stats after fsck

2d8a145

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: assert shared blob cache cleared on split

fad3d99

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-sharded-extent-map branch from 79b79e8 to fad3d99 Compare September 7, 2016 15:35

liewegas merged commit 68cf9d8 into ceph:master Sep 7, 2016

liewegas deleted the wip-bluestore-sharded-extent-map branch September 7, 2016 15:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: shard extent map #10963

os/bluestore: shard extent map #10963

liewegas commented Sep 2, 2016

ifed01 Sep 6, 2016

liewegas commented Sep 6, 2016

liewegas commented Sep 6, 2016

allensamuels Sep 6, 2016

liewegas commented Sep 6, 2016

allensamuels Sep 6, 2016

os/bluestore: shard extent map #10963

os/bluestore: shard extent map #10963

Conversation

liewegas commented Sep 2, 2016

ifed01 Sep 6, 2016

Choose a reason for hiding this comment

liewegas commented Sep 6, 2016

liewegas commented Sep 6, 2016

allensamuels Sep 6, 2016

Choose a reason for hiding this comment

liewegas commented Sep 6, 2016

allensamuels Sep 6, 2016

Choose a reason for hiding this comment