-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: shard extent map #10963
os/bluestore: shard extent map #10963
Conversation
4351945
to
605063e
Compare
while (in_flight) | ||
cond.Wait(lock); | ||
store->umount(); | ||
store->fsck(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have fsck_on_(u)mount set to true for this test suite (see main func) hence there is no need to call fsck directly. That's just a waste of time...
a9a83dc
to
0160c6a
Compare
This passes tests, except for the bitmap granularity issue in #10999 that is worked around in that PR. I suggest we merge that one too (with an interim fix) until we do something more clever with min_alloc_size |
{ | ||
const char *p = key.c_str(); | ||
if (key.length() < 2 + 8 + 4) | ||
if (key.length() < 8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 --- ugh. Magic number.
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
0160c6a
to
2a77a90
Compare
Thanks @allensamuels, fixed that. Also rolled the allocator granularity bug workaround into this PR for now so that the tests pass with all the new fsck's. It's passing the full test suite for me; I think it's ready to merge. [==========] 61 tests from 1 test case ran. (2841968 ms total) |
lock("BlueStore::Collection::lock", true, false), | ||
exists(true), | ||
bnode_set(MAX(16, g_conf->bluestore_onode_cache_size / 128)), | ||
onode_map(cs) | ||
shared_blob_set(MAX(16, g_conf->bluestore_onode_cache_size / 4)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More magic numbers -- that are different from the old magic numbers :)
Rewrote much of the persistence of onode metadata. The highlights: - extents and blobs stored together (the blob with the first referencing extent). - extents sharded across multiple k/v keys - if a blob if referenced from multiple blobs, it's stored in the onode key (called a "spanning blob"). - when we clone a blob we copy the metadata, but mark it shared and put (just) the ref_map on the underlying blocks in a shared_blob key. at this point we also assign a globally unique id (sbid = shared blob id) so the key has a unique name. - we instantiate a SharedBlob in memory regardless of whether we need to load the ref_map (which is only needed for deallocations!). the BufferSpace is attached to this SharedBlob so we get unified caching across clones. Signed-off-by: Sage Weil <sage@redhat.com>
We could bump the _max value for a TransContext in it's prepare state, have it wait for a long time on IO, and let another txc allocate and commit something with an id higher than the previous max. Fix this first by pushing the max ids into the TransContext where we can deal with them at commit time, and then making _kv_sync_thread bump the committed max in a safe way. Note that this will need to change if/when we do these commits in parallel. Signed-off-by: Sage Weil <sage@redhat.com>
Only examine the range we just wrote to (and to the left and right). Signed-off-by: Sage Weil <sage@redhat.com>
This has to be block_size bits because min_alloc_size can vary over mounts. Signed-off-by: Sage Weil <sage@redhat.com>
We need to handle objects written during previous mounts that may have had a smaller min_alloc_size. Use block_size, which is a safe lower bound. Signed-off-by: Sage Weil <sage@redhat.com>
These were taking min_alloc_size, but this can change across mounts; better to use the logical blob length instead (that's what we want anyway!). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
79b79e8
to
fad3d99
Compare
Rewrote much of the persistence of onode metadata. The
highlights: