New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: synchronous on_applied completions #18196
os/bluestore: synchronous on_applied completions #18196
Conversation
ebf75ca
to
cc1bc3d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
list<pair<Context*,int> > ls_rval; | ||
// This way other threads can submit new contexts to complete | ||
// while we are working. | ||
vector<pair<Context*,int>> ls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a pair here? Seems to be not handling anything other than a zero, if we are to handle the non zero values, then we can add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other queue() method(s) can take a non-zero value of r
@@ -10272,7 +10284,6 @@ void PrimaryLogPG::_committed_pushed_object( | |||
|
|||
void PrimaryLogPG::_applied_recovered_object(ObjectContextRef obc) | |||
{ | |||
lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be need to assert for lock()?
} | ||
|
||
void PrimaryLogPG::_applied_recovered_object_replica() | ||
{ | ||
lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to assert for lock().
cc1bc3d
to
bac146b
Compare
054b19e
to
ec2d2e1
Compare
This is passing my tests now; ready for review! |
will review this early tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need 1 or 2 more days to digest this change as i am not familiar with BlueStore.
c->complete(ls_rval.front().second); | ||
ls_rval.pop_front(); | ||
} | ||
for (auto p : ls) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my assumption was that since the value is just 2 words (pointer + int) the reference isn't helpful? I guess it'll get optimized away either way...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool, makes sense.
src/common/Finisher.h
Outdated
finisher_queue.push_back(NULL); | ||
} else | ||
finisher_queue.push_back(c); | ||
finisher_queue.push_back(pair<Context*, int>(c, r)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, make_pair()
w/o repeating the types of std::pair
elements..
@@ -4049,6 +4049,7 @@ int PG::build_scrub_map_chunk( | |||
// objects | |||
vector<hobject_t> ls; | |||
vector<ghobject_t> rollback_obs; | |||
osr->flush(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we consider this a bug fix, and hence a candidate to be backported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change is only needed because of the synchronous onreadable behavior change in BlueStore. Before now, the OSD doesn't expect changes to be there until onreadable is called, which happens later when the creates/deletes are visible.
src/os/ObjectStore.h
Outdated
assert(out_on_applied); | ||
assert(out_on_commit); | ||
assert(out_on_applied_sync); | ||
for (vector<Transaction>::iterator i = t.begin(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, range-based loop
src/osd/PrimaryLogPG.cc
Outdated
void finish(int r) override { | ||
pg->lock(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, i think we could use guardedly_lock()
or with_unique_lock()
here.
src/os/ObjectStore.h
Outdated
@@ -1579,6 +1579,11 @@ class ObjectStore { | |||
virtual bool wants_journal() = 0; //< prefers a journal | |||
virtual bool allows_journal() = 0; //< allows a journal | |||
|
|||
/// true if a txn is readable immediately after it is queued. | |||
virtual bool is_sync_onreadable() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could mark it const
src/os/bluestore/BlueStore.h
Outdated
@@ -1655,6 +1655,8 @@ class BlueStore : public ObjectStore, | |||
Sequencer *parent; | |||
BlueStore *store; | |||
|
|||
spg_t shard_hint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we can just remove the shard_hint
in ObjectStore::Sequencer
with this change. as nobody is actually reading it. fio_ceph_objectstore.cc
sets it, but i am not sure if there is any consumer of this value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The OSD is also using this to pass the shard_hint through (see PG ctor). So it goes PG -> Sequencer -> BlueStore::OpSequencer
http://pulpito.ceph.com/sage-2017-10-27_20:23:18-rados-wip-sage-testing-2017-10-27-1303-distro-basic-smithi/ has several failures to diagnose! |
src/osd/PrimaryLogPG.cc
Outdated
@@ -141,6 +141,10 @@ class PrimaryLogPG::BlessedGenContext : public GenContext<T> { | |||
c.release()->complete(t); | |||
pg->unlock(); | |||
} | |||
bool sync_finish(T t) { | |||
c.release()->complete(t); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it call (and return result of) sync_complete instead? The same below...
src/os/bluestore/BlueStore.cc
Outdated
@@ -8921,6 +8921,7 @@ int BlueStore::queue_transactions( | |||
} else { | |||
osr = new OpSequencer(cct, this); | |||
osr->parent = posr; | |||
osr->shard_hint = posr->shard_hint; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably call shard_hint.hash_to_shard and preserve its result in osr instance instead. This saves us a bit...
src/os/bluestore/BlueStore.cc
Outdated
@@ -10653,6 +10653,7 @@ int BlueStore::_do_remove( | |||
txc->removed(o); | |||
o->extent_map.clear(); | |||
o->onode = bluestore_onode_t(); | |||
txc->note_modified_object(o); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txc->removed call above seems to be the only usage for TransContext::removed() func. And the latter (in addition to onodes list update) performs a removal from 'modified_objects' list. Then we insert into it again with note_modified_object. May be simply modify removed() function behavior?
src/osd/PG.h
Outdated
@@ -848,6 +848,11 @@ class PG : public DoutPrefixProvider { | |||
eversion_t v; | |||
C_UpdateLastRollbackInfoTrimmedToApplied(PG *pg, epoch_t e, eversion_t v) | |||
: pg(pg), e(e), v(v) {} | |||
bool sync_complete(int r) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it's more correct/purer to override finish_sync here instead?
src/osd/PrimaryLogPG.cc
Outdated
@@ -224,8 +224,15 @@ class PrimaryLogPG::C_OSD_AppliedRecoveredObject : public Context { | |||
public: | |||
C_OSD_AppliedRecoveredObject(PrimaryLogPG *p, ObjectContextRef o) : | |||
pg(p), obc(o) {} | |||
bool sync_complete(int r) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_finish() instead?
src/osd/PrimaryLogPG.cc
Outdated
@@ -93,6 +93,11 @@ struct PrimaryLogPG::C_OSD_OnApplied : Context { | |||
epoch_t epoch, | |||
eversion_t v) | |||
: pg(pg), epoch(epoch), v(v) {} | |||
bool sync_complete(int r) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sync_finish?
ec2d2e1
to
28bcaa4
Compare
The parent may go away, so we need to keep our own copy of shard_hint in OpSequencer to avoid a user-after-free (e.g., when the user drops their osr and calls OpSequencer::discard()). Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This doesn't work as implemented. We are doing _txc_finalize_kv() from queue_transactions, which calls into the freelist and does this verify code. However, we have no assurance that a previous txc in the sequencer has applied its changes to the kv store, which means that a simple sequence like - write object - delete object can trigger if the write is waiting for aio. This currently happens with ObjectStore/StoreTest.SimpleRemount/2. Comment out the verify, but leave _verify_range() helper in place in case we can use it in the future in some other context. Signed-off-by: Sage Weil <sage@redhat.com>
The one exception to the "immediately readable" it collection_list, which is not readable until the kv transaction is applied. Our choices are 1. Wait until kv to apply to trigger onreadable (for any create/remove ops). This wipes away much of the benefit of fully sync onreadable. 2. Add tracking for created/removed objects in BlueStore so that we can incorporate those into collection_list. This is complex. 3. flush() from collection_list. Unfortunately we don't have osr linked to Collection, so this doesn't quite work with the current ObjectStore interface. 4. Require the caller flush() before list and put a big * next to the "immediately onreadable" claim. It turns out that because of FileStore, the OSD already does flush() before collection_list anyway, so this does not require any actual change... except to store_test tests. (This didn't affect filestore because store_test is using apply_transaction, which waits for readable, and on filestore that also implies visible by collection_list.) Signed-off-by: Sage Weil <sage@redhat.com>
We need to make sure we carry this ref through until the object is deleted or else another request right try to read it before the kv txn is applied. (This is easy to trigger now that onreadable is completed at queue time instead of commit time.) Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Good suggestion from Igor! Signed-off-by: Sage Weil <sage@redhat.com>
Make the only caller of removed() not need to call note_modified_object separately, dropping the unneeded erase() call. Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
197fee9
to
f525d86
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at the first 3 patches here for a different PR, so copying over the comment I had.
Besides trying to enforce easy-to-use interfaces, I'd like to see some performance tests before this merges. Just based on some commit messages I think there may be significant second-order effects here, although based on an irc conversation they might not be as serious as I'd initially thought.
class Context { | ||
Context(const Context& other); | ||
const Context& operator=(const Context& other); | ||
|
||
protected: | ||
virtual void finish(int r) = 0; | ||
|
||
// variant of finish that is safe to call "synchronously." override should | ||
// return true. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There ought to be a way to build this interface so that it doesn't require returning the correct value. (One terrible way: a private member we set in the default implementation indicating "not implemented" and check for when invoking it.)
Also, it needs more documentation in the source. What other constraints exist?
When we call handle_sub_write after a write completion, we may do a sync read completion and then call back into check_ops(). Attaching the on_write events to the op we're applying means that we don't ensure that the on_write event(s) happen before the next write in the queue is submitted (when we call back into check_ops()). For example, if we have op A, on_write event W, then op B, a sync applied completion would mean that we would queue the write for A, call back into SubWriteApplied -> handle_sub_write_reply -> check_ops and then process B... before getting to W. Resolve this by attaching the on_write callback to a separate Op that is placed into the queue, just like any other Op. This keeps the ordering logic clean, although it is a bit ugly with the polymorphism around Op being either an Op or an on_write callback. Signed-off-by: Sage Weil <sage@redhat.com>
f525d86
to
1908c06
Compare
Okay, ran a zillion more tests and grepped logs to verify the dicey ec change was being tested and it looks good. @gregsfortytwo maybe there some c++ syntax that allows an override without returning true, but I'm not sure that covers all the cases we care about--i implemented a C_Contexts finisher as part of this (not sure if it ended up in teh final version, tho, may not have needed it?) where the sync/non-sync behavior really is conditional. Happy to have someone come along and pretty it up later but I don't want to get stuck on that now. There are less than a half dozen contexts to fix if we come up with something better later. |
FWIW, I'm running this through tests now that caused a segfault on radoslaw's async read PR. Should complete in a couple of hours. |
This did not fail on any of the NVMe (rbd, 4 node, 1 NVMe/node, moderate concurrency, 3x rep) performance tests I ran, but did not show any performance advantages versus a previous master run. There was a moderate (~10%) drop in write performance in several write oriented tests. I will need to compare against the commit this is based on to determine if it's specifically caused by this PR though. |
Thanks @markhpc --- please let me know when you have the master comparison. Need to confirm that (1) there isn't a regression here and (2) figure out what broke master! (Also, this is blocking the pg removal, which is blocking the next thing. :) |
I'm closer to narrowing down the segfault to Radoslaw's PR. He thinks he has an issue with op ordering that he's working on fixing now. Next up is testing the commit this PR is based on. |
Ok, when comparing this PR vs da7d071, the variation I'm most concerned about disappears (historically I see more noise with large IO tests which is more or less what's left now). Both however remain slower vs an older master for small sequential writes. I'd go ahead and merge this and I'll have to bisect to figure out which PR in master introduced the small seq write regression. Thanks for waiting! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
turns out I read most of this earlier in the pg removal pr
Fix up the size inconsistency on freelist init. This way it will always happen after an upgrade... and before the user moves to something post-luminous. Signed-off-by: Sage Weil <sage@redhat.com> (cherry picked from commit b064ed1) ceph#18196 Resolves: rhbz#1504179
This takes some pieces from wip-bluestore-rtc but applies it just to the onreadable portion.
The short version is that any transaction is readable with bluestore at the end of queue_transaction,
so we can trigger all of those completions syncrhonously. Those Context's implement sync_complete(),
which doesn't retake pg->lock (we already hold it).