Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: synchronous on_applied completions #18196

Merged
merged 20 commits into from Dec 19, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Oct 9, 2017

This takes some pieces from wip-bluestore-rtc but applies it just to the onreadable portion.
The short version is that any transaction is readable with bluestore at the end of queue_transaction,
so we can trigger all of those completions syncrhonously. Those Context's implement sync_complete(),
which doesn't retake pg->lock (we already hold it).

Copy link
Contributor

@varadakari varadakari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

list<pair<Context*,int> > ls_rval;
// This way other threads can submit new contexts to complete
// while we are working.
vector<pair<Context*,int>> ls;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a pair here? Seems to be not handling anything other than a zero, if we are to handle the non zero values, then we can add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other queue() method(s) can take a non-zero value of r

@@ -10272,7 +10284,6 @@ void PrimaryLogPG::_committed_pushed_object(

void PrimaryLogPG::_applied_recovered_object(ObjectContextRef obc)
{
lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be need to assert for lock()?

}

void PrimaryLogPG::_applied_recovered_object_replica()
{
lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to assert for lock().

@liewegas
Copy link
Member Author

This is passing my tests now; ready for review!

@tchaikov
Copy link
Contributor

will review this early tomorrow.

Copy link
Contributor

@tchaikov tchaikov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need 1 or 2 more days to digest this change as i am not familiar with BlueStore.

c->complete(ls_rval.front().second);
ls_rval.pop_front();
}
for (auto p : ls) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto&

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my assumption was that since the value is just 2 words (pointer + int) the reference isn't helpful? I guess it'll get optimized away either way...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool, makes sense.

finisher_queue.push_back(NULL);
} else
finisher_queue.push_back(c);
finisher_queue.push_back(pair<Context*, int>(c, r));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, make_pair() w/o repeating the types of std::pair elements..

@@ -4049,6 +4049,7 @@ int PG::build_scrub_map_chunk(
// objects
vector<hobject_t> ls;
vector<ghobject_t> rollback_obs;
osr->flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider this a bug fix, and hence a candidate to be backported?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change is only needed because of the synchronous onreadable behavior change in BlueStore. Before now, the OSD doesn't expect changes to be there until onreadable is called, which happens later when the creates/deletes are visible.

assert(out_on_applied);
assert(out_on_commit);
assert(out_on_applied_sync);
for (vector<Transaction>::iterator i = t.begin();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, range-based loop

void finish(int r) override {
pg->lock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, i think we could use guardedly_lock() or with_unique_lock() here.

@@ -1579,6 +1579,11 @@ class ObjectStore {
virtual bool wants_journal() = 0; //< prefers a journal
virtual bool allows_journal() = 0; //< allows a journal

/// true if a txn is readable immediately after it is queued.
virtual bool is_sync_onreadable() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could mark it const

@@ -1655,6 +1655,8 @@ class BlueStore : public ObjectStore,
Sequencer *parent;
BlueStore *store;

spg_t shard_hint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we can just remove the shard_hint in ObjectStore::Sequencer with this change. as nobody is actually reading it. fio_ceph_objectstore.cc sets it, but i am not sure if there is any consumer of this value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OSD is also using this to pass the shard_hint through (see PG ctor). So it goes PG -> Sequencer -> BlueStore::OpSequencer

@liewegas
Copy link
Member Author

@liewegas liewegas added the DNM label Oct 27, 2017
@@ -141,6 +141,10 @@ class PrimaryLogPG::BlessedGenContext : public GenContext<T> {
c.release()->complete(t);
pg->unlock();
}
bool sync_finish(T t) {
c.release()->complete(t);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it call (and return result of) sync_complete instead? The same below...

@@ -8921,6 +8921,7 @@ int BlueStore::queue_transactions(
} else {
osr = new OpSequencer(cct, this);
osr->parent = posr;
osr->shard_hint = posr->shard_hint;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably call shard_hint.hash_to_shard and preserve its result in osr instance instead. This saves us a bit...

@@ -10653,6 +10653,7 @@ int BlueStore::_do_remove(
txc->removed(o);
o->extent_map.clear();
o->onode = bluestore_onode_t();
txc->note_modified_object(o);
Copy link
Contributor

@ifed01 ifed01 Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txc->removed call above seems to be the only usage for TransContext::removed() func. And the latter (in addition to onodes list update) performs a removal from 'modified_objects' list. Then we insert into it again with note_modified_object. May be simply modify removed() function behavior?

src/osd/PG.h Outdated
@@ -848,6 +848,11 @@ class PG : public DoutPrefixProvider {
eversion_t v;
C_UpdateLastRollbackInfoTrimmedToApplied(PG *pg, epoch_t e, eversion_t v)
: pg(pg), e(e), v(v) {}
bool sync_complete(int r) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it's more correct/purer to override finish_sync here instead?

@@ -224,8 +224,15 @@ class PrimaryLogPG::C_OSD_AppliedRecoveredObject : public Context {
public:
C_OSD_AppliedRecoveredObject(PrimaryLogPG *p, ObjectContextRef o) :
pg(p), obc(o) {}
bool sync_complete(int r) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_finish() instead?

@@ -93,6 +93,11 @@ struct PrimaryLogPG::C_OSD_OnApplied : Context {
epoch_t epoch,
eversion_t v)
: pg(pg), epoch(epoch), v(v) {}
bool sync_complete(int r) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sync_finish?

@liewegas liewegas force-pushed the wip-bluestore-sync-onreadable branch from ec2d2e1 to 28bcaa4 Compare November 15, 2017 16:54
The parent may go away, so we need to keep our own copy of shard_hint in
OpSequencer to avoid a user-after-free (e.g., when the user drops their
osr and calls OpSequencer::discard()).

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
This doesn't work as implemented.  We are doing _txc_finalize_kv() from
queue_transactions, which calls into the freelist and does this verify
code.  However, we have no assurance that a previous txc in the sequencer
has applied its changes to the kv store, which means that a simple sequence
like

 - write object
 - delete object

can trigger if the write is waiting for aio.  This currently happens
with ObjectStore/StoreTest.SimpleRemount/2.

Comment out the verify, but leave _verify_range() helper in place in case
we can use it in the future in some other context.

Signed-off-by: Sage Weil <sage@redhat.com>
The one exception to the "immediately readable" it collection_list, which
is not readable until the kv transaction is applied.  Our choices are

1. Wait until kv to apply to trigger onreadable (for any create/remove
ops).  This wipes away much of the benefit of fully sync onreadable.

2. Add tracking for created/removed objects in BlueStore so that we can
incorporate those into collection_list.  This is complex.

3. flush() from collection_list.  Unfortunately we don't have osr linked
to Collection, so this doesn't quite work with the current ObjectStore
interface.

4. Require the caller flush() before list and put a big * next to the
"immediately onreadable" claim.  It turns out that because of FileStore,
the OSD already does flush() before collection_list anyway, so this does
not require any actual change... except to store_test tests.  (This didn't
affect filestore because store_test is using apply_transaction, which
waits for readable, and on filestore that also implies visible by
collection_list.)

Signed-off-by: Sage Weil <sage@redhat.com>
We need to make sure we carry this ref through until the object is
deleted or else another request right try to read it before the kv
txn is applied.  (This is easy to trigger now that onreadable is completed
at queue time instead of commit time.)

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Good suggestion from Igor!

Signed-off-by: Sage Weil <sage@redhat.com>
Make the only caller of removed() not need to call note_modified_object
separately, dropping the unneeded erase() call.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Copy link
Member

@gregsfortytwo gregsfortytwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at the first 3 patches here for a different PR, so copying over the comment I had.

Besides trying to enforce easy-to-use interfaces, I'd like to see some performance tests before this merges. Just based on some commit messages I think there may be significant second-order effects here, although based on an irc conversation they might not be as serious as I'd initially thought.

class Context {
Context(const Context& other);
const Context& operator=(const Context& other);

protected:
virtual void finish(int r) = 0;

// variant of finish that is safe to call "synchronously." override should
// return true.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There ought to be a way to build this interface so that it doesn't require returning the correct value. (One terrible way: a private member we set in the default implementation indicating "not implemented" and check for when invoking it.)

Also, it needs more documentation in the source. What other constraints exist?

@liewegas liewegas mentioned this pull request Dec 11, 2017
5 tasks
When we call handle_sub_write after a write completion, we may
do a sync read completion and then call back into check_ops().  Attaching
the on_write events to the op we're applying means that we don't ensure
that the on_write event(s) happen before the next write in the queue
is submitted (when we call back into check_ops()).

For example, if we have op A, on_write event W, then op B, a sync
applied completion would mean that we would queue the write for A, call
back into SubWriteApplied -> handle_sub_write_reply -> check_ops and then
process B... before getting to W.

Resolve this by attaching the on_write callback to a separate Op that is
placed into the queue, just like any other Op.  This keeps the ordering
logic clean, although it is a bit ugly with the polymorphism around Op
being either an Op or an on_write callback.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas force-pushed the wip-bluestore-sync-onreadable branch from f525d86 to 1908c06 Compare December 12, 2017 17:22
@liewegas
Copy link
Member Author

Okay, ran a zillion more tests and grepped logs to verify the dicey ec change was being tested and it looks good.

@gregsfortytwo maybe there some c++ syntax that allows an override without returning true, but I'm not sure that covers all the cases we care about--i implemented a C_Contexts finisher as part of this (not sure if it ended up in teh final version, tho, may not have needed it?) where the sync/non-sync behavior really is conditional. Happy to have someone come along and pretty it up later but I don't want to get stuck on that now. There are less than a half dozen contexts to fix if we come up with something better later.

@markhpc
Copy link
Member

markhpc commented Dec 14, 2017

FWIW, I'm running this through tests now that caused a segfault on radoslaw's async read PR. Should complete in a couple of hours.

@markhpc
Copy link
Member

markhpc commented Dec 14, 2017

This did not fail on any of the NVMe (rbd, 4 node, 1 NVMe/node, moderate concurrency, 3x rep) performance tests I ran, but did not show any performance advantages versus a previous master run. There was a moderate (~10%) drop in write performance in several write oriented tests. I will need to compare against the commit this is based on to determine if it's specifically caused by this PR though.

@liewegas
Copy link
Member Author

Thanks @markhpc --- please let me know when you have the master comparison. Need to confirm that (1) there isn't a regression here and (2) figure out what broke master! (Also, this is blocking the pg removal, which is blocking the next thing. :)

@markhpc
Copy link
Member

markhpc commented Dec 15, 2017

I'm closer to narrowing down the segfault to Radoslaw's PR. He thinks he has an issue with op ordering that he's working on fixing now. Next up is testing the commit this PR is based on.

@markhpc
Copy link
Member

markhpc commented Dec 16, 2017

Ok, when comparing this PR vs da7d071, the variation I'm most concerned about disappears (historically I see more noise with large IO tests which is more or less what's left now). Both however remain slower vs an older master for small sequential writes. I'd go ahead and merge this and I'll have to bisect to figure out which PR in master introduced the small seq write regression. Thanks for waiting!

Copy link
Member

@jdurgin jdurgin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

turns out I read most of this earlier in the pg removal pr

@liewegas liewegas merged commit d35fed9 into ceph:master Dec 19, 2017
@liewegas liewegas deleted the wip-bluestore-sync-onreadable branch December 19, 2017 14:22
ivancich pushed a commit to ivancich/ceph-fork that referenced this pull request Jan 3, 2018
Fix up the size inconsistency on freelist init.  This way it will always
happen after an upgrade... and before the user moves to something
post-luminous.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit b064ed1)

ceph#18196

Resolves: rhbz#1504179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants