Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: synchronous on_applied completions #18196

Merged
merged 20 commits into from Dec 19, 2017

Conversation

Projects
None yet
7 participants
@liewegas
Copy link
Member

commented Oct 9, 2017

This takes some pieces from wip-bluestore-rtc but applies it just to the onreadable portion.
The short version is that any transaction is readable with bluestore at the end of queue_transaction,
so we can trigger all of those completions syncrhonously. Those Context's implement sync_complete(),
which doesn't retake pg->lock (we already hold it).

@liewegas liewegas requested a review from jdurgin Oct 9, 2017

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from ebf75ca to cc1bc3d Oct 9, 2017

@varadakari
Copy link
Contributor

left a comment

LGTM.

list<pair<Context*,int> > ls_rval;
// This way other threads can submit new contexts to complete
// while we are working.
vector<pair<Context*,int>> ls;

This comment has been minimized.

Copy link
@varadakari

varadakari Oct 10, 2017

Contributor

Do we need a pair here? Seems to be not handling anything other than a zero, if we are to handle the non zero values, then we can add it.

This comment has been minimized.

Copy link
@liewegas

liewegas Oct 10, 2017

Author Member

other queue() method(s) can take a non-zero value of r

@@ -10272,7 +10284,6 @@ void PrimaryLogPG::_committed_pushed_object(

void PrimaryLogPG::_applied_recovered_object(ObjectContextRef obc)
{
lock();

This comment has been minimized.

Copy link
@varadakari

varadakari Oct 10, 2017

Contributor

may be need to assert for lock()?

}

void PrimaryLogPG::_applied_recovered_object_replica()
{
lock();

This comment has been minimized.

Copy link
@varadakari

varadakari Oct 10, 2017

Contributor

Better to assert for lock().

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from cc1bc3d to bac146b Oct 10, 2017

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from 054b19e to ec2d2e1 Oct 18, 2017

@liewegas liewegas requested a review from tchaikov Oct 18, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Oct 18, 2017

This is passing my tests now; ready for review!

@tchaikov

This comment has been minimized.

Copy link
Contributor

commented Oct 19, 2017

will review this early tomorrow.

@tchaikov
Copy link
Contributor

left a comment

i need 1 or 2 more days to digest this change as i am not familiar with BlueStore.

c->complete(ls_rval.front().second);
ls_rval.pop_front();
}
for (auto p : ls) {

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

auto&

This comment has been minimized.

Copy link
@liewegas

liewegas Oct 20, 2017

Author Member

my assumption was that since the value is just 2 words (pointer + int) the reference isn't helpful? I guess it'll get optimized away either way...

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

cool, makes sense.

finisher_queue.push_back(NULL);
} else
finisher_queue.push_back(c);
finisher_queue.push_back(pair<Context*, int>(c, r));

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

nit, make_pair() w/o repeating the types of std::pair elements..

@@ -4049,6 +4049,7 @@ int PG::build_scrub_map_chunk(
// objects
vector<hobject_t> ls;
vector<ghobject_t> rollback_obs;
osr->flush();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

shall we consider this a bug fix, and hence a candidate to be backported?

This comment has been minimized.

Copy link
@liewegas

liewegas Oct 20, 2017

Author Member

The change is only needed because of the synchronous onreadable behavior change in BlueStore. Before now, the OSD doesn't expect changes to be there until onreadable is called, which happens later when the creates/deletes are visible.

assert(out_on_applied);
assert(out_on_commit);
assert(out_on_applied_sync);
for (vector<Transaction>::iterator i = t.begin();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

nit, range-based loop

void finish(int r) override {
pg->lock();

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

nit, i think we could use guardedly_lock() or with_unique_lock() here.

@@ -1579,6 +1579,11 @@ class ObjectStore {
virtual bool wants_journal() = 0; //< prefers a journal
virtual bool allows_journal() = 0; //< allows a journal

/// true if a txn is readable immediately after it is queued.
virtual bool is_sync_onreadable() {

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

could mark it const

@@ -1655,6 +1655,8 @@ class BlueStore : public ObjectStore,
Sequencer *parent;
BlueStore *store;

spg_t shard_hint;

This comment has been minimized.

Copy link
@tchaikov

tchaikov Oct 20, 2017

Contributor

i think we can just remove the shard_hint in ObjectStore::Sequencer with this change. as nobody is actually reading it. fio_ceph_objectstore.cc sets it, but i am not sure if there is any consumer of this value.

This comment has been minimized.

Copy link
@liewegas

liewegas Oct 20, 2017

Author Member

The OSD is also using this to pass the shard_hint through (see PG ctor). So it goes PG -> Sequencer -> BlueStore::OpSequencer

@liewegas

This comment has been minimized.

@liewegas liewegas added the DNM label Oct 27, 2017

@@ -141,6 +141,10 @@ class PrimaryLogPG::BlessedGenContext : public GenContext<T> {
c.release()->complete(t);
pg->unlock();
}
bool sync_finish(T t) {
c.release()->complete(t);

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

Shouldn't it call (and return result of) sync_complete instead? The same below...

@@ -8921,6 +8921,7 @@ int BlueStore::queue_transactions(
} else {
osr = new OpSequencer(cct, this);
osr->parent = posr;
osr->shard_hint = posr->shard_hint;

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

we can probably call shard_hint.hash_to_shard and preserve its result in osr instance instead. This saves us a bit...

@@ -10653,6 +10653,7 @@ int BlueStore::_do_remove(
txc->removed(o);
o->extent_map.clear();
o->onode = bluestore_onode_t();
txc->note_modified_object(o);

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

txc->removed call above seems to be the only usage for TransContext::removed() func. And the latter (in addition to onodes list update) performs a removal from 'modified_objects' list. Then we insert into it again with note_modified_object. May be simply modify removed() function behavior?

@@ -848,6 +848,11 @@ class PG : public DoutPrefixProvider {
eversion_t v;
C_UpdateLastRollbackInfoTrimmedToApplied(PG *pg, epoch_t e, eversion_t v)
: pg(pg), e(e), v(v) {}
bool sync_complete(int r) override {

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

Probably it's more correct/purer to override finish_sync here instead?

@@ -224,8 +224,15 @@ class PrimaryLogPG::C_OSD_AppliedRecoveredObject : public Context {
public:
C_OSD_AppliedRecoveredObject(PrimaryLogPG *p, ObjectContextRef o) :
pg(p), obc(o) {}
bool sync_complete(int r) override {

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

sync_finish() instead?

@@ -93,6 +93,11 @@ struct PrimaryLogPG::C_OSD_OnApplied : Context {
epoch_t epoch,
eversion_t v)
: pg(pg), epoch(epoch), v(v) {}
bool sync_complete(int r) override {

This comment has been minimized.

Copy link
@ifed01

ifed01 Nov 15, 2017

Contributor

sync_finish?

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from ec2d2e1 to 28bcaa4 Nov 15, 2017

liewegas added some commits Oct 10, 2017

os/bluestore: fix _set_alloc_sizes dout
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore/BitmapFreelistManager: disable bluestore_debug_freelist
This doesn't work as implemented.  We are doing _txc_finalize_kv() from
queue_transactions, which calls into the freelist and does this verify
code.  However, we have no assurance that a previous txc in the sequencer
has applied its changes to the kv store, which means that a simple sequence
like

 - write object
 - delete object

can trigger if the write is waiting for aio.  This currently happens
with ObjectStore/StoreTest.SimpleRemount/2.

Comment out the verify, but leave _verify_range() helper in place in case
we can use it in the future in some other context.

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_objectstore: flush() before collection_list()
The one exception to the "immediately readable" it collection_list, which
is not readable until the kv transaction is applied.  Our choices are

1. Wait until kv to apply to trigger onreadable (for any create/remove
ops).  This wipes away much of the benefit of fully sync onreadable.

2. Add tracking for created/removed objects in BlueStore so that we can
incorporate those into collection_list.  This is complex.

3. flush() from collection_list.  Unfortunately we don't have osr linked
to Collection, so this doesn't quite work with the current ObjectStore
interface.

4. Require the caller flush() before list and put a big * next to the
"immediately onreadable" claim.  It turns out that because of FileStore,
the OSD already does flush() before collection_list anyway, so this does
not require any actual change... except to store_test tests.  (This didn't
affect filestore because store_test is using apply_transaction, which
waits for readable, and on filestore that also implies visible by
collection_list.)

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: keep ref to deleted onode in _do_remove
We need to make sure we carry this ref through until the object is
deleted or else another request right try to read it before the kv
txn is applied.  (This is easy to trigger now that onreadable is completed
at queue time instead of commit time.)

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimaryLogPG: sync complete from C_OSD_OpApplied
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: store shard, not shard_hint,in OpSequencer
Good suggestion from Igor!

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: simplify Onode tracking for removed object
Make the only caller of removed() not need to call note_modified_object
separately, dropping the unneeded erase() call.

Signed-off-by: Sage Weil <sage@redhat.com>
osd/PrimarLogPG: remove jewel compat for error code logging
Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from 197fee9 to f525d86 Dec 11, 2017

@gregsfortytwo
Copy link
Member

left a comment

I looked at the first 3 patches here for a different PR, so copying over the comment I had.

Besides trying to enforce easy-to-use interfaces, I'd like to see some performance tests before this merges. Just based on some commit messages I think there may be significant second-order effects here, although based on an irc conversation they might not be as serious as I'd initially thought.

class Context {
Context(const Context& other);
const Context& operator=(const Context& other);

protected:
virtual void finish(int r) = 0;

// variant of finish that is safe to call "synchronously." override should
// return true.

This comment has been minimized.

Copy link
@gregsfortytwo

gregsfortytwo Dec 11, 2017

Member

There ought to be a way to build this interface so that it doesn't require returning the correct value. (One terrible way: a private member we set in the default implementation indicating "not implemented" and check for when invoking it.)

Also, it needs more documentation in the source. What other constraints exist?

@liewegas liewegas referenced this pull request Dec 11, 2017

Merged

osd: put pg removal in op_wq #19433

5 of 5 tasks complete
osd/ECBackend: fix on_write ordering w/ sync onreadable callbacks
When we call handle_sub_write after a write completion, we may
do a sync read completion and then call back into check_ops().  Attaching
the on_write events to the op we're applying means that we don't ensure
that the on_write event(s) happen before the next write in the queue
is submitted (when we call back into check_ops()).

For example, if we have op A, on_write event W, then op B, a sync
applied completion would mean that we would queue the write for A, call
back into SubWriteApplied -> handle_sub_write_reply -> check_ops and then
process B... before getting to W.

Resolve this by attaching the on_write callback to a separate Op that is
placed into the queue, just like any other Op.  This keeps the ordering
logic clean, although it is a bit ugly with the polymorphism around Op
being either an Op or an on_write callback.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas force-pushed the liewegas:wip-bluestore-sync-onreadable branch from f525d86 to 1908c06 Dec 12, 2017

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 13, 2017

Okay, ran a zillion more tests and grepped logs to verify the dicey ec change was being tested and it looks good.

@gregsfortytwo maybe there some c++ syntax that allows an override without returning true, but I'm not sure that covers all the cases we care about--i implemented a C_Contexts finisher as part of this (not sure if it ended up in teh final version, tho, may not have needed it?) where the sync/non-sync behavior really is conditional. Happy to have someone come along and pretty it up later but I don't want to get stuck on that now. There are less than a half dozen contexts to fix if we come up with something better later.

@markhpc

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

FWIW, I'm running this through tests now that caused a segfault on radoslaw's async read PR. Should complete in a couple of hours.

@markhpc

This comment has been minimized.

Copy link
Member

commented Dec 14, 2017

This did not fail on any of the NVMe (rbd, 4 node, 1 NVMe/node, moderate concurrency, 3x rep) performance tests I ran, but did not show any performance advantages versus a previous master run. There was a moderate (~10%) drop in write performance in several write oriented tests. I will need to compare against the commit this is based on to determine if it's specifically caused by this PR though.

@liewegas

This comment has been minimized.

Copy link
Member Author

commented Dec 14, 2017

Thanks @markhpc --- please let me know when you have the master comparison. Need to confirm that (1) there isn't a regression here and (2) figure out what broke master! (Also, this is blocking the pg removal, which is blocking the next thing. :)

@markhpc

This comment has been minimized.

Copy link
Member

commented Dec 15, 2017

I'm closer to narrowing down the segfault to Radoslaw's PR. He thinks he has an issue with op ordering that he's working on fixing now. Next up is testing the commit this PR is based on.

@markhpc

This comment has been minimized.

Copy link
Member

commented Dec 16, 2017

Ok, when comparing this PR vs da7d071, the variation I'm most concerned about disappears (historically I see more noise with large IO tests which is more or less what's left now). Both however remain slower vs an older master for small sequential writes. I'd go ahead and merge this and I'll have to bisect to figure out which PR in master introduced the small seq write regression. Thanks for waiting!

@jdurgin
Copy link
Member

left a comment

turns out I read most of this earlier in the pg removal pr

@jdurgin jdurgin added needs-qa and removed needs-review labels Dec 19, 2017

@liewegas liewegas merged commit d35fed9 into ceph:master Dec 19, 2017

5 checks passed

Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
make check make check succeeded
Details
make check (arm64) make check succeeded
Details

@liewegas liewegas deleted the liewegas:wip-bluestore-sync-onreadable branch Dec 19, 2017

ivancich pushed a commit to ivancich/ceph-fork that referenced this pull request Jan 3, 2018

os/bluestore: repair 21089 on freelist init
Fix up the size inconsistency on freelist init.  This way it will always
happen after an upgrade... and before the user moves to something
post-luminous.

Signed-off-by: Sage Weil <sage@redhat.com>
(cherry picked from commit b064ed1)

ceph#18196

Resolves: rhbz#1504179
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.