os/bluestore: synchronous on_applied completions #18196

liewegas · 2017-10-09T16:44:38Z

This takes some pieces from wip-bluestore-rtc but applies it just to the onreadable portion.
The short version is that any transaction is readable with bluestore at the end of queue_transaction,
so we can trigger all of those completions syncrhonously. Those Context's implement sync_complete(),
which doesn't retake pg->lock (we already hold it).

varadakari

LGTM.

varadakari · 2017-10-10T09:46:43Z

src/common/Finisher.cc

-      list<pair<Context*,int> > ls_rval;
+      // This way other threads can submit new contexts to complete
+      // while we are working.
+      vector<pair<Context*,int>> ls;


Do we need a pair here? Seems to be not handling anything other than a zero, if we are to handle the non zero values, then we can add it.

other queue() method(s) can take a non-zero value of r

varadakari · 2017-10-10T09:50:41Z

src/osd/PrimaryLogPG.cc

@@ -10272,7 +10284,6 @@ void PrimaryLogPG::_committed_pushed_object(

 void PrimaryLogPG::_applied_recovered_object(ObjectContextRef obc)
 {
-  lock();


may be need to assert for lock()?

varadakari · 2017-10-10T09:51:01Z

src/osd/PrimaryLogPG.cc

 }

 void PrimaryLogPG::_applied_recovered_object_replica()
 {
-  lock();


Better to assert for lock().

liewegas · 2017-10-18T15:17:25Z

This is passing my tests now; ready for review!

tchaikov · 2017-10-19T12:10:29Z

will review this early tomorrow.

tchaikov

i need 1 or 2 more days to digest this change as i am not familiar with BlueStore.

tchaikov · 2017-10-20T13:45:33Z

src/common/Finisher.cc

-	  c->complete(ls_rval.front().second);
-	  ls_rval.pop_front();
-	}
+      for (auto p : ls) {


my assumption was that since the value is just 2 words (pointer + int) the reference isn't helpful? I guess it'll get optimized away either way...

cool, makes sense.

tchaikov · 2017-10-20T13:47:58Z

src/common/Finisher.h

-      finisher_queue.push_back(NULL);
-    } else
-      finisher_queue.push_back(c);
+    finisher_queue.push_back(pair<Context*, int>(c, r));


nit, make_pair() w/o repeating the types of std::pair elements..

tchaikov · 2017-10-20T13:57:54Z

src/osd/PG.cc

@@ -4049,6 +4049,7 @@ int PG::build_scrub_map_chunk(
  // objects
  vector<hobject_t> ls;
  vector<ghobject_t> rollback_obs;
+  osr->flush();


shall we consider this a bug fix, and hence a candidate to be backported?

The change is only needed because of the synchronous onreadable behavior change in BlueStore. Before now, the OSD doesn't expect changes to be there until onreadable is called, which happens later when the creates/deletes are visible.

tchaikov · 2017-10-20T14:19:46Z

src/os/ObjectStore.h

+      assert(out_on_applied);
+      assert(out_on_commit);
+      assert(out_on_applied_sync);
+      for (vector<Transaction>::iterator i = t.begin();


nit, range-based loop

tchaikov · 2017-10-20T14:28:37Z

src/osd/PrimaryLogPG.cc

  void finish(int r) override {
+    pg->lock();


nit, i think we could use guardedly_lock() or with_unique_lock() here.

tchaikov · 2017-10-20T14:30:08Z

src/os/ObjectStore.h

@@ -1579,6 +1579,11 @@ class ObjectStore {
  virtual bool wants_journal() = 0;  //< prefers a journal
  virtual bool allows_journal() = 0; //< allows a journal

+  /// true if a txn is readable immediately after it is queued.
+  virtual bool is_sync_onreadable() {


could mark it const

tchaikov · 2017-10-20T14:39:45Z

src/os/bluestore/BlueStore.h

@@ -1655,6 +1655,8 @@ class BlueStore : public ObjectStore,
    Sequencer *parent;
    BlueStore *store;

+    spg_t shard_hint;


i think we can just remove the shard_hint in ObjectStore::Sequencer with this change. as nobody is actually reading it. fio_ceph_objectstore.cc sets it, but i am not sure if there is any consumer of this value.

The OSD is also using this to pass the shard_hint through (see PG ctor). So it goes PG -> Sequencer -> BlueStore::OpSequencer

liewegas · 2017-10-27T22:45:35Z

http://pulpito.ceph.com/sage-2017-10-27_20:23:18-rados-wip-sage-testing-2017-10-27-1303-distro-basic-smithi/ has several failures to diagnose!

ifed01 · 2017-11-15T10:27:09Z

src/osd/PrimaryLogPG.cc

@@ -141,6 +141,10 @@ class PrimaryLogPG::BlessedGenContext : public GenContext<T> {
      c.release()->complete(t);
    pg->unlock();
  }
+  bool sync_finish(T t) {
+    c.release()->complete(t);


Shouldn't it call (and return result of) sync_complete instead? The same below...

ifed01 · 2017-11-15T11:02:51Z

src/os/bluestore/BlueStore.cc

@@ -8921,6 +8921,7 @@ int BlueStore::queue_transactions(
  } else {
    osr = new OpSequencer(cct, this);
    osr->parent = posr;
+    osr->shard_hint = posr->shard_hint;


we can probably call shard_hint.hash_to_shard and preserve its result in osr instance instead. This saves us a bit...

ifed01 · 2017-11-15T11:15:47Z

src/os/bluestore/BlueStore.cc

@@ -10653,6 +10653,7 @@ int BlueStore::_do_remove(
  txc->removed(o);
  o->extent_map.clear();
  o->onode = bluestore_onode_t();
+  txc->note_modified_object(o);


txc->removed call above seems to be the only usage for TransContext::removed() func. And the latter (in addition to onodes list update) performs a removal from 'modified_objects' list. Then we insert into it again with note_modified_object. May be simply modify removed() function behavior?

ifed01 · 2017-11-15T11:22:30Z

src/osd/PG.h

@@ -848,6 +848,11 @@ class PG : public DoutPrefixProvider {
    eversion_t v;
    C_UpdateLastRollbackInfoTrimmedToApplied(PG *pg, epoch_t e, eversion_t v)
      : pg(pg), e(e), v(v) {}
+    bool sync_complete(int r) override {


Probably it's more correct/purer to override finish_sync here instead?

ifed01 · 2017-11-15T11:23:05Z

src/osd/PrimaryLogPG.cc

@@ -224,8 +224,15 @@ class PrimaryLogPG::C_OSD_AppliedRecoveredObject : public Context {
  public:
  C_OSD_AppliedRecoveredObject(PrimaryLogPG *p, ObjectContextRef o) :
    pg(p), obc(o) {}
+  bool sync_complete(int r) override {


sync_finish() instead?

ifed01 · 2017-11-15T11:23:52Z

src/osd/PrimaryLogPG.cc

@@ -93,6 +93,11 @@ struct PrimaryLogPG::C_OSD_OnApplied : Context {
    epoch_t epoch,
    eversion_t v)
    : pg(pg), epoch(epoch), v(v) {}
+  bool sync_complete(int r) override {


sync_finish?

The parent may go away, so we need to keep our own copy of shard_hint in OpSequencer to avoid a user-after-free (e.g., when the user drops their osr and calls OpSequencer::discard()). Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

This doesn't work as implemented. We are doing _txc_finalize_kv() from queue_transactions, which calls into the freelist and does this verify code. However, we have no assurance that a previous txc in the sequencer has applied its changes to the kv store, which means that a simple sequence like - write object - delete object can trigger if the write is waiting for aio. This currently happens with ObjectStore/StoreTest.SimpleRemount/2. Comment out the verify, but leave _verify_range() helper in place in case we can use it in the future in some other context. Signed-off-by: Sage Weil <sage@redhat.com>

The one exception to the "immediately readable" it collection_list, which is not readable until the kv transaction is applied. Our choices are 1. Wait until kv to apply to trigger onreadable (for any create/remove ops). This wipes away much of the benefit of fully sync onreadable. 2. Add tracking for created/removed objects in BlueStore so that we can incorporate those into collection_list. This is complex. 3. flush() from collection_list. Unfortunately we don't have osr linked to Collection, so this doesn't quite work with the current ObjectStore interface. 4. Require the caller flush() before list and put a big * next to the "immediately onreadable" claim. It turns out that because of FileStore, the OSD already does flush() before collection_list anyway, so this does not require any actual change... except to store_test tests. (This didn't affect filestore because store_test is using apply_transaction, which waits for readable, and on filestore that also implies visible by collection_list.) Signed-off-by: Sage Weil <sage@redhat.com>

We need to make sure we carry this ref through until the object is deleted or else another request right try to read it before the kv txn is applied. (This is easy to trigger now that onreadable is completed at queue time instead of commit time.) Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

Good suggestion from Igor! Signed-off-by: Sage Weil <sage@redhat.com>

Make the only caller of removed() not need to call note_modified_object separately, dropping the unneeded erase() call. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

gregsfortytwo

I looked at the first 3 patches here for a different PR, so copying over the comment I had.

Besides trying to enforce easy-to-use interfaces, I'd like to see some performance tests before this merges. Just based on some commit messages I think there may be significant second-order effects here, although based on an irc conversation they might not be as serious as I'd initially thought.

gregsfortytwo · 2017-12-11T22:36:46Z

src/include/Context.h

 class Context {
  Context(const Context& other);
  const Context& operator=(const Context& other);

 protected:
  virtual void finish(int r) = 0;

+  // variant of finish that is safe to call "synchronously."  override should
+  // return true.


There ought to be a way to build this interface so that it doesn't require returning the correct value. (One terrible way: a private member we set in the default implementation indicating "not implemented" and check for when invoking it.)

Also, it needs more documentation in the source. What other constraints exist?

When we call handle_sub_write after a write completion, we may do a sync read completion and then call back into check_ops(). Attaching the on_write events to the op we're applying means that we don't ensure that the on_write event(s) happen before the next write in the queue is submitted (when we call back into check_ops()). For example, if we have op A, on_write event W, then op B, a sync applied completion would mean that we would queue the write for A, call back into SubWriteApplied -> handle_sub_write_reply -> check_ops and then process B... before getting to W. Resolve this by attaching the on_write callback to a separate Op that is placed into the queue, just like any other Op. This keeps the ordering logic clean, although it is a bit ugly with the polymorphism around Op being either an Op or an on_write callback. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas · 2017-12-13T21:51:45Z

Okay, ran a zillion more tests and grepped logs to verify the dicey ec change was being tested and it looks good.

@gregsfortytwo maybe there some c++ syntax that allows an override without returning true, but I'm not sure that covers all the cases we care about--i implemented a C_Contexts finisher as part of this (not sure if it ended up in teh final version, tho, may not have needed it?) where the sync/non-sync behavior really is conditional. Happy to have someone come along and pretty it up later but I don't want to get stuck on that now. There are less than a half dozen contexts to fix if we come up with something better later.

markhpc · 2017-12-14T14:01:59Z

FWIW, I'm running this through tests now that caused a segfault on radoslaw's async read PR. Should complete in a couple of hours.

markhpc · 2017-12-14T18:47:49Z

This did not fail on any of the NVMe (rbd, 4 node, 1 NVMe/node, moderate concurrency, 3x rep) performance tests I ran, but did not show any performance advantages versus a previous master run. There was a moderate (~10%) drop in write performance in several write oriented tests. I will need to compare against the commit this is based on to determine if it's specifically caused by this PR though.

liewegas · 2017-12-14T22:39:33Z

Thanks @markhpc --- please let me know when you have the master comparison. Need to confirm that (1) there isn't a regression here and (2) figure out what broke master! (Also, this is blocking the pg removal, which is blocking the next thing. :)

markhpc · 2017-12-15T15:55:09Z

I'm closer to narrowing down the segfault to Radoslaw's PR. He thinks he has an issue with op ordering that he's working on fixing now. Next up is testing the commit this PR is based on.

markhpc · 2017-12-16T12:53:19Z

Ok, when comparing this PR vs da7d071, the variation I'm most concerned about disappears (historically I see more noise with large IO tests which is more or less what's left now). Both however remain slower vs an older master for small sequential writes. I'd go ahead and merge this and I'll have to bisect to figure out which PR in master introduced the small seq write regression. Thanks for waiting!

jdurgin

turns out I read most of this earlier in the pg removal pr

Fix up the size inconsistency on freelist init. This way it will always happen after an upgrade... and before the user moves to something post-luminous. Signed-off-by: Sage Weil <sage@redhat.com> (cherry picked from commit b064ed1) ceph#18196 Resolves: rhbz#1504179

liewegas added bluestore performance labels Oct 9, 2017

liewegas requested a review from jdurgin October 9, 2017 16:45

liewegas force-pushed the wip-bluestore-sync-onreadable branch from ebf75ca to cc1bc3d Compare October 9, 2017 18:00

liewegas added wip-sage-testing and removed wip-sage-testing labels Oct 9, 2017

varadakari approved these changes Oct 10, 2017

View reviewed changes

liewegas force-pushed the wip-bluestore-sync-onreadable branch from cc1bc3d to bac146b Compare October 10, 2017 12:10

liewegas force-pushed the wip-bluestore-sync-onreadable branch from 054b19e to ec2d2e1 Compare October 18, 2017 13:22

liewegas requested a review from tchaikov October 18, 2017 15:17

tchaikov reviewed Oct 20, 2017

View reviewed changes

liewegas added wip-sage-testing wip-sage2-testing and removed wip-sage-testing wip-sage2-testing labels Oct 20, 2017

liewegas removed the wip-sage-testing label Oct 27, 2017

liewegas added the DNM label Oct 27, 2017

ifed01 reviewed Nov 15, 2017

View reviewed changes

liewegas force-pushed the wip-bluestore-sync-onreadable branch from ec2d2e1 to 28bcaa4 Compare November 15, 2017 16:54

liewegas added 9 commits December 11, 2017 15:05

os/bluestore: avoid OpSequencer::parent

dbb365d

The parent may go away, so we need to keep our own copy of shard_hint in OpSequencer to avoid a user-after-free (e.g., when the user drops their osr and calls OpSequencer::discard()). Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fix _set_alloc_sizes dout

ec9f1fe

Signed-off-by: Sage Weil <sage@redhat.com>

osd/PrimaryLogPG: sync complete from C_OSD_OpApplied

75b71ab

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: store shard, not shard_hint,in OpSequencer

0ededde

Good suggestion from Igor! Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: simplify Onode tracking for removed object

3ccc436

Make the only caller of removed() not need to call note_modified_object separately, dropping the unneeded erase() call. Signed-off-by: Sage Weil <sage@redhat.com>

osd/PrimarLogPG: remove jewel compat for error code logging

fef5fe8

Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-sync-onreadable branch from 197fee9 to f525d86 Compare December 11, 2017 21:05

liewegas added the wip-sage2-testing label Dec 11, 2017

gregsfortytwo reviewed Dec 11, 2017

View reviewed changes

liewegas mentioned this pull request Dec 11, 2017

osd: put pg removal in op_wq #19433

Merged

5 tasks

liewegas force-pushed the wip-bluestore-sync-onreadable branch from f525d86 to 1908c06 Compare December 12, 2017 17:22

liewegas added needs-review and removed wip-sage2-testing labels Dec 14, 2017

jdurgin approved these changes Dec 19, 2017

View reviewed changes

jdurgin added needs-qa and removed needs-review labels Dec 19, 2017

liewegas merged commit d35fed9 into ceph:master Dec 19, 2017

liewegas deleted the wip-bluestore-sync-onreadable branch December 19, 2017 14:22

smithfarm mentioned this pull request Feb 22, 2019

luminous: os/bluestore: fix lack of onode ref during removal #26540

Merged

3 tasks

os/bluestore: synchronous on_applied completions #18196

os/bluestore: synchronous on_applied completions #18196

Conversation

liewegas commented Oct 9, 2017

varadakari left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Oct 18, 2017

tchaikov commented Oct 19, 2017

tchaikov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Oct 27, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ifed01 Nov 15, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gregsfortytwo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Dec 13, 2017

markhpc commented Dec 14, 2017

markhpc commented Dec 14, 2017 • edited

liewegas commented Dec 14, 2017

markhpc commented Dec 15, 2017

markhpc commented Dec 16, 2017 • edited

jdurgin left a comment

Choose a reason for hiding this comment

ifed01 Nov 15, 2017 •

edited

markhpc commented Dec 14, 2017 •

edited

markhpc commented Dec 16, 2017 •

edited