os/bluestore: fix deferred writes; improve flush #13888

liewegas · 2017-03-08T19:55:30Z

No description provided.

liewegas · 2017-03-13T13:54:24Z

ready for review!

ifed01 · 2017-03-14T14:27:39Z

src/os/bluestore/BlueStore.h


  uint64_t max_alloc_size = 0; ///< maximum allocation unit (power of 2)

-  bool sync_wal_apply;	  ///< see config option bluestore_sync_wal_apply
+  bool sync_deferred_apply;	  ///< see config option bluestore_sync_deferred_apply


assign a value?

ifed01 · 2017-03-15T12:45:20Z

src/os/bluestore/KernelDevice.h

@@ -34,7 +34,8 @@ class KernelDevice : public BlockDevice {
  interval_set<uint64_t> debug_inflight;

  Mutex flush_lock;


Wondering if this is used anywhere?

still used to protect flush_txns for Onode::flush() etc

There are 3 flush_lock-s around BlueStore - one at Onode, another one at KernelDevice and third one in SharedDriverData at NVMEDevice.cc. What you're saying is valid for Onode one. while I was referring for KernelDevice member. IMO it's not used...

yep, removed! good catch

ifed01 · 2017-03-15T13:07:13Z

src/os/bluestore/BlueStore.cc

@@ -7237,6 +7237,10 @@ void BlueStore::_txc_finish_io(TransContext *txc)
    _txc_state_proc(&*p++);
  } while (p != osr->q.end() &&
 	   p->state == TransContext::STATE_IO_DONE);
+
+  if (osr->kv_submitted_waiters) {


we can probably set the flag in the loop above when p->state >= STATE_KV_SUBMITTED and notify depending on this flag's value. just to avoid excessive wakeups.

fixed (moved wakeup to _txc_state_proc when state is set)

ifed01 · 2017-03-15T13:25:18Z

src/os/bluestore/BlueStore.h

@@ -1717,6 +1720,9 @@ class BlueStore : public ObjectStore,

  vector<Cache*> cache_shards;

+  std::mutex osr_lock;            ///< protect osd_set
+  std::set<OpSequencer*> osr_set; ///< set of all OpSequencers


Replace with an intrusive list(or set if we need lookup support) for cheaper insert/removal? Will need more stuff in osr_drain_all to make a clone though...

ifed01 · 2017-03-15T13:37:36Z

src/os/bluestore/BlueStore.h

+
+    void discard() override {
+      std::lock_guard<std::mutex> l(register_lock);
+      if (registered) {


replace with atomic::compare_and_exchange and get rid of mutex?

it protects both registered and membership in the osr_set as a unit

Still disagree.. There is a lock within (un)register_osr and hence osr_set is protected by them. OpSequencer ctor doesn't need any protection and can simply set registered to true and call register_osr - there is no way to call ctor concurrently . The only questionable location is discard. If it's called from different threads simultaneously the luckiest one atomically clears 'registered' var via atomic::compare_and_exchange and proceeds to unregister call. The loser just compares and do nothing. There is no need for additional mutex here if registered is std::atomic instance.

ifed01 · 2017-03-15T13:37:49Z

src/os/bluestore/BlueStore.h

    OpSequencer(CephContext* cct, BlueStore *store)
 	//set the qlock to PTHREAD_MUTEX_RECURSIVE mode
      : Sequencer_impl(cct),
 	parent(NULL), store(store) {
+      std::lock_guard<std::mutex> l(register_lock);


mutex acquisition seems excessive in ctor?

ifed01 · 2017-03-15T13:48:31Z

src/os/bluestore/BlueStore.cc

@@ -7786,7 +7798,8 @@ int BlueStore::_deferred_replay()
    _txc_state_proc(txc);
  }
  dout(20) << __func__ << " draining osr" << dendl;
-  osr->drain();
+  _osr_drain_all();


Is it drain_all here indeed?

ifed01 · 2017-03-15T14:11:22Z

src/test/objectstore/store_test.cc

-    contents[new_obj].data.clear();
-    contents[new_obj].data.append(contents[old_obj].data.c_str(),
-				  contents[old_obj].data.length());
+    contents[new_obj].data = contents[old_obj].data;


Do you think it might fix http://tracker.ceph.com/issues/19247?

ah! yeah, probably!

Unfortunately it doesn't. Just managed to reproduce the issue with this commit cherry-picked...

perhaps try with this whole pr? there were several hard-to-hit races that were fixed

ifed01 · 2017-03-15T14:26:59Z

src/os/bluestore/BlueStore.cc

@@ -8916,7 +8919,9 @@ void BlueStore::_wctx_finish(
    // longer allocated.  Note that this will leave behind edge bits
    // that are no longer referenced but not deallocated (until they
    // age out of the cache naturally).
-    b->discard_unallocated(c.get());
+    if (!blob.is_shared()) {


do the check inside discard_unallocated seems better...

liewegas · 2017-03-16T14:13:52Z

oh yeah! removing.

voidbag

I have a few comments about consistency...

voidbag · 2017-03-16T17:55:40Z

src/os/bluestore/BlueStore.cc

@@ -7170,39 +7110,26 @@ void BlueStore::_txc_state_proc(TransContext *txc)
    case TransContext::STATE_KV_SUBMITTED:
      txc->log_state_latency(logger, l_bluestore_state_kv_committing_lat);
      txc->state = TransContext::STATE_KV_DONE;
-      _txc_finish_kv(txc);
+      _txc_committed_kv(txc);


You must call _txc_applied_kv after db->submit_transaction_sync, not submit_transaction for safety.

the flush() call currently just enforces correct ordering, e.g.,

omap_set a=b

omap_clear (needs to see 'a' key in order to remove it)

it doesn't have anything to do with committed or durability; that's what the callbacks are for (so the caller can reason about that).

Okay, I forgot the onreadable callback's role.
I thought the old code where o->flush() blocks read(), because onreadable callback happened before data is applied to storage. Now, with temporary buffer cache, o->flush() isn't responsible for durability anymore.

Thank you for detail description.

voidbag · 2017-03-16T18:01:07Z

src/os/bluestore/BlueStore.cc

@@ -5778,7 +5719,6 @@ int BlueStore::_do_read(
  }

  utime_t start = ceph_clock_now();
-  o->flush();


o->flush cannot be removed. That's because worker can read the buffer of uncommitted transaction...
read request should be serviced after all flush_txns of the onode execute db->submit_transaction_sync

That's okay. The OSD layer above this doesn't read data until it's gotten the onreadable callback.

liewegas · 2017-03-16T19:31:47Z

Exactly. Thank you for the careful review! Next is #13943, then the RTC branch!

dmick · 2017-03-17T02:05:22Z

My fault the submodule test failed, ignore

Signed-off-by: Sage Weil <sage@redhat.com>

Make osr_set refcounts so that it can tolerate a Sequencer destruction racing with flush or a Sequencer that outlives the BlueStore instance itself. Signed-off-by: Sage Weil <sage@redhat.com>

First, eliminate the work queue--it's useless. We are dispatching aio and should not block. And if a single thread isn't sufficient to do it, it probably means we should be parallelizing kv_sync_thread too (which is our only caller that matters). Repurpose the old osr-list -> txc-list-per-osr queue structure to manage the queuing. For any given osr, dispatch one batch of aios at a time, taking care to collapse any overwrites so that the latest write wins. Signed-off-by: Sage Weil <sage@redhat.com>

This is less fragile, especially with 2 constructors. Signed-off-by: Sage Weil <sage@redhat.com>

In a simple HDD workload with queue depth of 1, we halve our throughput because the kv thread does a full commit twice per IO: once for the initial commit, and then again to clean up the deferred write record. The second wakeup is unnecessary; we can clean it up on the next commit. We do need to do this wakeup in a few cases, though, when draining the OpSequencers: (1) on replay during startup, and (2) on shutdown in _osr_drain_all(). Send everything through _osr_drain_all() for simplicity. This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device (rados bench 30 write -b 4096 -t 1). Signed-off-by: Sage Weil <sage@redhat.com>

If a blob is shared, we can't discard deallocated regions: there may be deferred buffers in flight and we might get a read via the clone. Signed-off-by: Sage Weil <sage@redhat.com>

Allow several deferred writes to accumulate before we submit them. In general we have no time pressure, and on HDD (and perhaps sometimes SSD) it is beneficial to accumulate and batch these so that they result in fewer seeks. On HDD, this is particularly true of seeks away from the journal. And on sequential workloads this can avoid seeks. In may even allow the block layer or SSD firmware to merge IOs and perform fewer writes. Signed-off-by: Sage Weil <sage@redhat.com>

When the Sequencer goes away it get deregistered. If there are still deferred IOs in flight, we need to wait for those too. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

…eferred If we have no non-deferred IO to flush, and we are running bluefs on a single shared device, then we can rely on the bluefs flush to make our current batch of deferred ios stable. Separate deferred into a "done" and "stable" list. If we do sync, put everything from "done" onto "stable". Otherwise, after we do our kv commit via bluefs, move "done" to "stable" then. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

We were modifying bufferlists in place, and kludging around it by making full copies elsewhere. Instead, never modify a buffer. This fixes issues where the buffer we submit to ObjectStore ends up in the cache and we modify in place later, corrupting the implementation's copy. (This was affecting BlueStore.) Rearrange the data methods to be next to each other and clean them up a bit too. Signed-off-by: Sage Weil <sage@redhat.com>

Kick off deferred IOs if we pass the throttle midpoint or if we would block during submission. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

These can survive as long as the txc, which can be longer than the Collection. Make sure we have a valid ref as both finish_write and ~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for debug). Signed-off-by: Sage Weil <sage@redhat.com>

Otherwise cache items survive beyond umount into the next mount cycle! Also, ensure that we flush_cache *before* clearing coll_map, as some cache items have references back to the Collection. Signed-off-by: Sage Weil <sage@redhat.com>

Signed-off-by: Sage Weil <sage@redhat.com>

We can't use a bare Collection since we get/put refs, the last put will delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately leaking!). Signed-off-by: Sage Weil <sage@redhat.com>

Better test coverage! Signed-off-by: Sage Weil <sage@redhat.com>

Clearer, and fewer wakeups. Signed-off-by: Sage Weil <sage@redhat.com>

We've been avoiding doing this for a while and it has finally caught up with us: the SharedBlob may outlive the split due to deferred IO, and a read on the child collection may load a competing Blob and SharedBlob and read from the on-disk blocks that haven't been written yet. Fix by preserving the one-SharedBlob-instance invariant by moving cache items to the new Collection and cache shard like we should have from the beginning. Signed-off-by: Sage Weil <sage@redhat.com>

Add assertions if we fail to flush everything. Signed-off-by: Sage Weil <sage@redhat.com>

It's possible for the Sequencer to go away while the OpSequencer still has txcs in flight. We were handling the case where the osr was on the deferred_queue, but it may be off the deferred_queue but waiting for the commit to happen, and we still need to wait for that. Fix this by introducing a 'zombie' state for the osr, in which we keep the osr in the osr_set. Clean up the OpSequencer methods and a few other method names. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas mentioned this pull request Mar 8, 2017

os/bluestore: sync commit transaction #12062

Closed

liewegas added bluestore bug-fix performance labels Mar 8, 2017

liewegas force-pushed the wip-bluestore-dw branch 5 times, most recently from 309f0fa to 2261ed2 Compare March 10, 2017 04:09

liewegas changed the title ~~os/bluestore: fix deferred writes; improve flush~~ WIP os/bluestore: fix deferred writes; improve flush Mar 10, 2017

liewegas force-pushed the wip-bluestore-dw branch 3 times, most recently from e4ff4df to 4890a69 Compare March 13, 2017 11:58

liewegas changed the title ~~WIP os/bluestore: fix deferred writes; improve flush~~ os/bluestore: fix deferred writes; improve flush Mar 13, 2017

liewegas requested a review from ifed01 March 13, 2017 13:54

liewegas force-pushed the wip-bluestore-dw branch 2 times, most recently from bbbc56e to e60ae8c Compare March 14, 2017 16:12

liewegas mentioned this pull request Mar 14, 2017

os/bluestore: fix perf counters #13965

Merged

ifed01 reviewed Mar 15, 2017

View reviewed changes

liewegas force-pushed the wip-bluestore-dw branch from 593ea0f to 2c6d39b Compare March 15, 2017 17:43

voidbag suggested changes Mar 16, 2017

View reviewed changes

liewegas added wip-sage-testing and removed wip-sage-testing labels Mar 16, 2017

liewegas force-pushed the wip-bluestore-dw branch from 04c9c12 to 3e5528f Compare March 16, 2017 21:38

liewegas added 27 commits March 21, 2017 13:56

os/bluestore: move _osr_reap_done

3dc82d5

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fix OpSequencer/Sequencer lifecycle

5fafd1f

Make osr_set refcounts so that it can tolerate a Sequencer destruction racing with flush or a Sequencer that outlives the BlueStore instance itself. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: move many initializations into header

78b9cea

This is less fragile, especially with 2 constructors. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: only discard deallocated regions of a blob if !shared

44d4983

If a blob is shared, we can't discard deallocated regions: there may be deferred buffers in flight and we might get a read via the clone. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: flush old/discarded OpSequencers too

7a3e85f

When the Sequencer goes away it get deregistered. If there are still deferred IOs in flight, we need to wait for those too. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: debug alloc release

c1f0108

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: drop obsolete comment

ba159de

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: prevent throttle deadlock due to deferred writes

3f9c216

Kick off deferred IOs if we pass the throttle midpoint or if we would block during submission. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: make throttles tunable online

3a3d9ad

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: remove dead _do_deferred_op code

f4d4c9c

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: fix perfcounters for deferred io

81e8682

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: nicer Onode dout prefix

3ad789c

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: better debugging around collections

ed9f54b

Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore/KernelDevice: drop unused flush_lock

4de29d0

Signed-off-by: Sage Weil <sage@redhat.com>

unittest_bluestore_types: fix Collection using tests

d93d6d0

We can't use a bare Collection since we get/put refs, the last put will delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately leaking!). Signed-off-by: Sage Weil <sage@redhat.com>

ceph_test_objectstore: set bluestore cache shards to 5

52c93f5

Better test coverage! Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: simplify flush() wake-up condition

e4d547e

Clearer, and fewer wakeups. Signed-off-by: Sage Weil <sage@redhat.com>

os/bluestore: clean up flush_all()

d8fa788

Add assertions if we fail to flush everything. Signed-off-by: Sage Weil <sage@redhat.com>

liewegas force-pushed the wip-bluestore-dw branch from a3c7c23 to def1760 Compare March 21, 2017 18:56

liewegas merged commit 66b42be into ceph:master Mar 21, 2017

liewegas deleted the wip-bluestore-dw branch March 21, 2017 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

os/bluestore: fix deferred writes; improve flush #13888

os/bluestore: fix deferred writes; improve flush #13888

liewegas commented Mar 8, 2017

liewegas commented Mar 13, 2017 •

edited

ifed01 Mar 14, 2017

ifed01 Mar 15, 2017

liewegas Mar 15, 2017

ifed01 Mar 16, 2017

liewegas Mar 16, 2017

ifed01 Mar 15, 2017

liewegas Mar 15, 2017

ifed01 Mar 15, 2017

ifed01 Mar 15, 2017

liewegas Mar 15, 2017 •

edited

ifed01 Mar 16, 2017 •

edited

ifed01 Mar 15, 2017

ifed01 Mar 15, 2017

ifed01 Mar 15, 2017

liewegas Mar 15, 2017

ifed01 Mar 16, 2017

liewegas Mar 16, 2017

ifed01 Mar 15, 2017

liewegas Mar 15, 2017

liewegas commented Mar 16, 2017 via email

voidbag left a comment •

edited

voidbag Mar 16, 2017

liewegas Mar 16, 2017

voidbag Mar 16, 2017

voidbag Mar 16, 2017

liewegas Mar 16, 2017

liewegas commented Mar 16, 2017 via email

dmick commented Mar 17, 2017

		@@ -34,7 +34,8 @@ class KernelDevice : public BlockDevice {
		interval_set<uint64_t> debug_inflight;

		Mutex flush_lock;

os/bluestore: fix deferred writes; improve flush #13888

os/bluestore: fix deferred writes; improve flush #13888

Conversation

liewegas commented Mar 8, 2017

liewegas commented Mar 13, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas Mar 15, 2017 • edited

Choose a reason for hiding this comment

ifed01 Mar 16, 2017 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Mar 16, 2017 via email

voidbag left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liewegas commented Mar 16, 2017 via email

dmick commented Mar 17, 2017

liewegas commented Mar 13, 2017 •

edited

liewegas Mar 15, 2017 •

edited

ifed01 Mar 16, 2017 •

edited

voidbag left a comment •

edited