Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: fix deferred writes; improve flush #13888

Merged
merged 44 commits into from Mar 21, 2017

Conversation

liewegas
Copy link
Member

@liewegas liewegas commented Mar 8, 2017

No description provided.

@liewegas liewegas force-pushed the wip-bluestore-dw branch 5 times, most recently from 309f0fa to 2261ed2 Compare March 10, 2017 04:09
@liewegas liewegas changed the title os/bluestore: fix deferred writes; improve flush WIP os/bluestore: fix deferred writes; improve flush Mar 10, 2017
@liewegas liewegas force-pushed the wip-bluestore-dw branch 3 times, most recently from e4ff4df to 4890a69 Compare March 13, 2017 11:58
@liewegas liewegas changed the title WIP os/bluestore: fix deferred writes; improve flush os/bluestore: fix deferred writes; improve flush Mar 13, 2017
@liewegas liewegas requested a review from ifed01 March 13, 2017 13:54
@liewegas
Copy link
Member Author

liewegas commented Mar 13, 2017

ready for review!

@liewegas liewegas force-pushed the wip-bluestore-dw branch 2 times, most recently from bbbc56e to e60ae8c Compare March 14, 2017 16:12

uint64_t max_alloc_size = 0; ///< maximum allocation unit (power of 2)

bool sync_wal_apply; ///< see config option bluestore_sync_wal_apply
bool sync_deferred_apply; ///< see config option bluestore_sync_deferred_apply
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assign a value?

@@ -34,7 +34,8 @@ class KernelDevice : public BlockDevice {
interval_set<uint64_t> debug_inflight;

Mutex flush_lock;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering if this is used anywhere?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still used to protect flush_txns for Onode::flush() etc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 3 flush_lock-s around BlueStore - one at Onode, another one at KernelDevice and third one in SharedDriverData at NVMEDevice.cc. What you're saying is valid for Onode one. while I was referring for KernelDevice member. IMO it's not used...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, removed! good catch

@@ -7237,6 +7237,10 @@ void BlueStore::_txc_finish_io(TransContext *txc)
_txc_state_proc(&*p++);
} while (p != osr->q.end() &&
p->state == TransContext::STATE_IO_DONE);

if (osr->kv_submitted_waiters) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably set the flag in the loop above when p->state >= STATE_KV_SUBMITTED and notify depending on this flag's value. just to avoid excessive wakeups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed (moved wakeup to _txc_state_proc when state is set)

@@ -1717,6 +1720,9 @@ class BlueStore : public ObjectStore,

vector<Cache*> cache_shards;

std::mutex osr_lock; ///< protect osd_set
std::set<OpSequencer*> osr_set; ///< set of all OpSequencers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with an intrusive list(or set if we need lookup support) for cheaper insert/removal? Will need more stuff in osr_drain_all to make a clone though...


void discard() override {
std::lock_guard<std::mutex> l(register_lock);
if (registered) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with atomic::compare_and_exchange and get rid of mutex?

Copy link
Member Author

@liewegas liewegas Mar 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it protects both registered and membership in the osr_set as a unit

Copy link
Contributor

@ifed01 ifed01 Mar 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still disagree.. There is a lock within (un)register_osr and hence osr_set is protected by them. OpSequencer ctor doesn't need any protection and can simply set registered to true and call register_osr - there is no way to call ctor concurrently . The only questionable location is discard. If it's called from different threads simultaneously the luckiest one atomically clears 'registered' var via atomic::compare_and_exchange and proceeds to unregister call. The loser just compares and do nothing. There is no need for additional mutex here if registered is std::atomic instance.

OpSequencer(CephContext* cct, BlueStore *store)
//set the qlock to PTHREAD_MUTEX_RECURSIVE mode
: Sequencer_impl(cct),
parent(NULL), store(store) {
std::lock_guard<std::mutex> l(register_lock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mutex acquisition seems excessive in ctor?

@@ -7786,7 +7798,8 @@ int BlueStore::_deferred_replay()
_txc_state_proc(txc);
}
dout(20) << __func__ << " draining osr" << dendl;
osr->drain();
_osr_drain_all();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it drain_all here indeed?

contents[new_obj].data.clear();
contents[new_obj].data.append(contents[old_obj].data.c_str(),
contents[old_obj].data.length());
contents[new_obj].data = contents[old_obj].data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might fix http://tracker.ceph.com/issues/19247?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah! yeah, probably!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately it doesn't. Just managed to reproduce the issue with this commit cherry-picked...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps try with this whole pr? there were several hard-to-hit races that were fixed

@@ -8916,7 +8919,9 @@ void BlueStore::_wctx_finish(
// longer allocated. Note that this will leave behind edge bits
// that are no longer referenced but not deallocated (until they
// age out of the cache naturally).
b->discard_unallocated(c.get());
if (!blob.is_shared()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the check inside discard_unallocated seems better...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@liewegas
Copy link
Member Author

liewegas commented Mar 16, 2017 via email

Copy link
Contributor

@voidbag voidbag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments about consistency...

@@ -7170,39 +7110,26 @@ void BlueStore::_txc_state_proc(TransContext *txc)
case TransContext::STATE_KV_SUBMITTED:
txc->log_state_latency(logger, l_bluestore_state_kv_committing_lat);
txc->state = TransContext::STATE_KV_DONE;
_txc_finish_kv(txc);
_txc_committed_kv(txc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must call _txc_applied_kv after db->submit_transaction_sync, not submit_transaction for safety.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flush() call currently just enforces correct ordering, e.g.,

  • omap_set a=b
  • omap_clear (needs to see 'a' key in order to remove it)

it doesn't have anything to do with committed or durability; that's what the callbacks are for (so the caller can reason about that).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I forgot the onreadable callback's role.
I thought the old code where o->flush() blocks read(), because onreadable callback happened before data is applied to storage. Now, with temporary buffer cache, o->flush() isn't responsible for durability anymore.

Thank you for detail description.

@@ -5778,7 +5719,6 @@ int BlueStore::_do_read(
}

utime_t start = ceph_clock_now();
o->flush();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

o->flush cannot be removed. That's because worker can read the buffer of uncommitted transaction...
read request should be serviced after all flush_txns of the onode execute db->submit_transaction_sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's okay. The OSD layer above this doesn't read data until it's gotten the onreadable callback.

@liewegas
Copy link
Member Author

liewegas commented Mar 16, 2017 via email

@dmick
Copy link
Member

dmick commented Mar 17, 2017

My fault the submodule test failed, ignore

Signed-off-by: Sage Weil <sage@redhat.com>
Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
First, eliminate the work queue--it's useless.  We are dispatching aio and
should not block.  And if a single thread isn't sufficient to do it, it
probably means we should be parallelizing kv_sync_thread too (which is our
only caller that matters).

Repurpose the old osr-list -> txc-list-per-osr queue structure to manage
the queuing.  For any given osr, dispatch one batch of aios at a time,
taking care to collapse any overwrites so that the latest write wins.

Signed-off-by: Sage Weil <sage@redhat.com>
This is less fragile, especially with 2 constructors.

Signed-off-by: Sage Weil <sage@redhat.com>
In a simple HDD workload with queue depth of 1, we halve our throughput
because the kv thread does a full commit twice per IO: once for the
initial commit, and then again to clean up the deferred write record. The
second wakeup is unnecessary; we can clean it up on the next commit.

We do need to do this wakeup in a few cases, though, when draining the
OpSequencers: (1) on replay during startup, and (2) on shutdown in
_osr_drain_all().

Send everything through _osr_drain_all() for simplicity.

This doubles HDD qd=1 IOPS from ~50 to ~100 on my 7200 rpm test device
(rados bench 30 write -b 4096 -t 1).

Signed-off-by: Sage Weil <sage@redhat.com>
If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
Allow several deferred writes to accumulate before we submit them.  In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks.  On HDD, this is particularly true of seeks away from the
journal.  And on sequential workloads this can avoid seeks.  In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.

Signed-off-by: Sage Weil <sage@redhat.com>
When the Sequencer goes away it get deregistered.  If there are still
deferred IOs in flight, we need to wait for those too.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
…eferred

If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.

Separate deferred into a "done" and "stable" list.  If we do sync, put
everything from "done" onto "stable".  Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere.  Instead, never modify a buffer.

This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy.  (This was affecting BlueStore.)

Rearrange the data methods to be next to each other and clean them up a
bit too.

Signed-off-by: Sage Weil <sage@redhat.com>
Kick off deferred IOs if we pass the throttle midpoint or if we would
block during submission.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
These can survive as long as the txc, which can be longer than the
Collection.  Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).

Signed-off-by: Sage Weil <sage@redhat.com>
Otherwise cache items survive beyond umount into the next mount cycle!

Also, ensure that we flush_cache *before* clearing coll_map, as some cache
items have references back to the Collection.

Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
Signed-off-by: Sage Weil <sage@redhat.com>
We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).

Signed-off-by: Sage Weil <sage@redhat.com>
Better test coverage!

Signed-off-by: Sage Weil <sage@redhat.com>
Clearer, and fewer wakeups.

Signed-off-by: Sage Weil <sage@redhat.com>
We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.

Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.

Signed-off-by: Sage Weil <sage@redhat.com>
Add assertions if we fail to flush everything.

Signed-off-by: Sage Weil <sage@redhat.com>
It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight.  We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.

Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.

Clean up the OpSequencer methods and a few other method names.

Signed-off-by: Sage Weil <sage@redhat.com>
@liewegas liewegas merged commit 66b42be into ceph:master Mar 21, 2017
@liewegas liewegas deleted the wip-bluestore-dw branch March 21, 2017 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants