New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: fix deferred writes; improve flush #13888

Merged
merged 44 commits into from Mar 21, 2017

Conversation

Projects
None yet
4 participants
@liewegas
Member

liewegas commented Mar 8, 2017

No description provided.

@liewegas liewegas changed the title from os/bluestore: fix deferred writes; improve flush to WIP os/bluestore: fix deferred writes; improve flush Mar 10, 2017

@liewegas liewegas changed the title from WIP os/bluestore: fix deferred writes; improve flush to os/bluestore: fix deferred writes; improve flush Mar 13, 2017

@liewegas liewegas requested a review from ifed01 Mar 13, 2017

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 13, 2017

Member

ready for review!

Member

liewegas commented Mar 13, 2017

ready for review!

Show outdated Hide outdated src/os/bluestore/BlueStore.h Outdated
Show outdated Hide outdated src/os/bluestore/KernelDevice.h Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.cc Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.h Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.h Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.h Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.cc Outdated
contents[new_obj].data.clear();
contents[new_obj].data.append(contents[old_obj].data.c_str(),
contents[old_obj].data.length());
contents[new_obj].data = contents[old_obj].data;

This comment has been minimized.

@ifed01

ifed01 Mar 15, 2017

Contributor

Do you think it might fix http://tracker.ceph.com/issues/19247?

@ifed01

ifed01 Mar 15, 2017

Contributor

Do you think it might fix http://tracker.ceph.com/issues/19247?

This comment has been minimized.

@liewegas

liewegas Mar 15, 2017

Member

ah! yeah, probably!

@liewegas

liewegas Mar 15, 2017

Member

ah! yeah, probably!

This comment has been minimized.

@ifed01

ifed01 Mar 16, 2017

Contributor

Unfortunately it doesn't. Just managed to reproduce the issue with this commit cherry-picked...

@ifed01

ifed01 Mar 16, 2017

Contributor

Unfortunately it doesn't. Just managed to reproduce the issue with this commit cherry-picked...

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

perhaps try with this whole pr? there were several hard-to-hit races that were fixed

@liewegas

liewegas Mar 16, 2017

Member

perhaps try with this whole pr? there were several hard-to-hit races that were fixed

Show outdated Hide outdated src/os/bluestore/BlueStore.cc Outdated
@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 16, 2017

Member
Member

liewegas commented Mar 16, 2017

@voidbag

voidbag suggested changes Mar 16, 2017 edited

I have a few comments about consistency...

@@ -7170,39 +7110,26 @@ void BlueStore::_txc_state_proc(TransContext *txc)
case TransContext::STATE_KV_SUBMITTED:
txc->log_state_latency(logger, l_bluestore_state_kv_committing_lat);
txc->state = TransContext::STATE_KV_DONE;
_txc_finish_kv(txc);
_txc_committed_kv(txc);

This comment has been minimized.

@voidbag

voidbag Mar 16, 2017

Contributor

You must call _txc_applied_kv after db->submit_transaction_sync, not submit_transaction for safety.

@voidbag

voidbag Mar 16, 2017

Contributor

You must call _txc_applied_kv after db->submit_transaction_sync, not submit_transaction for safety.

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

the flush() call currently just enforces correct ordering, e.g.,

  • omap_set a=b
  • omap_clear (needs to see 'a' key in order to remove it)

it doesn't have anything to do with committed or durability; that's what the callbacks are for (so the caller can reason about that).

@liewegas

liewegas Mar 16, 2017

Member

the flush() call currently just enforces correct ordering, e.g.,

  • omap_set a=b
  • omap_clear (needs to see 'a' key in order to remove it)

it doesn't have anything to do with committed or durability; that's what the callbacks are for (so the caller can reason about that).

This comment has been minimized.

@voidbag

voidbag Mar 16, 2017

Contributor

Okay, I forgot the onreadable callback's role.
I thought the old code where o->flush() blocks read(), because onreadable callback happened before data is applied to storage. Now, with temporary buffer cache, o->flush() isn't responsible for durability anymore.

Thank you for detail description.

@voidbag

voidbag Mar 16, 2017

Contributor

Okay, I forgot the onreadable callback's role.
I thought the old code where o->flush() blocks read(), because onreadable callback happened before data is applied to storage. Now, with temporary buffer cache, o->flush() isn't responsible for durability anymore.

Thank you for detail description.

@@ -5778,7 +5719,6 @@ int BlueStore::_do_read(
}
utime_t start = ceph_clock_now();
o->flush();

This comment has been minimized.

@voidbag

voidbag Mar 16, 2017

Contributor

o->flush cannot be removed. That's because worker can read the buffer of uncommitted transaction...
read request should be serviced after all flush_txns of the onode execute db->submit_transaction_sync

@voidbag

voidbag Mar 16, 2017

Contributor

o->flush cannot be removed. That's because worker can read the buffer of uncommitted transaction...
read request should be serviced after all flush_txns of the onode execute db->submit_transaction_sync

This comment has been minimized.

@liewegas

liewegas Mar 16, 2017

Member

That's okay. The OSD layer above this doesn't read data until it's gotten the onreadable callback.

@liewegas

liewegas Mar 16, 2017

Member

That's okay. The OSD layer above this doesn't read data until it's gotten the onreadable callback.

@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 16, 2017

Member
Member

liewegas commented Mar 16, 2017

@dmick

This comment has been minimized.

Show comment
Hide comment
@dmick

dmick Mar 17, 2017

Member

My fault the submodule test failed, ignore

Member

dmick commented Mar 17, 2017

My fault the submodule test failed, ignore

Show outdated Hide outdated src/os/bluestore/BlueStore.cc Outdated
@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 17, 2017

Member
Member

liewegas commented Mar 17, 2017

Show outdated Hide outdated src/os/bluestore/BlueStore.cc Outdated
Show outdated Hide outdated src/os/bluestore/BlueStore.h Outdated
@liewegas

This comment has been minimized.

Show comment
Hide comment
@liewegas

liewegas Mar 17, 2017

Member
Member

liewegas commented Mar 17, 2017

@ifed01

This comment has been minimized.

Show comment
Hide comment
@ifed01

ifed01 Mar 17, 2017

Contributor

The lock is held for the duration of many calls into _txc_state_proc (see
_txc_finish_io). My assumption is that calling cond.notify multiple times
is essentially a no-op (presumably the implemetniton sees it's already on
the wakeup list and does nothing). But i'm just assuming...

yep, I missed that mutex is still held, was thinking about semantics similar to wait() call that releases the lock...

Contributor

ifed01 commented Mar 17, 2017

The lock is held for the duration of many calls into _txc_state_proc (see
_txc_finish_io). My assumption is that calling cond.notify multiple times
is essentially a no-op (presumably the implemetniton sees it's already on
the wakeup list and does nothing). But i'm just assuming...

yep, I missed that mutex is still held, was thinking about semantics similar to wait() call that releases the lock...

liewegas added some commits Mar 8, 2017

os/bluestore: wal -> deferred
"wal" can refer to both the rocksdb wal (effectively, or journal) and the
"wal" events we include in it (mainly promises to do future IO or release
extents to the freelist).  This is super confusing!

Instead, call them 'deferred'.. deferred transactions, ops, writes, or
releases.

Signed-off-by: Sage Weil <sage@redhat.com>
vstart.sh: larger wal device
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: pin writing cache buffers until txc is finished
Notably, this includes WAL writes, which means an in-flight WAL write will
always be in the cache.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: no need to Onode::flush() in _do_read
We now ensure that deferred writes are in cache until the txc retires,
so there is no need to wait here.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: update freelist on initial commit
It does not matter if we update the freelist in the initial commit or when
cleaning up the deferred transaction; both will eventually update the
persistent kv freelist.  We maintain one case to ensure that legacy
deferred events (from a kraken upgrade) release when they are replayed.

What matters while online is the Allocator, which has an independent
in-memory copy of the freelist to make decisions.  And we can delay that
as long as we want.  To avoid any concerns about deferred writes racing
against released blocks, just defer any release until the txc is fully
completed (including any deferred writes).  This ensures that even if we
have a pattern like

 txc 1: schedule deferred write on block A
 txc 2: release block A
 txc 1+2: commit
 txc 2: done!
 txc 1: do deferred write
 txc 1: done!

then txc 2 won't do its release because it is stuck behind txc 1 in the
OpSequencer queue:

 ...
 txc 1: reaped
 txc 2: reaped (and extents released to alloc)

This builds in some delay in just-released space being usable again, but
it should be a very small amount of space relative to the size of the
store!

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: no need to Onode::flush() on truncate
We do not release extents until after any deferred IO, so this flush() is
unnecessary.

Signed-off-by: Sage Weil <sage@redhat.com>

# Conflicts:
#	src/os/bluestore/BlueStore.cc

liewegas added some commits Mar 9, 2017

os/bluestore: make Sequencer::flush() more efficient
BlueStore collection methods only need preceding transactions to be
applied to the kv db; they do not need to be committed.

Note that this is *only* needed for collection listings; all other read
operations are immediately safe after queue_transactions().

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: fix OpSequencer/Sequencer lifecycle
Make osr_set refcounts so that it can tolerate a Sequencer destruction
racing with flush or a Sequencer that outlives the BlueStore instance
itself.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: make OnodeSpace onode_map private
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: keep onode refs for lifetime of obc
This ensures that we don't trim an onode from the cache while it has a
txc that is still in flight.  Which in turn ensures that if we try to read
the object, we will have any writing buffers available.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: keep all OpSequencers registered
Maintain the set of all live OpSequencers.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: batch up to bluestore_deferred_batch_ops before submitting
Allow several deferred writes to accumulate before we submit them.  In
general we have no time pressure, and on HDD (and perhaps sometimes SSD)
it is beneficial to accumulate and batch these so that they result in
fewer seeks.  On HDD, this is particularly true of seeks away from the
journal.  And on sequential workloads this can avoid seeks.  In may even
allow the block layer or SSD firmware to merge IOs and perform fewer
writes.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: avoid extra dev flush on single device when all io is d…
…eferred

If we have no non-deferred IO to flush, and we are running bluefs on a
single shared device, then we can rely on the bluefs flush to make our
current batch of deferred ios stable.

Separate deferred into a "done" and "stable" list.  If we do sync, put
everything from "done" onto "stable".  Otherwise, after we do our kv
commit via bluefs, move "done" to "stable" then.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: drop obsolete comment
Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_objectstore: fix Synthetic to never modify bufferlists
We were modifying bufferlists in place, and kludging around it by making
full copies elsewhere.  Instead, never modify a buffer.

This fixes issues where the buffer we submit to ObjectStore ends up in
the cache and we modify in place later, corrupting the implementation's
copy.  (This was affecting BlueStore.)

Rearrange the data methods to be next to each other and clean them up a
bit too.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: only discard deallocated regions of a blob if !shared
If a blob is shared, we can't discard deallocated regions: there may
be deferred buffers in flight and we might get a read via the clone.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: prevent throttle deadlock due to deferred writes
Kick off deferred IOs if we pass the throttle midpoint or if we would
block during submission.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: flush old/discarded OpSequencers too
When the Sequencer goes away it get deregistered.  If there are still
deferred IOs in flight, we need to wait for those too.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: debug alloc release
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: make throttles tunable online
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: remove dead _do_deferred_op code
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: fix perfcounters for deferred io
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: take Collection ref from SharedBlob
These can survive as long as the txc, which can be longer than the
Collection.  Make sure we have a valid ref as both finish_write and
~SharedBlob use coll for the SharedBlobSet (and coll->store->cct for
debug).

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: flush_cache on umount, fsck finish, etc.
Otherwise cache items survive beyond umount into the next mount cycle!

Also, ensure that we flush_cache *before* clearing coll_map, as some cache
items have references back to the Collection.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: nicer Onode dout prefix
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: better debugging around collections
Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore/KernelDevice: drop unused flush_lock
Signed-off-by: Sage Weil <sage@redhat.com>
unittest_bluestore_types: fix Collection using tests
We can't use a bare Collection since we get/put refs, the last put will
delete it, and the dtor asserts nref == 0 (no faking a ref and deliberately
leaking!).

Signed-off-by: Sage Weil <sage@redhat.com>
ceph_test_objectstore: set bluestore cache shards to 5
Better test coverage!

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: move cached items around on collection split
We've been avoiding doing this for a while and it has finally caught up
with us: the SharedBlob may outlive the split due to deferred IO, and
a read on the child collection may load a competing Blob and SharedBlob
and read from the on-disk blocks that haven't been written yet.

Fix by preserving the one-SharedBlob-instance invariant by moving cache
items to the new Collection and cache shard like we should have from the
beginning.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: simplify flush() wake-up condition
Clearer, and fewer wakeups.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: clean up flush_all()
Add assertions if we fail to flush everything.

Signed-off-by: Sage Weil <sage@redhat.com>
os/bluestore: handle zombie OpSequencers
It's possible for the Sequencer to go away while the OpSequencer still has
txcs in flight.  We were handling the case where the osr was on the
deferred_queue, but it may be off the deferred_queue but waiting for the
commit to happen, and we still need to wait for that.

Fix this by introducing a 'zombie' state for the osr, in which we keep the
osr in the osr_set.

Clean up the OpSequencer methods and a few other method names.

Signed-off-by: Sage Weil <sage@redhat.com>

@liewegas liewegas merged commit 66b42be into ceph:master Mar 21, 2017

3 checks passed

Signed-off-by all commits in this PR are signed
Details
Unmodifed Submodules submodules for project are unmodified
Details
default Build finished.
Details

@liewegas liewegas deleted the liewegas:wip-bluestore-dw branch Mar 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment