os/bluestore: fix alloc release timing on sync submits #11983

Merged
merged 6 commits into from Nov 23, 2016

Projects

None yet

5 participants

@liewegas
Member

If we submit the txc synchronously, we can't immediately release our
freed space to the allocator; that still needs to be done between
commit_start() and commit_finish() from the kv_sync_thread, protected
by the bdev barriers.

Signed-off-by: Sage Weil sage@redhat.com

@ifed01

LGTM. You may want to add test cases to cover async transaction submit though. Looks like store_test lacks that for now.

@liewegas
Member

@voidbag This series fixes a few issues I saw while trying to reason through your change. I think these have to go in first. And the txc unstable ios count in the OpSequencer should be usable in place of the pipelined_io bool.

Do you mind reviewing?

@dachary
Member
dachary commented Nov 16, 2016

jenkins test this please (jenkins general failure)

@dachary
Member
dachary commented Nov 16, 2016

jenkins test this please (tox failure)

@voidbag
voidbag commented Nov 16, 2016 edited

@liewegas
I also used is_pipelined_io, when current transaction is blocked by sequencer in _txc_finish_io.
That's because ShardedOpWQ may yield its context.
Then, it can be called again by aio_thread of KernelDevice afterwards.
So I don't think is_pipelined_io can be completely replaced with txc_with_unstable_io....

Update:
If RTC is based on this patch

if (!txc_with_unstable_io && all previous transactions are submitted to finishers && txc was blocked in _txc_finish_io)
Then KernelDevice.aio_thread executes db->submit_transaction_sync() & _txc_finish_kv.
LGTM

P.S.
When RTC option is used, the condition (all previous transactions are submitted to finishers which is used in #11814) must be considered, because of pg ordering.

Thank you

@voidbag

@liewegas
I have an opinion about incrementing txc->osr->txc_with_unstable_io
Can you read my comment?

Thanks

src/os/bluestore/BlueStore.cc
@@ -6289,6 +6289,8 @@ void BlueStore::_txc_state_proc(TransContext *txc)
txc->log_state_latency(logger, l_bluestore_state_prepare_lat);
if (txc->ioc.has_pending_aios()) {
txc->state = TransContext::STATE_AIO_WAIT;
+ ++txc->osr->txc_with_unstable_io;
+ txc->had_ios = true;
@voidbag
voidbag Nov 16, 2016

I think ++txc->osr->txc_with_unstable_io can be moved right after case TransContext::STATE_IO_DONE: which is executed with holding osr->qlock.
Like `case TransContext::STATE_IO_DONE: if (txc->has_ios) { ++txc->osr->txc_with_unstable_io }

Otherwise, a recent transaction may prevent KernelDevice.aio_thread from doing RTC with earlier transaction

@liewegas
liewegas Nov 16, 2016 Member

I think that change is safe, but I'm not sure it would help: if the new txn is queuing new io, it is in teh queuing calling context, which means that there is no other txc currently in queue context (ios are not submitted in parallel within a sequencer)

@voidbag
voidbag Nov 16, 2016

If below ios can be done in submit_transaction_sync and KernelDevice.aio_thread are executing _txc_finish_io, holding osr->qlock

queue: (IO 1) (IO 2)

ShardedOpWQ wants to deal with IO 3
If ShardedOpWQ context increments txc->osr->txc_with_unstable_io before holding osr->qlock, then IO 1 and IO 2 cannot be executed in submit_transaction_sync, because of txc->osr->txc_with_unstable_io incremented by later txn.

This is just my scenario.
If you have different opinions, I'm glad to receive them.

Thank you

@liewegas
Member
@liewegas
Member
@voidbag
voidbag commented Nov 17, 2016 edited

I'll show the above example, more specifically.

My assumption is that IO 1 and 2 didn't increment unstable io count. That's because each is WAL only operation or metadata only operation.

I suppose IO 0 exists before IO 1 and 2.
IO 0 did increment unstable io count in prepare state and submit aio.

Then IO 1 and 2 is blocked by IO 0.
When IO 0 is done with data device, KernelDevice.aio_thread reaps it using io_getevents.

KernelDevice.aio_thread calls _txc_finish_io method holding osr->qlock.
At first iteration, it pushes back IO 0 to kv_sync_thread.
kv_sync_thread does bdev->flush and decrement unstable io count.

At the moment, the unstable io count is zero and IO 1 and 2 can be processed in submit_sync mode by KernelDevice.aio_thread. But i think this case is very extreme because bdev->flush is much slower than iterating elements of osr in _txc_finish_io.
Nevertheless, this is not impossible case i think

In this situatuation, later transaction can increment unstable IO count which prevents IO 1 and 2 from being processed in submit_sync mode.

Thank you

@ifed01 ifed01 dismissed their review Nov 17, 2016

More changes were made..

@liewegas liewegas changed the title from os/bluestore: fix alloc release timing on sync submits to DNM: os/bluestore: fix alloc release timing on sync submits Nov 17, 2016
@liewegas liewegas changed the title from DNM: os/bluestore: fix alloc release timing on sync submits to os/bluestore: fix alloc release timing on sync submits Nov 17, 2016
@liewegas
Member

@voidbag I don't follow your example.

IO 0 should block IO 1 and 2 if we haven't yet decremented io_unstable.. because the flush hasn't completed and we cannot queue or commit metadata before the data is stable.

I don't think we have to worry about a later IO 3, though. If IO 3 increments the unstable count, that means IO 3 was in the queueing context, which means that IO 1 and 2 have already been queued and have either completed (if IO 0 finished its sync) or were deferred and have in_queue_context=false (if IO 0 didn't finish its sync in time). It's impossible for two IOs to both be in_queue_context for the same sequencer, so it's impossible for a later IO to block an earlier IO's ability to RTC because it will have already completed or been queued.

@dachary
Member
dachary commented Nov 17, 2016

jenkins test this please (object memstore)

@voidbag
voidbag commented Nov 18, 2016

I agree with you.
Thats why i implemented rtc using the fact that there is no in_queue_context more than two simultaneouly.
I don't think incrementing unstable io count in queue context has any effect on rtc path.

But the reason why I left comment is that I thought later IO 3 in queue context might influenece the control flow of deferred IO 1 and 2 by incrementing the count that may be accessed by earlier transactions (IO 1 and 2). (I think it might be unsafe) In this case, deferred IO 1 and 2 are being processed by KernelDevice.aio_thread in _txc_finish_io method.
After prior transaction(IO 0) decrements unstable io counter in kv_sync_thread, unstable io count will be 0. At the moment, IO 3 can increment unstable io count in queue context. I think it's the case that later transaction affects prior transactions, because deferred IO 1 and 2 are still being processed by KernelDevice.aio_thread, not queue context.

So, I think it's better to move the code line that increments unstable io count after sequencer lock.

Thank you for reading my comment that might be wrong.
I think there is a possibility that I made a mistake. Nevertheless, I left this comment because I think I didn't deliver my thought to you completely.

I really appreciate the comments you left.
Thank you!
:)

@liewegas
Member

@ifed01 passes ceph_test_objectstore

@xiexingguo
Contributor

Rebase needed..

@liewegas
Member

I think I see what you're worried about.. but keep in mind that the kv_committing_serially will prevent us from doing anything inline for IO 3 (or 1 or 2) if IO 0 hasn't finished completely. Is that right?

kv_committing_serially is also important in general, where IO 0 submits some aio, that aio completes and is flushed and is stable, but the kv txn hasn't been written yet, and then IO 1 comes along with no io. It can't be submitted because IO 0 is in the pipeline ahead of it.

Am I still missing something? (Thanks for being patient!)

liewegas added some commits Nov 15, 2016
@liewegas liewegas os/bluestore: fix alloc release timing on sync submits
If we submit the txc synchronously, we can't immediately release our
freed space to the allocator; that still needs to be done between
commit_start() and commit_finish() from the kv_sync_thread, protected
by the bdev barriers.

Signed-off-by: Sage Weil <sage@redhat.com>
652d18f
@liewegas liewegas ceph_test_objectstore: test w/ and w/o sync_submit_transaction
Signed-off-by: Sage Weil <sage@redhat.com>
315e777
@liewegas liewegas os/bluestore: remove unused KV_COMMITTING state
Signed-off-by: Sage Weil <sage@redhat.com>
af12f84
@liewegas liewegas os/bluestore: kill kv_submitted bool; use a new state
We have state definitions; use them!

Signed-off-by: Sage Weil <sage@redhat.com>
a3747de
@voidbag
voidbag commented Nov 18, 2016

@liewegas
I agree with the role of kv_committing_serially.

But, I did consider the case that IO 0 has finished completely by kv_sync_thread context, before next iterations (IO 1 and 2) in _txc_finish_io in KernelDevice.aio_thread context.
After that, I think deferred IO 1 and 2 can do something inline, because IO 0 has finished completely, not partially. (kv_committing_serially: 0, txc_with_unstable_io: 0). (I hope you agree with this sentence)
=========== Any thread isn't holding pg lock in this situation ===========

At the moment, a thread (in_queue_context) with IO 3 can enter TransContext::STATE_PREPARE, because there's no ShardedOpWQ (in_queue_context) thread of same pg.
The context with IO 3 can execute ++txc->osr->txc_with_unstable_io, regardless of kv_committing_serially and txc_with_unstable_io. After entering _txc_finish_io method, this context(in_queue_context) will be blocked by osr->qlock which was already held by KernelDevice.aio_thread context.
After that, IO 1 and 2 will see kv_committing_serially as 0 and txc_with_unstable_io as 1 which is different from (kv_committing_serially: 0, txc_with_unstable_io: 0)

In summary,
I want to point out the certain situation that IO 0 has finished completely in kv_sync_thread and IO 3 increments unstable io count before next iterations(IO 1 and 2) are executed by KernelDevice.aio_thread. In this case, I thought later transaction IO 3 could affect prior transactions IO 1 and 2

Thank you.

@dachary
Member
dachary commented Nov 18, 2016

jenkins test this please (eio)

@liewegas
Member
@voidbag
voidbag commented Nov 18, 2016

Finally!
I'm very happy for your understanding!
:D

Thank you for trying to understand my comment.
Again, thank you very much!

src/os/bluestore/BlueStore.cc
@@ -6210,6 +6211,9 @@ void BlueStore::_txc_state_proc(TransContext *txc)
case TransContext::STATE_IO_DONE:
//assert(txc->osr->qlock.is_locked()); // see _txc_finish_io
+ if (txc->had_aios) {
@voidbag
voidbag Nov 18, 2016

I think build error is from this line.
There is an typo
The variable name is had_ios, not had_aios.

Thank you!

liewegas added some commits Nov 15, 2016
@liewegas liewegas os/bluestore: count txcs with unstable io per osr
Signed-off-by: Sage Weil <sage@redhat.com>
5035388
@liewegas liewegas os/bluestore: prevent sync_submit if there is unstable io for this osr
If this txc or any txc that precedes it has unstable IO we cannot queue
our kv transaction without a bdev sync.  Currently such a sync is only
triggered from the kv_sync_thread, so we need to do this kv submission
async.

Signed-off-by: Sage Weil <sage@redhat.com>
662cda6
@liewegas
Member

@ifed01 do you mind reviewing?

@ifed01
ifed01 approved these changes Nov 23, 2016 View changes

LGTM

@ifed01 ifed01 merged commit df97766 into ceph:master Nov 23, 2016

2 checks passed

Signed-off-by all commits in this PR are signed
Details
default Build finished.
Details
@liewegas liewegas deleted the liewegas:wip-bluestore-release branch Nov 23, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment