New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os/bluestore: RTC programming model has been applied #11814
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 big changes here combined into a single commit. Let's ignore the second change (finishers, no pg lock) since the locking change is complex and focus just on the first change (sync transaction commits). First, we'll want a PR that does just the transaction part.
Can you explain what RTC means?
The flush before we commit the transaction is important because the data has to be stable before we commit metadata that references it. I'm guessing that's why you are checking for wal_txn (if it's a WAL write you don't need to do the initial flush) but that isnt' quite right: you can have a write that includes a wal write and a direct write. You should be explicitly looking at whether there is IO beforehand.. either for this txn or any previous one in the sequencer. So I think you're on the right track with the no_rtc checks in _txc_finish_io. That's good, although the variable name is confusing. But with a simpler patch that focuses on just this change I think we will be in good shape.
I think we should revisit the config option names for bluestore_sync_submit_transaction and this option (bluestore_sync_submit_transaction_sync?). :/ 'rtc' doesn't really describe what's going on...
@@ -1050,6 +1050,9 @@ OPTION(bluestore_debug_inject_read_err, OPT_BOOL, false) | |||
OPTION(bluestore_debug_randomize_serial_transaction, OPT_INT, 0) | |||
OPTION(bluestore_inject_wal_apply_delay, OPT_FLOAT, 0) | |||
OPTION(bluestore_shard_finishers, OPT_BOOL, false) | |||
OPTION(bluestore_rtc, OPT_BOOL, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does RTC stand for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RTC stands for Run To Completion which means that a thread does work to the end w/o context switch.
In bluestore, RTC means that ShardedOPWQ(thread pool) does WAL commit(nvdimm) w/o context switch. (bluestore_rtc)
Furthermore, ShardedOPWQ does work on behalf of finishers, if possible. (bluestore_no_finisher)
ShardedOPWQ in RTC mode has pg lock already, so i needed to remove a pg lock in callback contexts.
I observed performance increase in no_finisher mode (i7 machine)
2 replicas 2 osds
bluestore_rtc (12.5KIOPS) -> bluestore_no_finisher (15KIOPS)
Without bluestore_rtc, it's impossible for ShardedOPWQ to do callbacks on behalf of finishers
So i added the patch to remove finisher
If needed, i'll separate the patch into two
@@ -58,6 +59,7 @@ class Context { | |||
virtual void finish(int r) = 0; | |||
|
|||
public: | |||
static const int NO_PGLOCK = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Skipping the PG lock for all of these completions is simply not safe... not without a lot of other changes in the PG code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just skipped PG lock when ShardedOpWQ holds PG lock.
So, I think the above approach is safe.
If needed, i separate the patch into two
@@ -6282,6 +6282,7 @@ void BlueStore::_txc_state_proc(TransContext *txc) | |||
case TransContext::STATE_PREPARE: | |||
txc->log_state_latency(logger, l_bluestore_state_prepare_lat); | |||
if (txc->ioc.has_pending_aios()) { | |||
txc->no_rtc = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The meaning of no_rtc is confusing. Making it a positive (can_rtc or can_submit_txn_sync) or describe what it actually means (is_pipelined_io?) would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think is_pipelined_io is better 👍
I'll change it
if (txc->osr->kv_finisher_submitting == 1) { | ||
db->submit_transaction_sync(txc->t); | ||
_txc_state_proc(txc); | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly, the goal here is to avoid the context switch into the kv_sync_thread when the kv commits are blazing fast (e.g., nvdimm). That seems reasonable. What does that have to do with wal_txn, though? Wouldn't we want to do this regardless of whether there is wal IO? (There usually isn't!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think i can remove txc->wal_txn condition.
It is redundant :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not redundant check.
Without checking txc->wal_txn, ShardedOpWQ tries to take osr->qlock, multiple times.
f->complete(Context::NO_PGLOCK); | ||
txc->oncommits.pop_front(); | ||
} | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PG lock stuff isn't safe. Even if it were, it should be separated out into a different patch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll do it
pg->process_copy_chunk(oid, tid, r); | ||
if (r == Context::NO_PGLOCK) { | ||
if (last_peering_reset == pg->get_last_peering_reset()) { | ||
pg->process_copy_chunk(oid, tid, r); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calls like this will behave very poorly without the PG lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The value Context::NO_PGLOCK is given only when current thread is ShardedOpWQ holding PG lock
So, I think it'll work correctly
actually I do alike things when develop spdk, I think this pr's meaning now is enforcing our future changes to make everything shard possible... three things need to do firstly:
|
I see. In that case, I'd say we could remove the bluestore_sync_submit_transaction option and make bluestore_rtc control that behavior as well. But either way, yes, we really need to separate this into separate patches making each logical change. For the no_finisher patch, we should change all the completions to look more like if (!have_pglock) pg->lock(). ... if (!have_pglock) pg->unlock(); instead of duplicating every function body. |
(Also, this is great!) |
Okay, I'll reflect the opinion and break the patch into two |
Two commits in the same PR is ideal, yes. |
13422c8
to
026b6b1
Compare
I think #11624 is the cause of build failure |
026b6b1
to
9ed8902
Compare
@liewegas Thank you |
f3879f5
to
f24caf2
Compare
@liewegas Thank you |
f24caf2
to
9c4cf3a
Compare
Build check succeeded after rebasing master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better! I have several suggestions for the first patch, and a few for the second. I didn't review the atomics tracking completions carefully; will do that on the next pass once the first patch is sorted out. Thanks!!
@@ -6409,6 +6418,7 @@ void BlueStore::_txc_finish_io(TransContext *txc) | |||
if (p->state < TransContext::STATE_IO_DONE) { | |||
dout(20) << __func__ << " " << txc << " blocked by " << &*p << " " | |||
<< p->get_state_name() << dendl; | |||
txc->is_pipelined_io = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this does anything... we already set this to true for this txc when we submitted the io in teh PREPARE case above; no need to set it again.
I think what we actually want is to mark is_pipelined_io if any of the txc's ahead of us have/had IO and are in a state < KV_DONE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Umm... is_pipelined_io flag is used for checking if ShardedOpWQ holding pg_lock is running
If i change where is_piplined_io is set as you can suggested(STATE_PREPARE), then i can remove it.
Then, my intention is that is_pipelined_io is set to check if this code line is executed by ShardedOpWQ. This patch already can check any of the txc's ahead of us by osr->kv_finisher_submitting counter.
Thank you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see!
@@ -6322,11 +6323,19 @@ void BlueStore::_txc_state_proc(TransContext *txc) | |||
<< dendl; | |||
} else { | |||
_txc_finalize_kv(txc, txc->t); | |||
if (g_conf->bluestore_rtc && !txc->is_pipelined_io && txc->wal_txn) { | |||
if (txc->osr->kv_finisher_submitting == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can combine these if's
I still don't understand the relevance of wal_txn here. You mentioned earlier that it crashed without it.. can you explain what happened? I don't think we should (need to) treat a wal write any differently than, say, a delete (which has no IO at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without txc->wal_txn, ShardedOpWQ holding osr->qlock runs to _txc_finish and _osr_reap_done method which also requires osr->qlock.
I think i can solve the problem by not trying to take osr->qlock if there is only metadata operation..
I'll change the code.
Thank you for your opinion
_txc_state_proc(txc); | ||
return; | ||
} | ||
} | ||
txc->kv_submitted = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May as well move kv_submitted above. Although nobody cares now, that might change in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, no.. i think we want a different flag, kv_submitted_sync, so that we can explicitly condition the sycn completions on a sync commit of the transaction (because it is strictly dependent on being in the caller context holding pg lock).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay i'll add kv_submitted_sync flag
Thank you for the opinion
@@ -1053,6 +1053,7 @@ OPTION(bluestore_debug_randomize_serial_transaction, OPT_INT, 0) | |||
OPTION(bluestore_inject_wal_apply_delay, OPT_FLOAT, 0) | |||
OPTION(bluestore_shard_finishers, OPT_BOOL, false) | |||
OPTION(bluestore_rtc, OPT_BOOL, false) | |||
OPTION(bluestore_no_finisher, OPT_BOOL, false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about bluestore_rtc_sync_completions, and default to true? (it'll only matter if bluestore_rtc is also true.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change bluestore_no_finisher to that you suggested.
Thanks
@@ -58,6 +58,7 @@ class Context { | |||
virtual void finish(int r) = 0; | |||
|
|||
public: | |||
static const int HAVE_PGLOCK = 256; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's call this FLAG_SYNC indicating we are still in the submitting context. At this layer we don't know what a pglock is or what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change the variable's name.
Thank you for the opinion
auto f = txc->oncommits.front(); | ||
finishers[n]->queue(f); | ||
txc->oncommits.pop_front(); | ||
if (g_conf->bluestore_no_finisher && !txc->is_pipelined_io && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want to use the new flag I mentioned (txc->kv_submitted_sync or similar?).. is_pipelined_io may be false for things that weren't done synchronously
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to know when is_pipelined_io is false and there isn't synchronous io on the above code line.
Thank you for the opinion.
@@ -6282,6 +6282,7 @@ void BlueStore::_txc_state_proc(TransContext *txc) | |||
case TransContext::STATE_PREPARE: | |||
txc->log_state_latency(logger, l_bluestore_state_prepare_lat); | |||
if (txc->ioc.has_pending_aios()) { | |||
txc->is_pipelined_io = true; | |||
txc->state = TransContext::STATE_AIO_WAIT; | |||
_txc_aio_submit(txc); | |||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
..I think were we need an else case where we look for txc's ahead of us that are doing io. I think it's sufficient to check that if hte next txc is >= KV_DONE then we're safe, otherwise set is_pipelined_io?
Alternatively, flip this bool around to be safe_for_sync_commit or something...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the last sentence.
Do you mean that i have to change the variable name 'is_pipelined_io' into 'safe_for_sync_commit or something'??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll check the unsafe case
Thanks
I skimmed through your comments. Here is 5:23 A.M. :D Thank you for your comments |
9c4cf3a
to
9d367ec
Compare
@liewegas Thank you |
finishers[n]->queue(f); | ||
txc->oncommits.pop_front(); | ||
if (g_conf->bluestore_rtc_sync_completions && txc->kv_submitted_sync && | ||
txc->wal_txn && txc->osr->is_finisher_done()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
txc->wal_txn condition is needed..
It's hard to apply bluestore_rtc_sync_completions option to the transactions w/o data.
That's because some cases cannot pass FLAG_SYNC to the destructors calling context->complete()...
~RunOnDelete() { /* FLAG_SYNC can't be delivered to_run object... */
~FlushState() { /* FLAG_SYNC can't be delivered to this destructor... */
So i added wal_txn condition(Transactions w/o no data) to exclude above cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it's hard not to take pg lock in case of ShardedOpWQ already holding pg lock, I want to add this option.
That's because there is performance gain (14KIOPS(56MB/s) w/ finishers -> 17.5KIOPS(70MB/s) w/o finishers) 2 osds, 2 replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This Problem is resolved by #11844
The destructors are called only by finishers(onreadable).
kv_queue.push_back(txc); | ||
kv_cond.notify_one(); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this condition to avoid trying to take osr->qlock in _txc_finish method several times, which is conflict with _txc_finish_io method taking osr->qlock.
To solve this problem, calling _txc_finish method is delegated to kv_sync_thread In case of metadata only I/O.
9d367ec
to
cf3ca0d
Compare
@@ -6531,6 +6555,7 @@ void BlueStore::_txc_finish_kv(TransContext *txc) | |||
txc->oncommits.pop_front(); | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I reorder oncommits callbacks with onreadable callback?
I didn't apply sync_completions to oncommits callbacks.
That's because I think that oncommits must be after onreadable.
I want to know oncommits pushed back by BlueStore::flush_commit should wait for onreadable or oncommit.
Thanks
Re: wal_txn in the condition. This strikes me as extremely fragile: we are inferring whether the caller can handle a synchronous callback from whether there happens to be IO associated with the transaction. Instead, let's have a patch that precedes this that adds a flags argument to queue_transaction, and define a flag that indicates that synchronous callbacks are allowed. We can store the flags value in the TransContext. Then, if I'm not mistaken, the first patch shouldn't need wal_txn, but the second patch should check that when deciding whether to do the callback synchronously... |
kv_queue.push_back(txc); | ||
kv_cond.notify_one(); | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas
I'll explain !txc->wal_txn condition in the first patch.
To call _txc_finish and _osr_reap_done by any thread other than ShardedOpWq, I added !txc->wal_txn condition.
wal_txn case:
Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv -> _wal_apply -> _txc_aio_submit
pg lock osr->qlock
Context2(KernelDevice.aio_thread) -> _wal_finish
Context3(kv_sync_thread) -> _kv_sync_thread -> _txc_finish calling _osr_reap_done
osr->qlock osr->qlock
There's no problem...
w/o !wal_txn condition in _txc_state_proc on STATE_KV_DONE case:
Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv -> _txc_finish
pg lock osr->qlock osr->qlock
There is a problem. ShardedOpWQ tries to take osr->qlock already taken by the thread.
Furthermore, _osr_reap_done tries to remove elements of osr->q, while _txc_finish_io iterates osr->q
So, I added if (txc->kv_submitted_sync && !txc->wal_txn)
statement, for kv_sync_thread to call _txc_finish calling _osr_reap_done
w/ !wal_txn condition in _txc_state_proc on STATE_KV_DONE case:
Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv
pg lock osr->qlock
Context2(kv_sync_thread) -> _kv_sync_thread-> _txc_finish calling _osr_reap_done
osr->qlock osr->qlock
I think this way is okay, because kv_sync_thread isn't critical path in this case
I think we can avoid the qlock issue by making osr_reap_done do a try_lock, and dropping the lock surrounding the STATE_DONE assignment. I'll play with this locally... |
@liewegas Your opinion is to add a flag to queue_transactions method Then, do you mean that I should discard FLAG_SYNC option to each context or not? Thank you. |
I will try to apply try_lock to solve this qlock issue, also. 👍 Thank you for your good opinion |
@liewegas Update: I have modified my code and I'm testing with unique_lock |
cf3ca0d
to
244d9c0
Compare
@liewegas I think i'm done with 1st patch. There's no txc->wal_txn condition. Can you please review the patches? |
I went to play with these patches locally and ended up needing to make several other fixes/cleanups first to fix bugs with the sync_submit_transaction. Pushing those first, see #11983 There is still a deeper problem, though, that I think we have to address first. The normal kv_sync_thread loop does while (true) { sync bdev; submit transactions; submit sync transactions; sync bdev; call completions; } The current bluestore_sync_submit_transaction completely skips the initial bdev sync, which means that we may queue a kv metadata write that references blocks that aren't stable on disk yet. We need to fix this.. probably by adding a bdev->sync in here. Which will either be very slow (normal devices) or free (NVRAM). |
@liewegas I have a question about your comment. Do we need to do flush bdev first in case of WAL? Thank you |
If it's a pure WAL write, and all prior txcs also have no unstable io, then yes, we can simply do a sync kv commit. The OpSEquencer counter in my other PR addresses this, I think. And in fact the change to do the commit sync on top of it is pretty simple.. playing with that now. |
@liewegas I already inserted a variable 'osr->kv_finisher_submitting counter' to check unstable transactions in the same sequencer.
Is there any other condition do i have to do consider? Thank you |
Ah, yes, that looks right. I'd like to rebase this on top of the wip-bluestore-release branch, though, since that fixes the existing sync_submit_transaction path to delay kv submit if there is unstable IO. I started to adapt your patch at https://github.com/liewegas/ceph/commits/wip-bluestore-rtc, but ended up simplifying a bit (and breaking it into smaller changes). I also added a per-txc field that tracks whether we are still in the queuing execution context...I think this will be a more reliable/safer way to decide whether the qlock dance is necessary? |
@liewegas But i have an opinion about the flag and I did leave my comment to your branch(https://github.com/liewegas/ceph/commits/wip-bluestore-rtc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kv_finisher_submitting counter can be moved right after TransContext::STATE_IO_DONE:
@@ -6961,6 +6978,7 @@ int BlueStore::queue_transactions( | |||
|
|||
// prepare | |||
TransContext *txc = _txc_create(osr); | |||
++txc->osr->kv_finisher_submitting; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liewegas
I think this line can be moved right after TransContext::STATE_IO_DONE:
jenkins test this please (eio error now ignored in master) |
@dachary |
244d9c0
to
828578c
Compare
@dachary Thanks |
bluestore_rtc option was added - ShardedOpWQ calls submit_transaction_sync and queues txc to finishers when wal is used - Consistency problem has been solved, using sequencer and its member variables Signed-off-by: Taeksang Kim <voidbag@gmail.com>
- It must be used with bluestore_rtc option - Context::FLAG_SYNC constant was added to do context callbacks, w/o pg lock - It's desirable to use this option with small core machine Signed-off-by: Taeksang Kim <voidbag@gmail.com>
828578c
to
166b882
Compare
Rebase on top of master has been finished. Thanks |
Thanks for notifying me. |
see #12062 |
Decouple Block Device I/O from WAL/DB Device I/O
When NVDIMM is used as WAL & DB device, RTC model is better than pipeline model
bluestore_rtc option was added
bluestore_no_finisher option was added
Current WAL Path
ShardedOpWQ -> kv_sync_thread -> finishers -> Pipe:Writer
I/O sequence:
submit_transaction(NVDIMM) -> [1]bdev->flush(SSD, bottleneck)
-> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
================= Above Path is the Critical Path of Client ===============
-> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
-> submit_transaction_sync(NVDIMM)
======================= End of Transaction ======================
The above [1] bdev->flush isn't for the transaction, but for previous transaction
So, it can be bypassed
WAL Path w/ bluestore_rtc on
ShardedOpWQ -> finishers -> Pipe:Writer
-> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
================= Above Path is the Critical Path of Client ===============
-> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
-> submit_transaction_sync(NVDIMM)
======================= End of Transaction ======================
In the above path, bdev->flush has been removed
WAL Path w/ bluestore_rtc & bluestore_no_finisher on
ShardedOpWQ -> Pipe:Writer
-> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
================= Above Path is the Critical Path of Client ===============
-> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
-> submit_transaction_sync(NVDIMM)
======================= End of Transaction ======================
In the above path, finishers wasn't used
Signed-off-by: Taeksang Kim voidbag@gmail.com