Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os/bluestore: RTC programming model has been applied #11814

Closed
wants to merge 2 commits into from

Conversation

voidbag
Copy link
Contributor

@voidbag voidbag commented Nov 7, 2016

Decouple Block Device I/O from WAL/DB Device I/O
When NVDIMM is used as WAL & DB device, RTC model is better than pipeline model

bluestore_rtc option was added

  • ShardedOpWQ calls submit_transaction_sync and queues txc to finishers when WAL is used
  • Consistency problem has been solved, using sequencer and its member variables

bluestore_no_finisher option was added

  • It must be used with bluestore_rtc option
  • Context::NO_PGLOCK constant was added to do context callbacks, without pg lock
  • It's desirable to use this option with machine that has small number of cores
  • It's desirable to use this option with replication mode

Current WAL Path

ShardedOpWQ -> kv_sync_thread -> finishers -> Pipe:Writer
I/O sequence:
submit_transaction(NVDIMM) -> [1]bdev->flush(SSD, bottleneck)
-> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
================= Above Path is the Critical Path of Client ===============
-> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
-> submit_transaction_sync(NVDIMM)
======================= End of Transaction ======================
The above [1] bdev->flush isn't for the transaction, but for previous transaction
So, it can be bypassed

WAL Path w/ bluestore_rtc on

ShardedOpWQ -> finishers -> Pipe:Writer

  • I/O sequence: submit_transaction(NVDIMM) ->
    -> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
    ================= Above Path is the Critical Path of Client ===============
    -> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
    -> submit_transaction_sync(NVDIMM)
    ======================= End of Transaction ======================
    In the above path, bdev->flush has been removed

WAL Path w/ bluestore_rtc & bluestore_no_finisher on

ShardedOpWQ -> Pipe:Writer

  • I/O sequence: submit_transaction(NVDIMM) ->
    -> submit_transaction_sync(NVDIMM) -> txc_finish_kv(reply)
    ================= Above Path is the Critical Path of Client ===============
    -> wal_apply(SSD) -> bdev->flush(SSD) -> rm_single_key(NVDIMM)
    -> submit_transaction_sync(NVDIMM)
    ======================= End of Transaction ======================
    In the above path, finishers wasn't used

Signed-off-by: Taeksang Kim voidbag@gmail.com

@voidbag voidbag changed the title os/bluestore: RTC programming model has been applied os/bluestore: RTC programming model has been applied [bluestore] Nov 7, 2016
@voidbag voidbag changed the title os/bluestore: RTC programming model has been applied [bluestore] os/bluestore: RTC programming model has been applied Nov 7, 2016
@voidbag voidbag changed the title os/bluestore: RTC programming model has been applied [bluestore] os/bluestore: RTC programming model has been applied Nov 7, 2016
@voidbag voidbag changed the title [bluestore] os/bluestore: RTC programming model has been applied os/bluestore: RTC programming model has been applied Nov 7, 2016
Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 2 big changes here combined into a single commit. Let's ignore the second change (finishers, no pg lock) since the locking change is complex and focus just on the first change (sync transaction commits). First, we'll want a PR that does just the transaction part.

Can you explain what RTC means?

The flush before we commit the transaction is important because the data has to be stable before we commit metadata that references it. I'm guessing that's why you are checking for wal_txn (if it's a WAL write you don't need to do the initial flush) but that isnt' quite right: you can have a write that includes a wal write and a direct write. You should be explicitly looking at whether there is IO beforehand.. either for this txn or any previous one in the sequencer. So I think you're on the right track with the no_rtc checks in _txc_finish_io. That's good, although the variable name is confusing. But with a simpler patch that focuses on just this change I think we will be in good shape.

I think we should revisit the config option names for bluestore_sync_submit_transaction and this option (bluestore_sync_submit_transaction_sync?). :/ 'rtc' doesn't really describe what's going on...

@@ -1050,6 +1050,9 @@ OPTION(bluestore_debug_inject_read_err, OPT_BOOL, false)
OPTION(bluestore_debug_randomize_serial_transaction, OPT_INT, 0)
OPTION(bluestore_inject_wal_apply_delay, OPT_FLOAT, 0)
OPTION(bluestore_shard_finishers, OPT_BOOL, false)
OPTION(bluestore_rtc, OPT_BOOL, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does RTC stand for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RTC stands for Run To Completion which means that a thread does work to the end w/o context switch.
In bluestore, RTC means that ShardedOPWQ(thread pool) does WAL commit(nvdimm) w/o context switch. (bluestore_rtc)
Furthermore, ShardedOPWQ does work on behalf of finishers, if possible. (bluestore_no_finisher)
ShardedOPWQ in RTC mode has pg lock already, so i needed to remove a pg lock in callback contexts.

I observed performance increase in no_finisher mode (i7 machine)
2 replicas 2 osds
bluestore_rtc (12.5KIOPS) -> bluestore_no_finisher (15KIOPS)

Without bluestore_rtc, it's impossible for ShardedOPWQ to do callbacks on behalf of finishers
So i added the patch to remove finisher
If needed, i'll separate the patch into two

@@ -58,6 +59,7 @@ class Context {
virtual void finish(int r) = 0;

public:
static const int NO_PGLOCK = 256;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Skipping the PG lock for all of these completions is simply not safe... not without a lot of other changes in the PG code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just skipped PG lock when ShardedOpWQ holds PG lock.
So, I think the above approach is safe.

If needed, i separate the patch into two

@@ -6282,6 +6282,7 @@ void BlueStore::_txc_state_proc(TransContext *txc)
case TransContext::STATE_PREPARE:
txc->log_state_latency(logger, l_bluestore_state_prepare_lat);
if (txc->ioc.has_pending_aios()) {
txc->no_rtc = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The meaning of no_rtc is confusing. Making it a positive (can_rtc or can_submit_txn_sync) or describe what it actually means (is_pipelined_io?) would be better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think is_pipelined_io is better 👍
I'll change it

if (txc->osr->kv_finisher_submitting == 1) {
db->submit_transaction_sync(txc->t);
_txc_state_proc(txc);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, the goal here is to avoid the context switch into the kv_sync_thread when the kv commits are blazing fast (e.g., nvdimm). That seems reasonable. What does that have to do with wal_txn, though? Wouldn't we want to do this regardless of whether there is wal IO? (There usually isn't!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think i can remove txc->wal_txn condition.
It is redundant :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not redundant check.
Without checking txc->wal_txn, ShardedOpWQ tries to take osr->qlock, multiple times.

f->complete(Context::NO_PGLOCK);
txc->oncommits.pop_front();
}
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PG lock stuff isn't safe. Even if it were, it should be separated out into a different patch.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll do it

pg->process_copy_chunk(oid, tid, r);
if (r == Context::NO_PGLOCK) {
if (last_peering_reset == pg->get_last_peering_reset()) {
pg->process_copy_chunk(oid, tid, r);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calls like this will behave very poorly without the PG lock.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value Context::NO_PGLOCK is given only when current thread is ShardedOpWQ holding PG lock

So, I think it'll work correctly

@yuyuyu101
Copy link
Member

actually I do alike things when develop spdk, I think this pr's meaning now is enforcing our future changes to make everything shard possible...

three things need to do firstly:

  1. async read
  2. avoid signal/wait
  3. almost of things should be nonblock

@liewegas
Copy link
Member

liewegas commented Nov 7, 2016

I see. In that case, I'd say we could remove the bluestore_sync_submit_transaction option and make bluestore_rtc control that behavior as well.

But either way, yes, we really need to separate this into separate patches making each logical change.

For the no_finisher patch, we should change all the completions to look more like

if (!have_pglock)
  pg->lock().
...
if (!have_pglock)
  pg->unlock();

instead of duplicating every function body.

@liewegas
Copy link
Member

liewegas commented Nov 7, 2016

(Also, this is great!)

@voidbag
Copy link
Contributor Author

voidbag commented Nov 7, 2016

Okay, I'll reflect the opinion and break the patch into two
I have a question to you @liewegas
I understand you asked me to make two commits and send a pull request.
Am I right?

@liewegas
Copy link
Member

liewegas commented Nov 7, 2016

Two commits in the same PR is ideal, yes.

@voidbag
Copy link
Contributor Author

voidbag commented Nov 7, 2016

I think #11624 is the cause of build failure

@voidbag
Copy link
Contributor Author

voidbag commented Nov 8, 2016

@liewegas
I updated the RTC codes with bugs fixed.
There are 3 commits.

Thank you

@voidbag voidbag force-pushed the wip-voidbag-rtc branch 2 times, most recently from f3879f5 to f24caf2 Compare November 8, 2016 06:19
@voidbag
Copy link
Contributor Author

voidbag commented Nov 8, 2016

@liewegas
I did make two commits based on recent master
Can you please review the patches?

Thank you

@voidbag
Copy link
Contributor Author

voidbag commented Nov 9, 2016

Build check succeeded after rebasing master

Copy link
Member

@liewegas liewegas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better! I have several suggestions for the first patch, and a few for the second. I didn't review the atomics tracking completions carefully; will do that on the next pass once the first patch is sorted out. Thanks!!

@@ -6409,6 +6418,7 @@ void BlueStore::_txc_finish_io(TransContext *txc)
if (p->state < TransContext::STATE_IO_DONE) {
dout(20) << __func__ << " " << txc << " blocked by " << &*p << " "
<< p->get_state_name() << dendl;
txc->is_pipelined_io = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this does anything... we already set this to true for this txc when we submitted the io in teh PREPARE case above; no need to set it again.

I think what we actually want is to mark is_pipelined_io if any of the txc's ahead of us have/had IO and are in a state < KV_DONE.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm... is_pipelined_io flag is used for checking if ShardedOpWQ holding pg_lock is running
If i change where is_piplined_io is set as you can suggested(STATE_PREPARE), then i can remove it.
Then, my intention is that is_pipelined_io is set to check if this code line is executed by ShardedOpWQ. This patch already can check any of the txc's ahead of us by osr->kv_finisher_submitting counter.

Thank you

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see!

@@ -6322,11 +6323,19 @@ void BlueStore::_txc_state_proc(TransContext *txc)
<< dendl;
} else {
_txc_finalize_kv(txc, txc->t);
if (g_conf->bluestore_rtc && !txc->is_pipelined_io && txc->wal_txn) {
if (txc->osr->kv_finisher_submitting == 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can combine these if's

I still don't understand the relevance of wal_txn here. You mentioned earlier that it crashed without it.. can you explain what happened? I don't think we should (need to) treat a wal write any differently than, say, a delete (which has no IO at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without txc->wal_txn, ShardedOpWQ holding osr->qlock runs to _txc_finish and _osr_reap_done method which also requires osr->qlock.

I think i can solve the problem by not trying to take osr->qlock if there is only metadata operation..
I'll change the code.

Thank you for your opinion

_txc_state_proc(txc);
return;
}
}
txc->kv_submitted = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May as well move kv_submitted above. Although nobody cares now, that might change in the future.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, no.. i think we want a different flag, kv_submitted_sync, so that we can explicitly condition the sycn completions on a sync commit of the transaction (because it is strictly dependent on being in the caller context holding pg lock).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay i'll add kv_submitted_sync flag

Thank you for the opinion

@@ -1053,6 +1053,7 @@ OPTION(bluestore_debug_randomize_serial_transaction, OPT_INT, 0)
OPTION(bluestore_inject_wal_apply_delay, OPT_FLOAT, 0)
OPTION(bluestore_shard_finishers, OPT_BOOL, false)
OPTION(bluestore_rtc, OPT_BOOL, false)
OPTION(bluestore_no_finisher, OPT_BOOL, false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about bluestore_rtc_sync_completions, and default to true? (it'll only matter if bluestore_rtc is also true.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change bluestore_no_finisher to that you suggested.
Thanks

@@ -58,6 +58,7 @@ class Context {
virtual void finish(int r) = 0;

public:
static const int HAVE_PGLOCK = 256;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's call this FLAG_SYNC indicating we are still in the submitting context. At this layer we don't know what a pglock is or what it means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the variable's name.
Thank you for the opinion

auto f = txc->oncommits.front();
finishers[n]->queue(f);
txc->oncommits.pop_front();
if (g_conf->bluestore_no_finisher && !txc->is_pipelined_io &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we want to use the new flag I mentioned (txc->kv_submitted_sync or similar?).. is_pipelined_io may be false for things that weren't done synchronously

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to know when is_pipelined_io is false and there isn't synchronous io on the above code line.

Thank you for the opinion.

@@ -6282,6 +6282,7 @@ void BlueStore::_txc_state_proc(TransContext *txc)
case TransContext::STATE_PREPARE:
txc->log_state_latency(logger, l_bluestore_state_prepare_lat);
if (txc->ioc.has_pending_aios()) {
txc->is_pipelined_io = true;
txc->state = TransContext::STATE_AIO_WAIT;
_txc_aio_submit(txc);
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

..I think were we need an else case where we look for txc's ahead of us that are doing io. I think it's sufficient to check that if hte next txc is >= KV_DONE then we're safe, otherwise set is_pipelined_io?

Alternatively, flip this bool around to be safe_for_sync_commit or something...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the last sentence.
Do you mean that i have to change the variable name 'is_pipelined_io' into 'safe_for_sync_commit or something'??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll check the unsafe case

Thanks

@voidbag
Copy link
Contributor Author

voidbag commented Nov 9, 2016

@liewegas

I skimmed through your comments.
I'll read your comments completely and write replies.
Then I'll push another patches based on the comments in a few days.

Here is 5:23 A.M. :D

Thank you for your comments

@voidbag
Copy link
Contributor Author

voidbag commented Nov 13, 2016

@liewegas
I modified the codes, according to your opinions
Can you please review the patches?

Thank you

finishers[n]->queue(f);
txc->oncommits.pop_front();
if (g_conf->bluestore_rtc_sync_completions && txc->kv_submitted_sync &&
txc->wal_txn && txc->osr->is_finisher_done()) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

txc->wal_txn condition is needed..
It's hard to apply bluestore_rtc_sync_completions option to the transactions w/o data.
That's because some cases cannot pass FLAG_SYNC to the destructors calling context->complete()...
~RunOnDelete() { /* FLAG_SYNC can't be delivered to_run object... */
~FlushState() { /* FLAG_SYNC can't be delivered to this destructor... */

So i added wal_txn condition(Transactions w/o no data) to exclude above cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's hard not to take pg lock in case of ShardedOpWQ already holding pg lock, I want to add this option.
That's because there is performance gain (14KIOPS(56MB/s) w/ finishers -> 17.5KIOPS(70MB/s) w/o finishers) 2 osds, 2 replicas

Copy link
Contributor Author

@voidbag voidbag Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This Problem is resolved by #11844
The destructors are called only by finishers(onreadable).

kv_queue.push_back(txc);
kv_cond.notify_one();
return;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this condition to avoid trying to take osr->qlock in _txc_finish method several times, which is conflict with _txc_finish_io method taking osr->qlock.
To solve this problem, calling _txc_finish method is delegated to kv_sync_thread In case of metadata only I/O.

@voidbag
Copy link
Contributor Author

voidbag commented Nov 14, 2016

@liewegas
I did rebase my branch on #11844 PR
Thanks

@@ -6531,6 +6555,7 @@ void BlueStore::_txc_finish_kv(TransContext *txc)
txc->oncommits.pop_front();
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I reorder oncommits callbacks with onreadable callback?
I didn't apply sync_completions to oncommits callbacks.
That's because I think that oncommits must be after onreadable.
I want to know oncommits pushed back by BlueStore::flush_commit should wait for onreadable or oncommit.

Thanks

@liewegas
Copy link
Member

Re: wal_txn in the condition. This strikes me as extremely fragile: we are inferring whether the caller can handle a synchronous callback from whether there happens to be IO associated with the transaction.

Instead, let's have a patch that precedes this that adds a flags argument to queue_transaction, and define a flag that indicates that synchronous callbacks are allowed. We can store the flags value in the TransContext.

Then, if I'm not mistaken, the first patch shouldn't need wal_txn, but the second patch should check that when deciding whether to do the callback synchronously...

kv_queue.push_back(txc);
kv_cond.notify_one();
return;
}
Copy link
Contributor Author

@voidbag voidbag Nov 14, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas
I'll explain !txc->wal_txn condition in the first patch.
To call _txc_finish and _osr_reap_done by any thread other than ShardedOpWq, I added !txc->wal_txn condition.

wal_txn case:

Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv -> _wal_apply -> _txc_aio_submit
      pg lock                               osr->qlock
Context2(KernelDevice.aio_thread) -> _wal_finish
Context3(kv_sync_thread) -> _kv_sync_thread -> _txc_finish calling _osr_reap_done
                                                                              osr->qlock                osr->qlock
There's no problem...

w/o !wal_txn condition in _txc_state_proc on STATE_KV_DONE case:

Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv -> _txc_finish
            pg lock                         osr->qlock                                    osr->qlock
There is a problem. ShardedOpWQ tries to take osr->qlock already taken by the thread.
Furthermore, _osr_reap_done tries to remove elements of osr->q, while _txc_finish_io iterates osr->q
So, I added if (txc->kv_submitted_sync && !txc->wal_txn) statement, for kv_sync_thread to call _txc_finish calling _osr_reap_done

w/ !wal_txn condition in _txc_state_proc on STATE_KV_DONE case:

Context1(ShardedOpWQ) -> _txc_finish_io -> _txc_finish_kv
            pg lock                         osr->qlock
Context2(kv_sync_thread) -> _kv_sync_thread-> _txc_finish calling _osr_reap_done
                                                                              osr->qlock              osr->qlock
I think this way is okay, because kv_sync_thread isn't critical path in this case

@liewegas
Copy link
Member

I think we can avoid the qlock issue by making osr_reap_done do a try_lock, and dropping the lock surrounding the STATE_DONE assignment. I'll play with this locally...

@voidbag
Copy link
Contributor Author

voidbag commented Nov 14, 2016

@liewegas
In the second patch, I already removed wal_txn condition for synchronous callbacks.

Your opinion is to add a flag to queue_transactions method
I think that callback contexts shouldn't take a pg lock to execute callbacks synchronously.
I understand you want me to add code lines "If existing callback contexts need a pg lock, then the new argument(is_sync_callback) is set true"

Then, do you mean that I should discard FLAG_SYNC option to each context or not?
I'm a little confused...

Thank you.

@voidbag
Copy link
Contributor Author

voidbag commented Nov 14, 2016

I will try to apply try_lock to solve this qlock issue, also.

👍 Thank you for your good opinion

@voidbag
Copy link
Contributor Author

voidbag commented Nov 14, 2016

@liewegas
I forgot to mention an important thing with qlock issue
I tried to take qlock only if !wal_txn && kv_submitted_sync
Then, another issue appeared.
The elements of osr->q are removed in _osr_reap_done while _txc_finish_io iterates osr->q
I think the above issue led me to using kv_sync_thread

Update:
I found my problem..
I used lock_guard.... like
if (wal_txn || !kv_submitted_sync) std::lock_guard<std::mutex>
Above condition is useless, because out of the scope, the lock is unlocked automatically.

I have modified my code and I'm testing with unique_lock
Thanks

@voidbag
Copy link
Contributor Author

voidbag commented Nov 15, 2016

@liewegas
I modified 1st patch, thanks to your opinion.
-I resolved the problem popping osr->q, by conserving previous TransContext as a variable (cursor)
-I used unique_lock(osr->qlock, std::defer_lock), instead of using lock_guard(osr->qlock) in _txc_finish and _osr_reap_done methods.

I think i'm done with 1st patch. There's no txc->wal_txn condition.
Thank you for good opinion!

Can you please review the patches?

@liewegas
Copy link
Member

I went to play with these patches locally and ended up needing to make several other fixes/cleanups first to fix bugs with the sync_submit_transaction. Pushing those first, see #11983

There is still a deeper problem, though, that I think we have to address first. The normal kv_sync_thread loop does

while (true) {
  sync bdev;
  submit transactions;
  submit sync transactions;
  sync bdev;
  call completions;
}

The current bluestore_sync_submit_transaction completely skips the initial bdev sync, which means that we may queue a kv metadata write that references blocks that aren't stable on disk yet. We need to fix this.. probably by adding a bdev->sync in here. Which will either be very slow (normal devices) or free (NVRAM).

@voidbag
Copy link
Contributor Author

voidbag commented Nov 15, 2016

@liewegas
Thank you for your opinion.

I have a question about your comment.
I skipped the first bdev->flush() only when transactions does data journaling(WAL).
I think that db->submit_transaction_sync is enough to ensure data durability, when using data journaling(WAL)
I think this is valid like FileStore which replies to client right after doing data journaling(WAL)

Do we need to do flush bdev first in case of WAL?

Thank you

@liewegas
Copy link
Member

If it's a pure WAL write, and all prior txcs also have no unstable io, then yes, we can simply do a sync kv commit. The OpSEquencer counter in my other PR addresses this, I think. And in fact the change to do the commit sync on top of it is pretty simple.. playing with that now.

@voidbag
Copy link
Contributor Author

voidbag commented Nov 15, 2016

@liewegas
I think you mentioned kv_committing_serially variable.

I already inserted a variable 'osr->kv_finisher_submitting counter' to check unstable transactions in the same sequencer.
osr->kv_finisher_submitting counter indicates the number of transactions not queued to finishers.
If a transaction is queued to finishers, then osr->kv_finisher_submitting is decremented.
(kv_finisher_submitting is more strict condition than kv_committing_serially)

if (g_conf->bluestore_rtc && !txc->is_pipelined_io && txc->osr->kv_finisher_submitting == 1)
I think that the 1st patch can simply do sync kv commit by checking kv_finisher_submitting like kv_committing_serially which you created.

Is there any other condition do i have to do consider?

Thank you

@liewegas
Copy link
Member

Ah, yes, that looks right. I'd like to rebase this on top of the wip-bluestore-release branch, though, since that fixes the existing sync_submit_transaction path to delay kv submit if there is unstable IO.

I started to adapt your patch at https://github.com/liewegas/ceph/commits/wip-bluestore-rtc, but ended up simplifying a bit (and breaking it into smaller changes). I also added a per-txc field that tracks whether we are still in the queuing execution context...I think this will be a more reliable/safer way to decide whether the qlock dance is necessary?

@voidbag
Copy link
Contributor Author

voidbag commented Nov 15, 2016

@liewegas
Thank you for adapting my patch

But i have an opinion about the flag and I did leave my comment to your branch(https://github.com/liewegas/ceph/commits/wip-bluestore-rtc)
Thank you.

Copy link
Contributor Author

@voidbag voidbag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kv_finisher_submitting counter can be moved right after TransContext::STATE_IO_DONE:

@@ -6961,6 +6978,7 @@ int BlueStore::queue_transactions(

// prepare
TransContext *txc = _txc_create(osr);
++txc->osr->kv_finisher_submitting;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liewegas
I think this line can be moved right after TransContext::STATE_IO_DONE:

@ghost
Copy link

ghost commented Nov 17, 2016

jenkins test this please (eio error now ignored in master)

@voidbag
Copy link
Contributor Author

voidbag commented Nov 17, 2016

@dachary
Thanks
I'll rebase this branch on master and push it
Thank you

@voidbag
Copy link
Contributor Author

voidbag commented Nov 17, 2016

@dachary
Jenkins test is finished.

Thanks

bluestore_rtc option was added
- ShardedOpWQ calls submit_transaction_sync and queues txc to finishers
  when wal is used
- Consistency problem has been solved, using sequencer and its member variables

Signed-off-by: Taeksang Kim <voidbag@gmail.com>
- It must be used with bluestore_rtc option
- Context::FLAG_SYNC constant was added to do context callbacks, w/o pg lock
- It's desirable to use this option with small core machine

Signed-off-by: Taeksang Kim <voidbag@gmail.com>
@voidbag
Copy link
Contributor Author

voidbag commented Nov 21, 2016

Rebase on top of master has been finished.

Thanks

@varadakari
Copy link
Contributor

@liewegas has a patch on the same #12062.

@voidbag
Copy link
Contributor Author

voidbag commented Nov 21, 2016

Thanks for notifying me.

@liewegas
Copy link
Member

see #12062

@liewegas liewegas closed this Dec 27, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants