opt(lua): Coordinate single-shard Lua evals in remote thread #1845

chakaz · 2023-09-12T06:56:09Z

This removes the need for an additional (and costly) Fiber.

src/server/main_service.cc

romange · 2023-09-12T07:15:59Z

src/server/main_service.cc

+
+    boost::intrusive_ptr<Transaction> stub_tx = new Transaction{cntx->transaction};
+    cntx->transaction = stub_tx.get();
+    tx->PrepareSquashedMultiHop(registry_.Find("EVAL"), [&](ShardId id) { return id == *sid; });


can you please explain here or in the commit description what happens there?
we disable squashing before CallFromScript(cntx, args) but call PrepareSquashedMultiHop here.
what does it do?

I suggested him to re-use parts the squashing mechanism instead of implementing a new function. PrepareSquashedMultiHop just initializes the transaction with a "fake" command to prepare the hop. In theory its not needed because we have only a single shard and its correctly marked as active by StartMultiLockedAhead

What is disabled is "real" squashing that is responsible for executing commands in parallel.

Nice. So how come it works well and using the real transaction did not work? where do we switch to the original one to allow a command execution? (Beforehand ScheduleSingleHop calls crashed because cb_ptr was already set with the coordinator callback

We tried different things 🙂

Transaction is not designed to be re-entrant, meaning you can't use "itself" while executing a callback. So a custom function needs to be implemented that is not ScheduleSingleHop... But then you run into trouble again, because you run from the callback and multi command logic still submits callbacks and calls PollExecution, yet PollExecution is also not re-entrant, so you run into the weirdest bugs there

In general I'd suggest not using a complicated approach because its difficult to verify and test. We already have squashing that works fairly reliably (and in a much simpler way by offloading all the difficulties to a stub transaction)

The issue is that ScheduleSingleHop (and many other functions of Transaction) are not re-entrant, and so attempts to call them from within a callback calling them requires subtle changes and hacks.
But Vlad correctly commented that the squashing mechanism already handles a similar case, and if a transaction is set to have a SQUASHED_STUB role then it executes a different path, similar to what we're trying to achieve:

dragonfly/src/server/transaction.cc

Line 654 in 02fff36

if (multi_ && multi_->role == SQUASHED_STUB) {

Ok, please add statistics about how many times we offload the coordinator vs we do it in its original thread.

Just to make sure we do not forget it. For OOO transactions - we can run them immediately and this is something we must check with this code.

They will run immediately, but we have +1 hop for scheduling and +1 hop for unlocking

we can simulate OOO conditions easily using memtier - like I described to you offline using redis.call('set', KEYS[1], 'foo'") script with large key range. In fact, I would like us to repeat the same experiment and also see we get a similar performance as I did.

Re/ simulating OOO:

Without this change (i.e. with release 1.9):

ALL STATS =================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec --------------------------------------------------------------------------------------------------- Evalshas 451195.23 0.55021 0.47900 1.50300 2.30300 46216.21 Totals 451195.23 0.55021 0.47900 1.50300 2.30300 46216.21

With this change:

ALL STATS =================================================================================================== Type Ops/sec Avg. Latency p50 Latency p99 Latency p99.9 Latency KB/sec --------------------------------------------------------------------------------------------------- Evalshas 620400.31 0.40318 0.36700 0.85500 1.77500 63548.00 Totals 620400.31 0.40318 0.36700 0.85500 1.77500 63548.00

romange · 2023-09-12T13:15:48Z

src/server/server_family.cc

@@ -1418,6 +1420,8 @@ void ServerFamily::Info(CmdArgList args, ConnectionContext* cntx) {
    append("defrag_attempt_total", m.shard_stats.defrag_attempt_total);
    append("defrag_realloc_total", m.shard_stats.defrag_realloc_total);
    append("defrag_task_invocation_total", m.shard_stats.defrag_task_invocation_total);
+    append("eval_local_coordination", m.eval_local_coordination);


counters should end with _total

romange · 2023-09-12T13:16:48Z

src/server/main_service.cc

+    cntx->transaction = stub_tx.get();
+    tx->PrepareSquashedMultiHop(registry_.Find("EVAL"), [&](ShardId id) { return id == *sid; });
+    tx->ScheduleSingleHop([&](Transaction* tx, EngineShard*) {
+      ++ServerState::tlocal()->stats.eval_remote_coordination;


can you think of other options for stats naming?
I find remote/local naming confusing because it depends on a POV.
could be something like
eval_io_coordination_cnt?
eval_shardlocal_coordination_cnt?

romange · 2023-09-13T07:10:54Z

src/server/server_family.cc

@@ -1418,6 +1420,8 @@ void ServerFamily::Info(CmdArgList args, ConnectionContext* cntx) {
    append("defrag_attempt_total", m.shard_stats.defrag_attempt_total);
    append("defrag_realloc_total", m.shard_stats.defrag_realloc_total);
    append("defrag_task_invocation_total", m.shard_stats.defrag_task_invocation_total);
+    append("eval_io_coordination_cnt", m.eval_io_coordination_cnt);


the string names should have _total at the end according to open metrics spec.

chakaz added 7 commits September 6, 2023 23:15

opt(lua): !!WIP!! Coordinate single-shard Lua evals in remote thread

dbbf010

This removes the need for an additional (and costly) Fiber.

SAVAGE!

add00b9

Beautify

c74e44b

Merge branch 'main' into lua-queue

3ec61c6

Doc

61af159

Not releasing keys :|

80527aa

Working solution!

53fc578

chakaz requested review from dranikpg and romange September 12, 2023 06:56

romange reviewed Sep 12, 2023

View reviewed changes

src/server/main_service.cc Show resolved Hide resolved

romange reviewed Sep 12, 2023

View reviewed changes

Add stats

4f44b5c

chakaz requested a review from romange September 12, 2023 11:07

romange reviewed Sep 12, 2023

View reviewed changes

Counter renaming

b72a2fc

chakaz requested a review from romange September 13, 2023 06:13

romange reviewed Sep 13, 2023

View reviewed changes

Comply with open metrics spec

be83d19

chakaz requested a review from romange September 13, 2023 07:22

romange approved these changes Sep 13, 2023

View reviewed changes

chakaz merged commit b91435e into main Sep 13, 2023
10 checks passed

chakaz deleted the lua-queue branch September 13, 2023 08:52

dranikpg mentioned this pull request Oct 2, 2023

fix(server): Fix async lua #1982

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opt(lua): Coordinate single-shard Lua evals in remote thread #1845

opt(lua): Coordinate single-shard Lua evals in remote thread #1845

chakaz commented Sep 12, 2023

romange Sep 12, 2023

dranikpg Sep 12, 2023

romange Sep 12, 2023

dranikpg Sep 12, 2023

chakaz Sep 12, 2023

romange Sep 12, 2023

romange Sep 12, 2023

dranikpg Sep 12, 2023

romange Sep 12, 2023

chakaz Sep 12, 2023

romange Sep 12, 2023

romange Sep 12, 2023

romange Sep 13, 2023

opt(lua): Coordinate single-shard Lua evals in remote thread #1845

opt(lua): Coordinate single-shard Lua evals in remote thread #1845

Conversation

chakaz commented Sep 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment