Async snapshot #268

MathieuBordere · 2022-03-23T11:48:26Z

This PR introduces:

a raft_io->async_work method that allows raft to run any workload asynchronously from the main loop.
raft_fsm version 2 that introduces the snapshot_async and snapshot_finalize method. snapshot_async is ran through the new async_work interface and allows a user of raft to run snapshots out of the main loop.

Updated version-info according to https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

Allows raft to perform a potentially long running task asynchronously from the main loop. Users of this function should take the necessary thread-safety measures when making use of this function as the work can run in parallell to the main thread.

codecov-commenter · 2022-03-23T11:51:01Z

Codecov Report

Merging #268 (05ab704) into master (ced1ca7) will increase coverage by 0.18%.
The diff coverage is 94.83%.

@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
+ Coverage   87.71%   87.90%   +0.18%     
==========================================
  Files         108      110       +2     
  Lines       15527    15777     +250     
  Branches     2398     2412      +14     
==========================================
+ Hits        13620    13869     +249     
- Misses       1720     1722       +2     
+ Partials      187      186       -1

Impacted Files	Coverage Δ
src/uv_work.c	`74.41% <74.41%> (ø)`
src/uv.c	`87.23% <75.00%> (-0.35%)`	⬇️
test/integration/test_uv_work.c	`94.73% <94.73%> (ø)`
src/fixture.c	`94.40% <100.00%> (+0.15%)`	⬆️
src/replication.c	`82.92% <100.00%> (+2.50%)`	⬆️
src/uv_append.c	`93.31% <100.00%> (ø)`
src/uv_segment.c	`90.41% <100.00%> (ø)`
src/uv_snapshot.c	`76.33% <100.00%> (+0.05%)`	⬆️
test/integration/test_fixture.c	`100.00% <100.00%> (ø)`
test/integration/test_snapshot.c	`100.00% <100.00%> (ø)`
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ced1ca7...05ab704. Read the comment docs.

version-info updated according to https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

freeekanayaka

I didn't go through all details yet, but here's a couple of questions:

If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.
In version 2 of raft_fsm is it intentional that the snapshot method still gets called even if snapshot_async is defined?

MathieuBordere · 2022-03-23T20:23:36Z

I didn't go through all details yet, but here's a couple of questions:

1. If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.

2. In version 2 of `raft_fsm` is it intentional that the `snapshot` method still gets called even if `snapshot_async` is defined?

Yes, with larger databases they can become an issue, on my system a database of 1.5GB can take 800-900ms to snapshot, which is quite long to block the eventloop.

We have a (rudimentary) benchmark, a typical result of a run with synchronous snapshots on the Github actions machines is something like

Write results:
n 677104
n_err 1
avg [ms] 4.759796
max [ms] 17417.409460
min [ms] 0.354293

Read results:
n 676192
n_err 0
avg [ms] 0.438674
max [ms] 12199.753818
min [ms] 0.038299

Notice, the large maximum times a write or a read can take.

A typical run with async snapshots is (on the dqlite branch where I'm implementing it)

Write results:
n 716654
n_err 1
avg [ms] 4.886415
max [ms] 4860.268703
min [ms] 0.487495

Read results:
n 715646
n_err 0
avg [ms] 0.108416
max [ms] 536.435864
min [ms] 0.044499

The spikes are still there, but a lot less high. I'll try to do some more measurements tomorrow.

Yes, it's the intention snapshot is called in the main loop, in dqlite e.g. this is used to copy the wal and prevent checkpoints from running, Afterwards snapshot_async is run in a worker thread, in dqlite we would copy pointers to the database pages to raft buffers in this method. And finally, snapshot_finalize is called in the main loop again, in dqlite that will allow checkpoints to run again and free any allocated memory.

freeekanayaka · 2022-03-23T20:57:33Z

I didn't go through all details yet, but here's a couple of questions:
1. If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.

2. In version 2 of `raft_fsm` is it intentional that the `snapshot` method still gets called even if `snapshot_async` is defined?
1. Yes, with larger databases they can become an issue, on my system a database of 1.5GB can take 800-900ms to snapshot, which is quite long to block the eventloop.
We have a (rudimentary) benchmark, a typical result of a run with synchronous snapshots on the Github actions machines is something like
Write results:
n 677104
n_err 1
avg [ms] 4.759796
max [ms] 17417.409460
min [ms] 0.354293

Read results:
n 676192
n_err 0
avg [ms] 0.438674
max [ms] 12199.753818
min [ms] 0.038299
Notice, the large maximum times a write or a read can take.

A typical run with async snapshots is (on the dqlite branch where I'm implementing it)
Write results:
n 716654
n_err 1
avg [ms] 4.886415
max [ms] 4860.268703
min [ms] 0.487495

Read results:
n 715646
n_err 0
avg [ms] 0.108416
max [ms] 536.435864
min [ms] 0.044499
The spikes are still there, but a lot less high. I'll try to do some more measurements tomorrow.

I see. It'd be interesting to run those measurements on a bare metal machine with as little disk activity as possible going on (other than raft/dqlite itself). I would expect the version with async snapshot support to virtually eliminate the write spikes. If not, perhaps there is something else also to look at, as a separate work from this PR.

2. Yes, it's the intention `snapshot` is called in the main loop, in dqlite e.g. this is used to copy the wal and prevent checkpoints from running, Afterwards `snapshot_async` is run in a worker thread, in dqlite we would copy pointers to the database pages to raft buffers in this method. And finally, `snapshot_finalize` is called in the main loop again, in dqlite that will allow checkpoints to run again and free any allocated memory.

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

MathieuBordere · 2022-03-23T21:02:42Z

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

Yes, and now that checkpoints are off and we have a copy of the WAL, we no longer need to copy the database, we can just pass the page pointers to raft.

freeekanayaka · 2022-03-23T21:55:53Z

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

Yes, and now that checkpoints are off and we have a copy of the WAL, we no longer need to copy the database, we can just pass the page pointers to raft.

Ah nice.

freeekanayaka · 2022-03-24T08:07:00Z

Another question: since only page pointers are passed to raft, what would the snapshot_async implementation of dqlite do that is expensive and requires to be run outside of the main loop? It feels that basically the real trick here is to disable checkpoints, so in principle just snapshot_finalize would be needed? Or I guess I'm missing something.

MathieuBordere · 2022-03-24T08:48:12Z

Another question: since only page pointers are passed to raft, what would the snapshot_async implementation of dqlite do that is expensive and requires to be run outside of the main loop? It feels that basically the real trick here is to disable checkpoints, so in principle just snapshot_finalize would be needed? Or I guess I'm missing something.

Yeah indeed, I first made this while I still did the copy of the database in the worker thread, only afterwards I moved dqlite to passing pointers. I still think it's a bit cleaner to offload the work to a thread.

freeekanayaka · 2022-03-24T09:08:05Z

In which sense do you think it's cleaner? I'd say it's cleaner if the goal is "add support for asynchronous snapshots". But if we switch the goal to "add support for post-snapshot cleanup", then simply introducing snapshot_finalize would be equally clean.

I'm just a bit perplexed because it feels there's additional complexity (new raft_io method, new raft_io request, new snapshot-related code paths etc) that is not strictly needed right now to solve the performance issue of dqlite.

I know it might sound bad, but perhaps it'd be worth considering implementing just snapshot_finalize instead for now (i.e. support for snapshot cleanup)? And keep this branch around as experimental until there's need for it in dqlite or someone asks for it for some other use case.

MathieuBordere · 2022-03-24T09:20:30Z

In which sense do you think it's cleaner? I'd say it's cleaner if the goal is "add support for asynchronous snapshots". But if we switch the goal to "add support for post-snapshot cleanup", then simply introducing snapshot_finalize would be equally clean.

I'm just a bit perplexed because it feels there's additional complexity (new raft_io method, new raft_io request, new snapshot-related code paths etc) that is not strictly needed right now to solve the performance issue of dqlite.

I know it might sound bad, but perhaps it'd be worth considering implementing just snapshot_finalize instead for now (i.e. support for snapshot cleanup)? And keep this branch around as experimental until there's need for it in dqlite or someone asks for it for some other use case.

It was already merged in this night, if you really object, I'll talk to @stgraber to see if it's worth to revert it.
I would need to do some measurements to find out what happens if we would do all the work in the main thread instead of in a worker.

freeekanayaka · 2022-03-24T09:47:24Z

Assuming that the work is copying pointers it should really be fast and I wouldn't expect any regression.

I'd personally go for the minimum required change (adding snapshot_finalize), the main reason being that adding parallelism can always introduce subtle issues that might not be immediate to see right now, and it's also something to maintain and be careful of when modifying the code in the future. But I'll let you guys decide since YMMV.

freeekanayaka · 2022-03-24T10:08:09Z

For example, we have a few spots that check if a snapshot is in progress, and they use if (r->snapshot.put.data != NULL), e.g.:

https://github.com/canonical/raft/blob/master/src/replication.c#L1276
https://github.com/canonical/raft/blob/master/src/replication.c#L1418
https://github.com/canonical/raft/blob/master/src/replication.c#L1657

Do they still work correctly now that a snapshot might be in progress even if r->snapshot.put.data is NULL (because we're still running snapshot_async)? Or am I misinterpreting the change?

MathieuBordere · 2022-03-24T10:21:16Z

For example, we have a few spots that check if a snapshot is in progress, and they use if (r->snapshot.put.data != NULL), e.g.:

https://github.com/canonical/raft/blob/master/src/replication.c#L1276 https://github.com/canonical/raft/blob/master/src/replication.c#L1418 https://github.com/canonical/raft/blob/master/src/replication.c#L1657

Do they still work correctly now that a snapshot might be in progress even if r->snapshot.put.data is NULL (because we're still running snapshot_async)? Or am I misinterpreting the change?

https://github.com/canonical/raft/blob/master/src/replication.c#L1276 checks if snapshot.pending.term != 0 to detect if a snapshot is in progress, that's still okay.

same for https://github.com/canonical/raft/blob/master/src/replication.c#L1418

https://github.com/canonical/raft/blob/master/src/replication.c#L1657 only tries to detect a snapshot install and is still okay.

MathieuBordere · 2022-03-24T10:27:52Z

Don't understand me wrong, I understand your concern, and I would probably have done it differently after your input.

MathieuBordere · 2022-03-24T10:50:22Z

Did some measurements and the async work is indeed really fast, I would probably then revert this merge (@stgraber) and work on the simpler approach to remove some risk and keep this one around for when it's really needed.

freeekanayaka · 2022-03-24T11:09:05Z

Thanks, sounds like a wise choice to me.

MathieuBordere · 2022-03-24T12:18:53Z

Thanks, sounds like a wise choice to me.

Thank you for the input.

calvin2021y · 2022-06-22T12:50:45Z

hi @MathieuBordere

Can we have a async branch with updated commit to master ?

I am work on a SQL based solution with in memory database, I need async snapshot so I can flush the database into disk.

Without async snapshot it will block the event loop for long time If I have a big database.

I also need async restore with a callback let the raft know it is ready to continue. (without block the loop, because there could be raft group run in it)

MathieuBordere · 2022-06-22T13:14:10Z

hi @MathieuBordere

Can we have a async branch with updated commit to master ?

I am work on a SQL based solution with in memory database, I need async snapshot so I can flush the database into disk.

Without async snapshot it will block the event loop for long time If I have a big database.

I also need async restore with a callback let the raft know it is ready to continue. (without block the loop, because there could be raft group run in it)

Hi, it will probably land this week but you can already find it here . There's no async_restore yet, but it probably makes sense to add it, let me think about it.

calvin2021y · 2022-06-23T09:59:21Z

hi @MathieuBordere

I has merge your branch into master and it work well.

please consider add async_restore into v3, and made it optional work like async_snapshot.

With async_restore I am able to add disk support with fast snapshot for big size database. Without async_restore the event will blocked and break a lot others. (like NATS sub client run in the event loop, or other raft group)

raft_io: add async_work method

2ada8f9

Allows raft to perform a potentially long running task asynchronously from the main loop. Users of this function should take the necessary thread-safety measures when making use of this function as the work can run in parallell to the main thread.

MathieuBordere force-pushed the async-snapshot branch 3 times, most recently from 04d3b43 to d0b0557 Compare March 23, 2022 14:28

stgraber requested a review from freeekanayaka March 23, 2022 14:38

MathieuBordere force-pushed the async-snapshot branch 2 times, most recently from 6581c9a to c01d121 Compare March 23, 2022 14:55

Mathieu Borderé added 3 commits March 23, 2022 19:21

raft_fsm: version 2 introducing async snapshot

1be7725

general: Initialize stack variables

d82f013

version: Update version-info & module version

a3e585c

version-info updated according to https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

MathieuBordere force-pushed the async-snapshot branch from c01d121 to a3e585c Compare March 23, 2022 18:23

MathieuBordere marked this pull request as ready for review March 23, 2022 18:29

freeekanayaka reviewed Mar 23, 2022

View reviewed changes

stgraber merged commit b37dacd into canonical:master Mar 23, 2022

MathieuBordere mentioned this pull request Mar 24, 2022

Revert "Async snapshot" #269

Merged

MathieuBordere mentioned this pull request Apr 28, 2022

Slow snapshot can hang eventloop thread #222

Closed

MathieuBordere mentioned this pull request May 24, 2022

dqlite disk backend canonical/dqlite#368

Open

MathieuBordere mentioned this pull request Jun 28, 2022

Async snapshots #286

Merged

MathieuBordere deleted the async-snapshot branch December 9, 2022 10:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async snapshot #268

Async snapshot #268

MathieuBordere commented Mar 23, 2022 •

edited

Loading

codecov-commenter commented Mar 23, 2022 •

edited

Loading

freeekanayaka left a comment

MathieuBordere commented Mar 23, 2022

freeekanayaka commented Mar 23, 2022

MathieuBordere commented Mar 23, 2022 •

edited

Loading

freeekanayaka commented Mar 23, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 •

edited

Loading

MathieuBordere commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 •

edited

Loading

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 •

edited

Loading

calvin2021y commented Jun 22, 2022 •

edited

Loading

MathieuBordere commented Jun 22, 2022 •

edited

Loading

calvin2021y commented Jun 23, 2022 •

edited

Loading

Async snapshot #268

Async snapshot #268

Conversation

MathieuBordere commented Mar 23, 2022 • edited Loading

codecov-commenter commented Mar 23, 2022 • edited Loading

Codecov Report

freeekanayaka left a comment

Choose a reason for hiding this comment

MathieuBordere commented Mar 23, 2022

freeekanayaka commented Mar 23, 2022

MathieuBordere commented Mar 23, 2022 • edited Loading

freeekanayaka commented Mar 23, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 • edited Loading

MathieuBordere commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 • edited Loading

freeekanayaka commented Mar 24, 2022

MathieuBordere commented Mar 24, 2022 • edited Loading

calvin2021y commented Jun 22, 2022 • edited Loading

MathieuBordere commented Jun 22, 2022 • edited Loading

calvin2021y commented Jun 23, 2022 • edited Loading

MathieuBordere commented Mar 23, 2022 •

edited

Loading

codecov-commenter commented Mar 23, 2022 •

edited

Loading

MathieuBordere commented Mar 23, 2022 •

edited

Loading

MathieuBordere commented Mar 24, 2022 •

edited

Loading

MathieuBordere commented Mar 24, 2022 •

edited

Loading

MathieuBordere commented Mar 24, 2022 •

edited

Loading

calvin2021y commented Jun 22, 2022 •

edited

Loading

MathieuBordere commented Jun 22, 2022 •

edited

Loading

calvin2021y commented Jun 23, 2022 •

edited

Loading