Skip to content
This repository has been archived by the owner on Mar 4, 2024. It is now read-only.

Async snapshot #268

Merged
merged 4 commits into from
Mar 23, 2022
Merged

Async snapshot #268

merged 4 commits into from
Mar 23, 2022

Conversation

MathieuBordere
Copy link
Contributor

@MathieuBordere MathieuBordere commented Mar 23, 2022

This PR introduces:

  • a raft_io->async_work method that allows raft to run any workload asynchronously from the main loop.
  • raft_fsm version 2 that introduces the snapshot_async and snapshot_finalize method. snapshot_async is ran through the new async_work interface and allows a user of raft to run snapshots out of the main loop.

Updated version-info according to https://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html

Allows raft to perform a potentially long running task asynchronously
from the main loop.
Users of this function should take the necessary thread-safety measures
when making use of this function as the work can run in parallell to the
main thread.
@codecov-commenter
Copy link

codecov-commenter commented Mar 23, 2022

Codecov Report

Merging #268 (05ab704) into master (ced1ca7) will increase coverage by 0.18%.
The diff coverage is 94.83%.

@@            Coverage Diff             @@
##           master     #268      +/-   ##
==========================================
+ Coverage   87.71%   87.90%   +0.18%     
==========================================
  Files         108      110       +2     
  Lines       15527    15777     +250     
  Branches     2398     2412      +14     
==========================================
+ Hits        13620    13869     +249     
- Misses       1720     1722       +2     
+ Partials      187      186       -1     
Impacted Files Coverage Δ
src/uv_work.c 74.41% <74.41%> (ø)
src/uv.c 87.23% <75.00%> (-0.35%) ⬇️
test/integration/test_uv_work.c 94.73% <94.73%> (ø)
src/fixture.c 94.40% <100.00%> (+0.15%) ⬆️
src/replication.c 82.92% <100.00%> (+2.50%) ⬆️
src/uv_append.c 93.31% <100.00%> (ø)
src/uv_segment.c 90.41% <100.00%> (ø)
src/uv_snapshot.c 76.33% <100.00%> (+0.05%) ⬆️
test/integration/test_fixture.c 100.00% <100.00%> (ø)
test/integration/test_snapshot.c 100.00% <100.00%> (ø)
... and 5 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ced1ca7...05ab704. Read the comment docs.

@MathieuBordere MathieuBordere force-pushed the async-snapshot branch 3 times, most recently from 04d3b43 to d0b0557 Compare March 23, 2022 14:28
@MathieuBordere MathieuBordere force-pushed the async-snapshot branch 2 times, most recently from 6581c9a to c01d121 Compare March 23, 2022 14:55
@MathieuBordere MathieuBordere marked this pull request as ready for review March 23, 2022 18:29
Copy link
Contributor

@freeekanayaka freeekanayaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't go through all details yet, but here's a couple of questions:

  1. If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.
  2. In version 2 of raft_fsm is it intentional that the snapshot method still gets called even if snapshot_async is defined?

@MathieuBordere
Copy link
Contributor Author

I didn't go through all details yet, but here's a couple of questions:

1. If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.

2. In version 2 of `raft_fsm` is it intentional that the `snapshot` method still gets called even if `snapshot_async` is defined?
  1. Yes, with larger databases they can become an issue, on my system a database of 1.5GB can take 800-900ms to snapshot, which is quite long to block the eventloop.

We have a (rudimentary) benchmark, a typical result of a run with synchronous snapshots on the Github actions machines is something like

Write results:
n 677104
n_err 1
avg [ms] 4.759796
max [ms] 17417.409460
min [ms] 0.354293

Read results:
n 676192
n_err 0
avg [ms] 0.438674
max [ms] 12199.753818
min [ms] 0.038299

Notice, the large maximum times a write or a read can take.

A typical run with async snapshots is (on the dqlite branch where I'm implementing it)

Write results:
n 716654
n_err 1
avg [ms] 4.886415
max [ms] 4860.268703
min [ms] 0.487495

Read results:
n 715646
n_err 0
avg [ms] 0.108416
max [ms] 536.435864
min [ms] 0.044499

The spikes are still there, but a lot less high. I'll try to do some more measurements tomorrow.

  1. Yes, it's the intention snapshot is called in the main loop, in dqlite e.g. this is used to copy the wal and prevent checkpoints from running, Afterwards snapshot_async is run in a worker thread, in dqlite we would copy pointers to the database pages to raft buffers in this method. And finally, snapshot_finalize is called in the main loop again, in dqlite that will allow checkpoints to run again and free any allocated memory.

@freeekanayaka
Copy link
Contributor

I didn't go through all details yet, but here's a couple of questions:

1. If you can disclose that, where did the need of adding this feature come from? (I assume some real world use case). I ask because I'm slightly surprised that synchronous snapshots can become a performance issue, at least with the assumption that the whole FSM is kept in memory.

2. In version 2 of `raft_fsm` is it intentional that the `snapshot` method still gets called even if `snapshot_async` is defined?
1. Yes, with larger databases they can become an issue, on my system a database of 1.5GB can take 800-900ms to snapshot, which is quite long to block the eventloop.

We have a (rudimentary) benchmark, a typical result of a run with synchronous snapshots on the Github actions machines is something like

Write results:
n 677104
n_err 1
avg [ms] 4.759796
max [ms] 17417.409460
min [ms] 0.354293

Read results:
n 676192
n_err 0
avg [ms] 0.438674
max [ms] 12199.753818
min [ms] 0.038299

Notice, the large maximum times a write or a read can take.

A typical run with async snapshots is (on the dqlite branch where I'm implementing it)

Write results:
n 716654
n_err 1
avg [ms] 4.886415
max [ms] 4860.268703
min [ms] 0.487495

Read results:
n 715646
n_err 0
avg [ms] 0.108416
max [ms] 536.435864
min [ms] 0.044499

The spikes are still there, but a lot less high. I'll try to do some more measurements tomorrow.

I see. It'd be interesting to run those measurements on a bare metal machine with as little disk activity as possible going on (other than raft/dqlite itself). I would expect the version with async snapshot support to virtually eliminate the write spikes. If not, perhaps there is something else also to look at, as a separate work from this PR.

2. Yes, it's the intention `snapshot` is called in the main loop, in dqlite e.g. this is used to copy the wal and prevent checkpoints from running, Afterwards `snapshot_async` is run in a worker thread, in dqlite we would copy pointers to the database pages to raft buffers in this method. And finally, `snapshot_finalize` is called in the main loop again, in dqlite that will allow checkpoints to run again and free any allocated memory.

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Mar 23, 2022

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

Yes, and now that checkpoints are off and we have a copy of the WAL, we no longer need to copy the database, we can just pass the page pointers to raft.

@freeekanayaka
Copy link
Contributor

Oh I see. So you'd copy the whole WAL synchronously (since it should typically be relatively small) so there are no race conditions, and then copy the database part asynchronously and since it won't be changed (checkpoints off) it is okay to do that in a separate thread.

Yes, and now that checkpoints are off and we have a copy of the WAL, we no longer need to copy the database, we can just pass the page pointers to raft.

Ah nice.

@stgraber stgraber merged commit b37dacd into canonical:master Mar 23, 2022
@freeekanayaka
Copy link
Contributor

Another question: since only page pointers are passed to raft, what would the snapshot_async implementation of dqlite do that is expensive and requires to be run outside of the main loop? It feels that basically the real trick here is to disable checkpoints, so in principle just snapshot_finalize would be needed? Or I guess I'm missing something.

@MathieuBordere
Copy link
Contributor Author

Another question: since only page pointers are passed to raft, what would the snapshot_async implementation of dqlite do that is expensive and requires to be run outside of the main loop? It feels that basically the real trick here is to disable checkpoints, so in principle just snapshot_finalize would be needed? Or I guess I'm missing something.

Yeah indeed, I first made this while I still did the copy of the database in the worker thread, only afterwards I moved dqlite to passing pointers. I still think it's a bit cleaner to offload the work to a thread.

@freeekanayaka
Copy link
Contributor

In which sense do you think it's cleaner? I'd say it's cleaner if the goal is "add support for asynchronous snapshots". But if we switch the goal to "add support for post-snapshot cleanup", then simply introducing snapshot_finalize would be equally clean.

I'm just a bit perplexed because it feels there's additional complexity (new raft_io method, new raft_io request, new snapshot-related code paths etc) that is not strictly needed right now to solve the performance issue of dqlite.

I know it might sound bad, but perhaps it'd be worth considering implementing just snapshot_finalize instead for now (i.e. support for snapshot cleanup)? And keep this branch around as experimental until there's need for it in dqlite or someone asks for it for some other use case.

@MathieuBordere
Copy link
Contributor Author

In which sense do you think it's cleaner? I'd say it's cleaner if the goal is "add support for asynchronous snapshots". But if we switch the goal to "add support for post-snapshot cleanup", then simply introducing snapshot_finalize would be equally clean.

I'm just a bit perplexed because it feels there's additional complexity (new raft_io method, new raft_io request, new snapshot-related code paths etc) that is not strictly needed right now to solve the performance issue of dqlite.

I know it might sound bad, but perhaps it'd be worth considering implementing just snapshot_finalize instead for now (i.e. support for snapshot cleanup)? And keep this branch around as experimental until there's need for it in dqlite or someone asks for it for some other use case.

It was already merged in this night, if you really object, I'll talk to @stgraber to see if it's worth to revert it.
I would need to do some measurements to find out what happens if we would do all the work in the main thread instead of in a worker.

@freeekanayaka
Copy link
Contributor

Assuming that the work is copying pointers it should really be fast and I wouldn't expect any regression.

I'd personally go for the minimum required change (adding snapshot_finalize), the main reason being that adding parallelism can always introduce subtle issues that might not be immediate to see right now, and it's also something to maintain and be careful of when modifying the code in the future. But I'll let you guys decide since YMMV.

@freeekanayaka
Copy link
Contributor

For example, we have a few spots that check if a snapshot is in progress, and they use if (r->snapshot.put.data != NULL), e.g.:

https://github.com/canonical/raft/blob/master/src/replication.c#L1276
https://github.com/canonical/raft/blob/master/src/replication.c#L1418
https://github.com/canonical/raft/blob/master/src/replication.c#L1657

Do they still work correctly now that a snapshot might be in progress even if r->snapshot.put.data is NULL (because we're still running snapshot_async)? Or am I misinterpreting the change?

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Mar 24, 2022

For example, we have a few spots that check if a snapshot is in progress, and they use if (r->snapshot.put.data != NULL), e.g.:

https://github.com/canonical/raft/blob/master/src/replication.c#L1276 https://github.com/canonical/raft/blob/master/src/replication.c#L1418 https://github.com/canonical/raft/blob/master/src/replication.c#L1657

Do they still work correctly now that a snapshot might be in progress even if r->snapshot.put.data is NULL (because we're still running snapshot_async)? Or am I misinterpreting the change?

https://github.com/canonical/raft/blob/master/src/replication.c#L1276 checks if snapshot.pending.term != 0 to detect if a snapshot is in progress, that's still okay.

same for https://github.com/canonical/raft/blob/master/src/replication.c#L1418

https://github.com/canonical/raft/blob/master/src/replication.c#L1657 only tries to detect a snapshot install and is still okay.

@MathieuBordere
Copy link
Contributor Author

Don't understand me wrong, I understand your concern, and I would probably have done it differently after your input.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Mar 24, 2022

Did some measurements and the async work is indeed really fast, I would probably then revert this merge (@stgraber) and work on the simpler approach to remove some risk and keep this one around for when it's really needed.

@freeekanayaka
Copy link
Contributor

Thanks, sounds like a wise choice to me.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Mar 24, 2022

Thanks, sounds like a wise choice to me.

Thank you for the input.

@calvin2021y
Copy link

calvin2021y commented Jun 22, 2022

hi @MathieuBordere

Can we have a async branch with updated commit to master ?

I am work on a SQL based solution with in memory database, I need async snapshot so I can flush the database into disk.

Without async snapshot it will block the event loop for long time If I have a big database.

I also need async restore with a callback let the raft know it is ready to continue. (without block the loop, because there could be raft group run in it)

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Jun 22, 2022

hi @MathieuBordere

Can we have a async branch with updated commit to master ?

I am work on a SQL based solution with in memory database, I need async snapshot so I can flush the database into disk.

Without async snapshot it will block the event loop for long time If I have a big database.

I also need async restore with a callback let the raft know it is ready to continue. (without block the loop, because there could be raft group run in it)

Hi, it will probably land this week but you can already find it here . There's no async_restore yet, but it probably makes sense to add it, let me think about it.

@calvin2021y
Copy link

calvin2021y commented Jun 23, 2022

hi @MathieuBordere

I has merge your branch into master and it work well.

please consider add async_restore into v3, and made it optional work like async_snapshot.

With async_restore I am able to add disk support with fast snapshot for big size database. Without async_restore the event will blocked and break a lot others. (like NATS sub client run in the event loop, or other raft group)

@MathieuBordere MathieuBordere mentioned this pull request Jun 28, 2022
@MathieuBordere MathieuBordere deleted the async-snapshot branch December 9, 2022 10:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants