Shallow snapshot #356

MathieuBordere · 2022-03-24T14:17:03Z

This PR is WIP, it is just to have a discussion on the methodology, don't review the code yet. As discussed in the raft PR on the async snapshots, the trick to have faster snapshots in dqlite is to:

copy the WAL
disallow checkpoints while taking a snapshot
pass pointers to the database pages to raft to eliminate the copy of the whole database
allow checkpoints again after raft is done accessing the page pointers

I currently set a flag read_lock on the db object to indicate that a snapshot or a checkpoint is busy and allow neither to start when the other is running.

The problem I encountered is that, in the current implementation, the dqlite leader decides when to checkpoint, it then issues a checkpoint command to the whole cluster and when a node applies the checkpoint command, the sqlite database will be checkpointed.

Because nodes independently decide when to snapshot, it can happen that the application of that checkpoint command can fail due to a follower that is taking a snapshot (holding the read_lock while compressing and writing to disk) at that time, and the checkpoint command will not be executed again, because we only apply raft logs once.
My solution to this is to also let the nodes decide independently when to checkpoint, this happens at the end of apply_frames instead of as a result of a checkpoint command.

Now, here comes the question, do you see any obvious issues with that approach?

freeekanayaka · 2022-03-24T14:54:57Z

@freeekanayaka

This PR is WIP, it is just to have a discussion on the methodology, don't review the code yet. As discussed in the raft PR on the async snapshots, the trick to have faster snapshots in dqlite is to:
* copy the WAL

* disallow checkpoints while taking a snapshot

* pass pointers to the database pages to raft to eliminate the copy of the whole database

* allow checkpoints again after raft is done accessing the page pointers
I currently set a flag read_lock on the db object to indicate that a snapshot or a checkpoint is busy and allow neither to start when the other is running.

The problem I encountered is that, in the current implementation, the dqlite leader decides when to checkpoint, it then issues a checkpoint command to the whole cluster and when a node applies the checkpoint command, the sqlite database will be checkpointed.

Because nodes independently decide when to snapshot, it can happen that the application of that checkpoint command can fail due to a follower that is taking a snapshot (holding the read_lock while compressing and writing to disk) at that time, and the checkpoint command will not be executed again, because we only apply raft logs once. My solution to this is to also let the nodes decide independently when to checkpoint, this happens at the end of apply_frames instead of as a result of a checkpoint command.

Now, here comes the question, do you see any obvious issues with that approach?

I looked very briefly at the change, and at first sight, yes I do see a potential problem. Although I'd need to refresh my memory on some aspects of the code to confirm. Currently the leader decides whether to issue a checkpoint command in leaderMaybeCheckpoint (in leader.c), and basically the main check is whether there are active readers: if we're above the threshold and there are no readers, then a checkpoint FSM command is issued. At that point we're guaranteed that the checkpoint operation will succeed on all FSMs (leader and followers), since the leader will acquire a lock (in leaderMaybeCheckpoint) and followers don't currently support readers. This makes the FSM totally deterministic. Instead, if we perform the "are there active readers?" check in the apply-frames command, then if we are on the leader it might turn out that there are active readers and the checkpoint can't be performed. That makes the FSM non deterministic, and you'd end up with a different state on the leader than on followers. At that point I believe that further apply-frames commands will be inconsistent because the WAL will be different. The current design (i.e. the checkpoint command) was done with that goal in mind: fully deterministic FSMs, as the Raft paper requires.

This is a bit of a tricky problem and it seems equivalent to the problem of allowing readers on followers. One obvious idea would be to "throttle" checkpoint commands and subsequent frames commands, but it feels complicated.

MathieuBordere · 2022-03-24T15:10:09Z

@freeekanayaka
This PR is WIP, it is just to have a discussion on the methodology, don't review the code yet. As discussed in the raft PR on the async snapshots, the trick to have faster snapshots in dqlite is to:
* copy the WAL

* disallow checkpoints while taking a snapshot

* pass pointers to the database pages to raft to eliminate the copy of the whole database

* allow checkpoints again after raft is done accessing the page pointers
I currently set a flag read_lock on the db object to indicate that a snapshot or a checkpoint is busy and allow neither to start when the other is running.
The problem I encountered is that, in the current implementation, the dqlite leader decides when to checkpoint, it then issues a checkpoint command to the whole cluster and when a node applies the checkpoint command, the sqlite database will be checkpointed.
Because nodes independently decide when to snapshot, it can happen that the application of that checkpoint command can fail due to a follower that is taking a snapshot (holding the read_lock while compressing and writing to disk) at that time, and the checkpoint command will not be executed again, because we only apply raft logs once. My solution to this is to also let the nodes decide independently when to checkpoint, this happens at the end of apply_frames instead of as a result of a checkpoint command.
Now, here comes the question, do you see any obvious issues with that approach?
I looked very briefly at the change, and at first sight, yes I do see a potential problem. Although I'd need to refresh my memory on some aspects of the code to confirm. Currently the leader decides whether to issue a checkpoint command in leaderMaybeCheckpoint (in leader.c), and basically the main check is whether there are active readers: if we're above the threshold and there are no readers, then a checkpoint FSM command is issued. At that point we're guaranteed that the checkpoint operation will succeed on all FSMs (leader and followers), since the leader will acquire a lock (in leaderMaybeCheckpoint) and followers don't currently support readers. This makes the FSM totally deterministic. Instead, if we perform the "are there active readers?" check in the apply-frames command, then if we are on the leader it might turn out that there are active readers and the checkpoint can't be performed. That makes the FSM non deterministic, and you'd end up with a different state on the leader than on followers. At that point I believe that further apply-frames commands will be inconsistent because the WAL will be different. The current design (i.e. the checkpoint command) was done with that goal in mind: fully deterministic FSMs, as the Raft paper requires.

This is a bit of a tricky problem and it seems equivalent to the problem of allowing readers on followers. One obvious idea would be to "throttle" checkpoint commands and subsequent frames commands, but it feels complicated.

It will be inconsistent in how the information is spread between the database file and the WAL but the content/information in the database (as in database file and WAL file together) should be identical no?

freeekanayaka · 2022-03-24T15:10:09Z

Unless there's something obvious that I'm missing, I have the sensation this is going to be a bit of a tough problem :/ Or at least I can't think of any straightforward solution right now. What do you think?

freeekanayaka · 2022-03-24T15:14:50Z

It will be inconsistent in how the information is spread between the database and the WAL but the content should be identical no?

Say that at apply-frames command N the followers perform the checkpoint but leader does not. Won't the apply-frames command N+1 be screwed? (i.e. the followers will not be able to apply it correctly because the WAL is different). Or maybe it was a problem only in earlier implementations. I'm checking the code now to refresh my memory.

freeekanayaka · 2022-03-24T15:35:16Z

Okay, probably I'm confusing the situation with an earlier implementation of dqlite where we were actually sending WAL frames and not merely database pages and so there was a need for the checkpoint command. Now we're only shipping page numbers, and there seems to be no dependency on the WAL state. So it might be fine, I'd just be extra careful and check if there is any subtle ramification with the fact that although the information in the database is the same, the physical bytes can be different and so the "FSM is deterministic" assumption gets relaxed. I don't know if there are consequences of that.

MathieuBordere · 2022-03-24T15:39:48Z

Okay, probably I'm confusing the situation with an earlier implementation of dqlite where we were actually sending WAL frames and not merely database pages and so there was a need for the checkpoint command. Now we're only shipping page numbers, and there seems to be no dependency on the WAL state. So it might be fine, I'd just be extra careful and check if there is any subtle ramification with the fact that although the information in the database is the same, the physical bytes can be different and so the "FSM is deterministic" assumption gets relaxed. I don't know if there are consequences of that.

That's good news, thanks a lot, I'll be extra careful :-)

MathieuBordere · 2022-04-07T17:14:05Z

I think this is now ready for review, please take your time to go through this and let me clarify anything that comes to your minds. I'm out for the evening, will comment back in the morning.

Changes:

Nodes no longer checkpoint as a result of a checkpoint command, but independently decide to checkpoint the database based on the size of the WAL as a result of running apply_frames. Have added tests in test_vfs to test some scenarios where nodes checkpoint at different times.
dqlite now uses the snapshot_finalize method introduced in raft 0.13.0. In fsm__snapshot dqlite will set a flag, read_lock on the database indicating that a snapshot is active, this flag disallows checkpoints from running at the same time as a snapshot. dqlite will then deep-copy the WAL and copy pointers to the database pages. These pointers can be safely accessed by raft to write the snapshot to disk as long as the WAL is not checkpointed into the database, this is prevented by the aforementioned read_lock. The snapshot format on disk has not been changed. Finally, after the snapshot has reached disk or has erred out, raft will call fsm__snapshot_finalize that will release the read_lock, allowing checkpoints to be performed again and will free any allocated buffers.

Remarks:

Legacy (pre this PR) nodes in a cluster ran by a Modern (this PR) node will still receive checkpoint commands, the checkpoint commands are no-ops on Modern nodes. A leader will issue checkpoint commands once its WAL size has 0-size, so once it has performed a checkpoint itself. This is because the callback that contains this logic is called after the WAL has been potentially checkpointed.
Modern nodes in a cluster run by a Legacy leader will perform checkpoints autonomously and will perform the no-op checkpoint commands. I have manually tested these mixed cluster scenarios, but they're not in CI.
snapshots that would take ages to reach disk will block the checkpoint from running, however once the snapshot eventually reaches disk, the checkpoint will have a chance to run before a new snapshot is triggered by raft. This is because raft applies commands before taking a snapshot & dqlite checkpoints during the execution of a command. So there's a minor risk that the WAL size can grow largeish during a snapshot write, but I believe this is minor.

src/fsm.c

src/leader.c

freeekanayaka

Looks good to me, minus nitpicks. Good work with VFS-level tests.

This can happen when the leader of a cluster is running an older version of dqlite where checkpoints were synchronized cluster-wide.

Legacy followers only checkpoint their WAL when they receive a checkpoint command from the leader. To prevent the WAL from growing too large checkpoint commands still have to be issued to accomodate these nodes.

MathieuBordere · 2022-04-11T11:11:52Z

Found a bug, will reopen soon.

edit: bug is fixed now, was not really related to this PR. Will just try and fix this last issue with clang complaining about the wrong ASan runtime.

edit2: should be good now

When the gateway is closed, and the leader is freed, it's possible that there is a still a reference to the raft_apply member in the leader requests queue of the raft node. Later, when raft tries to access that memory when cleaning up its leader requests queue a memory fault can be triggered. Solve this by allocating the request and only freeing it in the callback.

stgraber · 2022-04-11T23:03:37Z

@freeekanayaka got a few minutes to do another quick review on this before we hit merge?

freeekanayaka · 2022-04-12T09:00:52Z

All good!

MathieuBordere force-pushed the shallow-snapshot branch 14 times, most recently from ea82c45 to afee9f4 Compare April 5, 2022 09:04

MathieuBordere force-pushed the shallow-snapshot branch 4 times, most recently from 8def505 to ee1f9b4 Compare April 5, 2022 18:24

MathieuBordere mentioned this pull request Apr 6, 2022

cluster tests #360

Merged

fsm: use raft_fsm version 2

a5858e7

MathieuBordere force-pushed the shallow-snapshot branch 3 times, most recently from 05caba9 to cf13a29 Compare April 7, 2022 16:46

MathieuBordere marked this pull request as ready for review April 7, 2022 17:14

MathieuBordere requested a review from freeekanayaka April 7, 2022 17:14