-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shallow snapshot #356
Shallow snapshot #356
Conversation
I looked very briefly at the change, and at first sight, yes I do see a potential problem. Although I'd need to refresh my memory on some aspects of the code to confirm. Currently the leader decides whether to issue a checkpoint command in This is a bit of a tricky problem and it seems equivalent to the problem of allowing readers on followers. One obvious idea would be to "throttle" checkpoint commands and subsequent frames commands, but it feels complicated. |
It will be inconsistent in how the information is spread between the database file and the WAL but the content/information in the database (as in database file and WAL file together) should be identical no? |
Unless there's something obvious that I'm missing, I have the sensation this is going to be a bit of a tough problem :/ Or at least I can't think of any straightforward solution right now. What do you think? |
Say that at apply-frames command N the followers perform the checkpoint but leader does not. Won't the apply-frames command N+1 be screwed? (i.e. the followers will not be able to apply it correctly because the WAL is different). Or maybe it was a problem only in earlier implementations. I'm checking the code now to refresh my memory. |
Okay, probably I'm confusing the situation with an earlier implementation of dqlite where we were actually sending WAL frames and not merely database pages and so there was a need for the checkpoint command. Now we're only shipping page numbers, and there seems to be no dependency on the WAL state. So it might be fine, I'd just be extra careful and check if there is any subtle ramification with the fact that although the information in the database is the same, the physical bytes can be different and so the "FSM is deterministic" assumption gets relaxed. I don't know if there are consequences of that. |
That's good news, thanks a lot, I'll be extra careful :-) |
ea82c45
to
afee9f4
Compare
8def505
to
ee1f9b4
Compare
05caba9
to
cf13a29
Compare
I think this is now ready for review, please take your time to go through this and let me clarify anything that comes to your minds. I'm out for the evening, will comment back in the morning. Changes:
Remarks:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, minus nitpicks. Good work with VFS-level tests.
This can happen when the leader of a cluster is running an older version of dqlite where checkpoints were synchronized cluster-wide.
Legacy followers only checkpoint their WAL when they receive a checkpoint command from the leader. To prevent the WAL from growing too large checkpoint commands still have to be issued to accomodate these nodes.
cf13a29
to
cfbb0ba
Compare
Found a bug, will reopen soon. edit: bug is fixed now, was not really related to this PR. Will just try and fix this last issue with clang complaining about the wrong ASan runtime. edit2: should be good now |
389d372
to
7bc292c
Compare
When the gateway is closed, and the leader is freed, it's possible that there is a still a reference to the raft_apply member in the leader requests queue of the raft node. Later, when raft tries to access that memory when cleaning up its leader requests queue a memory fault can be triggered. Solve this by allocating the request and only freeing it in the callback.
7bc292c
to
f54a444
Compare
@freeekanayaka got a few minutes to do another quick review on this before we hit merge? |
All good! |
@freeekanayaka
This PR is WIP, it is just to have a discussion on the methodology, don't review the code yet. As discussed in the raft PR on the async snapshots, the trick to have faster snapshots in dqlite is to:
I currently set a flag
read_lock
on the db object to indicate that a snapshot or a checkpoint is busy and allow neither to start when the other is running.The problem I encountered is that, in the current implementation, the dqlite leader decides when to checkpoint, it then issues a checkpoint command to the whole cluster and when a node applies the checkpoint command, the sqlite database will be checkpointed.
Because nodes independently decide when to snapshot, it can happen that the application of that checkpoint command can fail due to a follower that is taking a snapshot (holding the
read_lock
while compressing and writing to disk) at that time, and the checkpoint command will not be executed again, because we only apply raft logs once.My solution to this is to also let the nodes decide independently when to checkpoint, this happens at the end of
apply_frames
instead of as a result of a checkpoint command.Now, here comes the question, do you see any obvious issues with that approach?