Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify autocheckpoints consequences #547

Closed
FiloSottile opened this issue Dec 19, 2023 · 5 comments
Closed

Clarify autocheckpoints consequences #547

FiloSottile opened this issue Dec 19, 2023 · 5 comments

Comments

@FiloSottile
Copy link

https://litestream.io/tips/ recommends turning off autocheckpoints for high write load servers, if I understand correctly because the application might race Litestream while it switches locks.

I would like to understand the consequences of such a race happening. Is Litestream just going to notice it missed a WAL and make a fresh snapshot, like it does when it is stopped and restarted, or is it going to corrupt the replica?

I am asking because I would like not to make my application dependent on Litestream for checkpointing, and I am willing to take the risk of a few more snapshots, but not of a corrupt replica.

@hifi
Copy link
Collaborator

hifi commented Dec 19, 2023

I'm not the original author but have worked with Litestream for a good while so take my comments with a grain of salt.

From what I understand from the code and some late fixes I've done regarding checkpointing, if the WAL gets successfully checkpointed outside Litestream's supervision it will declare that it has lost the position due to checksum mismatch and force a new generation as you've thought.

This could be tested by disabling the persistent read lock code path and forcing checkpoints and writes to a database during replication.

However I don't quite follow your reasoning why you want your application to control checkpointing? Litestream intentionally keeps a read transaction open to prevent application checkpoints from rolling in the WAL from the outside so in practice all of your own checkpoints would fail unless they race Litestream successfully which is an error condition.

@FiloSottile
Copy link
Author

Thank you for the answer! Glad to hear it fails safe. Maybe it's worth mentioning on the tips page? As it is, it looked a bit scary. ("So wait, if I don't read this page and remember to set a PRAGMA do I risk corruption?")

However I don't quite follow your reasoning why you want your application to control checkpointing? Litestream intentionally keeps a read transaction open to prevent application checkpoints from rolling in the WAL from the outside so in practice all of your own checkpoints would fail unless they race Litestream successfully which is an error condition.

Oh I wasn't clear sorry. I am saying that I don't want my application to have to be run with Litestream. Some users might use Litestream, some users might use EBS atomic snapshots, some users might choose to have no replication. If I turn off autocheckpointing, users that don't run Litestream will end up with an endlessly growing WAL, so I'd have to add a config option, which is annoying and error prone.

@hifi
Copy link
Collaborator

hifi commented Dec 19, 2023

Ah, right, that makes sense.

I'd suggest keeping sane defaults and allowing overriding the PRAGMAs in config if someone wants to improve compatibility with Litestream so checkpointing on by default. That's what we've been doing: https://github.com/mautrix/go-util/blob/main/dbutil/litestream/register.go

But it indeed should be safe to have checkpointing on and it would at most force a new generation/snapshot if the app wins the unlikely race.

@hifi
Copy link
Collaborator

hifi commented Dec 20, 2023

Just to be sure I ran some tests where the read lock was intentionally removed from Litestream so it couldn't prevent external checkpoints at all. Regardless how much I abused it it would successfully always recover with:

time=2023-12-20T12:46:52.927+02:00 level=INFO msg="sync: new generation" db=//path/to/test.db generation=b9b04512b4365e9a reason="wal overwritten by another process"
time=2023-12-20T12:46:52.931+02:00 level=INFO msg="write snapshot" db=/path/to/test.db replica=file position=b9b04512b4365e9a/00000001:4152

So at worst it would do as expected and start a new generation if it lost the WAL. The remote was never corrupted and could always restore up to latest sync.

I'll update the documentation to be less scary about that, thanks!

@hifi
Copy link
Collaborator

hifi commented Dec 26, 2023

I added a new sentence to the paragraph:

When Litestream notices this it will force a new generation and take a full snapshot to ensure consistency.

That should clear up the fear of it breaking. Closing this issue.

@hifi hifi closed this as completed Dec 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants