Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore is slow #266

Open
sirupsen opened this issue Jan 11, 2022 · 13 comments
Open

Restore is slow #266

sirupsen opened this issue Jan 11, 2022 · 13 comments
Assignees
Labels
performance Improve performance of Litesteam
Milestone

Comments

@sirupsen
Copy link

sirupsen commented Jan 11, 2022

Hey @benbjohnson, long time no see!! Thank you for working on Litestream! 🙏🏻

I, too, love Sqlite. I wanted to track a few events on my website, e.g. what people search for, and saw this as an opportunity to use Litestream. Loved the idea of tracking events in Sqlite and just do analysis on a local copy.

However, even though my db is only ~100kb on disk and ~1000 rows over a few days, it takes ~10 seconds to restore with litestream restore, and this is going up fast.

Is there a plan for a litestream compress or similar to avoid replaying the WAL from early on, similar to what databases do when the WAL gets big enough? Or am I doing something wrong? Unfortunately this will be a bit of a deal-breaker to me using this in production :(

@benbjohnson
Copy link
Owner

benbjohnson commented Jan 12, 2022

hey @sirupsen! 👋 It's been a long time indeed! Still at Shopify?

Restore performance is something that needs improvement but I'm surprised it's 10s for such a small workload. You can change the retention and snapshot-interval configuration fields to create a snapshot more frequently. If you don't care about keeping historic data, you can just set retention to 1h and it'll just keep one snapshot that's an hour old at most.

If you do want to retain data longer, then you could set retention to 24h (which is the default anyway) and the snapshot-interval to 1h. That'll make a new snapshot every hour but only keep them for a rolling 24 hours. Here's the docs for those settings: https://litestream.io/reference/config/#replica-settings

Another option you could try is setting the -parallel N flag on the litestream restore command. If you set that to something high like 64 then it should speed up the downloads at least.

Finally, there are plans for maintaining a hot backup so you can restore instantly but that's still a few months away. I'm also working on a version that works in a serverless environment but that's probably going to be ready later in the year.

@benbjohnson benbjohnson self-assigned this Jan 12, 2022
@benbjohnson benbjohnson added bug Something isn't working enhancement New feature or request and removed bug Something isn't working labels Jan 12, 2022
@benbjohnson benbjohnson added this to the v0.4.0 milestone Jan 12, 2022
@benbjohnson benbjohnson added performance Improve performance of Litesteam and removed enhancement New feature or request labels Jan 12, 2022
@sirupsen
Copy link
Author

sirupsen commented Jan 12, 2022

I stopped working at Shopify mid last year, doing infra consulting now :)

Thank youuuu! snapshot-interval is exactly what I needed... I somehow missed that in the docs. I can add a section to the docs in the Tips & Caveats section if you'd accept it? That said, -parallelism did help a lot to speed up getting ~24h worth of WALs.

FWIW I'm using Cloud Run, so for me, it's already working in serverless 😉

I'm stoked for hot standbys, and maybe one day the ability to 'merge' instances would be cool too.

@benbjohnson
Copy link
Owner

I can add a section to the docs in the Tips & Caveats section if you'd accept it?

Yes! That'd be awesome. Thanks, Simon.

FWIW I'm using Cloud Run, so for me, it's already working in serverless

Cool. I saw some folks talking about getting Litestream running on Cloud Run but I haven't had a chance to give it a go yet.

The idea of "serverless SQLite" that I'm thinking of is paging in data on-demand in a way that's transactionally safe. That way it'd give you zero startup time but also low-latency queries once data is hot on a serverless instance. I'm still toying around with the idea but I think it might have some legs.

@benbjohnson
Copy link
Owner

@sirupsen I saw your comment in #223 (comment) but I'm moving the discussion back over to this ticket.

I am seeing it being stuck on restore too. Is there a good debugging step I can take? 👀

Can you hit CTRL-\ to issue a SIGQUIT when it gets stuck for a bit? That should dump out a stack trace that'll tell us what it's stuck on.

@sirupsen
Copy link
Author

sirupsen commented Jan 17, 2022

@benbjohnson Sorry, I might have misused the word 'stuck'—it doesn't get stuck in a loop inside Litestream, just stuck in a loop trying to boot the container by restoring Litestream. litestream restore just exits after a few seconds with the same error as @pfw:

cannot find max wal index for restore: missing initial wal segment: generation=4f2abd0f421cf473 index=00001c73 offset=1080

This is the stacktrace I get from sending SIGQUIT just before it exits. I got a few stacktraces by sending SIGQUIT before it terminates, and they all look like that

I have nothing sensitive in this database, so I've DM'ed you a zip of the generations directory on the Litestream slack 👍🏻

CleanShot 2022-01-17 at 07 47 43

@benbjohnson
Copy link
Owner

@sirupsen The missing initial wal segment issue from @pfw turned out to be two applications replicating into the same bucket and their retention enforcement was deleting each other's WAL segments: #224 (comment)

I think the issue might be that GCR doesn't enforce a single instance at a time and there could be overlap—especially when deploying—that's causing issues. I think GCR isn't going to work well until I can get better support for serverless in Litestream.

I'm not sure if you're committed to GCR but another good alternative is fly.io. If you attach a persistent disk on their instances then they enforce a single instance at a time.

@sirupsen
Copy link
Author

sirupsen commented Jan 17, 2022

Fair enough... I will consider migrating to fly.io. 👍🏻

How do I fix this error though even when nothing is running on GCP? To recover my dear database

@benbjohnson
Copy link
Owner

How do I fix this error though even when nothing is running on GCP? To recover my dear database

Unfortunately, with the missing initial WAL segment the best you can do is recover from the last snapshot.

# Copy out the last snapshot.
cp generations/f6d6d1e96d38dafb/snapshots/00000093.snapshot.lz4 db.lz4

# Uncompress it
lz4 db.lz4

# Verify the database integrity
sqlite3 db
sqlite> PRAGMA integrity_check;
ok

@sirupsen
Copy link
Author

I don't deserve you, thank you :)

benbjohnson/litestream.io#43

@benbjohnson
Copy link
Owner

Thanks for going on this debugging journey with me, @sirupsen! The doc updates are incredibly helpful. 🎉

@benbjohnson benbjohnson modified the milestones: v0.4.0, v0.4.1 Mar 5, 2022
@hifi
Copy link
Collaborator

hifi commented Aug 24, 2022

@sirupsen do you still have a test case you could throw at #416? I'm curious if you just hit an ordering bottleneck when retrieving WAL segments.

@sirupsen
Copy link
Author

@hifi it's very fast for me these days despite the database being far larger. Probably with improper snapshot intervals and retention you might be able to make it slow!

@northeastprince
Copy link

Can this be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improve performance of Litesteam
Projects
None yet
Development

No branches or pull requests

4 participants