-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dqlite disk backend #368
Comments
I agree with your assessment. Even without considering the rather complex effort of implementing what you describe (with all the ramifications you correctly identified), I feel that, as you point, the main deal breaker is indeed the fact that the There are brute force ways of solving the issue of the main loop being blocked, for example going back to tricks like using The current in-memory approach of dqlite was driven by 1) the fact that the synchronous SQLite APIs would make using an event loop AND storing stuff on disk quite hard (as I briefly mentioned above) 2) the goal of dqlite as database was to hold shared state between agents of various type (like LXD cluster nodes or Anbox nodes or edge devices etc) without the need of an external "heavy-weight" highly-available database (e.g. CockroachDB). Those use cases typically don't need to store a lot of data, and there's a lot one can store in a few Gigabytes and that can perfectly fit in memory on modern hardware. I'm not entirely sure what use cases of dqlite have now emerged that stretch those assumptions, if it's mainly the k8s integration perhaps there are ways to solve the problem at the Since @stgraber mentioned that you were looking at this issue, I've been thinking on and off about possible strategies to implement the on-disk approach, but I didn't come up with anything convincing yet. Perhaps it needs more investigation, but at first sight it feels a bit of hard problem. |
We're certainly going to run into troubles if we both have low memory and slow I/O. The main goal here is to handle cases where you have a DB of say 20GB with only 1GB or so being commonly accessed on a system with 8GB of RAM. In an ideal case, what we'd see is Dqlite start up, read the data from disk as it needs it (hopefully never doing a full scan) and the VFS providing a fast in-memory caching layer for the data. So long as the VFS cache doesn't get evicted, this should give us performance that's not too far off what we have in-memory as the data would still need to be read once from disk and then would be in the VFS cache. One approach I suggested at the beginning but was discarded as just having sqlite run with a file backend seemed easier was to use mmap to map on-disk data to memory, effectively allowing for dqlite to still operate in much the same way it usually does but having the data reside on-disk and be copied in an out of memory based on memory pressure. The main difficulty would be to ensure that dqlite doesn't ever need the entirety of the database in memory at once and also avoids having to perform full reads. This particularly affects snapshots where we'd want a filesystem based approach to atomically sync memory to the backing file, perform a copy of the backing file and then resume operations. I also expect this mode to be an opt-in thing. We're still going to get much better performance and reliability out of the default in-memory operation mode, but we've seen that users like microk8s can have very large databases of mostly stale data where our current behavior becomes quickly problematic. |
The good news here is that, assuming we find a way to do something along the lines of what @MathieuBordere described (WAL in memory and database file on disk), there's no work to do for the caching, because SQLite already does that in a layer above the VFS one. It already has its internal cache that tries to minimize VFS-level I/O.
As said, I believe we can probably just rely on SQLite's cache for this specific behavior. Having dqlite's VFS implement the disk I/O only with no caching (again, just as described by @MathieuBordere) would be enough.
This would certainly be the cheapest way to achieve something very similar to what @MathieuBordere described, while at the same time minimizing the code changes.
This shouldn't be that much of a problem, because once you manage to have the database file on disk instead of in memory, you could basically change a bit the on-disk format of the snapshot and just It's probably obvious what I'm writing, but just talking loud so we are sure to be on the same page.
I was thinking the same indeed.
I know I'm most probably beating a dead horse, but if microk8s is the only real-world case so far where this is an issue, maybe the question should be put on the table whether it's a better idea to use etcd, instead of going down a very costly and potentially sub-optimal path. My only concern in implementing this feature is really purely technical: the "cheap" way to do it is with synchronous I/O (and as @MathieuBordere was saying it's not even that cheap, because of all the ramifications it has), and doing synchronous I/O means blocking the event loop which will then bring its own set of problems (reads can't be concurrent, slow I/O can disrupt the cluster, etc). The "optimal" way would probably be with coroutines combined with non-blocking I/O (either with At the moment I can't think of other approaches, but maybe there are. |
I have a branch where this works, it's still a bit hacky, but works reasonably well. I'll look into edit: or are async snapshots no longer needed once Libco will be involved? |
Yeah, I certainly would be happy to see microk8s doing something similar to what we're doing with microcloud/microceph where we use dqlite to bootstrap a cluster, store machine roles and manage system configs, then use that config to spin etcd on the correct systems and connect all machines into it. That would feel like a much more appropriate setup for them. Though large sqlite databases aren't uncommon and so having a way to handle them in dqlite still makes sense in general. We also have Juju that's planning on moving their database over to dqlite. In their case, they're currently looking at running a lot of small dqlite databases (up to 3 per model, though maybe less in their current iteration of the design), each of those being relatively small in size but the total footprint being quite significant if all held in-memory. |
The advantage of
Note that I think there are several C coroutines libraries named In any case, as @stgraber was suggesting I'd make sure that this behavior is triggered by a configuration flag and by default you get the current code basically unchanged, since it is now relatively mature and won't suffer from block-the-main-loop issues. I think you will still need async snapshots, unless you introduce coroutines also at the raft level, which doesn't feel appropriate since it can be avoided. |
Right, that feels indeed a sane approach. We have always said it, but it might be worth repeating that, regardless of the on-disk issue, since what k8s needs is a watchable key/value store, emulating that with SQL is not the greatest idea, since introduces additional complexity for nothing in return (and actually you get back slightly worse performance). Let's keep dqlite for things that need SQL and are fine with dqlite's limitations (e.g. single-writer, in-memory, etc). SQLite itself has made it clear what its appropriate uses are. If you want I can volunteer to try to persuade folks on that :)
Yes, it would be nice to have that, but it's impractical and costly, so in my mind you'd better have a really compelling reason, and I don't quite see one at the moment (though I realize I don't know many details). To the least, that's something that could be deferred and those energies spent differently.
Ok, I have no idea of what their requirements are, but the rational should ideally be "we'll use dqlite because after having evaluated our options we believe it's the right tool for the job". Although I certainly understand that using something that was built in house is a plus over a third party, it might lead to the risk to stretch things a bit. I'd just wish to make sure that people understand the technical factors that make the in-memory approach a wise practical one and that going beyond that might be counterproductive (you might spend a lot of time to get something sub-optimal when compared to other options). |
Naive question incoming: as a way of addressing the blocking issue, what about isolating all sqlite3 API calls in a separate thread that communicates using queues/channels with the main thread (the one that runs raft/dqlite server code)? I've been thinking about this for a bit and my sense is that it would be possible to implement, at the cost of a lot of extra code complexity -- how that compares to the tradeoffs of the coroutine approach, I don't know. |
That might work I believe, but then you'll probably have to deal with locks and race conditions. I'm not entirely clear of the implications of that, but that would presumably be the tradeoff. There are a lot of ramifications involved in implementing this, I still recommend making this feature fully opt-in and keeping the current code unless you explicitly enable the feature. |
Regarding snapshots with possibly very large snapshot files: In the dqlite disk-mode case, the WALs will live in memory and the DB files will live on disk.
As a consequence the fsm_snapshot functions have to change signature to allow the user to provide a callback. I would propose to pass the whole snapshot struct as argument and the fsm fills out
The internal raft functions also have to somehow know if the user prefers to load snapshots in memory because they're sufficiently small, or prefers to leave them on disk, We could opt to never load snapshots in memory, always Does this sound sensible? |
I agree that snapshots should continue to live in memory by default (snapshotting to disk should be opt-in). My sense is that we expect most applications to want one of
If that's true, we should have a single configuration knob that makes it easy to set the DB mode and the snapshot mode simultaneously (without necessarily making it impossible to set them to distinct values). |
Agreed! It does seem like a daunting task to pick out all our sqlite3 API calls and rework them into a non-blocking idiom -- especially since (as far as I can tell) libuv doesn't offer a way to run a piece of blocking work in a dedicated thread. Might be something we can do incrementally, I guess. |
There's Hope that's clear.
What you mean exactly? Implement support for on-disk database incrementally? |
That sounds like a good point to me. I think it's safe to assume that this is really an on/off switch, with only one of the two strategies supported and no mixing of these modes. |
Ah, I was under the mistaken impression that SQLite required all API calls to originate from a single thread, which would preclude putting those calls into a thread pool.
Ah, thanks for explaining. We're still keeping the WAL in memory, right? Where do problems come up in this suspend/resume dance if we're keeping the DB file on disk?
What I had in mind was merging the on-disk DB/snapshot support in a form that just blocks libuv at first, with follow-up commits gradually moving different pieces of blocking work off the loop thread. |
Well, SQLite has different modes of operation and thready-safety. We surely need to take into account both SQLite-level thread-safety and dqlite-level thread-safety if we are going to use threads in some form or another.
That's a good question, and basically the answer is I'm not sure :) And it's precisely the tricky part of the business, to understand all the ramifications. For one, if you start executing
Sure, any incremental implementation strategy is ok, the main thing is to have an idea of the end picture which is as much detailed as possible, in order to avoid surprises down the road. |
The feeling I get is that this design is based around Instead, I'd say that we should probably bite the bullet and find a design to support large snapshots that can't be fully loaded in-memory in a way that does not assume This is going to be quite tricky and complicates the interfaces and implementation (and that's the reason why the current implementation assumes snapshots that can fit in memory), but I don't see a way around this if we really want this feature. I'll try to sketch some proposal just to illustrate better what I mean. |
No, your impression is completely right. |
One possible design would be something like this: /**
* Tell the FSM that we want to start taking a snapshot.
*
* After this method is called, the FSM is guaranteed that no other snapshot
* will be taken or restored until @snapshot_stop is called.
*
* However, while the snapshot is being taken the FSM must still be able to handle
* new entries via the @apply method, which may be called in parallel with the
* @snapshot_acquire method.
*/
int (*snapshot_start)(struct raft_fsm *fsm);
/**
* Acquire the next chunk of the snapshot being taken.
*
* This method will be called one or more times after @snapshot_start has been
* called, until the FSM implementation sets the @last output parameter to #true.
*
* This method will be called using the @async_work method of the #raft_io
* implementation, so its implementation is allowed to perform blocking I/O.
* However, this also means that it might be called in parallel with the @apply
* method of the FSM.
*/
int (*snapshot_acquire)(struct raft_fsm *fsm,
struct raft_buffer *bufs[],
unsigned *n_bufs, bool *last);
/**
* Release the given chunk of the snapshot being taken.
*
* This method will be called after each invokation of @snapshot_acquire,
* notifying the FSM that the chunk won't be used anymore, and giving it a chance to possibly
* release associated memory.
*/
int (*snapshot_release)(struct raft_fsm *fsm,
struct raft_buffer *bufs[],
unsigned *n_bufs);
/**
* Tell the FSM that we want to stop taking a snapshot.
*
* This method will always be called after @snapshot_start has been invoked,
* to notify the FSM that we want to stop taking the snapshot (either because
* we completed it or we aborted it).
*/
int (*snapshot_stop)(struct raft_fsm *fsm);
/**
* Tell the FSM that we want to start restoring a snapshot.
*
* After this method is called, the FSM is guaranteed that no snapshot will be
* taken or new entries applied.
*/
int (*restore_start)(struct raft_fsm *fsm);
/**
* Restore the next snapshot chunk.
*
* This method will be called using the @async_work method of the #raft_io
* implementation, so its implementation is allowed to perform blocking I/O.
*/
int (*restore_chunk)(struct raft_fsm *fsm,
struct raft_buffer *bufs[],
unsigned *n_bufs, bool last);
/**
* Tell the FSM that we want to stop restoring a snapshot.
*/
int (*restore_stop)(struct raft_fsm *fsm); This is just a rough sketch, I'm not sure it captures everything we want, but it's just to convey the idea. It will require non-trivial modifications in libraft I believe. |
One thing that will need tweaking is the |
I quite like the proposal, will try to grasp the most important ramifications for libraft. |
One thing that I'm not totally sure about is whether to make Ideally I believe I would prefer the asynchronous approach, I just wrote the synchronous one for simplicity. Perhaps it's worth reasoning a bit about it. |
I'm trying as well to look at other aspects. I might be able to come up with a more detailed proposal too. |
I'm currently working on a disk backend for dqlite, it's in a proof-of-concept phase right now but wanted to start discussion on it.
My initial idea is/was to keep the WAL in-memory, this allows us to easily reuse the the replication logic and store the DB on disk in a regular sqlite3 database file. Checkpointing then flushes the WAL to the disk DB.
The
vfsFileMethods
for the DB file would then generallly be the vanilla sqlite3vfsFileMethods
.A couple of issues:
dqlite/src/fsm.c
Line 274 in 35b5418
1024
transactions is also possibly very heavy on the storage. We can lower the frequency, but this will increase memory footprint of raft log.installSnapshot RPC
. We would need to add a way to send the snapshot in chunks, adding to overall complexity. (e.g. we can't just delete snapshots that are still being sent, etc.)All-in-all I'm not feeling too confident right now if this is practically feasible, do you want to share some insights on this @freeekanayaka ?
The text was updated successfully, but these errors were encountered: