Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dqlite disk backend #368

Open
MathieuBordere opened this issue May 24, 2022 · 23 comments
Open

dqlite disk backend #368

MathieuBordere opened this issue May 24, 2022 · 23 comments
Labels
Feature New feature, not a bug

Comments

@MathieuBordere
Copy link
Contributor

MathieuBordere commented May 24, 2022

I'm currently working on a disk backend for dqlite, it's in a proof-of-concept phase right now but wanted to start discussion on it.

My initial idea is/was to keep the WAL in-memory, this allows us to easily reuse the the replication logic and store the DB on disk in a regular sqlite3 database file. Checkpointing then flushes the WAL to the disk DB.
The vfsFileMethods for the DB file would then generallly be the vanilla sqlite3 vfsFileMethods.

A couple of issues:

  • Checkpointing to disk happens in the main loop, this can quickly take a couple of hundred ms, blocking the eventloop. I don't really know how this can be avoided.
    maybeCheckpoint(db);
  • Snapshotting the database means copying the on-disk database, instead of passing pointers to raft. For this we would probably need to add the async snapshot behavior to raft. Async snapshot raft#268 .
  • Snapshotting aka copying a large database every 1024 transactions is also possibly very heavy on the storage. We can lower the frequency, but this will increase memory footprint of raft log.
  • Because the database can be large, we can't just load the entire snapshot in memory when sending a raft installSnapshot RPC. We would need to add a way to send the snapshot in chunks, adding to overall complexity. (e.g. we can't just delete snapshots that are still being sent, etc.)

All-in-all I'm not feeling too confident right now if this is practically feasible, do you want to share some insights on this @freeekanayaka ?

@freeekanayaka
Copy link
Contributor

I agree with your assessment. Even without considering the rather complex effort of implementing what you describe (with all the ramifications you correctly identified), I feel that, as you point, the main deal breaker is indeed the fact that the vfsFileMethods interface is a synchronous one. Not only checkpointing would block the main loop, but also any operation that involves reading a database page (which would need to be fetched from disk). So even simple SELECT statements could end up blocking the event loop. Since disk access performance can vary a lot depending on the system state at a given moment, the worst case scenario is that when a very long blocking disk I/O syscall returns and the even loop is unblocked, enough time might have elapsed that raft timers kick in, for example deposing the leader or converting a remote follower to a candidate (because it didn't receive anything) from the leader. Not only that, but during that time nothing else could be done (e.g. serving another SELECT) which would probably reduce the overall performance/throughput of the system.

There are brute force ways of solving the issue of the main loop being blocked, for example going back to tricks like using libco (a coroutine library which was used in dqlite) and threads, but the effort would become even more substantial and complicated.

The current in-memory approach of dqlite was driven by 1) the fact that the synchronous SQLite APIs would make using an event loop AND storing stuff on disk quite hard (as I briefly mentioned above) 2) the goal of dqlite as database was to hold shared state between agents of various type (like LXD cluster nodes or Anbox nodes or edge devices etc) without the need of an external "heavy-weight" highly-available database (e.g. CockroachDB). Those use cases typically don't need to store a lot of data, and there's a lot one can store in a few Gigabytes and that can perfectly fit in memory on modern hardware.

I'm not entirely sure what use cases of dqlite have now emerged that stretch those assumptions, if it's mainly the k8s integration perhaps there are ways to solve the problem at the kine level. Or maybe acknowledge that etcd might be a better fit for k8s (at least in case of deployments that for some reason require to store a large amount of data).

Since @stgraber mentioned that you were looking at this issue, I've been thinking on and off about possible strategies to implement the on-disk approach, but I didn't come up with anything convincing yet. Perhaps it needs more investigation, but at first sight it feels a bit of hard problem.

@stgraber
Copy link
Contributor

We're certainly going to run into troubles if we both have low memory and slow I/O.

The main goal here is to handle cases where you have a DB of say 20GB with only 1GB or so being commonly accessed on a system with 8GB of RAM. In an ideal case, what we'd see is Dqlite start up, read the data from disk as it needs it (hopefully never doing a full scan) and the VFS providing a fast in-memory caching layer for the data.

So long as the VFS cache doesn't get evicted, this should give us performance that's not too far off what we have in-memory as the data would still need to be read once from disk and then would be in the VFS cache.

One approach I suggested at the beginning but was discarded as just having sqlite run with a file backend seemed easier was to use mmap to map on-disk data to memory, effectively allowing for dqlite to still operate in much the same way it usually does but having the data reside on-disk and be copied in an out of memory based on memory pressure.

The main difficulty would be to ensure that dqlite doesn't ever need the entirety of the database in memory at once and also avoids having to perform full reads. This particularly affects snapshots where we'd want a filesystem based approach to atomically sync memory to the backing file, perform a copy of the backing file and then resume operations.

I also expect this mode to be an opt-in thing. We're still going to get much better performance and reliability out of the default in-memory operation mode, but we've seen that users like microk8s can have very large databases of mostly stale data where our current behavior becomes quickly problematic.

@freeekanayaka
Copy link
Contributor

We're certainly going to run into troubles if we both have low memory and slow I/O.

The main goal here is to handle cases where you have a DB of say 20GB with only 1GB or so being commonly accessed on a system with 8GB of RAM. In an ideal case, what we'd see is Dqlite start up, read the data from disk as it needs it (hopefully never doing a full scan) and the VFS providing a fast in-memory caching layer for the data.

The good news here is that, assuming we find a way to do something along the lines of what @MathieuBordere described (WAL in memory and database file on disk), there's no work to do for the caching, because SQLite already does that in a layer above the VFS one. It already has its internal cache that tries to minimize VFS-level I/O.

So long as the VFS cache doesn't get evicted, this should give us performance that's not too far off what we have in-memory as the data would still need to be read once from disk and then would be in the VFS cache.

As said, I believe we can probably just rely on SQLite's cache for this specific behavior. Having dqlite's VFS implement the disk I/O only with no caching (again, just as described by @MathieuBordere) would be enough.

One approach I suggested at the beginning but was discarded as just having sqlite run with a file backend seemed easier was to use mmap to map on-disk data to memory, effectively allowing for dqlite to still operate in much the same way it usually does but having the data reside on-disk and be copied in an out of memory based on memory pressure.

This would certainly be the cheapest way to achieve something very similar to what @MathieuBordere described, while at the same time minimizing the code changes.

The main difficulty would be to ensure that dqlite doesn't ever need the entirety of the database in memory at once and also avoids having to perform full reads. This particularly affects snapshots where we'd want a filesystem based approach to atomically sync memory to the backing file, perform a copy of the backing file and then resume operations.

This shouldn't be that much of a problem, because once you manage to have the database file on disk instead of in memory, you could basically change a bit the on-disk format of the snapshot and just cp the file (I don't think a full read can be avoided, or at least that would require other changes as well, but you wouldn't load it all in memory). When you need to send the snapshot to other nodes you would stream it in chunks like @MathieuBordere was suggesting, so it wouldn't be loaded all in memory.

It's probably obvious what I'm writing, but just talking loud so we are sure to be on the same page.

I also expect this mode to be an opt-in thing. We're still going to get much better performance and reliability out of the default in-memory operation mode

I was thinking the same indeed.

but we've seen that users like microk8s can have very large databases of mostly stale data where our current behavior becomes quickly problematic.

I know I'm most probably beating a dead horse, but if microk8s is the only real-world case so far where this is an issue, maybe the question should be put on the table whether it's a better idea to use etcd, instead of going down a very costly and potentially sub-optimal path.

My only concern in implementing this feature is really purely technical: the "cheap" way to do it is with synchronous I/O (and as @MathieuBordere was saying it's not even that cheap, because of all the ramifications it has), and doing synchronous I/O means blocking the event loop which will then bring its own set of problems (reads can't be concurrent, slow I/O can disrupt the cluster, etc). The "optimal" way would probably be with coroutines combined with non-blocking I/O (either with AsyncIO, io_uring or simply a threadpool): the coroutines would be the trick to "suspend" SQLite's blocking APIs and resume them when the non-blocking I/O completes. However this approach is costly and complex to implement and would require a careful implementation that will then need to be hardened against the unavoidable bugs and edge cases.

At the moment I can't think of other approaches, but maybe there are.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented May 24, 2022

The good news here is that, assuming we find a way to do something along the lines of what @MathieuBordere described (WAL in memory and database file on disk), there's no work to do for the caching, because SQLite already does that in a layer above the VFS one. It already has its internal cache that tries to minimize VFS-level I/O.

I have a branch where this works, it's still a bit hacky, but works reasonably well.
Regarding the mmap approach, I can also investigate it if we have the feeling it's more appropriate.

I'll look into Libco and then probably start first by reintroducing raft async snapshots and chunked snapshot sends (necessary here and useful for vanilla dqlite/raft too).

edit: or are async snapshots no longer needed once Libco will be involved?

@stgraber
Copy link
Contributor

I know I'm most probably beating a dead horse, but if microk8s is the only real-world case so far where this is an issue, maybe the question should be put on the table whether it's a better idea to use etcd, instead of going down a very costly and potentially sub-optimal path.

Yeah, I certainly would be happy to see microk8s doing something similar to what we're doing with microcloud/microceph where we use dqlite to bootstrap a cluster, store machine roles and manage system configs, then use that config to spin etcd on the correct systems and connect all machines into it. That would feel like a much more appropriate setup for them.

Though large sqlite databases aren't uncommon and so having a way to handle them in dqlite still makes sense in general.

We also have Juju that's planning on moving their database over to dqlite. In their case, they're currently looking at running a lot of small dqlite databases (up to 3 per model, though maybe less in their current iteration of the design), each of those being relatively small in size but the total footprint being quite significant if all held in-memory.

@freeekanayaka
Copy link
Contributor

The good news here is that, assuming we find a way to do something along the lines of what @MathieuBordere described (WAL in memory and database file on disk), there's no work to do for the caching, because SQLite already does that in a layer above the VFS one. It already has its internal cache that tries to minimize VFS-level I/O.

I have a branch where this works, it's still a bit hacky, but works reasonably well. Regarding the mmap approach, I can also investigate it if we have the feeling it's more appropriate.

The advantage of mmap would be to minimize the code changes, but of course it would have mostly the same drawbacks as performing manual blocking I/O. I believe it depends where you want to go after the hacky approach is in place: if the plan is to keep blocking I/O and not introduce coroutines plus async I/O, then it probably doesn't matter much if you use manual blocking I/O or mmap, perhaps you'd have a bit more control with the manual blocking I/O approach (at the cost of more code changes). Otherwise, if the plan is to start with blocking I/O to have something relatively cheap that start working now and then move to coroutines plus async I/O later, then maybe actually mmap is a better option, since it buys you almost everything you get with manual blocking I/O with probably much less code changes and those code changes would need to be thrown away anyway once you go async (or at least that's my feeling).

I'll look into Libco and then probably start first by reintroducing raft async snapshots and chunked snapshot sends (necessary here and useful for vanilla dqlite/raft too).

Note that I think there are several C coroutines libraries named libco, the one that dqlite used in the past was this one. To be honest, it feels really a lot work and a can of worms to go down that path (it was a major source of simplification when dqlite went away from libco), but YMMV.

In any case, as @stgraber was suggesting I'd make sure that this behavior is triggered by a configuration flag and by default you get the current code basically unchanged, since it is now relatively mature and won't suffer from block-the-main-loop issues.

I think you will still need async snapshots, unless you introduce coroutines also at the raft level, which doesn't feel appropriate since it can be avoided.

@freeekanayaka
Copy link
Contributor

I know I'm most probably beating a dead horse, but if microk8s is the only real-world case so far where this is an issue, maybe the question should be put on the table whether it's a better idea to use etcd, instead of going down a very costly and potentially sub-optimal path.

Yeah, I certainly would be happy to see microk8s doing something similar to what we're doing with microcloud/microceph where we use dqlite to bootstrap a cluster, store machine roles and manage system configs, then use that config to spin etcd on the correct systems and connect all machines into it. That would feel like a much more appropriate setup for them.

Right, that feels indeed a sane approach. We have always said it, but it might be worth repeating that, regardless of the on-disk issue, since what k8s needs is a watchable key/value store, emulating that with SQL is not the greatest idea, since introduces additional complexity for nothing in return (and actually you get back slightly worse performance). Let's keep dqlite for things that need SQL and are fine with dqlite's limitations (e.g. single-writer, in-memory, etc). SQLite itself has made it clear what its appropriate uses are. If you want I can volunteer to try to persuade folks on that :)

Though large sqlite databases aren't uncommon and so having a way to handle them in dqlite still makes sense in general.

Yes, it would be nice to have that, but it's impractical and costly, so in my mind you'd better have a really compelling reason, and I don't quite see one at the moment (though I realize I don't know many details). To the least, that's something that could be deferred and those energies spent differently.

We also have Juju that's planning on moving their database over to dqlite. In their case, they're currently looking at running a lot of small dqlite databases (up to 3 per model, though maybe less in their current iteration of the design), each of those being relatively small in size but the total footprint being quite significant if all held in-memory.

Ok, I have no idea of what their requirements are, but the rational should ideally be "we'll use dqlite because after having evaluated our options we believe it's the right tool for the job". Although I certainly understand that using something that was built in house is a plus over a third party, it might lead to the risk to stretch things a bit.

I'd just wish to make sure that people understand the technical factors that make the in-memory approach a wise practical one and that going beyond that might be counterproductive (you might spend a lot of time to get something sub-optimal when compared to other options).

@cole-miller
Copy link
Contributor

cole-miller commented Sep 16, 2022

Naive question incoming: as a way of addressing the blocking issue, what about isolating all sqlite3 API calls in a separate thread that communicates using queues/channels with the main thread (the one that runs raft/dqlite server code)? I've been thinking about this for a bit and my sense is that it would be possible to implement, at the cost of a lot of extra code complexity -- how that compares to the tradeoffs of the coroutine approach, I don't know.

@freeekanayaka
Copy link
Contributor

Naive question incoming: as a way of addressing the blocking issue, what about isolating all sqlite3 API calls in a separate thread that communicates using queues/channels with the main thread (the one that runs raft/dqlite server code)? I've been thinking about this for a bit and my sense is that it would be possible to implement, at the cost of a lot of extra code complexity -- how that compares to the tradeoffs of the coroutine approach, I don't know.

That might work I believe, but then you'll probably have to deal with locks and race conditions. I'm not entirely clear of the implications of that, but that would presumably be the tradeoff.

There are a lot of ramifications involved in implementing this, I still recommend making this feature fully opt-in and keeping the current code unless you explicitly enable the feature.

@MathieuBordere
Copy link
Contributor Author

MathieuBordere commented Sep 26, 2022

Regarding snapshots with possibly very large snapshot files:

In the dqlite disk-mode case, the WALs will live in memory and the DB files will live on disk.
If checkpoints are disabled, we can still copy the WAL like today and pass a pointer to the mmaped database files through the raft buffers that are passed in the snapshot functions' arguments. We just have to make sure to call munmap when cleaning up the snapshot, in a regular case we would do this in snapshot_finalize.

struct raft_snapshot is used in multiple places throughout the raft code, e.g. when loading a snapshot to send in context of a InstallSnapshot RPC. It might be useful to add a user-provided snapshot_close_cb (or something similar) to struct raft_snapshot. raft would then promise to always run this when it no longer needs the snapshot. For snapshots provided by the dqlite fsm, this cb could munmap the DB files, while for in-memory snapshots, this would just be a regular raft_free for example. The struct would look something like this.

struct raft_snapshot
{
    raft_index index;
    raft_term term;
    struct raft_configuration configuration;
    raft_index configuration_index;
    struct raft_buffer *bufs;
    unsigned n_bufs;
    snapshot_cb cb; /* Always runs on destruction, i.e. for cleanup */
};

As a consequence the fsm_snapshot functions have to change signature to allow the user to provide a callback. I would propose to pass the whole snapshot struct as argument and the fsm fills out bufs, n_bufs and the cb.
Additionally we would need a snapshot_restore_async to load large snapshot files.

/* Communicates the bufs, n_bufs and snapshot_cb through the provided snapshot struct */
int (*snapshot)(struct raft_fsm *fsm,
         struct raft_snapshot* /* OUT */);
int (*snapshot_restore)(struct raft_fsm *fsm,
         struct *raft_snapshot);
int (*snapshot_finalize)(struct raft_fsm *fsm,
                                      struct *raft_snapshot);
int (*snapshot_async)(struct raft_fsm *fsm,
                              struct raft_snapshot * /* OUT */);
int (*snapshot_restore_async)(struct raft_fsm *fsm,
         struct *raft_snapshot);

The internal raft functions also have to somehow know if the user prefers to load snapshots in memory because they're sufficiently small, or prefers to leave them on disk, mmaping them. Remark that compressed snapshots have to be decompressed, straight to disk, for this to work without causing large allocations.

We could opt to never load snapshots in memory, always mmaping them, this would incur an extra disk write (decompressing to disk instead of memory) and an extra disk read (first read compressed data, then read uncompressed data), this doesn't sound optimal. We could add a field to struct raft_fsm that's filled out by the user before init that indicates how raft should treat the snapshots.

Does this sound sensible?

@cole-miller
Copy link
Contributor

I agree that snapshots should continue to live in memory by default (snapshotting to disk should be opt-in). My sense is that we expect most applications to want one of

  • DB file in memory, snapshots in memory
  • DB file on disk, snapshots on disk

If that's true, we should have a single configuration knob that makes it easy to set the DB mode and the snapshot mode simultaneously (without necessarily making it impossible to set them to distinct values).

@cole-miller
Copy link
Contributor

@freeekanayaka

There are a lot of ramifications involved in implementing this, I still recommend making this feature fully opt-in and keeping the current code unless you explicitly enable the feature.

Agreed! It does seem like a daunting task to pick out all our sqlite3 API calls and rework them into a non-blocking idiom -- especially since (as far as I can tell) libuv doesn't offer a way to run a piece of blocking work in a dedicated thread. Might be something we can do incrementally, I guess.

@freeekanayaka
Copy link
Contributor

@freeekanayaka

There are a lot of ramifications involved in implementing this, I still recommend making this feature fully opt-in and keeping the current code unless you explicitly enable the feature.

Agreed! It does seem like a daunting task to pick out all our sqlite3 API calls and rework them into a non-blocking idiom -- especially since (as far as I can tell) libuv doesn't offer a way to run a piece of blocking work in a dedicated thread.

There's uv_queue_work, but it doesn't help that much. The only sqlite3 API call we care about in this context is sqlite3_step(), but the problem is that it's not as simple as running it in a thread. We need to effectively "suspend" its execution (which is currently done by our custom VFS) during the final stages of a commit, where data is getting written to the WAL: we want to stop there, before the WAL is actually modified, capture the data that SQLite is writing to the WAL, replicate it with raft, and only then "resume" sqlite3_step() and finalize the WAL modification.

Hope that's clear.

Might be something we can do incrementally, I guess.

What you mean exactly? Implement support for on-disk database incrementally?

@freeekanayaka
Copy link
Contributor

I agree that snapshots should continue to live in memory by default (snapshotting to disk should be opt-in). My sense is that we expect most applications to want one of

  • DB file in memory, snapshots in memory
  • DB file on disk, snapshots on disk

If that's true, we should have a single configuration knob that makes it easy to set the DB mode and the snapshot mode simultaneously (without necessarily making it impossible to set them to distinct values).

That sounds like a good point to me. I think it's safe to assume that this is really an on/off switch, with only one of the two strategies supported and no mixing of these modes.

@cole-miller
Copy link
Contributor

Ah, I was under the mistaken impression that SQLite required all API calls to originate from a single thread, which would preclude putting those calls into a thread pool.

We need to effectively "suspend" its execution (which is currently done by our custom VFS) during the final stages of a commit, where data is getting written to the WAL: we want to stop there, before the WAL is actually modified, capture the data that SQLite is writing to the WAL, replicate it with raft, and only then "resume" sqlite3_step() and finalize the WAL modification.

Ah, thanks for explaining. We're still keeping the WAL in memory, right? Where do problems come up in this suspend/resume dance if we're keeping the DB file on disk?

What you mean exactly? Implement support for on-disk database incrementally?

What I had in mind was merging the on-disk DB/snapshot support in a form that just blocks libuv at first, with follow-up commits gradually moving different pieces of blocking work off the loop thread.

@freeekanayaka
Copy link
Contributor

Ah, I was under the mistaken impression that SQLite required all API calls to originate from a single thread, which would preclude putting those calls into a thread pool.

Well, SQLite has different modes of operation and thready-safety. We surely need to take into account both SQLite-level thread-safety and dqlite-level thread-safety if we are going to use threads in some form or another.

We need to effectively "suspend" its execution (which is currently done by our custom VFS) during the final stages of a commit, where data is getting written to the WAL: we want to stop there, before the WAL is actually modified, capture the data that SQLite is writing to the WAL, replicate it with raft, and only then "resume" sqlite3_step() and finalize the WAL modification.

Ah, thanks for explaining. We're still keeping the WAL in memory, right? Where do problems come up in this suspend/resume dance if we're keeping the DB file on disk?

That's a good question, and basically the answer is I'm not sure :) And it's precisely the tricky part of the business, to understand all the ramifications. For one, if you start executing sqlite3_step() in a thread you'll have to be careful about all the locking/concurrency mechanisms that are just now coming for free because we're single-threaded. In other world, what happens exactly when 2 different concurrent requests will call sqlite3_step() in parallel in different threads? Right now the whole system and VFS is designed around single-thread assumptions, and everything needs to be examined to see when they hold or not.

What you mean exactly? Implement support for on-disk database incrementally?

What I had in mind was merging the on-disk DB/snapshot support in a form that just blocks libuv at first, with follow-up commits gradually moving different pieces of blocking work off the loop thread.

Sure, any incremental implementation strategy is ok, the main thing is to have an idea of the end picture which is as much detailed as possible, in order to avoid surprises down the road.

@freeekanayaka
Copy link
Contributor

Regarding snapshots with possibly very large snapshot files:

In the dqlite disk-mode case, the WALs will live in memory and the DB files will live on disk. If checkpoints are disabled, we can still copy the WAL like today and pass a pointer to the mmaped database files through the raft buffers that are passed in the snapshot functions' arguments. We just have to make sure to call munmap when cleaning up the snapshot, in a regular case we would do this in snapshot_finalize.

struct raft_snapshot is used in multiple places throughout the raft code, e.g. when loading a snapshot to send in context of a InstallSnapshot RPC. It might be useful to add a user-provided snapshot_close_cb (or something similar) to struct raft_snapshot. raft would then promise to always run this when it no longer needs the snapshot. For snapshots provided by the dqlite fsm, this cb could munmap the DB files, while for in-memory snapshots, this would just be a regular raft_free for example. The struct would look something like this.

struct raft_snapshot
{
    raft_index index;
    raft_term term;
    struct raft_configuration configuration;
    raft_index configuration_index;
    struct raft_buffer *bufs;
    unsigned n_bufs;
    snapshot_cb cb; /* Always runs on destruction, i.e. for cleanup */
};

As a consequence the fsm_snapshot functions have to change signature to allow the user to provide a callback. I would propose to pass the whole snapshot struct as argument and the fsm fills out bufs, n_bufs and the cb. Additionally we would need a snapshot_restore_async to load large snapshot files.

/* Communicates the bufs, n_bufs and snapshot_cb through the provided snapshot struct */
int (*snapshot)(struct raft_fsm *fsm,
         struct raft_snapshot* /* OUT */);
int (*snapshot_restore)(struct raft_fsm *fsm,
         struct *raft_snapshot);
int (*snapshot_finalize)(struct raft_fsm *fsm,
                                      struct *raft_snapshot);
int (*snapshot_async)(struct raft_fsm *fsm,
                              struct raft_snapshot * /* OUT */);
int (*snapshot_restore_async)(struct raft_fsm *fsm,
         struct *raft_snapshot);

The internal raft functions also have to somehow know if the user prefers to load snapshots in memory because they're sufficiently small, or prefers to leave them on disk, mmaping them. Remark that compressed snapshots have to be decompressed, straight to disk, for this to work without causing large allocations.

We could opt to never load snapshots in memory, always mmaping them, this would incur an extra disk write (decompressing to disk instead of memory) and an extra disk read (first read compressed data, then read uncompressed data), this doesn't sound optimal. We could add a field to struct raft_fsm that's filled out by the user before init that indicates how raft should treat the snapshots.

Does this sound sensible?

The feeling I get is that this design is based around mmap as a way to minimizes changes in libraft. Am I having the wrong impression?

Instead, I'd say that we should probably bite the bullet and find a design to support large snapshots that can't be fully loaded in-memory in a way that does not assume mmap.

This is going to be quite tricky and complicates the interfaces and implementation (and that's the reason why the current implementation assumes snapshots that can fit in memory), but I don't see a way around this if we really want this feature. I'll try to sketch some proposal just to illustrate better what I mean.

@MathieuBordere
Copy link
Contributor Author

The feeling I get is that this design is based around mmap as a way to minimizes changes in libraft. Am I having the wrong impression?

No, your impression is completely right.

@freeekanayaka
Copy link
Contributor

One possible design would be something like this:

/**
 * Tell the FSM that we want to start taking a snapshot.
 *
 * After this method is called, the FSM is guaranteed that no other snapshot
 * will be taken or restored until @snapshot_stop is called.
 *
 * However, while the snapshot is being taken the FSM must still be able to handle
 * new entries via the @apply method, which may be called in parallel with the
 * @snapshot_acquire method.
 */
int (*snapshot_start)(struct raft_fsm *fsm);

/**
 * Acquire the next chunk of the snapshot being taken.
 *
 * This method will be called one or more times after @snapshot_start has been
 * called, until the FSM implementation sets the @last output parameter to #true.
 *
 * This method will be called using the @async_work method of the #raft_io
 * implementation, so its implementation is allowed to perform blocking I/O.
 * However, this also means that it might be called in parallel with the @apply
 * method of the FSM.
 */
int (*snapshot_acquire)(struct raft_fsm *fsm,
                        struct raft_buffer *bufs[],
                        unsigned *n_bufs, bool *last);

/**
 * Release the given chunk of the snapshot being taken.
 *
 * This method will be called after each invokation of @snapshot_acquire,
 * notifying the FSM that the chunk won't be used anymore, and giving it a chance to possibly
 * release associated memory.
 */
int (*snapshot_release)(struct raft_fsm *fsm,
                        struct raft_buffer *bufs[],
                        unsigned *n_bufs);

/**
 * Tell the FSM that we want to stop taking a snapshot.
 *
 * This method will always be called after @snapshot_start has been invoked,
 * to notify the FSM that we want to stop taking the snapshot (either because
 * we completed it or we aborted it).                                                                    
 */
int (*snapshot_stop)(struct raft_fsm *fsm);

/**
 * Tell the FSM that we want to start restoring a snapshot.
 *
 * After this method is called, the FSM is guaranteed that no snapshot will be
 * taken or new entries applied.
 */
int (*restore_start)(struct raft_fsm *fsm);

/**
 * Restore the next snapshot chunk.
 *
 * This method will be called using the @async_work method of the #raft_io
 * implementation, so its implementation is allowed to perform blocking I/O.
 */
int (*restore_chunk)(struct raft_fsm *fsm,
                     struct raft_buffer *bufs[],
                     unsigned *n_bufs, bool last);

/**
 * Tell the FSM that we want to stop restoring a snapshot.
 */
int (*restore_stop)(struct raft_fsm *fsm);

This is just a rough sketch, I'm not sure it captures everything we want, but it's just to convey the idea. It will require non-trivial modifications in libraft I believe.

@freeekanayaka
Copy link
Contributor

One thing that will need tweaking is the load method of raft_io, since that assumes an in-memory snapshot. Of course snapshot_put and snapshot_get as well.

@MathieuBordere
Copy link
Contributor Author

This is just a rough sketch, I'm not sure it captures everything we want, but it's just to convey the idea. It will require non-trivial modifications in libraft I believe.

I quite like the proposal, will try to grasp the most important ramifications for libraft.

@freeekanayaka
Copy link
Contributor

freeekanayaka commented Sep 28, 2022

One thing that I'm not totally sure about is whether to make snapshot_acquire and restore_chunk synchronous (hence requiring to be run by async_work and deal with threads/parallelism) or rather to make them natively asynchronous (with a callback parameter).

Ideally I believe I would prefer the asynchronous approach, I just wrote the synchronous one for simplicity. Perhaps it's worth reasoning a bit about it.

@freeekanayaka
Copy link
Contributor

This is just a rough sketch, I'm not sure it captures everything we want, but it's just to convey the idea. It will require non-trivial modifications in libraft I believe.

I quite like the proposal, will try to grasp the most important ramifications for libraft.

I'm trying as well to look at other aspects. I might be able to come up with a more detailed proposal too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature, not a bug
Projects
None yet
Development

No branches or pull requests

4 participants