Recover from brutal shutdown (checkpointing) (Take 2) #608

chubei · 2023-01-13T10:30:54Z

chubei
Jan 13, 2023

Environment

Our checkpointing solution is based on a data structure Environment, which is built upon lmdb. Here we borrow the terms from lmdb.

Environment is defined as follows, ignoring most error handling.

/// A checkpoint must be `Eq` and `Serialize`, and can produce the zero checkpoint.
trait Checkpoint: Eq + Serialize {
    /// The initial checkpoint. The environment should contain no data at this checkpoint.
    fn zero() -> Self;
}

/// A persistent storage that provides rewinding capabilities.
struct Environment<C: Checkpoint> {
    /// Opens or creates an environment at given path.
    fn new(path: &str) -> Self;

    /// Creates a transaction backed by a lmdb transaction.
    fn create_txn(&mut self) -> Transaction<'_, C>;

    /// Returns a inclusive range of checkpoints that the environment can rewind to.
    fn history_range(&self) -> (C, C);

    /// Rewinds the environment to the given checkpoint. After rewinding, the history after `checkpoint` is dropped, exclusive.
    ///
    /// An environment can always rewind to `zero`, effectively clears all managed databases.
    fn rewind(&mut self, checkpoint: C);

    /// Drops the history before the given checkpoint, exclusive. Releases space in the environment.
    fn drop_history_before(&mut self, checkpoint: C);
}

/// Reads and writes the database and persist data on checkpoints.
struct Transaction<'env, C: Checkpoint> {
    /// Opens or creates a database that supports rewinding, backed by several lmdb databases.
    fn open_database(&mut self, name: &str) -> Database;

    /// Opens or creates a unmanaged, raw lmdb database, which is not affected by rewinding.
    fn open_unmanaged_database(&mut self, name: &str) -> lmdb::Database;

    /// Reads a value from the database.
    fn get(&self, db: &Database, key: &[u8]) -> Option<&[u8]>;

    /// Writes a value to the database.
    fn put(&self, db: &Database, key: &[u8], value: &[u8]);

    /// Deletes a value from the database.
    fn delete(&self, db: &Database, key: &[u8], value: Option<&[u8]>);

    /// Persists data recorded in this transaction.
    fn commit(self, checkpoint: C);

    /// Discards data recorded in this transaction.
    fn abort(self);

    /// Returns the lmdb transaction.
    fn into_inner(self) -> lmdb::Transaction<'env>;
}

The most important method is rewind, with the help of it, we can implement the following simple algorithm of checkpointing.

Checkpointing

We use the Checkpoint defined in the last discussion about checkpointing to monomorphize the Environment type. See #516.

Every node in the DAG has an associated Environment. Nodes persist data to the environment.

Upon pipeline start, we read all the nodes' history_range, and find the largest possible checkpoint that all nodes can rewind to. We rewind all nodes to this checkpoint and start sources from this checkpoint.

Merger

As the pipeline runs, the space that the environment occupies grows because it has to remember all the histories. The Merger is responsible for dropping old history and releasing space.

The Merger runs in a separate thread and periodically queries the history range of all environments. It finds the largest possible checkpoint that all nodes can rewind to, and call drop_history_before with this checkpoint on all environments.

Implementing Managed Database

A managed database is implemented as one primary database, one snapshot database and several incremental databases.

The primary database always has the latest state.

The snapshot database is initially empty, and is only updated on drop_history_before.

Each increamental database is the difference between one checkpoint and its immediate next checkpoint.

Initialization

Upon initialization, primary database and snapshot database are both empty. And there's no incremental database.

Create Transaction

Upon transaction creation, an incremental database from latest checkpoint to the next one is created. Following writes operate on this increamental database.

Reads

Reads always read from the primary database.

Writes

Upon writing, puts and deletes operates on the primary database. The operation log is recorded in the incremental database.

There are two ways to implement the incremental database, one is to just serialize all the operations, the other is to summarize the operations and only store the difference.

For example, for operations Put 1 1, Put 2 2 and Delete 1 1, the first way would store [Put 1 1, Put 2 2, Delete 1 1], the second ways would store [Put 2 2]. Although the exact workings of the second way is not totally clear now.

Query History Range

The history range is the just [snapshot database, primary database].

Rewinding

Rewinding happens in three steps:

Clear the primary database and populate it with data from the snapshot database.
Apply all the operations in the incremental databases to the primary database, until it reaches the target checkpoint.
Remove all the incremental databases that are after the target checkpoint.

This is assuming our incremental database doesn't support going backwards in time. If it does, we can just reversely apply the operations to the primary database and remove the applied increamental database.

Merging

Upon merging ,we apply all the operations in the incremental databases to the snapshot database, and remove the incremental database, until we reach the target checkpoing.

chubei · 2023-01-13T10:31:13Z

chubei
Jan 13, 2023
Author

@v3g42 @mediuminvader @snork-alt

4 replies

snork-alt Jan 13, 2023
Maintainer

Environment
Transaction will also require the implementation of cursors, as these are used in multiple places within various processors (i.e Aggregators, JOINs)

Merger
In the merger section, we say "The Merger is responsible for dropping old history and releasing space". As the name suggests, it is more correct to say that the merger's responsibility is to merge multiple incremental snapshots into a consolidated state, before dropping them. We have to decide what the criteria are for determining when to merge and drop a particular snapshot file. In fact, any merge cannot start before all nodes of a pipeline have persisted the incremental snapshot for a particular epoch. If that happens, we might run again into potential inconsistencies.

Writes
I prefer the 2nd approach.

Rewinding
If we design the incremental snapshot properly, we might not need to clear the primary database but we might be able to rollback to a previous epoch with the data available in the partial snapshot database. We could maintain a copy of updated and deleted records in the incremental snapshot.

Partial hydration
As we are addressing checkpoints we should also keep in mind another use case that is partially related to it which is when an application is shit down, a new pipeline is added to it and started again. In this case, we have an application that is partially consistent as all pipelines are in a consistent state and can be recovered, but only the newly added one has no state. How to we hydrate this pipeline starting with the state of the source?

chubei Jan 13, 2023
Author

Environment
Sure.

Merger
finds the largest possible checkpoint that all nodes can rewind to, and call drop_history_before with this checkpoint on all environments. Is this criteria safe for the merge?

Writes
Me too. Will possibly post the design later in this thread.

Rewinding
My gut tells me it's possible and not very difficult.

Partial hydration
Good point! Let me think about it.

snork-alt Jan 14, 2023
Maintainer

On Merger. Might not be necessarily be a rewind. Could also be a move forward. Imagine pipeline P is in Epoch 100. Checkpoint message is sent by the source and all nodes have been able to write their incremental snapshot for E100. However, the node N5 has not been able to merge with the primary database yet, and the pipeline is shut down now. When we restart the pipeline we have the option to align all nodes to E100 by simply performing the merge for N5, instead of rolling back to E99.

chubei Jan 14, 2023
Author

Yes that's exactly what I thought. finds the largest possible checkpoint that all nodes can rewind to

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover from brutal shutdown (checkpointing) (Take 2) #608

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Recover from brutal shutdown (checkpointing) (Take 2) #608

chubei Jan 13, 2023

Environment

Checkpointing

Merger

Implementing Managed Database

Initialization

Create Transaction

Reads

Writes

Query History Range

Rewinding

Merging

Replies: 1 comment · 4 replies

chubei Jan 13, 2023 Author

snork-alt Jan 13, 2023 Maintainer

chubei Jan 13, 2023 Author

snork-alt Jan 14, 2023 Maintainer

chubei Jan 14, 2023 Author

chubei
Jan 13, 2023

Replies: 1 comment 4 replies

chubei
Jan 13, 2023
Author

snork-alt Jan 13, 2023
Maintainer

chubei Jan 13, 2023
Author

snork-alt Jan 14, 2023
Maintainer

chubei Jan 14, 2023
Author