chore: client module recovery refactor #4035

dpc · 2024-01-13T07:19:58Z

This is a initial step towards client module recovery system described/agreed on in #2977.

Most notable points of the new design:

Module recoveries work as just standalone function (future): ClientModuleInit::recover called instead of usual ClientModuleInit::init. This makes module job easier (in a sense at least), as it is not constantly being interrupted (will help with efficient streaming through blocks). It's up to the implementation to periodically save state and send progress updates, but a follow up changes will introduce helper functions to set the module implementation on the right track. It also avoids saving tons of in-progress data in the state machine execuation log. Conceptually module recovery is not a client state machine.

Module recovery state is persisted and tracked and modules become available as they complete their recovery (on the next client start). A slow module will not block recovery of other modules.

Follow-up work

Mint recovery should be refactored into a framework-function that will take care of saving progress and sending updates (a method on ClientmoduleRecoverArgs)
Add streaming of blocks like in chore: speed up recovery by batching and streaming #4019, but this time for the whole range all at once (should lead to much better performance): feat: stream blocks in new refactoring of mint module recovery #4042
Caching of api.await_block in a transparent LRU: feat: await_block LRU cache (non-global) #4080

codecov · 2024-01-13T07:28:09Z

Codecov Report

Attention: 667 lines in your changes are missing coverage. Please review.

Comparison is base (d6f1ab8) 58.38% compared to head (57711ff) 58.34%.
Report is 31 commits behind head on master.

Files	Patch %	Lines
fedimint-client/src/lib.rs	31.93%	260 Missing ⚠️
modules/fedimint-mint-client/src/lib.rs	4.05%	142 Missing ⚠️
fedimint-client/src/module/init.rs	0.00%	98 Missing ⚠️
fedimint-client/src/module/mod.rs	0.00%	30 Missing ⚠️
fedimint-client/src/module/recovery.rs	0.00%	27 Missing ⚠️
fedimint-client/src/sm/executor.rs	58.46%	27 Missing ⚠️
fedimint-core/src/core.rs	0.00%	21 Missing ⚠️
fedimint-client/src/backup.rs	5.55%	17 Missing ⚠️
fedimint-client/src/db.rs	56.75%	16 Missing ⚠️
fedimint-cli/src/lib.rs	0.00%	13 Missing ⚠️
... and 6 more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4035      +/-   ##
==========================================
- Coverage   58.38%   58.34%   -0.04%     
==========================================
  Files         193      192       -1     
  Lines       42577    42659      +82     
==========================================
+ Hits        24857    24889      +32     
- Misses      17720    17770      +50

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

elsirion

First part of review, a lot to follow, but better to get it out than to forget about it.

fedimint-client/src/module/init.rs

fedimint-client/src/module/mod.rs

fedimint-client/src/module/recovery.rs

elsirion · 2024-01-15T13:16:12Z

modules/fedimint-mint-client/src/client_db.rs

@@ -13,6 +14,7 @@ pub enum DbKeyPrefix {
    Note = 0x20,
    NextECashNoteIndex = 0x2a,
    CancelledOOBSpend = 0x2b,
+    RestoreState = 0x2c,


It had to happen eventually (in fact, it already did), still sad that we are now overloading the DB key prefixes. It's not a problem due to DB isolation, but reading raw hex encoded data will be a bit more ambiguous (e.g. if one doesn't know the instance<>kind mapping).

Yeah, we'll need dbdump do a good job. eg. even for keys that are not implemented print <kind> <prefix> or something.

maan2003

LGTM, just small nits

fedimint-client/src/lib.rs

maan2003 · 2024-01-16T04:51:34Z

fedimint-client/src/module/recovery.rs

+#[derive(Debug, Copy, Clone, Encodable, Decodable)]
+pub struct RecoveryProgress {
+    pub complete: u32,
+    pub total: u32,
+}


nice, apps can measure speed of recovery using updates and provide some ETA.

fedimint-client/src/module/recovery.rs

dpc · 2024-01-16T06:32:42Z

.

elsirion · 2024-01-17T09:00:11Z

fedimint-client/src/lib.rs

@@ -1552,18 +1761,42 @@ impl ClientBuilder {
        Ok(client)
    }

+    pub async fn join(


Needs documentation on when to use imo (with some big warnings about risk of fund loss if re-joining a federation with a previously used key without recovering). Even if the user never provided the key and it was randomly generated, but e.g. an app allowed leaving+joining federations that will be a problem.

Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?

Oh, I actually wanted to have a top level secret key to be an enum:

enum RootSecret { Fresh(DeriveableSecret), FromUser(DeriveableSecret), }

so that Fresh is only returned when the key is randomly generated, and FromUser when entered by the user. This way type system would steer the implementation to handle it correctly, and it would be a good place to put docs explaining why it's needed.

Edit: This does not seem to work all that well.

Ehhh... I wrote some docstrings, but it made me realize that without some reliable way to tell if a given root_secret ever joined Federations, most apps will have to pessimize and call recover a lot which is particularly slow when a given Federation was not used (as it causes a whole history scan, because there can't be any previous backup).

Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?

And I'm not sure if relying on a backup is fundamentally sound. I wouldn't be surprised if backups will be a source of DoS and popular Federations will need to disable it altogether, or prune very old ones etc.

elsirion · 2024-01-17T09:08:30Z

fedimint-client/src/sm/executor.rs

+    /// Adds a number of state machines to the executor atomically with other DB
+    /// changes is `dbtx`, but without actually starting executing them.
+    ///
+    /// Like [`Self::add_state_machines_dbtx`] but useful for recovering
+    /// modules, where to module itself is not yet available (recovered) for
+    /// the executor.
+    pub async fn add_state_machines_inactive_dbtx(
+        &self,
+        dbtx: &mut DatabaseTransaction<'_>,
+        states: Vec<DynState<GC>>,
+    ) -> AddStateMachinesResult {
+        for state in states {
+            if !self
+                .inner
+                .valid_module_ids
+                .contains(&state.module_instance_id())
+            {
+                return Err(AddStateMachinesError::Other(anyhow!("Unknown module")));
+            }
+
+            let is_active_state = dbtx
+                .get_value(&ActiveStateKey::from_state(state.clone()))
+                .await
+                .is_some();
+            let is_inactive_state = dbtx
+                .get_value(&InactiveStateKey::from_state(state.clone()))
+                .await
+                .is_some();
+
+            if is_active_state || is_inactive_state {
+                return Err(AddStateMachinesError::StateAlreadyExists);
+            }
+
+            dbtx.insert_entry(
+                &ActiveStateKey::from_state(state.clone()),
+                &ActiveState::new(),
+            )
+            .await;
+        }
+
+        Ok(())
+    }


The naming of this fn seems weird. It adds a state as active to the executor without checking if it actually is active. This also means afaik that we have to only call it with active states, otherwise inactive ones will get stuck in the active table forever, as they don't have any state transitions by definition.

Either we make the is_terminal check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.

Either we make the is_terminal check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.

Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.

Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.

Can we do this in this PR please? I want to avoid forgetting about it after this one is used.

🙄 Ok, ok. See the last commit. I wasn't all that confident about how I handled it there, but I hope it's OK.

dpc · 2024-01-17T20:33:26Z

@maan2003 @elsirion I added docstring. Unless there are some architectural blockers, I'd rather land, so the follow-up work is easier to do. Feel free to add any requests as an item to #2977

fedimint-client/src/sm/executor.rs

elsirion

Thx for adding the last commit!

elsirion · 2024-02-08T14:22:39Z

modules/fedimint-mint-client/src/lib.rs

@@ -1526,7 +1600,6 @@ pub enum MintClientStateMachines {
    Output(MintOutputStateMachine),
    Input(MintInputStateMachine),
    OOB(MintOOBStateMachine),
-    Restore(MintRestoreStateMachine),


I should have caught this, running recoveries or logs of past recoveries could crash the client. We can:

Revert this change and put a dummy struct in there (since enums are length-encoded we don't have to preserver MintRestoreStateMachine but can just ignore whatever bytes are found)

Write a client DB migration that removes this particular SM. This would be super hacky since it would have to be done on the client and not module level.

Wait for SM migrations to land. What's the progress looking there @m1sterc001guy?

dpc requested review from a team as code owners January 13, 2024 07:19

dpc requested a review from a team as a code owner January 13, 2024 07:37

dpc force-pushed the 23-11-21-recovery-impov-1 branch 2 times, most recently from 090ce1b to 53be986 Compare January 13, 2024 07:56

elsirion reviewed Jan 15, 2024

View reviewed changes

maan2003 self-requested a review January 15, 2024 16:37

dpc mentioned this pull request Jan 15, 2024

feat: stream blocks in new refactoring of mint module recovery #4042

Merged

dpc force-pushed the 23-11-21-recovery-impov-1 branch from 53be986 to bd4e2a7 Compare January 15, 2024 17:47

dpc mentioned this pull request Jan 16, 2024

feat: await_block LRU cache #4046

Closed

maan2003 reviewed Jan 16, 2024

View reviewed changes

dpc force-pushed the 23-11-21-recovery-impov-1 branch from bd4e2a7 to fc8fd92 Compare January 16, 2024 06:30

dpc force-pushed the 23-11-21-recovery-impov-1 branch from fc8fd92 to c57e0d4 Compare January 16, 2024 06:40

maan2003 previously approved these changes Jan 16, 2024

View reviewed changes

elsirion reviewed Jan 17, 2024

View reviewed changes

dpc dismissed maan2003’s stale review via bafefe9 January 17, 2024 20:28

dpc requested a review from elsirion January 17, 2024 20:29

dpc mentioned this pull request Jan 17, 2024

Rearchitect modularized backup recovery #2977

Closed

4 tasks

maan2003 previously approved these changes Jan 18, 2024

View reviewed changes

elsirion requested a review from joschisan January 18, 2024 13:42

dpc dismissed maan2003’s stale review via 8be9f38 January 19, 2024 19:40

dpc added 5 commits January 19, 2024 11:41

chore: don't print error about uncommitted dbtx on user f error

addb2a8

chore: client module recovery refactor

6c7ca1b

fix: way for recoveries to create states

e95fb47

feat: fedimint-cli dev wait-complete

90f3082

chore: clippy

7b3b343

dpc added 2 commits January 19, 2024 11:44

chore: clippy

9c1b0c8

chore: add docstring to join and recover

8b45be2

dpc force-pushed the 23-11-21-recovery-impov-1 branch from 8be9f38 to 81dd56e Compare January 19, 2024 19:44

dpc requested a review from maan2003 January 19, 2024 20:10

dpc mentioned this pull request Jan 19, 2024

feat: await_block LRU cache (non-global) #4080

Merged

maan2003 previously approved these changes Jan 22, 2024

View reviewed changes

elsirion reviewed Jan 23, 2024

View reviewed changes

fedimint-client/src/sm/executor.rs Outdated Show resolved Hide resolved

refactor: unify add_state_machines{,_inactive}_dbtx

57711ff

dpc dismissed maan2003’s stale review via 57711ff January 23, 2024 22:34

dpc force-pushed the 23-11-21-recovery-impov-1 branch from 81dd56e to 57711ff Compare January 23, 2024 22:34

elsirion approved these changes Jan 24, 2024

View reviewed changes

elsirion requested a review from maan2003 January 24, 2024 06:43

dpc enabled auto-merge January 24, 2024 07:13

maan2003 approved these changes Jan 24, 2024

View reviewed changes

dpc added this pull request to the merge queue Jan 24, 2024

Merged via the queue into fedimint:master with commit aef0f62 Jan 24, 2024
20 checks passed

dpc deleted the 23-11-21-recovery-impov-1 branch January 24, 2024 08:18

elsirion mentioned this pull request Jan 27, 2024

Unable to restore without backup #4143

Open

elsirion reviewed Feb 8, 2024

View reviewed changes

elsirion mentioned this pull request Feb 8, 2024

Fix Mint restore SM breaking change #4271

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: client module recovery refactor #4035

chore: client module recovery refactor #4035

dpc commented Jan 13, 2024 •

edited

codecov bot commented Jan 13, 2024 •

edited

elsirion left a comment

elsirion Jan 15, 2024

dpc Jan 15, 2024

maan2003 left a comment

maan2003 Jan 16, 2024

dpc commented Jan 16, 2024 •

edited

elsirion Jan 17, 2024

dpc Jan 17, 2024 •

edited

dpc Jan 17, 2024

elsirion Jan 17, 2024

dpc Jan 17, 2024 •

edited

elsirion Jan 19, 2024

dpc Jan 19, 2024

dpc commented Jan 17, 2024

elsirion left a comment

elsirion Feb 8, 2024

chore: client module recovery refactor #4035

chore: client module recovery refactor #4035

Conversation

dpc commented Jan 13, 2024 • edited

Most notable points of the new design:

Follow-up work

codecov bot commented Jan 13, 2024 • edited

Codecov Report

elsirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maan2003 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpc commented Jan 16, 2024 • edited

Choose a reason for hiding this comment

dpc Jan 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpc Jan 17, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpc commented Jan 17, 2024

elsirion left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpc commented Jan 13, 2024 •

edited

codecov bot commented Jan 13, 2024 •

edited

dpc commented Jan 16, 2024 •

edited

dpc Jan 17, 2024 •

edited

dpc Jan 17, 2024 •

edited