Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: client module recovery refactor #4035

Merged
merged 8 commits into from
Jan 24, 2024

Conversation

dpc
Copy link
Contributor

@dpc dpc commented Jan 13, 2024

Re #2977

This is a initial step towards client module recovery system described/agreed on in #2977.

Most notable points of the new design:

Module recoveries work as just standalone function (future): ClientModuleInit::recover called instead of usual ClientModuleInit::init. This makes module job easier (in a sense at least), as it is not constantly being interrupted (will help with efficient streaming through blocks). It's up to the implementation to periodically save state and send progress updates, but a follow up changes will introduce helper functions to set the module implementation on the right track. It also avoids saving tons of in-progress data in the state machine execuation log. Conceptually module recovery is not a client state machine.

Module recovery state is persisted and tracked and modules become available as they complete their recovery (on the next client start). A slow module will not block recovery of other modules.

Follow-up work

@dpc dpc requested review from a team as code owners January 13, 2024 07:19
Copy link

codecov bot commented Jan 13, 2024

Codecov Report

Attention: 667 lines in your changes are missing coverage. Please review.

Comparison is base (d6f1ab8) 58.38% compared to head (57711ff) 58.34%.
Report is 31 commits behind head on master.

Files Patch % Lines
fedimint-client/src/lib.rs 31.93% 260 Missing ⚠️
modules/fedimint-mint-client/src/lib.rs 4.05% 142 Missing ⚠️
fedimint-client/src/module/init.rs 0.00% 98 Missing ⚠️
fedimint-client/src/module/mod.rs 0.00% 30 Missing ⚠️
fedimint-client/src/module/recovery.rs 0.00% 27 Missing ⚠️
fedimint-client/src/sm/executor.rs 58.46% 27 Missing ⚠️
fedimint-core/src/core.rs 0.00% 21 Missing ⚠️
fedimint-client/src/backup.rs 5.55% 17 Missing ⚠️
fedimint-client/src/db.rs 56.75% 16 Missing ⚠️
fedimint-cli/src/lib.rs 0.00% 13 Missing ⚠️
... and 6 more
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4035      +/-   ##
==========================================
- Coverage   58.38%   58.34%   -0.04%     
==========================================
  Files         193      192       -1     
  Lines       42577    42659      +82     
==========================================
+ Hits        24857    24889      +32     
- Misses      17720    17770      +50     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dpc dpc requested a review from a team as a code owner January 13, 2024 07:37
@dpc dpc force-pushed the 23-11-21-recovery-impov-1 branch 2 times, most recently from 090ce1b to 53be986 Compare January 13, 2024 07:56
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First part of review, a lot to follow, but better to get it out than to forget about it.

fedimint-client/src/module/init.rs Outdated Show resolved Hide resolved
fedimint-client/src/module/init.rs Outdated Show resolved Hide resolved
fedimint-client/src/module/mod.rs Show resolved Hide resolved
fedimint-client/src/module/mod.rs Show resolved Hide resolved
fedimint-client/src/module/recovery.rs Show resolved Hide resolved
fedimint-client/src/module/recovery.rs Show resolved Hide resolved
@@ -13,6 +14,7 @@ pub enum DbKeyPrefix {
Note = 0x20,
NextECashNoteIndex = 0x2a,
CancelledOOBSpend = 0x2b,
RestoreState = 0x2c,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It had to happen eventually (in fact, it already did), still sad that we are now overloading the DB key prefixes. It's not a problem due to DB isolation, but reading raw hex encoded data will be a bit more ambiguous (e.g. if one doesn't know the instance<>kind mapping).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'll need dbdump do a good job. eg. even for keys that are not implemented print <kind> <prefix> or something.

Copy link
Member

@maan2003 maan2003 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just small nits

fedimint-client/src/lib.rs Outdated Show resolved Hide resolved
fedimint-client/src/lib.rs Show resolved Hide resolved
Comment on lines +110 to +114
#[derive(Debug, Copy, Clone, Encodable, Decodable)]
pub struct RecoveryProgress {
pub complete: u32,
pub total: u32,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, apps can measure speed of recovery using updates and provide some ETA.

fedimint-client/src/module/recovery.rs Show resolved Hide resolved
@dpc dpc force-pushed the 23-11-21-recovery-impov-1 branch from bd4e2a7 to fc8fd92 Compare January 16, 2024 06:30
@dpc
Copy link
Contributor Author

dpc commented Jan 16, 2024

.

@dpc dpc force-pushed the 23-11-21-recovery-impov-1 branch from fc8fd92 to c57e0d4 Compare January 16, 2024 06:40
maan2003
maan2003 previously approved these changes Jan 16, 2024
@@ -1552,18 +1761,42 @@ impl ClientBuilder {
Ok(client)
}

pub async fn join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs documentation on when to use imo (with some big warnings about risk of fund loss if re-joining a federation with a previously used key without recovering). Even if the user never provided the key and it was randomly generated, but e.g. an app allowed leaving+joining federations that will be a problem.

Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?

Copy link
Contributor Author

@dpc dpc Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I actually wanted to have a top level secret key to be an enum:

enum RootSecret {
  Fresh(DeriveableSecret),
  FromUser(DeriveableSecret),
}

so that Fresh is only returned when the key is randomly generated, and FromUser when entered by the user. This way type system would steer the implementation to handle it correctly, and it would be a good place to put docs explaining why it's needed.

Edit: This does not seem to work all that well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehhh... I wrote some docstrings, but it made me realize that without some reliable way to tell if a given root_secret ever joined Federations, most apps will have to pessimize and call recover a lot which is particularly slow when a given Federation was not used (as it causes a whole history scan, because there can't be any previous backup).

Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?

And I'm not sure if relying on a backup is fundamentally sound. I wouldn't be surprised if backups will be a source of DoS and popular Federations will need to disable it altogether, or prune very old ones etc.

Comment on lines 185 to 226
/// Adds a number of state machines to the executor atomically with other DB
/// changes is `dbtx`, but without actually starting executing them.
///
/// Like [`Self::add_state_machines_dbtx`] but useful for recovering
/// modules, where to module itself is not yet available (recovered) for
/// the executor.
pub async fn add_state_machines_inactive_dbtx(
&self,
dbtx: &mut DatabaseTransaction<'_>,
states: Vec<DynState<GC>>,
) -> AddStateMachinesResult {
for state in states {
if !self
.inner
.valid_module_ids
.contains(&state.module_instance_id())
{
return Err(AddStateMachinesError::Other(anyhow!("Unknown module")));
}

let is_active_state = dbtx
.get_value(&ActiveStateKey::from_state(state.clone()))
.await
.is_some();
let is_inactive_state = dbtx
.get_value(&InactiveStateKey::from_state(state.clone()))
.await
.is_some();

if is_active_state || is_inactive_state {
return Err(AddStateMachinesError::StateAlreadyExists);
}

dbtx.insert_entry(
&ActiveStateKey::from_state(state.clone()),
&ActiveState::new(),
)
.await;
}

Ok(())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming of this fn seems weird. It adds a state as active to the executor without checking if it actually is active. This also means afaik that we have to only call it with active states, otherwise inactive ones will get stuck in the active table forever, as they don't have any state transitions by definition.

Either we make the is_terminal check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.

Copy link
Contributor Author

@dpc dpc Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either we make the is_terminal check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.

Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.

Can we do this in this PR please? I want to avoid forgetting about it after this one is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙄 Ok, ok. See the last commit. I wasn't all that confident about how I handled it there, but I hope it's OK.

@dpc
Copy link
Contributor Author

dpc commented Jan 17, 2024

@maan2003 @elsirion I added docstring. Unless there are some architectural blockers, I'd rather land, so the follow-up work is easier to do. Feel free to add any requests as an item to #2977

maan2003
maan2003 previously approved these changes Jan 18, 2024
maan2003
maan2003 previously approved these changes Jan 22, 2024
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for adding the last commit!

@dpc dpc enabled auto-merge January 24, 2024 07:13
@dpc dpc added this pull request to the merge queue Jan 24, 2024
Merged via the queue into fedimint:master with commit aef0f62 Jan 24, 2024
20 checks passed
@dpc dpc deleted the 23-11-21-recovery-impov-1 branch January 24, 2024 08:18
@@ -1526,7 +1600,6 @@ pub enum MintClientStateMachines {
Output(MintOutputStateMachine),
Input(MintInputStateMachine),
OOB(MintOOBStateMachine),
Restore(MintRestoreStateMachine),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have caught this, running recoveries or logs of past recoveries could crash the client. We can:

  1. Revert this change and put a dummy struct in there (since enums are length-encoded we don't have to preserver MintRestoreStateMachine but can just ignore whatever bytes are found)
  2. Write a client DB migration that removes this particular SM. This would be super hacky since it would have to be done on the client and not module level.
  3. Wait for SM migrations to land. What's the progress looking there @m1sterc001guy?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants