-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: client module recovery refactor #4035
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #4035 +/- ##
==========================================
- Coverage 58.38% 58.34% -0.04%
==========================================
Files 193 192 -1
Lines 42577 42659 +82
==========================================
+ Hits 24857 24889 +32
- Misses 17720 17770 +50 ☔ View full report in Codecov by Sentry. |
090ce1b
to
53be986
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First part of review, a lot to follow, but better to get it out than to forget about it.
@@ -13,6 +14,7 @@ pub enum DbKeyPrefix { | |||
Note = 0x20, | |||
NextECashNoteIndex = 0x2a, | |||
CancelledOOBSpend = 0x2b, | |||
RestoreState = 0x2c, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It had to happen eventually (in fact, it already did), still sad that we are now overloading the DB key prefixes. It's not a problem due to DB isolation, but reading raw hex encoded data will be a bit more ambiguous (e.g. if one doesn't know the instance<>kind mapping).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we'll need dbdump do a good job. eg. even for keys that are not implemented print <kind> <prefix>
or something.
53be986
to
bd4e2a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just small nits
#[derive(Debug, Copy, Clone, Encodable, Decodable)] | ||
pub struct RecoveryProgress { | ||
pub complete: u32, | ||
pub total: u32, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, apps can measure speed of recovery using updates and provide some ETA.
bd4e2a7
to
fc8fd92
Compare
. |
fc8fd92
to
c57e0d4
Compare
@@ -1552,18 +1761,42 @@ impl ClientBuilder { | |||
Ok(client) | |||
} | |||
|
|||
pub async fn join( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs documentation on when to use imo (with some big warnings about risk of fund loss if re-joining a federation with a previously used key without recovering). Even if the user never provided the key and it was randomly generated, but e.g. an app allowed leaving+joining federations that will be a problem.
Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I actually wanted to have a top level secret key to be an enum
:
enum RootSecret {
Fresh(DeriveableSecret),
FromUser(DeriveableSecret),
}
so that Fresh
is only returned when the key is randomly generated, and FromUser
when entered by the user. This way type system would steer the implementation to handle it correctly, and it would be a good place to put docs explaining why it's needed.
Edit: This does not seem to work all that well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehhh... I wrote some docstrings, but it made me realize that without some reliable way to tell if a given root_secret
ever joined Federations, most apps will have to pessimize and call recover
a lot which is particularly slow when a given Federation was not used (as it causes a whole history scan, because there can't be any previous backup).
Maybe we should always call recover and only attempt recovery if there is a backup (since otherwise it will take forever for real federations anyway)?
And I'm not sure if relying on a backup is fundamentally sound. I wouldn't be surprised if backups will be a source of DoS and popular Federations will need to disable it altogether, or prune very old ones etc.
fedimint-client/src/sm/executor.rs
Outdated
/// Adds a number of state machines to the executor atomically with other DB | ||
/// changes is `dbtx`, but without actually starting executing them. | ||
/// | ||
/// Like [`Self::add_state_machines_dbtx`] but useful for recovering | ||
/// modules, where to module itself is not yet available (recovered) for | ||
/// the executor. | ||
pub async fn add_state_machines_inactive_dbtx( | ||
&self, | ||
dbtx: &mut DatabaseTransaction<'_>, | ||
states: Vec<DynState<GC>>, | ||
) -> AddStateMachinesResult { | ||
for state in states { | ||
if !self | ||
.inner | ||
.valid_module_ids | ||
.contains(&state.module_instance_id()) | ||
{ | ||
return Err(AddStateMachinesError::Other(anyhow!("Unknown module"))); | ||
} | ||
|
||
let is_active_state = dbtx | ||
.get_value(&ActiveStateKey::from_state(state.clone())) | ||
.await | ||
.is_some(); | ||
let is_inactive_state = dbtx | ||
.get_value(&InactiveStateKey::from_state(state.clone())) | ||
.await | ||
.is_some(); | ||
|
||
if is_active_state || is_inactive_state { | ||
return Err(AddStateMachinesError::StateAlreadyExists); | ||
} | ||
|
||
dbtx.insert_entry( | ||
&ActiveStateKey::from_state(state.clone()), | ||
&ActiveState::new(), | ||
) | ||
.await; | ||
} | ||
|
||
Ok(()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming of this fn seems weird. It adds a state as active to the executor without checking if it actually is active. This also means afaik that we have to only call it with active states, otherwise inactive ones will get stuck in the active table forever, as they don't have any state transitions by definition.
Either we make the is_terminal
check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either we make the
is_terminal
check not require the actual module context (that seems to be the problem here) or we have to add a filter in the executor that will remove inactive states from the active states table if they end up there by accident.
Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Detecting terminal ones and handling them correctly seems to simplify the caller and generally more robust.
Can we do this in this PR please? I want to avoid forgetting about it after this one is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🙄 Ok, ok. See the last commit. I wasn't all that confident about how I handled it there, but I hope it's OK.
8be9f38
to
81dd56e
Compare
81dd56e
to
57711ff
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for adding the last commit!
@@ -1526,7 +1600,6 @@ pub enum MintClientStateMachines { | |||
Output(MintOutputStateMachine), | |||
Input(MintInputStateMachine), | |||
OOB(MintOOBStateMachine), | |||
Restore(MintRestoreStateMachine), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should have caught this, running recoveries or logs of past recoveries could crash the client. We can:
- Revert this change and put a dummy struct in there (since enums are length-encoded we don't have to preserver
MintRestoreStateMachine
but can just ignore whatever bytes are found) - Write a client DB migration that removes this particular SM. This would be super hacky since it would have to be done on the client and not module level.
- Wait for SM migrations to land. What's the progress looking there @m1sterc001guy?
Re #2977
This is a initial step towards client module recovery system described/agreed on in #2977.
Most notable points of the new design:
Module recoveries work as just standalone function (future):
ClientModuleInit::recover
called instead of usualClientModuleInit::init
. This makes module job easier (in a sense at least), as it is not constantly being interrupted (will help with efficient streaming through blocks). It's up to the implementation to periodically save state and send progress updates, but a follow up changes will introduce helper functions to set the module implementation on the right track. It also avoids saving tons of in-progress data in the state machine execuation log. Conceptually module recovery is not a client state machine.Module recovery state is persisted and tracked and modules become available as they complete their recovery (on the next client start). A slow module will not block recovery of other modules.
Follow-up work
ClientmoduleRecoverArgs
)api.await_block
in a transparent LRU: feat: await_block LRU cache (non-global) #4080