-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: restart federation setup #3669
Conversation
f50471e
to
1554057
Compare
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #3669 +/- ##
==========================================
+ Coverage 57.13% 57.38% +0.25%
==========================================
Files 193 193
Lines 42861 42778 -83
==========================================
+ Hits 24487 24548 +61
+ Misses 18374 18230 -144 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, does anyone more familiar with the setup take a look? @justinmoon
fedimint-server/src/config/io.rs
Outdated
remove_existing_file(path.join(LOCAL_CONFIG).with_extension(JSON_EXT))?; | ||
remove_existing_file(path.join(CONSENSUS_CONFIG).with_extension(JSON_EXT))?; | ||
remove_existing_file(path.join(CLIENT_INVITE_CODE_FILE))?; | ||
remove_existing_file(path.join(CLIENT_CONFIG).with_extension(JSON_EXT))?; | ||
remove_existing_file(path.join(PRIVATE_CONFIG).with_extension(ENCRYPTED_EXT))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very brittle if we ever add any files. In another PR we should introduce a list of files that FM potentially creates and have a test that checks that data dirs contain only these and nothing more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tracking here #3692
1554057
to
e55dd5a
Compare
CI is failing and it seems related to the PR |
e55dd5a
to
0a7a8c7
Compare
@elsirion , now green on CI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, but I'll leave @justinmoon a chance to take another look
One concern is that we could accidentally delete important key material if we ever called these delete fns by accident somewhere. To reduce that problem I'd propose to |
That's a good point, but probably best practice to keep the in-progress things in a temp directory (randomized) and |
marked as draft while applying this |
beff7a8
to
d065f9c
Compare
Are you on this @okjodom? @justinmoon should this get into 0.2? |
Trying to debug the failing reconnect test now. Been up and down I think an api changed? |
Ps, this was originally marked for 0.2.1 |
d065f9c
to
7ad169a
Compare
7ad169a
to
c4447a1
Compare
@elsirion, @justinmoon I'm tracking two issues as follow-ups to this. Take a look |
4471ada
to
e7dbb10
Compare
fixes #3788 |
- adds a helper for asserting the config server is in one of several expected status
- add restart_federation_setup rpc used by federation members for restarting the setup process - leader waits for all peers to progess to SetupRestarted. it then progresses to AwaitingPassword - each follower wait for leader to progress to AwaitingPassword before proceeding to AwaitingPassword
e7dbb10
to
6de7e8d
Compare
- add tests that cover restart of federation setup mid-way - this test restarts the federation right after SharingConfigGenParams
6de7e8d
to
f82f234
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks mostly good, left some questions.
I don't think this is stability-relevant, we don't need to guarantee stability of the setup API, just need to release the admin UI in lockstep (but any reasonable deployment should use the admin UI that fits the Fedimint version).
I also didn't find any Rust API breakage, so in principle this could be backported to 0.2 if we saw the need (I'd like to avoid that though).
let leader = { | ||
let expected_status = [ | ||
ServerStatus::SharingConfigGenParams, | ||
ServerStatus::ReadyForConfigGen, | ||
ServerStatus::ConfigGenFailed, | ||
ServerStatus::VerifyingConfigs, | ||
ServerStatus::VerifiedConfigs, | ||
]; | ||
let mut state = self.require_any_status(&expected_status)?; | ||
|
||
let cfg_staging_dir = self.data_dir.join(CONFIG_STAGING_DIR); | ||
self.remove_staged_configs(&cfg_staging_dir)?; | ||
|
||
state.status = ServerStatus::SetupRestarted; | ||
info!( | ||
target: fedimint_logging::LOG_NET_PEER_DKG, | ||
"Update config gen status to 'Setup restarted'" | ||
); | ||
// Create a WSClient for the leader | ||
state | ||
.local | ||
.clone() | ||
.and_then(|local| local.leader_api_url.map(WsAdminClient::new)) | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reason to put this into a separate block?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this allows us to acquire lock on self.state
, use it and and releasing lock, before we call update_leader
. update leader creates a separate lock on state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remember that require_any_status
returned a mutex guard, makes sense!
fn reset(&mut self) { | ||
self.config = None; | ||
self.peers = Default::default(); | ||
self.auth = None; | ||
self.requested_params = None; | ||
self.status = ServerStatus::AwaitingPassword; | ||
self.local = None; | ||
|
||
info!( | ||
target: fedimint_logging::LOG_NET_PEER_DKG, | ||
"Reset config gen state" | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to reset any on-disk state? I think the PASSWORD
file has been written already at that point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at this point, PASSWORD
or any other config file is already written to a file in a temporary dir. remove_staged_config()
resets disk state by deleting the whole dir, and this function finally resets server state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thx for bearing with us, the release took a lot of bandwidth and this PR a few iterations, but I think we arrived at a good result at the end! |
After setting config gen params and before starting consensus, any peer can initiate a restart of the federation setup using
restart_federation_setup
rpc.SetupRestarted
status.SetupRestarted
.AwaitingPassword
.AwaitingPassword
, they also make progress to the same stateAll peers can then begin the setup process
While the server is in any of these states
SharingConfigGenParams
|ReadyForConfigGen
|ConfigGenFailed
|VerifyingConfigs
|VerifiedConfigs
, the setup UI should watch peer information in consensus status. Should any peer be inSetupRestarted
, the setup UI should immediately prompt the user to initiate restart by make a call torestart_federation_setup
. Suggestion here it to make this some blocking but interactive UXcloses #3535