Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: restart federation setup #3669

Merged
merged 4 commits into from
Jan 15, 2024
Merged

Conversation

okjodom
Copy link
Contributor

@okjodom okjodom commented Nov 21, 2023

After setting config gen params and before starting consensus, any peer can initiate a restart of the federation setup using restart_federation_setup rpc.

server-side behavior

  • This API sets the peer server in SetupRestarted status.
  • The Leader server waits for all peers to progress to SetupRestarted.
  • Leader then automatically makes progress to AwaitingPassword.
  • Followers watch leader status and when they see AwaitingPassword, they also make progress to the same state

All peers can then begin the setup process

suggested user experience

While the server is in any of these states SharingConfigGenParams | ReadyForConfigGen | ConfigGenFailed | VerifyingConfigs | VerifiedConfigs, the setup UI should watch peer information in consensus status. Should any peer be in SetupRestarted, the setup UI should immediately prompt the user to initiate restart by make a call to restart_federation_setup. Suggestion here it to make this some blocking but interactive UX

closes #3535

@okjodom okjodom marked this pull request as ready for review November 21, 2023 20:29
@okjodom okjodom requested review from a team as code owners November 21, 2023 20:29
Copy link

codecov bot commented Nov 22, 2023

Codecov Report

Attention: 15 lines in your changes are missing coverage. Please review.

Comparison is base (e882197) 57.13% compared to head (4471ada) 57.38%.
Report is 13 commits behind head on master.

❗ Current head 4471ada differs from pull request most recent head f82f234. Consider uploading reports for the commit f82f234 to get more accurate results

Files Patch % Lines
fedimint-server/src/config/api.rs 93.30% 15 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3669      +/-   ##
==========================================
+ Coverage   57.13%   57.38%   +0.25%     
==========================================
  Files         193      193              
  Lines       42861    42778      -83     
==========================================
+ Hits        24487    24548      +61     
+ Misses      18374    18230     -144     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

elsirion
elsirion previously approved these changes Nov 22, 2023
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, does anyone more familiar with the setup take a look? @justinmoon

fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
Comment on lines 135 to 139
remove_existing_file(path.join(LOCAL_CONFIG).with_extension(JSON_EXT))?;
remove_existing_file(path.join(CONSENSUS_CONFIG).with_extension(JSON_EXT))?;
remove_existing_file(path.join(CLIENT_INVITE_CODE_FILE))?;
remove_existing_file(path.join(CLIENT_CONFIG).with_extension(JSON_EXT))?;
remove_existing_file(path.join(PRIVATE_CONFIG).with_extension(ENCRYPTED_EXT))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very brittle if we ever add any files. In another PR we should introduce a list of files that FM potentially creates and have a test that checks that data dirs contain only these and nothing more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tracking here #3692

fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
@elsirion
Copy link
Contributor

CI is failing and it seems related to the PR

@okjodom
Copy link
Contributor Author

okjodom commented Nov 24, 2023

@elsirion , now green on CI

elsirion
elsirion previously approved these changes Nov 24, 2023
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, but I'll leave @justinmoon a chance to take another look

fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
@elsirion
Copy link
Contributor

One concern is that we could accidentally delete important key material if we ever called these delete fns by accident somewhere. To reduce that problem I'd propose to chmod -w config files right before starting the consensus, so that deletes are prevented.

@dpc
Copy link
Contributor

dpc commented Nov 26, 2023

One concern is that we could accidentally delete important key material if we ever called these delete fns by accident somewhere. To reduce that problem I'd propose to chmod -w config files right before starting the consensus, so that deletes are prevented.

That's a good point, but probably best practice to keep the in-progress things in a temp directory (randomized) and rename it atomically on success.

@okjodom okjodom marked this pull request as draft November 27, 2023 16:14
@okjodom
Copy link
Contributor Author

okjodom commented Nov 27, 2023

One concern is that we could accidentally delete important key material if we ever called these delete fns by accident somewhere. To reduce that problem I'd propose to chmod -w config files right before starting the consensus, so that deletes are prevented.

That's a good point, but probably best practice to keep the in-progress things in a temp directory (randomized) and rename it atomically on success.

marked as draft while applying this

@elsirion
Copy link
Contributor

Are you on this @okjodom? @justinmoon should this get into 0.2?

@okjodom
Copy link
Contributor Author

okjodom commented Nov 29, 2023

Are you on this @okjodom? @justinmoon should this get into 0.2?

Trying to debug the failing reconnect test now. Been up and down

I think an api changed?

@okjodom
Copy link
Contributor Author

okjodom commented Nov 29, 2023

Ps, this was originally marked for 0.2.1
I bumped it after first round of reviews because thinking I'd be able to land it fast

@okjodom
Copy link
Contributor Author

okjodom commented Nov 29, 2023

@elsirion, @justinmoon I'm tracking two issues as follow-ups to this. Take a look

@okjodom okjodom force-pushed the restart-setup branch 2 times, most recently from 4471ada to e7dbb10 Compare December 10, 2023 15:56
@okjodom
Copy link
Contributor Author

okjodom commented Dec 11, 2023

fixes #3788

fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
fedimint-server/src/config/api.rs Outdated Show resolved Hide resolved
- adds a helper for asserting the config server is in one of several
  expected status
- add restart_federation_setup rpc used by federation members for
  restarting the setup process
- leader waits for all peers to progess to SetupRestarted. it then
  progresses to AwaitingPassword
- each follower wait for leader to progress to AwaitingPassword before
  proceeding to AwaitingPassword
- add tests that cover restart of federation setup mid-way
- this test restarts the federation right after SharingConfigGenParams
Copy link
Contributor

@elsirion elsirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks mostly good, left some questions.

I don't think this is stability-relevant, we don't need to guarantee stability of the setup API, just need to release the admin UI in lockstep (but any reasonable deployment should use the admin UI that fits the Fedimint version).

I also didn't find any Rust API breakage, so in principle this could be backported to 0.2 if we saw the need (I'd like to avoid that though).

Comment on lines +439 to +462
let leader = {
let expected_status = [
ServerStatus::SharingConfigGenParams,
ServerStatus::ReadyForConfigGen,
ServerStatus::ConfigGenFailed,
ServerStatus::VerifyingConfigs,
ServerStatus::VerifiedConfigs,
];
let mut state = self.require_any_status(&expected_status)?;

let cfg_staging_dir = self.data_dir.join(CONFIG_STAGING_DIR);
self.remove_staged_configs(&cfg_staging_dir)?;

state.status = ServerStatus::SetupRestarted;
info!(
target: fedimint_logging::LOG_NET_PEER_DKG,
"Update config gen status to 'Setup restarted'"
);
// Create a WSClient for the leader
state
.local
.clone()
.and_then(|local| local.leader_api_url.map(WsAdminClient::new))
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to put this into a separate block?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this allows us to acquire lock on self.state, use it and and releasing lock, before we call update_leader. update leader creates a separate lock on state

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't remember that require_any_status returned a mutex guard, makes sense!

Comment on lines +718 to +730
fn reset(&mut self) {
self.config = None;
self.peers = Default::default();
self.auth = None;
self.requested_params = None;
self.status = ServerStatus::AwaitingPassword;
self.local = None;

info!(
target: fedimint_logging::LOG_NET_PEER_DKG,
"Reset config gen state"
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to reset any on-disk state? I think the PASSWORD file has been written already at that point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at this point, PASSWORD or any other config file is already written to a file in a temporary dir. remove_staged_config() resets disk state by deleting the whole dir, and this function finally resets server state

Copy link
Contributor

@justinmoon justinmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@elsirion elsirion added this pull request to the merge queue Jan 15, 2024
@elsirion
Copy link
Contributor

Thx for bearing with us, the release took a lot of bandwidth and this PR a few iterations, but I think we arrived at a good result at the end!

Merged via the queue into fedimint:master with commit 9ec5418 Jan 15, 2024
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RPC method to restart setup process
4 participants