feat(ic-admin): Added take-subnet-offline-for-repairs.#7361
Conversation
1f82bcf to
54691b1
Compare
There was a problem hiding this comment.
This pull request changes code owned by the Governance team. Therefore, make sure that
you have considered the following (for Governance-owned code):
-
Update
unreleased_changelog.md(if there are behavior changes, even if they are
non-breaking). -
Are there BREAKING changes?
-
Is a data migration needed?
-
Security review?
How to Satisfy This Automatic Review
-
Go to the bottom of the pull request page.
-
Look for where it says this bot is requesting changes.
-
Click the three dots to the right.
-
Select "Dismiss review".
-
In the text entry box, respond to each of the numbered items in the previous
section, declare one of the following:
-
Done.
-
$REASON_WHY_NO_NEED. E.g. for
unreleased_changelog.md, "No
canister behavior changes.", or for item 2, "Existing APIs
behave as before.".
Brief Guide to "Externally Visible" Changes
"Externally visible behavior change" is very often due to some NEW canister API.
Changes to EXISTING APIs are more likely to be "breaking".
If these changes are breaking, make sure that clients know how to migrate, how to
maintain their continuity of operations.
If your changes are behind a feature flag, then, do NOT add entrie(s) to
unreleased_changelog.md in this PR! But rather, add entrie(s) later, in the PR
that enables these changes in production.
Reference(s)
For a more comprehensive checklist, see here.
GOVERNANCE_CHECKLIST_REMINDER_DEDUP
- No canister behavior changes.
- Non-breaking
- All existing data is fine.
- Will happen.
54691b1 to
ee18e25
Compare
| /// | ||
| /// This is the first step of subnet recovery. Previously the first step was | ||
| /// done using propose-to-update-subnet. However, that does not support | ||
| /// changing ssn_node_state_write_access, which is needed when a subnet has |
There was a problem hiding this comment.
| /// changing ssn_node_state_write_access, which is needed when a subnet has | |
| /// changing ssh_node_state_write_access, which is needed when a subnet has |
| let node_id = PrincipalId::from_str(node_id).map_err(|err| format!("{err}"))?; | ||
| let node_id = NodeId::from(node_id); | ||
|
|
||
| // Parse public_keys, by simply splitting on ','.< |
There was a problem hiding this comment.
| // Parse public_keys, by simply splitting on ','.< | |
| // Parse public_keys, by simply splitting on ','. |
| let node_id = parts.next().ok_or("Missing node ID.")?; | ||
| let public_keys = parts | ||
| .next() | ||
| .ok_or("Missing semicolon separating node ID and SSH public key.")? |
There was a problem hiding this comment.
| .ok_or("Missing semicolon separating node ID and SSH public key.")? | |
| .ok_or("Missing semicolon separating node ID and SSH public keys.")? |
| /// clobber. However, in practice, it would probably be empty to begin with, | ||
| /// so most likely, this won't be an issue. | ||
| #[clap(long, value_parser, num_args(1..))] | ||
| pub ssh_node_state_write_access: Vec<NodeSshAccessFlagValue>, |
There was a problem hiding this comment.
Does this mean we need to pass at least one value? Ideally we would be able to switch to these new proposals entirely. That means it should also be possible to use them to recover subnets without SEV, in which case we don't need to deploy any ssh_node_state_write_access.
The same applies for ssh_readonly_access: You could imagine sending the proposal without any public keys, just to halt the subnet.
…irs`. (#7375) Closes [NNS1-4223]. [NNS1-4223]: https://dfinity.atlassian.net/browse/NNS1-4223 [👈 Previous PR][prev] [prev]: #7361
…very (#8718) As part of SEV Recovery, recovery operators might eventually need to provision write-access SSH keys to patch the state. Registry changes were implemented in previous PRs (notably #7361), and orchestrator changes in #7265. This PR implements the required changes in `ic-recovery`. The tool now uses the new `ProposeToTakeSubnetOfflineForRepairs` `ic-admin` command to provision an SSH key to a given `NodeId`, that the tool asks for in the `Halt` step. To help the recovery operator choose, this step will now print the heights of all nodes, next to their `NodeId`. In contrast to regular recoveries, that first step will have to be ran on a local machine instead of directly on a node. Here is how the new `Halt` step would look like: <img width="3816" height="1106" alt="image" src="https://github.com/user-attachments/assets/77bb1119-b211-4baf-b2c9-abf21adfebe6" /> --------- Co-authored-by: Jason Zhu <jason.zhu@dfinity.org> Co-authored-by: Ruediger Birkner <ruediger.birkner@dfinity.org>
Closes NNS1-4222.
Next PR 👉