Skip to content

Commit

Permalink
Merge branch 'kpop_con_922' into 'master'
Browse files Browse the repository at this point in the history
feat(Consensus, CON-922): add a state to the recovery tool so that it's possible to resume the previous execution.

In order to make the recovery tool resumable the following changes have been made:

1. The Recovery structs (i.e. AppSubnetRecovery, NNSRecoverySameNodes, NNSRecoveryFalloverNodes) have been made serializable.
2. Added two new user prompts at the beginning of the program which asks the user whether to resume the recovery if an appropriate state file has been found and if the arguments passed to the program have changed.
3. Explicitly skip the steps which have been completed in the previous run.
4. After execution of each step save the state to the disk.

Jira ticket(s): [CON-922](https://dfinity.atlassian.net/browse/CON-922?atlOrigin=eyJpIjoiMDhjZjU4NDY5MzE0NGM4MTkyYjZhNDcyMmU2NTVkMTMiLCJwIjoiaiJ9)
One pager: [CON-922](https://docs.google.com/document/d/1ALqyUXprY30aweyk123PVMtT2N4RUdZ8Cb9TSavK97w/edit#)

Tests:
1. Added some unit tests;
2. Ran the tool:
    1. broke an app subnet;
    2. ran the tool to recover it;
    3. CTRL-C'ed out of it after the "DownloadState" step;
    4. reran the tool (see the output below);
    5. confirmed that the subnet is recovered (see the attached image).

```
kpop@zh1-spm34:~/repos/ic/rs$ cargo run --bin ic-recovery -- --dir /home/kpop/recovery-test-2 --nns-url http://[2602:fb2b:100:10:5000:12ff:feec:cae3]:8080 --replica-version 1ff87140826d378d74854a01364583d12c18fc3d --test app-subnet-recovery --subnet-id nmb2x-aqod7-k257h-gho3i-tl3q2-v3poy-kp2a4-daytu-sqtv6-4l7kg-jae
warning: /home/kpop/repos/ic/rs/boundary_node/ic_balance_exporter/Cargo.toml: unused manifest key: bin.0.src
    Finished dev [unoptimized + debuginfo] target(s) in 1.15s
     Running `target/debug/ic-recovery --dir /home/kpop/recovery-test-2 --nns-url 'http://[2602:fb2b:100:10:5000:12ff:feec:cae3]:8080' --replica-version 1ff87140826d378d74854a01364583d12c18fc3d --test app-subnet-recovery --subnet-id nmb2x-aqod7-k257h-gho3i-tl3q2-v3poy-kp2a4-daytu-sqtv6-4l7kg-jae`
Mar 06 09:20:16.135 INFO Recovery state file found with parameters {
  "recovery_args": {
    "dir": "/home/kpop/recovery-test-2",
    "nns_url": "http://[2602:fb2b:100:10:5000:12ff:feec:cae3]:8080/",
    "replica_version": {
      "version_id": "1ff87140826d378d74854a01364583d12c18fc3d"
    },
    "key_file": null,
    "test_mode": true
  },
  "subcommand_args": {
    "AppSubnetRecovery": {
      "subnet_id": "nmb2x-aqod7-k257h-gho3i-tl3q2-v3poy-kp2a4-daytu-sqtv6-4l7kg-jae",
      "upgrade_version": null,
      "replacement_nodes": null,
      "pub_key": "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIG7mlrQoHBB+Iq16f/R0hMjCWVf8Xc/WgbfRME6cvJ00 kamil.popielarz@Kamil-Popielarzs-MacBook-Pro.local",
      "download_node": "2602:fb2b:100:10:5000:c3ff:feeb:5b52",
      "keep_downloaded_state": true,
      "upload_node": null,
      "ecdsa_subnet_id": null,
      "last_executed_step": "DownloadState"
    }
  },
  "neuron_args": null
}
Mar 06 09:20:16.135 INFO Resume previously started recovery? [y/n]
y
Mar 06 09:20:33.097 INFO

Mar 06 09:20:33.097 INFO ###############################
Mar 06 09:20:33.097 INFO V     App Subnet Recovery     V
Mar 06 09:20:33.097 INFO ###############################
Mar 06 09:20:33.098 INFO NNS Url: http://[2602:fb2b:100:10:5000:12ff:feec:cae3]:8080/
Mar 06 09:20:33.098 INFO Starting recovery of subnet with ID:
Mar 06 09:20:33.098 INFO -> nmb2x-aqod7-k257h-gho3i-tl3q2-v3poy-kp2a4-daytu-sqtv6-4l7kg-jae
Mar 06 09:20:33.098 INFO Binary version:
Mar 06 09:20:33.098 INFO -> Some(ReplicaVersion { version_id: "1ff87140826d378d74854a01364583d12c18fc3d" })
Mar 06 09:20:33.098 INFO Creating recovery directory in "/home/kpop/recovery-test-2"
Mar 06 09:20:33.098 INFO Press [ENTER] to continue...

Mar 06 09:20:37.545 INFO ic-admin exists, skipping download.
Mar 06 09:20:37.545 INFO nns.pem exists, skipping download of NNS public key
Mar 06 09:20:37.545 INFO Syncing registry local store
Mar 06 09:20:37.545 INFO Continuing with public key:
-----BEGIN PUBLIC KEY-----
MIGCMB0GDSsGAQQBgtx8BQMBAgEGDCsGAQQBgtx8BQMCAQNhAJYUx5HHCrc6jJ94
tF18BKuqBbfuvhxfP162gypbLSJ/pXydTON52qZ47im/J8lGLBX91IqmmicWY2CC
wg9SawgklTe1UAeJ/tMJEWKqHaiQ++3LIAlgrI01kIHrUWoJXQ==
-----END PUBLIC KEY-----

Mar 06 09:20:37.545 WARN Downloaded key is NOT equal to included NNS public key
Mar 06 09:20:37.549 INFO s:/n:/ic_registry_replicator/ic_registry_replicator Local registry store is not empty, skipping initialization.
Mar 06 09:20:37.553 INFO Skipping already executed step Halt
Mar 06 09:20:37.553 INFO Skipping already executed step DownloadCertifications
Mar 06 09:20:37.554 INFO Skipping already executed step MergeCertificationPools
Mar 06 09:20:37.554 INFO Skipping already executed step DownloadState
Mar 06 09:20:37.554 INFO

Mar 06 09:20:37.554 INFO ####################
Mar 06 09:20:37.554 INFO V     ICReplay     V
Mar 06 09:20:37.554 INFO ####################
Mar 06 09:20:37.554 INFO Delete old checkpoints found in /home/kpop/recovery-test-2/recovery/working_dir/data/ic_state/checkpoints, and execute:
ic-replay /home/kpop/recovery-test-2/recovery/working_dir/ic.json5 --subnet-id nmb2x-aqod7-k257h-gho3i-tl3q2-v3poy-kp2a4-daytu-sqtv6-4l7kg-jae
Mar 06 09:20:37.554 INFO Execute now? [y/n]
```

![image](/uploads/eb086ddbb9189a68bff1bc6cb6d37333/image.png) 

See merge request dfinity-lab/public/ic!11107
  • Loading branch information
kpop-dfinity committed Mar 15, 2023
2 parents 520fc2f + 5ddd229 commit cf2ae2e
Show file tree
Hide file tree
Showing 15 changed files with 695 additions and 86 deletions.
35 changes: 24 additions & 11 deletions rs/recovery/src/app_subnet_recovery.rs
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,16 @@ use crate::{error::RecoveryError, RecoveryArgs};
use clap::Parser;
use ic_base_types::{NodeId, SubnetId};
use ic_types::ReplicaVersion;
use serde::{Deserialize, Serialize};
use slog::{info, Logger};
use std::iter::Peekable;
use std::net::IpAddr;
use strum::IntoEnumIterator;
use strum_macros::EnumIter;
use strum_macros::{EnumIter, EnumString};

use crate::{NeuronArgs, Recovery, Step};

#[derive(Debug, Copy, Clone, EnumIter)]
#[derive(Debug, Copy, Clone, PartialEq, EnumIter, EnumString, Serialize, Deserialize)]
pub enum StepType {
Halt,
DownloadCertifications,
Expand All @@ -32,7 +34,7 @@ pub enum StepType {
Cleanup,
}

#[derive(Parser)]
#[derive(Debug, Clone, PartialEq, Parser, Serialize, Deserialize)]
#[clap(version = "1.0")]
pub struct AppSubnetRecoveryArgs {
/// Id of the broken subnet
Expand Down Expand Up @@ -66,11 +68,17 @@ pub struct AppSubnetRecoveryArgs {
/// Id of the ecdsa subnet used for resharing ecdsa key of subnet to be recovered
#[clap(long, parse(try_from_str=crate::util::subnet_id_from_str))]
pub ecdsa_subnet_id: Option<SubnetId>,

/// If present the tool will start execution for the provided step, skipping the initial ones
#[clap(long = "resume")]
pub next_step: Option<StepType>,
}

pub struct AppSubnetRecovery {
step_iterator: Box<dyn Iterator<Item = StepType>>,
step_iterator: Peekable<StepTypeIter>,
pub params: AppSubnetRecoveryArgs,
pub recovery_args: RecoveryArgs,
pub neuron_args: Option<NeuronArgs>,
recovery: Recovery,
interactive: bool,
logger: Logger,
Expand All @@ -84,13 +92,14 @@ impl AppSubnetRecovery {
subnet_args: AppSubnetRecoveryArgs,
interactive: bool,
) -> Self {
let ssh_confirmation = neuron_args.is_some();
let recovery = Recovery::new(logger.clone(), recovery_args, neuron_args, ssh_confirmation)
let recovery = Recovery::new(logger.clone(), recovery_args.clone(), neuron_args.clone())
.expect("Failed to init recovery");
recovery.init_registry_local_store();
Self {
step_iterator: Box::new(StepType::iter()),
step_iterator: StepType::iter().peekable(),
params: subnet_args,
recovery_args,
neuron_args,
recovery,
logger,
interactive,
Expand All @@ -102,11 +111,15 @@ impl AppSubnetRecovery {
}
}

impl RecoveryIterator<StepType> for AppSubnetRecovery {
fn get_step_iterator(&mut self) -> &mut Box<dyn Iterator<Item = StepType>> {
impl RecoveryIterator<StepType, StepTypeIter> for AppSubnetRecovery {
fn get_step_iterator(&mut self) -> &mut Peekable<StepTypeIter> {
&mut self.step_iterator
}

fn store_next_step(&mut self, step_type: Option<StepType>) {
self.params.next_step = step_type;
}

fn get_logger(&self) -> &Logger {
&self.logger
}
Expand All @@ -130,8 +143,8 @@ impl RecoveryIterator<StepType> for AppSubnetRecovery {

StepType::DownloadCertifications => {
info!(&self.logger, "Ensure subnet is halted.");
// This can hardly be automated as currently the notion of "subnet is halted" is unclear,
// especially in the presence of failures.
// This can hardly be automated as currently the notion of "subnet is halted" is
// unclear, especially in the presence of failures.
wait_for_confirmation(&self.logger);
}

Expand Down
87 changes: 87 additions & 0 deletions rs/recovery/src/args_merger.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
use serde_json::Value;
use slog::{warn, Logger};

/// Merges two serializable objects ignoring all the None values of [new]. For each field where
/// there is a difference between the old and the new value, will emit a warning with the
/// new value.
pub fn merge<T>(logger: &Logger, label: &str, old: &T, new: &T) -> Result<T, serde_json::Error>
where
T: Clone + for<'a> serde::Deserialize<'a> + serde::Serialize,
{
serde_json::from_value(merge_values(
logger,
label,
serde_json::to_value(old.clone()).unwrap(),
serde_json::to_value(new.clone()).unwrap(),
))
}

fn merge_values(logger: &Logger, label: &str, old: Value, new: Value) -> Value {
match (old, new) {
(old, Value::Null) => old,
(Value::Object(old), Value::Object(new)) => {
let mut merged = old.clone();

for (key, new_value) in new {
if !new_value.is_null() {
let old_value = old.get(&key).unwrap_or(&Value::Null).clone();
merged.insert(
key.clone(),
merge_values(logger, &key, old_value, new_value),
);
}
}

Value::Object(merged)
}
(old_value, new_value) => {
if old_value != new_value {
warn!(
logger,
"Value of {} has changed. Old: {}, new: {}", label, old_value, new_value
);
}
new_value
}
}
}

#[cfg(test)]
mod tests {
use std::path::PathBuf;
use url::Url;

use crate::{util, RecoveryArgs};

use super::*;

#[test]
fn merge_test() {
let logger = util::make_logger();

let args1 = RecoveryArgs {
dir: PathBuf::from("/dir1/"),
nns_url: Url::parse("https://fake_nns_url.com/").unwrap(),
replica_version: None,
key_file: Some(PathBuf::from("/dir1/key_file")),
test_mode: true,
};
let args2 = RecoveryArgs {
dir: PathBuf::from("/dir2/"),
nns_url: Url::parse("https://fake_nns_url.com/").unwrap(),
replica_version: None,
key_file: None,
test_mode: false,
};

let expected = RecoveryArgs {
dir: args2.dir.clone(),
nns_url: args2.nns_url.clone(),
replica_version: args2.replica_version.clone(),
key_file: args1.key_file.clone(),
test_mode: args2.test_mode,
};

assert_eq!(expected, merge(&logger, "test", &args1, &args2).unwrap());
}
}
71 changes: 47 additions & 24 deletions rs/recovery/src/cli.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
//! Command line interfaces to various subnet recovery processes.
//! Calls the recovery library.
use crate::app_subnet_recovery::{AppSubnetRecovery, AppSubnetRecoveryArgs};
use crate::get_node_heights_from_metrics;
use crate::nns_recovery_failover_nodes::{NNSRecoveryFailoverNodes, NNSRecoveryFailoverNodesArgs};
use crate::nns_recovery_same_nodes::{NNSRecoverySameNodes, NNSRecoverySameNodesArgs};
use crate::recovery_iterator::RecoveryIterator;
use crate::recovery_state::HasRecoveryState;
use crate::steps::Step;
use crate::util;
use crate::util::subnet_id_from_str;
use crate::{NeuronArgs, RecoveryArgs};
use core::fmt::Debug;
use ic_registry_client::client::RegistryClientImpl;
use ic_types::{NodeId, ReplicaVersion, SubnetId};
use slog::{info, warn, Logger};
Expand All @@ -34,14 +36,13 @@ pub fn app_subnet_recovery(
logger: Logger,
args: RecoveryArgs,
subnet_recovery_args: AppSubnetRecoveryArgs,
test: bool,
mut neuron_args: Option<NeuronArgs>,
) {
print_step(&logger, "App Subnet Recovery");
print_summary(&logger, &args, subnet_recovery_args.subnet_id);
wait_for_confirmation(&logger);

let mut neuron_args = None;
if !test {
if neuron_args.is_none() && !args.test_mode {
neuron_args = Some(read_neuron_args(&logger));
}

Expand All @@ -50,13 +51,10 @@ pub fn app_subnet_recovery(
args,
neuron_args,
subnet_recovery_args,
true,
/*interactive=*/ true,
);

for (step_type, step) in subnet_recovery {
print_step(&logger, &format!("{:?}", step_type));
execute_step_after_consent(&logger, step);
}
execute_steps(&logger, subnet_recovery);
}

/// NNS is recovered on same nodes by:
Expand All @@ -74,19 +72,19 @@ pub fn nns_recovery_same_nodes(
logger: Logger,
args: RecoveryArgs,
nns_recovery_args: NNSRecoverySameNodesArgs,
test: bool,
) {
print_step(&logger, "NNS Recovery Same Nodes");
print_summary(&logger, &args, nns_recovery_args.subnet_id);
wait_for_confirmation(&logger);

let nns_recovery =
NNSRecoverySameNodes::new(logger.clone(), args, nns_recovery_args, test, true);
let nns_recovery = NNSRecoverySameNodes::new(
logger.clone(),
args,
nns_recovery_args,
/*interactive=*/ true,
);

for (step_type, step) in nns_recovery {
print_step(&logger, &format!("{:?}", step_type));
execute_step_after_consent(&logger, step);
}
execute_steps(&logger, nns_recovery);
}

/// NNS is recovered on failover nodes by:
Expand All @@ -104,23 +102,48 @@ pub fn nns_recovery_failover_nodes(
logger: Logger,
args: RecoveryArgs,
nns_recovery_args: NNSRecoveryFailoverNodesArgs,
test: bool,
mut neuron_args: Option<NeuronArgs>,
) {
print_step(&logger, "NNS Recovery Failover Nodes");
print_summary(&logger, &args, nns_recovery_args.subnet_id);
wait_for_confirmation(&logger);

let mut neuron_args = None;
if !test {
if neuron_args.is_none() && !args.test_mode {
neuron_args = Some(read_neuron_args(&logger));
}

let nns_recovery =
NNSRecoveryFailoverNodes::new(logger.clone(), args, neuron_args, nns_recovery_args, true);
let nns_recovery = NNSRecoveryFailoverNodes::new(
logger.clone(),
args,
neuron_args,
nns_recovery_args,
/*interactive=*/ true,
);

for (step_type, step) in nns_recovery {
print_step(&logger, &format!("{:?}", step_type));
execute_step_after_consent(&logger, step);
execute_steps(&logger, nns_recovery);
}

fn execute_steps<
StepType: Copy + Debug + PartialEq,
I: Iterator<Item = StepType>,
Steps: HasRecoveryState<StepType = StepType>
+ RecoveryIterator<StepType, I>
+ Iterator<Item = (StepType, Box<dyn Step>)>,
>(
logger: &Logger,
mut steps: Steps,
) {
if let Some(next_step) = steps.get_next_step() {
steps.resume(next_step);
}

while let Some((step_type, step)) = steps.next() {
print_step(logger, &format!("{:?}", step_type));
execute_step_after_consent(logger, step);

if let Err(e) = steps.get_state().save() {
warn!(logger, "Failed to save the recovery state: {}", e);
}
}
}

Expand Down
7 changes: 4 additions & 3 deletions rs/recovery/src/cmd.rs
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
use clap::Parser;
use ic_types::ReplicaVersion;
use serde::{Deserialize, Serialize};
use std::path::PathBuf;
use url::Url;

Expand All @@ -10,12 +11,12 @@ use crate::{
};

/// Subcommands for recovery procedures (application subnets, NNS with failover nodes, etc...)
#[derive(Parser)]
#[derive(Clone, Debug, PartialEq, Parser, Serialize, Deserialize)]
pub enum SubCommand {
/// Application subnet recovery on same or failover nodes.
AppSubnetRecovery(AppSubnetRecoveryArgs),
/// NNS recovery on a failover IC.
NNSRecoveryFailoverNodes(Box<NNSRecoveryFailoverNodesArgs>),
NNSRecoveryFailoverNodes(NNSRecoveryFailoverNodesArgs),
/// NNS recovery on the same nodes.
NNSRecoverySameNodes(NNSRecoverySameNodesArgs),
}
Expand Down Expand Up @@ -50,5 +51,5 @@ pub struct RecoveryToolArgs {
pub test: bool,

#[clap(subcommand)]
pub subcmd: SubCommand,
pub subcmd: Option<SubCommand>,
}
14 changes: 14 additions & 0 deletions rs/recovery/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ pub enum RecoveryError {
CommandError(Option<i32>, String),
OutputError(String),
DownloadError(String, FileDownloadError),
ParsingError(serde_json::Error),
SerializationError(serde_json::Error),
UnexpectedError(String),
StepSkipped,
}
Expand All @@ -39,6 +41,12 @@ impl RecoveryError {
pub(crate) fn invalid_output_error(output: String) -> Self {
RecoveryError::OutputError(format!("Invalid output: {}", output))
}
pub(crate) fn parsing_error(e: serde_json::Error) -> Self {
RecoveryError::ParsingError(e)
}
pub(crate) fn serialization_error(e: serde_json::Error) -> Self {
RecoveryError::SerializationError(e)
}
pub(crate) fn download_error(url: String, target: &Path, e: FileDownloadError) -> Self {
RecoveryError::DownloadError(
format!("Failed to download from {} to {:?}", url, target),
Expand Down Expand Up @@ -68,6 +76,12 @@ impl fmt::Display for RecoveryError {
RecoveryError::StepSkipped => {
write!(f, "Recovery step skipped.")
}
RecoveryError::ParsingError(e) => {
write!(f, "Parsing error, error: {:?}", e)
}
RecoveryError::SerializationError(e) => {
write!(f, "Serialization error, error: {:?}", e)
}
}
}
}
Expand Down

0 comments on commit cf2ae2e

Please sign in to comment.