Skip to content

Commit

Permalink
Merge branch 'eichhorl/key-alert' into 'master'
Browse files Browse the repository at this point in the history
feat(consensus): CON-1041 ECDSA public key monitoring

Currently, if a subnet changes or loses its ECDSA key due to a bug, for instance after a replica upgrade, we will not be aware of it until the issue propagates to users.

With this MR we instead detect such a situation by inspecting the CUP of an ECDSA subnet. If its key differs from the one included in the previous CUP of the subnet, we increment an error metric and raise an alert.

In particular, by upgrading the ECDSA backup subnet first, this would allow us to be notified of certain problems in advance of upgrading the ECDSA signing subnet.

Note the [Design document](https://docs.google.com/document/d/1WBMkPHbUe8R4IWtuQsrOvlcKPzN14dwC6A4QvGUXyJc/edit) primarily talks about only monitoring the first CUP after an upgrade. Here we extend this idea to analogously compare every new CUP with the previous one.

One problem is monitoring key changes done in the last CUP before a restart. As the orchestrator typically shuts down immediately after inspecting the CUP, the error metric is reset before it can be scraped. This can be fixed by persisting the metric in the orchestrator's data directory, which will be done in a follow-up MR. 

See merge request dfinity-lab/public/ic!13013
  • Loading branch information
eichhorl committed Jun 27, 2023
2 parents 2fc4d35 + 8df3cd6 commit 6a00d6e
Show file tree
Hide file tree
Showing 7 changed files with 322 additions and 11 deletions.
4 changes: 4 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

4 changes: 4 additions & 0 deletions rs/orchestrator/BUILD.bazel
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ rust_library(
"//rs/registry/proto_data_provider",
"//rs/registry/routing_table",
"//rs/sys",
"//rs/types/ic00_types",
"//rs/types/types",
"//rs/utils",
"@crate_index//:candid",
Expand Down Expand Up @@ -84,7 +85,10 @@ rust_test(
crate = ":lib",
deps = [
"//rs/crypto/temp_crypto",
"//rs/crypto/test_utils/canister_threshold_sigs",
"//rs/crypto/test_utils/reproducible_rng",
"//rs/registry/fake",
"//rs/test_utilities",
"//rs/test_utilities/in_memory_logger",
"//rs/test_utilities/logger",
"@crate_index//:assert_cmd",
Expand Down
4 changes: 4 additions & 0 deletions rs/orchestrator/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ ic-crypto-utils-threshold-sig = { path = "../crypto/utils/threshold_sig" }
ic-crypto-sha = { path = "../crypto/sha" }
ic-dashboard = { path = "./dashboard" }
ic-http-utils = { path = "../http_utils" }
ic-ic00-types = { path = "../types/ic00_types" }
ic-image-upgrader = { path = "./image_upgrader" }
ic-interfaces = { path = "../interfaces" }
ic-interfaces-registry = { path = "../interfaces/registry" }
Expand Down Expand Up @@ -60,7 +61,10 @@ url = "2.1.1"
[dev-dependencies]
assert_cmd = "0.12"
ic-crypto-temp-crypto = { path = "../crypto/temp_crypto" }
ic-crypto-test-utils-canister-threshold-sigs = { path = "../crypto/test_utils/canister_threshold_sigs" }
ic-crypto-test-utils-reproducible-rng = { path = "../crypto/test_utils/reproducible_rng" }
ic-registry-client-fake = { path = "../registry/fake" }
ic-test-utilities = { path = "../test_utilities" }
ic-test-utilities-in-memory-logger = { path = "../test_utilities/in_memory_logger" }
ic-test-utilities-logger = { path = "../test_utilities/logger" }
mockall = "0.8.3"
Expand Down
3 changes: 1 addition & 2 deletions rs/orchestrator/src/catch_up_package_provider.rs
Original file line number Diff line number Diff line change
Expand Up @@ -310,7 +310,6 @@ impl CatchUpPackageProvider {
registry_version,
))?;

let unsigned = latest_cup.signature.signature.get_ref().0.is_empty();
let height = Some(latest_cup.content.height());
// We recreate the local registry CUP everytime to avoid incompatibility issues. Without
// this recreation, we might run into the following problem: assume the orchestrator of
Expand All @@ -323,7 +322,7 @@ impl CatchUpPackageProvider {
//
// By re-creating the unsigned CUP every time we realize it's the newest one, we instead
// recreate the CUP on all orchestrators of the version B before starting the replica.
if height > local_cup_height || height == local_cup_height && unsigned {
if height > local_cup_height || height == local_cup_height && latest_cup.is_unsigned() {
self.persist_cup(&latest_cup_proto)?;
}
Ok((latest_cup_proto, latest_cup))
Expand Down
8 changes: 7 additions & 1 deletion rs/orchestrator/src/metrics.rs
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
use prometheus::{IntCounter, IntGauge, IntGaugeVec};
use prometheus::{IntCounter, IntCounterVec, IntGauge, IntGaugeVec};
use strum::IntoEnumIterator;
use strum_macros::{EnumIter, IntoStaticStr};

Expand All @@ -15,6 +15,7 @@ pub struct OrchestratorMetrics {
pub reboot_duration: IntGauge,
pub orchestrator_info: IntGaugeVec,
pub key_rotation_status: IntGaugeVec,
pub ecdsa_key_changed_errors: IntCounterVec,
}

#[derive(Copy, Clone, Debug, EnumIter, Eq, IntoStaticStr, PartialOrd, Ord, PartialEq)]
Expand Down Expand Up @@ -77,6 +78,11 @@ impl OrchestratorMetrics {
"The current key rotation status.",
&["status"],
),
ecdsa_key_changed_errors: metrics_registry.int_counter_vec(
"orchestrator_tecdsa_key_changed_errors_total",
"Critical error counter monitoring changed tECDSA public keys",
&["key_id"],
),
}
}

Expand Down

0 comments on commit 6a00d6e

Please sign in to comment.