Skip to content

Commit

Permalink
Merge branch 'eichhorl/monitor-last-cup' into 'master'
Browse files Browse the repository at this point in the history
feat(consensus): CON-1063 Persist ECDSA key monitoring metric across upgrades

Erroneous threshold public key changes are monitored in the orchestrator by comparing the keys of a CUP with those of the previous CUP. If a key was changed or deleted, the corresponding metric is incremented, thus raising an alert. 

However, this doesn't work for the last CUP before an upgrade:

As the orchestrator typically shuts down immediately after inspecting the last CUP before an upgrade, the error metric is reset before it can be scraped.

In this MR we persist the result of comparing the last two CUPs in the orchestrator's data directory. After restart, we initialize the metric with the persisted result and thus guarantee that the alert is raised correctly.

Metrics of keys that haven't changed in the last CUP are reset to 0, as we assume that the alert was already raised.

In order to not raise false alarms after membership changes, we reset the metric once a node is unassigned. 

See merge request dfinity-lab/public/ic!13126
  • Loading branch information
eichhorl committed Jun 28, 2023
2 parents 8783b1e + 5e9c687 commit 1afbb9f
Show file tree
Hide file tree
Showing 4 changed files with 151 additions and 35 deletions.
4 changes: 2 additions & 2 deletions rs/orchestrator/image_upgrader/src/image_upgrader.rs
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,8 @@ pub trait ImageUpgrader<V: Clone + Debug + PartialEq + Eq + Send + Sync, R: Send
/// Path to the image image download and unpacking destination.
fn image_path(&self) -> &PathBuf;
/// Optional data path, used for storing latest reboot time. Default is None.
fn data_dir(&self) -> &Option<PathBuf> {
&None
fn data_dir(&self) -> Option<&PathBuf> {
None
}
/// Return the logger to be passed to the upgrade functions.
fn log(&self) -> &ReplicaLogger;
Expand Down
2 changes: 1 addition & 1 deletion rs/orchestrator/src/args.rs
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ pub struct OrchestratorArgs {
/// The path to directory that is dedicated to data specific to the orchstrator.
/// If not provided, the relevant data are not persisted to the disk.
#[clap(long, parse(from_os_str))]
pub(crate) orchestrator_data_directory: Option<PathBuf>,
pub(crate) orchestrator_data_directory: PathBuf,
}

impl OrchestratorArgs {
Expand Down
14 changes: 14 additions & 0 deletions rs/orchestrator/src/error.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ pub enum OrchestratorError {

/// Generic error while handling reboot time
RebootTimeError(String),

/// Generic error while monitoring key changes
ThresholdKeyMonitoringError(String),
}

impl OrchestratorError {
Expand All @@ -61,6 +64,10 @@ impl OrchestratorError {
pub(crate) fn invalid_configuration_error(msg: impl ToString) -> Self {
OrchestratorError::InvalidConfigurationError(msg.to_string())
}

pub(crate) fn key_monitoring_error(msg: impl ToString) -> Self {
OrchestratorError::ThresholdKeyMonitoringError(msg.to_string())
}
}

impl fmt::Display for OrchestratorError {
Expand Down Expand Up @@ -94,6 +101,13 @@ impl fmt::Display for OrchestratorError {
OrchestratorError::RebootTimeError(msg) => {
write!(f, "Failed to read or write reboot time: {}", msg)
}
OrchestratorError::ThresholdKeyMonitoringError(msg) => {
write!(
f,
"Failed to read or write threshold key changed metric: {}",
msg
)
}
OrchestratorError::SubnetMissingError(subnet_id, registry_version) => write!(
f,
"Subnet ID {:?} does not exist in the Registry at registry version {:?}",
Expand Down

0 comments on commit 1afbb9f

Please sign in to comment.