Skip to content

Conversation

@pierugo-dfinity
Copy link
Contributor

@pierugo-dfinity pierugo-dfinity commented Oct 30, 2025

This PR logs more useful information (especially the state hash) about the local CUP just before persisting it in the orchestrator.

This is useful in cases where the orchestrator breaks after an upgrade which would prevent from provisioning readonly SSH keys to recover the subnet. In that case, there is no easy way to know the latest state hash to be included in the recovery CUP except from hoping that the recovery operator's node is up to date. Logging information about the CUP just before rebooting removes this requirement, as long as the latest logs were scraped before the node reboots.

Edit: Following the PR comments, the original solution suffered that it could be possible that the logs were not scraped before rebooting if the node reboots too fast. Since the state hash is logged by the state manager anyways before actually creating the CUP, we can rely on this log instead. The original twin PR intended to test the functionality now relies on the log from the state manager, preventing it to be removed in the future, and is now also open since we do not need to wait for the current PR to be merged to mainnet NNS. The two PRs are independent.

Still, including the state hash in the orchestrator cannot hurt and this PR does just that.

About the original sleep of 2 seconds at the end of the orchestrator to let Vector scrape late logs, there may be a way to persist logs before rebooting and ask systemd-journal-gatewayd to serve logs from the previous boot but I do not think it is worth the effort (we would need to change the Vector configs f.ex.) just to see a few lines of logs missing.

@github-actions github-actions bot added the chore label Oct 30, 2025
@pierugo-dfinity pierugo-dfinity changed the title chore(orchestrator): add local CUP state hash metric chore(orchestrator): log local CUP info after persisting it Nov 4, 2025
@pierugo-dfinity pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Nov 4, 2025
@pierugo-dfinity pierugo-dfinity marked this pull request as ready for review November 13, 2025 12:54
@pierugo-dfinity pierugo-dfinity requested a review from a team as a code owner November 13, 2025 12:54
@pierugo-dfinity pierugo-dfinity changed the title chore(orchestrator): log local CUP info after persisting it chore(orchestrator): log more info about local CUP before persisting it Nov 19, 2025
@pierugo-dfinity pierugo-dfinity added this pull request to the merge queue Nov 26, 2025
Merged via the queue into master with commit b0ea5e9 Nov 26, 2025
69 of 70 checks passed
@pierugo-dfinity pierugo-dfinity deleted the pierugo/orchestrator/local-cup-hash-metric branch November 26, 2025 13:27
mraszyk pushed a commit that referenced this pull request Dec 1, 2025
…it (#7487)

This PR logs more useful information (especially the state hash) about
the local CUP just before persisting it in the orchestrator.

This is useful in cases where the orchestrator breaks after an upgrade
which would prevent from provisioning readonly SSH keys to recover the
subnet. In that case, there is no easy way to know the latest state hash
to be included in the recovery CUP except from hoping that the recovery
operator's node is up to date. Logging information about the CUP just
before rebooting removes this requirement, as long as the latest logs
were scraped before the node reboots.

Edit: Following the PR comments, the original solution suffered that it
could be possible that the logs were not scraped before rebooting if the
node reboots too fast. Since the state hash is logged by the state
manager anyways before actually creating the CUP, we can rely on this
log instead. The original twin
[PR](#7525) intended to test the
functionality now relies on the log from the state manager, preventing
it to be removed in the future, and is now also open since we do not
need to wait for the current PR to be merged to mainnet NNS. The two PRs
are independent.

Still, including the state hash in the orchestrator cannot hurt and this
PR does just that.

About the original sleep of 2 seconds at the end of the orchestrator to
let Vector scrape late logs, there may be a way to persist logs before
rebooting and ask `systemd-journal-gatewayd` to serve logs from the
previous boot but I do not think it is worth the effort (we would need
to change the Vector configs f.ex.) just to see a few lines of logs
missing.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

chore CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 @consensus

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants