feat(orchestrator): return subnet assignment also on upgrade loop errors #7868

pierugo-dfinity · 2025-12-01T09:51:14Z

The upgrade loop in the orchestrator is responsible both for executing upgrades and determining the subnet ID of the node, used to provision SSH keys and rotate IDKG keys. Though there are multiple code flows where the orchestrator determines the subnet ID but there is an error later in the loop, which makes the function return an error and the caller not apply the subnet ID. This prevents SSH keys from being provisioned even though the subnet ID had correctly been identified.

An example of such a code flow is if the local CUP is not deserializable but the NiDkgId is, which allows the subnet ID to be correctly determined (i.e. we hit here). But since the CUP is not deserializable and currently has the highest height compared to a recovery or peers CUP (we imagine it's at the very start of a recovery, before applying SSH keys -> there is no recovery CUP yet), we return an error here and the subnet ID is not updated, and SSH keys are not provisioned. If it does not have the highest height (i.e. there is a recovery CUP), then we can use the latter, which explains why we can still recover.

Note: the existing system test sr_app_no_upgrade_with_chain_keys_test is testing that we can recover a subnet exactly in that case (if the CUP is not deserializable but the NiDkgId is). As explained, nodes can see the Recovery CUP, but we do not apply readonly keys even though we could. In a parallel PR, I distinguished cases where the NiDkgId was corrupted or not. If yes, then there's indeed no way of provisioning SSH keys, but there's also no way of seeing the Recovery CUP -> thus use failover nodes. If not, then we should be able to provision SSH keys. When the second case runs on the current implementation, it fails because we cannot provision SSH keys. When merging this branch to it, the test succeeds, which is a positive sign towards the added value of this change.

Another example is if we detected we need to leave the subnet but removing the state failed (i.e. hit here). Then, we'd return an error again and fail to remove SSH keys of the subnet.

This PR is not supposed to bring any functional change to the upgrade logic but instead moves the responsibility of setting the subnet assignment from the caller of check_for_upgrade to the latter directly.

PS: The PR also uses the same registry version for the entire loop, instead of determining multiple times the latest registry version (in functions prepare_upgrade_if_scheduled, check_for_upgrade_as_unassigned, should_node_become_unassigned), in order to have a more consistent and predictable behaviour.

pierugo-dfinity · 2025-12-03T13:24:05Z

rs/orchestrator/src/upgrade.rs

Note: I would also argue changing both of these false to true. I.e., in case the registry version is unavailable locally or somehow the field is empty or not deserializable, I would prefer not to accidentally remove the state, keep the subnet's SSH keys for a bit too long and try to rotate IDKG keys (in which case the registry should anyways deny the request because we would have left the subnet) rather than the opposite.

As of today, I cannot see a way where the registry version is not available locally, since it is always lower than the latest that we have. But if this function gets re-used somewhere else, I feel like returning true is more fail-safe than false. What do you guys think?

Note that replacing this to true could also mean launching the replica even though we are unassigned. But again, I do not think it hurts a lot if it's a single node doing so since other nodes would ignore it.

eichhorl · 2025-12-18T09:30:03Z

rs/orchestrator/src/upgrade.rs


 #[must_use = "This may be a `Stop` variant, which should be handled"]
-pub(crate) enum OrchestratorControlFlow {
+pub(crate) enum UpgradeCheckResult {


This is called "UpgradeCheckResult", but it is not actually a Result type, which is confusing. The enum does have different Ok and Err variants, however. Could we just turn it into a "real" Result? For instance, could we have a new error type that wraps both an OrchestratorError and SubnetAssignment, or similar?

eichhorl · 2025-12-18T09:31:14Z

rs/orchestrator/src/upgrade.rs

    Stop,
+    /// There was an error while checking for an upgrade, but we still successfully determined that
+    /// the node is assigned to the given subnet.
+    ErrorAsAssigned(SubnetId, OrchestratorError),


Should there also be an ErrorAsLeaving variant?

eichhorl · 2025-12-18T09:40:50Z

rs/orchestrator/src/upgrade.rs

-    ) -> OrchestratorResult<OrchestratorControlFlow> {
-        let registry_version = self.registry.get_latest_version();
-
+        registry_version: RegistryVersion,
+    ) -> OrchestratorResult<bool> {


Does this make a difference?

Are you talking about the return type being bool? Not much, I just did not want to return some ErrorAsUnassigned here and instead keep a simple Err to be more generic (i.e. the function does not need to “know” about UpgradeCheckResult). This was reverted anyways.

eichhorl · 2025-12-18T09:46:07Z

rs/orchestrator/src/orchestrator.rs

-                    Ok(Ok(control_flow)) => {
+                    Ok(upgrade_result) => {
+                        // Update the subnet assignment based on the latest upgrade result.
+                        *subnet_assignment.write().unwrap() = upgrade_result.as_subnet_assignment();


If returning the subnet assignment is too much of a hassle, especially since it needs to be updated in both the Ok and Err case, maybe another idea could be to have the Upgrade struct own a copy to this RwLock and mutating it directly there?

eichhorl · 2025-12-18T09:50:46Z

rs/orchestrator/src/upgrade.rs

-                }
-                (None, None) => match self.registry.get_subnet_id(latest_registry_version) {
+        let maybe_local_cup_proto = self.cup_provider.get_local_cup_proto();
+        let (subnet_id, maybe_local_cup) = 'block_subnet_id_local_cup: {


I think this block is a bit hard to understand. If this refactoring is needed, is there a way to avoid the named blocks and break statements? We could also consider extracting some functions if that makes it easier.

Indeed. I reverted this and kept most of what was already there while improving it to return just the subnet ID out of the block, instead of also the CUP and CUP proto. LMK what you think. (I recommend reviewing with whitespaces disabled)

perf: grab exclusive lock at the start of the function Revert "perf: grab exclusive lock at the start of the function" This reverts commit bbf44232d18db279c35a21ef033cb76a21d0bade. perf: less locking style: use ? refactor: allow harmless race condition

pierugo-dfinity · 2025-12-19T14:13:53Z

Thanks for the review @eichhorl, I agree with all your remarks, that function has quite a high coupling so it’s tricky 😅. I’ll cover the ones that I haven’t answered directly here since they are interconnected.

This is called "UpgradeCheckResult", but it is not actually a Result type, which is confusing. The enum does have different Ok and Err variants, however. Could we just turn it into a "real" Result? For instance, could we have a new error type that wraps both an OrchestratorError and SubnetAssignment, or similar?

and

Should there also be an ErrorAsLeaving variant?

I think these translate to keep the current OrchestratorControlFlowenum and returning a Result<OrchestratorControlFlow, (Option<OrchestratorControlFlow>, OrchestratorError)>(obviously with intermediary types) (the Option in case we have an error and can't tell the flow). This could even encode an error during Stop. But I don’t think it helps readability much:

We probably would still need a get_control_flow function or smth similar to get the flow from either the Ok or Err variant without doing a pattern matching in the caller, but we would need to introduce a new trait for that because Result is a foreign type (or wrap the Result in a 1-tuple).
Each time we’d like to return an error in check(), we would have to return something like Err(UpgradeCheckError::new(OrchestratorControlFlow::Unassigned, err)), in contrast to the current UpgradeCheckResult::ErrorAsUnassigned(err) which reads more easily imo.

Which comes to

If returning the subnet assignment is too much of a hassle, especially since it needs to be updated in both the Ok and Err case, maybe another idea could be to have the Upgrade struct own a copy to this RwLock and mutating it directly there?

I think this is the cleanest solution. Unfortunately, it is impossible to hold the lock during an .await. We could either use an async RwLock, but it does not feel super idiomatic (slow and would turn the dashboard’s response async), or instead commit the Assigned assignment early and overwrite it later if we discover we are actually not part of the subnet anymore. This means that other tasks could potentially see the unassignment a bit late, but that is not a big deal (since the commitment of being Assigned is sound: we were previously in the subnet). This is what I finally implemented, reverting a lot of the original changes of the PR. I have to agree it looks much nicer.
I'm also open to using an async RwLock to prevent requesting an exclusive lock twice, but that could also introduce some blocking on the readers (including the dashboard's response) while we are fetching a CUP from our peers (in get_latest_cup), so I'm more tempted towards the current solution.

feat: do not swallow subnet assignment on upgrade loop errors

ccac624

github-actions bot added the feat label Dec 1, 2025

docs: revert removed docs

710b4fc

pierugo-dfinity added the CI_ALL_BAZEL_TARGETS Runs all bazel targets and uploads them to S3 label Dec 1, 2025

re-trigger CI

3801182

pierugo-dfinity changed the title ~~feat(orchestrator): do not swallow subnet assignment on upgrade loop errors~~ feat(orchestrator): do not ignore subnet assignment on upgrade loop errors Dec 2, 2025

style

14d9854

pierugo-dfinity commented Dec 3, 2025

View reviewed changes

pierugo-dfinity added 2 commits December 8, 2025 12:30

style

bf7e146

Merge branch 'master' into pierugo/orchestrator/return-subnet-id

338cb5b

pierugo-dfinity mentioned this pull request Dec 11, 2025

test(recovery): distinguish tests for recoverable vs unrecoverable CUP corruptions #7920

Draft

pierugo-dfinity changed the title ~~feat(orchestrator): do not ignore subnet assignment on upgrade loop errors~~ feat(orchestrator): return subnet assignment also on upgrade loop errors Dec 11, 2025

pierugo-dfinity marked this pull request as ready for review December 11, 2025 15:00

pierugo-dfinity requested a review from a team as a code owner December 11, 2025 15:00

github-actions bot added the @consensus label Dec 11, 2025

eichhorl reviewed Dec 18, 2025

View reviewed changes

pierugo-dfinity added 2 commits December 19, 2025 13:10

style: improve subnet_id+cup block

4fe8cbf

Merge branch 'master' into pierugo/orchestrator/return-subnet-id

6deb8dc

pierugo-dfinity mentioned this pull request Dec 23, 2025

feat: Allow recalling GuestOS versions #8188

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(orchestrator): return subnet assignment also on upgrade loop errors #7868

feat(orchestrator): return subnet assignment also on upgrade loop errors #7868

Uh oh!

pierugo-dfinity commented Dec 1, 2025 •

edited

Loading

Uh oh!

pierugo-dfinity Dec 3, 2025

Uh oh!

eichhorl Dec 18, 2025

Uh oh!

eichhorl Dec 18, 2025

Uh oh!

eichhorl Dec 18, 2025

Uh oh!

pierugo-dfinity Dec 19, 2025

Uh oh!

eichhorl Dec 18, 2025

Uh oh!

eichhorl Dec 18, 2025

Uh oh!

pierugo-dfinity Dec 19, 2025

Uh oh!

pierugo-dfinity commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat(orchestrator): return subnet assignment also on upgrade loop errors #7868

Are you sure you want to change the base?

feat(orchestrator): return subnet assignment also on upgrade loop errors #7868

Uh oh!

Conversation

pierugo-dfinity commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierugo-dfinity commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pierugo-dfinity commented Dec 1, 2025 •

edited

Loading