Clean up setup experience cancellation behavior#43437
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #43437 +/- ##
==========================================
- Coverage 66.91% 66.89% -0.03%
==========================================
Files 2596 2597 +1
Lines 208103 208259 +156
Branches 9322 9214 -108
==========================================
+ Hits 139250 139305 +55
- Misses 56197 56276 +79
- Partials 12656 12678 +22
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
…ctivity emission Zed + Opus 4.6; prompt: Are there tests covering the new activity creation in `server/worker/apple_mdm.go` L594-608? If not, add test coverage to assert that the activity is as expected.
Zed + Opis 4.6; prompt: Update tests added on this branch to reflect the cleanup done in the most recent commit.
Zed + Opus 4.6; prompt: Resolve test failures in `TestIntegrationsEnterprise/TestSetupExperienceLinuxWithSoftware` and `TestIntegrationsMDM/TestCancelUpcomingActivity`. You can enable MySQL tests to validate that tests run correctly; don't guess based on code analysis.
(even though I personally use "cancelled")
| { | ||
| HostUUID: hostUUID, | ||
| Status: fleet.SetupExperienceStatusFailure, | ||
| SetupExperienceScriptID: ptr.Uint(4), |
There was a problem hiding this comment.
@marko-lisica @melpike Right now we don't block setup experience on a failed script, so we don't emit a cancelled setup experience activity when a script is cancelled. Looks like we emit the ran_script activity with a failure status, but the FromSetupExperience flag doesn't actually get persisted so setup experience script activities aren't retrieves as such.
Given that setup experience scripts for macOS are a bit of a legacy feature (vs. script packages), do we need to do anything here?
ee/server/service/orbit_test.go
Outdated
| assert.Equal(t, "FailedApp", cseAct.SoftwareTitle) | ||
| assert.Equal(t, failedTitleID, cseAct.SoftwareTitleID) | ||
|
|
||
| // Second: canceled install software |
There was a problem hiding this comment.
@marko-lisica @melpike Do we want this order on activities (failed install first, then cxl'd setup experience, then cxl'd software/app installs) or do we want the setup experience cancellation activity after all of the individual install cancellations? The install failure itself happens a fair amount earlier in a different request so that would need to stay first, absent a larger refactor.
| assert.True(t, vppAct.FromSetupExperience) | ||
| }) | ||
|
|
||
| t.Run("script cancellation does not trigger activity", func(t *testing.T) { |
There was a problem hiding this comment.
@melpike @marko-lisica This test explicitly describes what I mentioned above.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review once |
There was a problem hiding this comment.
Pull request overview
Adds audit-log visibility and UI rendering for setup experience cancellation due to software install failures, and improves cancellation attribution for related “canceled install” activities.
Changes:
- Adds a new
canceled_setup_experienceactivity type and UI templates for host + global activity feeds. - Extends “canceled install” activities (software + App Store app) with
from_setup_experiencemetadata and surfaces that in UI copy. - Records additional activities on VPP enqueue failures during setup experience and adds/updates tests accordingly.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
server/worker/apple_mdm.go |
Emits an installed_app_store_app activity when VPP enqueue fails during iOS/iPadOS setup experience. |
server/worker/apple_mdm_test.go |
Adds worker test coverage asserting activity emission on VPP enqueue failure. |
server/service/integration_mdm_test.go |
Updates integration expectations to include from_setup_experience: false on canceled install activities. |
server/service/integration_enterprise_test.go |
Updates setup-experience Linux integration test to expect canceled install activity semantics. |
server/fleet/setup_experience.go |
Adds helper IsForVPPApp() to classify setup experience status results. |
server/fleet/activities.go |
Adds canceled_setup_experience activity type; extends canceled install activity details with from_setup_experience. |
frontend/pages/hosts/details/cards/Activity/ActivityItems/CanceledSetupExperienceActivityItem/index.ts |
Exports the new host activity item component. |
frontend/pages/hosts/details/cards/Activity/ActivityItems/CanceledSetupExperienceActivityItem/CanceledSetupExperienceActivityItem.tsx |
Renders the new host-scoped canceled setup experience activity item. |
frontend/pages/hosts/details/cards/Activity/ActivityItems/CanceledInstallSoftwareActivityItem/CanceledInstallSoftwareActivityItem.tsx |
Updates host activity copy to mention “during setup experience” when applicable. |
frontend/pages/hosts/details/cards/Activity/ActivityConfig.tsx |
Wires the new activity type to the host activity component map. |
frontend/pages/DashboardPage/cards/ActivityFeed/GlobalActivityItem/GlobalActivityItem.tsx |
Adds global-feed detail template + switch case for canceled setup experience; updates copy for canceled installs. |
frontend/interfaces/activity.ts |
Introduces new activity enum value and filter label; adds from_setup_experience to details typing. |
ee/server/service/setup_experience.go |
Emits an installed_app_store_app activity on VPP enqueue failure during setup experience. |
ee/server/service/orbit.go |
Replaces “fail cancelled installs” logic with “record canceled activities + mark cancelled as failure” logic. |
ee/server/service/orbit_test.go |
Adds unit tests for the new cancellation-recording helper and WasFromAutomation behavior. |
ee/server/service/devices.go |
Uses the new cancellation-recording helper when serving device setup experience status. |
cmd/fleet/serve.go |
Passes svc.NewActivity into the Apple MDM worker schedule. |
cmd/fleet/cron.go |
Threads NewActivityFn into the Apple MDM worker instance. |
changes/34288-setup-experience-cancel-activity |
Adds release notes for the new cancellation/failure activities behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Resolves #43437 (comment)
…dupes if that step doesn't succeed Resolves #43437 (review)
|
@claude review once |
There was a problem hiding this comment.
The bug hunter found a couple of confirmed issues worth addressing before merge: the FromSetupExperience field on ActivityInstalledAppStoreApp and ActivityTypeInstalledSoftware carries json:"-" instead of json:"from_setup_experience", so the frontend "during setup experience" annotation will silently never appear for those activity types; and ActivityTypeCanceledSetupExperience is missing SoftwareDisplayName, so custom VPP display names won't show in the canceled-setup-experience activity.
Extended reasoning...
Overview
PR #43437 cleans up setup experience cancellation behavior: the canceled_setup_experience activity is now emitted from maybyCancelPendingSetupExperienceSteps rather than downstream, ActivityTypeCanceledSetupExperience is introduced as a new struct, and several activity types gain a FromSetupExperience: true flag to annotate the frontend with "during setup experience" context.
Security Risks
No security-sensitive paths touched. This is purely activity/telemetry plumbing and setup experience state transitions.
Confirmed Bugs Found
Two bugs were confirmed by static inspection of the codebase:
Bug 4 (json tag): In server/fleet/activities.go, both ActivityTypeInstalledSoftware.FromSetupExperience (line 1091) and ActivityInstalledAppStoreApp.FromSetupExperience (line 1334) are tagged json:"-". The PR's new code sets FromSetupExperience: true on these types in multiple call sites, but the field will never serialize to JSON. The frontend template that gates on from_setup_experience will therefore never fire for installed_software or installed_app_store_app activities. By contrast, ActivityTypeCanceledInstallSoftware and ActivityTypeCanceledInstallAppStoreApp correctly use json:"from_setup_experience".
Bug 1 (missing field): ActivityTypeCanceledSetupExperience has SoftwareTitle but no SoftwareDisplayName field. The new CanceledSetupExperienceActivityItem frontend component calls getDisplayedSoftwareName(activity.details.software_title, activity.details.software_display_name), which prefers software_display_name for VPP apps — but that key will always be undefined because the Go struct never emits it.
Level of Scrutiny
Medium-high. The core feature this PR ships ("during setup experience" annotation and canceled-setup-experience activity) is partially broken due to the json:"-" tag mismatch. These are not cosmetic nits — they affect the visible correctness of the activity feed. A human should verify the fix and confirm the two other flagged issues (spurious upcoming-activity activation and potential duplicate activity emission) before merging.
| failActivity := fleet.ActivityInstalledAppStoreApp{ | ||
| HostID: host.ID, | ||
| HostDisplayName: host.DisplayName(), | ||
| SoftwareTitle: app.Name, | ||
| AppStoreID: ptr.ValOrZero(app.VPPAppAdamID), | ||
| Status: string(fleet.SoftwareInstallFailed), | ||
| HostPlatform: host.Platform, | ||
| FromSetupExperience: true, | ||
| } | ||
| if actErr := a.NewActivityFn(ctx, nil, failActivity); actErr != nil { | ||
| a.Log.WarnContext(ctx, "failed to create activity for VPP app install failure during setup experience", "err", actErr) | ||
| } | ||
| } |
There was a problem hiding this comment.
🔴 ActivityInstalledAppStoreApp.MustActivateNextUpcomingActivity() lacks the CommandUUID != "" guard present in the analogous ActivityTypeInstalledSoftware.MustActivateNextUpcomingActivity(), causing ActivateNextUpcomingActivity to be called with an empty execution ID when a VPP install fails before any MDM command is sent. This does not delete a completed entry (the delete is guarded by fromCompletedExecID != ""), but the function still proceeds to find and spuriously activate the next pending upcoming activity for the host out of order. Fix: add the same a.CommandUUID != "" guard to ActivityInstalledAppStoreApp.MustActivateNextUpcomingActivity().
Extended reasoning...
What the bug is and how it manifests
This PR introduces two new code paths that emit ActivityInstalledAppStoreApp with an empty CommandUUID when a VPP install fails before an MDM command is ever sent: (1) ee/server/service/setup_experience.go around line 285, where installSoftwareFromVPP returns an error before InsertHostVPPSoftwareInstall is called; (2) server/worker/apple_mdm.go around line 605, where installSoftwareFromVPP returns an error in the iOS/iPadOS setup experience path. In both cases, failActivity is constructed with CommandUUID left at its zero value (empty string), and svc.NewActivity / a.NewActivityFn is called with that activity.
The specific code path that triggers it
When NewActivity is called, the activity service eventually calls MustActivateNextUpcomingActivity() on the ActivityInstalledAppStoreApp struct (server/fleet/activities.go ~line 1355). That method is currently:
return a.Status != string(SoftwareInstalled)
Because Status is SoftwareInstallFailed, this returns true. The service then calls ActivateNextUpcomingActivityArgs() which returns (hostID, ""), and ultimately ActivateNextUpcomingActivity(ctx, hostID, "") is called (new_activity.go ~line 61).
Why existing code does not prevent it
Inside activateNextUpcomingActivity (activities.go ~line 933), the first action is to delete a completed upcoming activity by its execution ID. This delete step is correctly guarded by if fromCompletedExecID != "" (activities.go ~line 972), so with an empty string it is a no-op. However, the function unconditionally continues: it queries for the next unactivated pending upcoming activity for that host, and if one exists, activates it. Because no entry was deleted, this activation happens out of order — a pending activity is promoted without its predecessor having actually completed.
The analogous ActivityTypeInstalledSoftware.MustActivateNextUpcomingActivity() (activities.go ~line 1112) was already fixed for exactly this scenario:
return a.CommandUUID != "" && a.Status != string(SoftwareInstalled)
The CommandUUID guard ensures that when a failure activity is emitted without a real execution ID (pre-enqueue failure), the upcoming-activity machinery is not triggered at all. ActivityInstalledAppStoreApp was not updated to match.
What the impact would be
If a host has other pending upcoming activities queued (e.g., a queued software install, script run, or lock command), the failure of a VPP pre-enqueue will spuriously activate one of those activities out of order. The upcoming-activities queue is designed to process one item at a time in a specific sequence; out-of-order activation can cause a later activity to execute before an earlier dependency has finished (or started), potentially leading to incorrect fleet state.
How to fix it
Change ActivityInstalledAppStoreApp.MustActivateNextUpcomingActivity() in server/fleet/activities.go to mirror the existing guard:
return a.CommandUUID != "" && a.Status != string(SoftwareInstalled)
This ensures that pre-enqueue failure activities (which have no real MDM command UUID) do not trigger the upcoming-activity activation pipeline.
Step-by-step proof
- A VPP install is attempted via setup experience for an iOS/iPadOS host that has other upcoming activities queued.
- installSoftwareFromVPP returns an error (e.g., no available licenses); no MDM command is sent, so cmdUUID is empty.
- failActivity := fleet.ActivityInstalledAppStoreApp{..., Status: SoftwareInstallFailed} is constructed with CommandUUID left as "".
- a.NewActivityFn(ctx, nil, failActivity) is called.
- The activity service invokes failActivity.MustActivateNextUpcomingActivity() → returns true (Status != SoftwareInstalled, no CommandUUID check).
- ActivateNextUpcomingActivity(ctx, hostID, "") is called.
- Inside activateNextUpcomingActivity: the delete WHERE execution_id='' is a no-op (empty-string guard). The function then queries the host's upcoming activities and activates the next one.
- A second upcoming activity is now in "activated" state even though the first one never completed, breaking the intended sequential ordering.
| // Emit the canceled_setup_experience activity once at cancellation time. | ||
| // Find the software item that failed and triggered this cancellation from the | ||
| // already-loaded statuses (no extra DB call). | ||
| if newActivityFn != nil { | ||
| for _, s := range statuses { | ||
| if s.Status == fleet.SetupExperienceStatusFailure && s.IsForSoftware() { | ||
| if err := newActivityFn(ctx, nil, fleet.ActivityTypeCanceledSetupExperience{ | ||
| HostID: host.ID, | ||
| HostDisplayName: host.DisplayName(), | ||
| SoftwareTitle: s.Name, | ||
| SoftwareTitleID: ptr.ValOrZero(s.SoftwareTitleID), | ||
| }); err != nil { | ||
| return ctxerr.Wrap(ctx, err, "creating canceled setup experience activity") | ||
| } | ||
| break | ||
| } | ||
| } | ||
| } | ||
|
|
||
| return nil | ||
| } |
There was a problem hiding this comment.
🔴 When a late MDM result arrives for a VPP or software install that was set to cancelled by CancelPendingSetupExperienceSteps, MaybeUpdateSetupExperienceVPPStatus (and the software variant) silently transitions that row from cancelled to failure with no status guard on the UPDATE, returning updated=true. This causes maybeUpdateSetupExperienceStatus to re-invoke maybeCancelPendingSetupExperienceSteps, which finds the original failed software item still in failure state and emits a duplicate canceled_setup_experience activity. Fix by guarding the datastore UPDATE to only apply when the current status is not already terminal, or by checking whether CancelPendingSetupExperienceSteps actually cancelled any rows before emitting the activity.
Extended reasoning...
What the bug is and how it manifests
The new maybeCancelPendingSetupExperienceSteps function (server/service/setup_experience.go:318–341) emits ActivityTypeCanceledSetupExperience whenever it finds a software item in SetupExperienceStatusFailure in the loaded statuses list. This emission happens unconditionally after calling ds.CancelPendingSetupExperienceSteps, regardless of whether that call actually cancelled any rows. Because maybeCancelPendingSetupExperienceSteps can be invoked multiple times for the same host (once per late-arriving MDM result), a duplicate activity is emitted on every subsequent invocation that still finds a failure row.
The specific code path that triggers it
When requireAllSoftware=true and software item A fails, maybeCancelPendingSetupExperienceSteps is called (via maybeUpdateSetupExperienceStatus at line 397–403). It cancels in-flight item B and emits the activity. Later, B's in-flight MDM command result arrives with an error status. MaybeUpdateSetupExperienceVPPStatus (and the software-install variant) execute a simple SELECT id … WHERE host_uuid=? AND nano_command_uuid=? followed by UPDATE setup_experience_status_results SET status=? WHERE id=?. There is no AND status NOT IN ('cancelled','failure','success') guard on this UPDATE, so the row transitions from cancelled → failure and RowsAffected=1. maybeUpdateSetupExperienceStatus sees updated=true && status==Failure and calls maybeCancelPendingSetupExperienceSteps a second time.
Why existing code doesn't prevent it
The comment in the new code says 'Emit the canceled_setup_experience activity once at cancellation time,' but nothing in the function enforces idempotency. The function scans statuses (freshly fetched from DB) and finds item A still in SetupExperienceStatusFailure – A's status is never updated because it was the original failure trigger, not a cancelled item. So every invocation that sees at least one failed software item will emit the activity, and the number of invocations is unbounded when requireAllSoftware=true (the device is never released, orbit keeps polling).
What the impact would be
Every in-flight MDM result that arrives after cancellation produces a spurious duplicate canceled_setup_experience activity entry in the audit log. Under requireAllSoftware=true the device is never released, so orbit polls indefinitely, meaning the duplicate activity is emitted once per poll cycle (typically every 30 seconds), producing an unbounded stream of duplicates for the same cancellation event.
How to fix it
Two equivalent fixes: (a) guard the datastore UPDATE in MaybeUpdateSetupExperienceVPPStatus and MaybeUpdateSetupExperienceSoftwareInstallStatus so rows already in a terminal state (cancelled, failure, success) are not updated – this makes updated=false and prevents the re-trigger entirely; or (b) have CancelPendingSetupExperienceSteps return the number of rows it actually set to cancelled and gate the activity emission in maybeCancelPendingSetupExperienceSteps on rowsCancelled > 0.
Step-by-step proof
requireAllSoftware=true. macOS host has software A (pending) and VPP app B (running/pending). A's install result arrives with failure status.maybeUpdateSetupExperienceStatus→ A's status set to Failure (updated=true, status=Failure).maybeCancelPendingSetupExperienceStepsinvoked.CancelPendingSetupExperienceStepssets B: Running→Cancelled (1 row affected). Activity emitted for A. ✓- B's MDM command result arrives as Error.
MaybeUpdateSetupExperienceVPPStatusruns:SELECT id … WHERE host_uuid=? AND nano_command_uuid=?finds B (status is now 'cancelled', but SELECT has no status guard).UPDATE … SET status='failure' WHERE id=?runs, RowsAffected=1. Returnsupdated=true. maybeUpdateSetupExperienceStatusseesupdated=true && status==Failure→ callsmaybeCancelPendingSetupExperienceStepsagain.- Freshly loaded statuses: A=Failure, B=Failure. No pending/running items.
CancelPendingSetupExperienceStepscancels 0 rows. - Activity emission loop finds A=Failure → emits DUPLICATE
ActivityTypeCanceledSetupExperience. - Device still not released (requireAllSoftware+failure path), orbit polls again in ~30s, step 5–7 repeats indefinitely.
There was a problem hiding this comment.
@iansltx this from the bot does seem like a decent find. Thoughts?
There was a problem hiding this comment.
Yeah, digging into the other findings first but this is on my list. TBD what the fix shakes out to be here.
… no command UUID Resolves #43437 (comment)
Zed + Opus 4.6; prompt (after multiple attempts it wouldn't create the test any other way):
Commit `159194acc9d92843bb2de933309f159c84a501aa` should fix the issue in the following execution chain:
1. A VPP install is attempted via setup experience for an iOS/iPadOS host that has other upcoming activities queued.
2. installSoftwareFromVPP returns an error (e.g., no available licenses); no MDM command is sent, so cmdUUID is empty.
3. failActivity := fleet.ActivityInstalledAppStoreApp{..., Status: SoftwareInstallFailed} is constructed with CommandUUID left as "".
4. a.NewActivityFn(ctx, nil, failActivity) is called.
5. The activity service invokes failActivity.MustActivateNextUpcomingActivity() → returns true (Status != SoftwareInstalled, no CommandUUID check).
6. ActivateNextUpcomingActivity(ctx, hostID, "") is called.
7. Inside activateNextUpcomingActivity: the delete WHERE execution_id='' is a no-op (empty-string guard). The function then queries the host's upcoming activities and activates the next one.
8. A second upcoming activity is now in "activated" state even though the first one never completed, breaking the intended sequential ordering.
Build out a test that covers this workflow, confirming that the commit does indeed solve this issue. Item 8 in the list needs to be checked for as part of this.
Zed + Opus 4.6; prompt: Fix `TestIntegrationsEnterprise` and `TestIntegrationsMDM` failures. Pretty sure they're all due to `from_setup_experience` being added to activities.
Zed + Sonnet 4.6; prompt: Update audit-logs.md to reflect the new fields introduced in `d3666ab`
…e activites module Zed + Opus 4.6; prompt: Move `TestVPPInstallFailureEmptyCommandUUIDDoesNotActivateNext` to the main module to avoid the activity module depending on the main module.
For #34288.
Related issue: Resolves #
Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.See Changes files for more information.
Testing
Added/updated automated tests
QA'd all new/changed functionality manually
Summary by CodeRabbit