Conversation
…a certificate install failure
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Pull request overview
Implements automatic retry behavior for Android certificate template installs, adding persistent retry tracking and tests to validate retry/terminal failure behavior. Resolves #37546.
Changes:
- Added
retry_counttohost_certificate_templates(schema + migration) and surfaced it inHostCertificateTemplate. - Added datastore support to reset failed installs back to
pendingand increment retry count for automatic retries. - Updated the certificate status update service flow and expanded integration/datastore tests around retries, terminal failure, and resend semantics.
Reviewed changes
Copilot reviewed 10 out of 11 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| server/service/certificates.go | Adds automatic retry logic on install failure and attempts to log install activities. |
| server/datastore/mysql/host_certificate_templates.go | Adds RetryHostCertificateTemplate and includes retry_count in record retrieval + renewal reset. |
| server/datastore/mysql/certificate_templates.go | Updates resend logic to set retry_count to max (single post-resend attempt semantics). |
| server/fleet/host_certificate_template.go | Introduces MaxCertificateInstallRetries and RetryCount field on the model. |
| server/fleet/datastore.go | Extends datastore interface with RetryHostCertificateTemplate. |
| server/mock/datastore_mock.go | Adds mock support for the new datastore method. |
| server/datastore/mysql/schema.sql | Adds retry_count column and bumps migration status table seed. |
| server/datastore/mysql/migrations/tables/20260331000000_AddCertRetryCount.go | Migration to add retry_count column. |
| server/datastore/mysql/host_certificate_templates_test.go | Adds unit test coverage for retry reset behavior and challenge deletion. |
| server/service/integration_android_certificate_templates_test.go | Adds integration coverage for auto-retry/terminal failure/resend behavior. |
| changes/37546-android-certificate-install-activity | Adds a changelog entry for the retry behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
WalkthroughThis PR adds automatic retry logic for Android certificate installation failures: a new Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
server/datastore/mysql/certificate_templates.go (1)
405-423:⚠️ Potential issue | 🟠 MajorReset the retry budget on resend.
Line 418 writes
fleet.MaxCertificateInstallRetriesback intoretry_count, so the next install failure after a manual resend is treated as terminal and gets zero automatic retries. That makes a resend a one-shot attempt instead of starting a fresh delivery cycle.Proposed fix
func (ds *Datastore) ResendHostCertificateTemplate(ctx context.Context, hostID uint, templateID uint) error { - stmt := fmt.Sprintf(` + const stmt = ` UPDATE host_certificate_templates hct INNER JOIN hosts h ON h.uuid = hct.host_uuid SET hct.uuid = UUID_TO_BIN(UUID(), true), hct.fleet_challenge = NULL, hct.not_valid_before = NULL, hct.not_valid_after = NULL, hct.serial = NULL, hct.detail = NULL, - hct.retry_count = %d, + hct.retry_count = 0, hct.status = ? WHERE h.id = ? AND hct.certificate_template_id = ? - `, fleet.MaxCertificateInstallRetries) + `🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/certificate_templates.go` around lines 405 - 423, The ResendHostCertificateTemplate SQL resets retry_count incorrectly to fleet.MaxCertificateInstallRetries which makes the next install failure treated as terminal; change the UPDATE in ResendHostCertificateTemplate to reset hct.retry_count back to 0 (or the initial starting value your delivery cycle expects) instead of fleet.MaxCertificateInstallRetries so a manual resend restarts the delivery retry budget; locate the UPDATE that touches host_certificate_templates / hct.retry_count in the ResendHostCertificateTemplate function and replace the injected fleet.MaxCertificateInstallRetries with 0 (or the configured initial retry counter) and run tests.
🧹 Nitpick comments (1)
server/datastore/mysql/host_certificate_templates_test.go (1)
2012-2077: Good retry-flow coverage; add assertions for cleared challenge and regenerated UUID.This test already checks the key state transitions well. Consider also asserting
record.FleetChallenge == niland UUID rotation after retry to fully lock down the retry contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/host_certificate_templates_test.go` around lines 2012 - 2077, Add assertions in testRetryHostCertificateTemplate to ensure the FleetChallenge is cleared and the template UUID is rotated on retry: before calling ds.RetryHostCertificateTemplate capture the initial record's FleetChallenge and UUID via ds.GetHostCertificateTemplateRecord, then after the first RetryHostCertificateTemplate call assert record.FleetChallenge == nil (or record.FleetChallenge pointer is nil) and assert the record.UUID (or the unique identifier field returned by GetHostCertificateTemplateRecord) is different from the captured initial UUID; use the same helpers (GetHostCertificateTemplateRecord and RetryHostCertificateTemplate) to locate and verify these values.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/datastore/mysql/host_certificate_templates.go`:
- Around line 175-205: The RetryHostCertificateTemplate function currently
performs the DELETE and UPDATE outside a transaction and does not verify the
UPDATE actually affected a row; change it to run both statements inside a single
transaction (use ds.writer(ctx).BeginTx / tx := ds.writer(ctx).BeginTx or
equivalent to obtain a tx and use tx.ExecContext for both statements), ensure
you call tx.Rollback on error and tx.Commit on success, and after the UPDATE
check the sql.Result.RowsAffected value and return an error (wrapped with
ctxerr.Wrap) if zero rows were affected so missed target rows are surfaced; keep
existing error messages and preserve the detail parameter handling and UUID
logic when moving the statements into the transaction.
In `@server/service/certificates.go`:
- Around line 699-741: The activity for failed installs is being created before
the retry/DB update, causing transient failures to be logged; change the flow so
svc.NewActivity (creating CertificateActivityInstalled /
CertificateActivityFailedInstall) is only called after the datastore terminal
write completes — i.e., run RetryHostCertificateTemplate and/or
UpsertCertificateStatus first (use svc.ds.RetryHostCertificateTemplate and the
UpsertCertificateStatus path that writes the final record), then, only when the
status is terminal (MDMDeliveryVerified or MDMDeliveryFailed with
record.RetryCount >= fleet.MaxCertificateInstallRetries), call svc.NewActivity
to emit the installed_certificate/failed_install activity (using
CertificateActivityFailedInstall, MDMDeliveryFailed, record.RetryCount and
MaxCertificateInstallRetries to determine terminality) and handle/log errors as
before.
In `@server/service/integration_android_certificate_templates_test.go`:
- Around line 1621-1633: The current check in TestCertificateTemplateResend only
matches act.Type and can pick an earlier successful installed_certificate
activity; instead, after fetching hostActivitiesResp (listActivitiesResponse)
locate the most recent activity for ActivityTypeInstalledCertificate by
inspecting hostActivitiesResp.Activities ordered by created_at (or use index 0
if response is newest-first), then assert the activity's payload/status/detail
indicates failure (e.g., payload.Status == "failed_install" or payload.Detail
contains the failure message) rather than just matching act.Type; update the
verification using the activity struct fields (from listActivitiesResponse/act)
to confirm the terminal failed_install record is logged.
---
Outside diff comments:
In `@server/datastore/mysql/certificate_templates.go`:
- Around line 405-423: The ResendHostCertificateTemplate SQL resets retry_count
incorrectly to fleet.MaxCertificateInstallRetries which makes the next install
failure treated as terminal; change the UPDATE in ResendHostCertificateTemplate
to reset hct.retry_count back to 0 (or the initial starting value your delivery
cycle expects) instead of fleet.MaxCertificateInstallRetries so a manual resend
restarts the delivery retry budget; locate the UPDATE that touches
host_certificate_templates / hct.retry_count in the
ResendHostCertificateTemplate function and replace the injected
fleet.MaxCertificateInstallRetries with 0 (or the configured initial retry
counter) and run tests.
---
Nitpick comments:
In `@server/datastore/mysql/host_certificate_templates_test.go`:
- Around line 2012-2077: Add assertions in testRetryHostCertificateTemplate to
ensure the FleetChallenge is cleared and the template UUID is rotated on retry:
before calling ds.RetryHostCertificateTemplate capture the initial record's
FleetChallenge and UUID via ds.GetHostCertificateTemplateRecord, then after the
first RetryHostCertificateTemplate call assert record.FleetChallenge == nil (or
record.FleetChallenge pointer is nil) and assert the record.UUID (or the unique
identifier field returned by GetHostCertificateTemplateRecord) is different from
the captured initial UUID; use the same helpers
(GetHostCertificateTemplateRecord and RetryHostCertificateTemplate) to locate
and verify these values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 201481b3-3ee9-4874-805b-ec201936a77b
📒 Files selected for processing (11)
changes/37546-android-certificate-install-activityserver/datastore/mysql/certificate_templates.goserver/datastore/mysql/host_certificate_templates.goserver/datastore/mysql/host_certificate_templates_test.goserver/datastore/mysql/migrations/tables/20260331000000_AddCertRetryCount.goserver/datastore/mysql/schema.sqlserver/fleet/datastore.goserver/fleet/host_certificate_template.goserver/mock/datastore_mock.goserver/service/certificates.goserver/service/integration_android_certificate_templates_test.go
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #42734 +/- ##
==========================================
+ Coverage 66.80% 66.82% +0.01%
==========================================
Files 2541 2542 +1
Lines 203737 204013 +276
Branches 9279 9279
==========================================
+ Hits 136107 136332 +225
- Misses 55307 55333 +26
- Partials 12323 12348 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
server/service/certificates.go (1)
699-744:⚠️ Potential issue | 🟠 MajorPersist the retry/final state before creating the certificate activity.
ActivityTypeInstalledCertificateis written before eitherRetryHostCertificateTemplateorUpsertCertificateStatusruns, so a later datastore error can leave aninstalled/failed_installactivity for a transition that never committed. It also recordsfailed_installbefore the code knows whether the failure is terminal or just being queued for retry.💡 Suggested structure
- // Log activity for install statuses (not removals). Failures are logged on every attempt - // (including retries) so IT admins have visibility into retry attempts. - if update.OperationType == fleet.MDMOperationTypeInstall { - var actStatus fleet.CertificateActivityStatus - switch update.Status { - case fleet.MDMDeliveryVerified: - actStatus = fleet.CertificateActivityInstalled - case fleet.MDMDeliveryFailed: - actStatus = fleet.CertificateActivityFailedInstall - } - if actStatus != "" { - detail := "" - if update.Detail != nil { - detail = *update.Detail - } - if err := svc.NewActivity(ctx, nil, fleet.ActivityTypeInstalledCertificate{ - HostID: host.ID, - HostDisplayName: host.DisplayName(), - CertificateTemplateID: update.CertificateTemplateID, - CertificateName: record.Name, - Status: string(actStatus), - Detail: detail, - }); err != nil { - // Log and continue since we don't want the client to fail on this. - svc.logger.ErrorContext(ctx, "failed to create certificate install activity", "host.id", host.ID, "activity.status", actStatus, - "err", err) - ctxerr.Handle(ctx, err) - } - } - } - // For failed installs, automatically retry if under the retry limit. if update.OperationType == fleet.MDMOperationTypeInstall && update.Status == fleet.MDMDeliveryFailed { if record.RetryCount < fleet.MaxCertificateInstallRetries { detail := "" if update.Detail != nil { detail = *update.Detail } if err := svc.ds.RetryHostCertificateTemplate(ctx, host.UUID, update.CertificateTemplateID, detail); err != nil { return ctxerr.Wrap(ctx, err, "retrying certificate install") } return nil } } - return svc.ds.UpsertCertificateStatus(ctx, update) + if err := svc.ds.UpsertCertificateStatus(ctx, update); err != nil { + return err + } + + if update.OperationType == fleet.MDMOperationTypeInstall { + var actStatus fleet.CertificateActivityStatus + switch update.Status { + case fleet.MDMDeliveryVerified: + actStatus = fleet.CertificateActivityInstalled + case fleet.MDMDeliveryFailed: + actStatus = fleet.CertificateActivityFailedInstall + } + if actStatus != "" { + detail := "" + if update.Detail != nil { + detail = *update.Detail + } + if err := svc.NewActivity(ctx, nil, fleet.ActivityTypeInstalledCertificate{ + HostID: host.ID, + HostDisplayName: host.DisplayName(), + CertificateTemplateID: update.CertificateTemplateID, + CertificateName: record.Name, + Status: string(actStatus), + Detail: detail, + }); err != nil { + svc.logger.ErrorContext(ctx, "failed to create certificate install activity", "host.id", host.ID, "activity.status", actStatus, "err", err) + ctxerr.Handle(ctx, err) + } + } + } + + return nil🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/service/certificates.go` around lines 699 - 744, The activity (ActivityTypeInstalledCertificate with statuses CertificateActivityInstalled/CertificateActivityFailedInstall) is created before the datastore changes (RetryHostCertificateTemplate or UpsertCertificateStatus) are persisted, which can leave activities that never correspond to a committed state; change the flow so you first perform the datastore operation (for MDMDeliveryFailed: call RetryHostCertificateTemplate when retrying, otherwise call UpsertCertificateStatus to persist the final status), check for errors, and only after the datastore call succeeds call svc.NewActivity to record the CertificateActivity; preserve the same fields (detail from update.Detail, host.ID/DisplayName, update.CertificateTemplateID, record.Name) when creating the activity and keep existing error handling around NewActivity.
🧹 Nitpick comments (1)
server/fleet/activities.go (1)
1831-1837: Use typed status instead of raw string for install-certificate activity.Line 1836 currently uses
string, which bypasses the status type constraints added in this PR and can allow invalid values.Proposed refactor
type ActivityTypeInstalledCertificate struct { HostID uint `json:"host_id"` HostDisplayName string `json:"host_display_name"` CertificateTemplateID uint `json:"certificate_template_id"` CertificateName string `json:"certificate_name"` - Status string `json:"status"` + Status CertificateActivityStatus `json:"status"` Detail string `json:"detail,omitempty"` }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/fleet/activities.go` around lines 1831 - 1837, The ActivityTypeInstalledCertificate struct's Status field should use the new typed status instead of raw string: change the Status field type in ActivityTypeInstalledCertificate from string to the project's certificate/install status type (e.g., InstallCertificateStatus or CertificateInstallStatus introduced in this PR), update its JSON tag as needed, and fix any call sites that assign or compare Status to use the typed enum (including conversions or imports) so only valid status values are allowed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@server/datastore/mysql/schema.sql`:
- Around line 1811-1813: Remove the manually added migration_status_tables row
with value 20260331000000 (the tuple starting with (504,20260331000000,...)) and
any manual schema changes for the retry_count column from schema.sql, then
create a proper migration file named with the timestamp prefix 20260331000000
(i.e. 20260331000000*.go) under the migrations/tables migrations set that
performs the retry_count change; after adding that migration, run the project's
migration/schema-generation step so schema.sql is regenerated from migrations
and includes the new migration entry instead of a manual edit.
---
Duplicate comments:
In `@server/service/certificates.go`:
- Around line 699-744: The activity (ActivityTypeInstalledCertificate with
statuses CertificateActivityInstalled/CertificateActivityFailedInstall) is
created before the datastore changes (RetryHostCertificateTemplate or
UpsertCertificateStatus) are persisted, which can leave activities that never
correspond to a committed state; change the flow so you first perform the
datastore operation (for MDMDeliveryFailed: call RetryHostCertificateTemplate
when retrying, otherwise call UpsertCertificateStatus to persist the final
status), check for errors, and only after the datastore call succeeds call
svc.NewActivity to record the CertificateActivity; preserve the same fields
(detail from update.Detail, host.ID/DisplayName, update.CertificateTemplateID,
record.Name) when creating the activity and keep existing error handling around
NewActivity.
---
Nitpick comments:
In `@server/fleet/activities.go`:
- Around line 1831-1837: The ActivityTypeInstalledCertificate struct's Status
field should use the new typed status instead of raw string: change the Status
field type in ActivityTypeInstalledCertificate from string to the project's
certificate/install status type (e.g., InstallCertificateStatus or
CertificateInstallStatus introduced in this PR), update its JSON tag as needed,
and fix any call sites that assign or compare Status to use the typed enum
(including conversions or imports) so only valid status values are allowed.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 59c8ee8f-676b-43d1-b6fa-f4e697e1ab55
📒 Files selected for processing (13)
changes/37546-android-certificate-install-activityserver/datastore/mysql/certificate_templates.goserver/datastore/mysql/host_certificate_templates.goserver/datastore/mysql/host_certificate_templates_test.goserver/datastore/mysql/migrations/tables/20260331000000_AddCertRetryCount.goserver/datastore/mysql/schema.sqlserver/fleet/activities.goserver/fleet/certificate_templates.goserver/fleet/datastore.goserver/fleet/host_certificate_template.goserver/mock/datastore_mock.goserver/service/certificates.goserver/service/integration_android_certificate_templates_test.go
…ert-retry # Conflicts: # changes/37546-android-certificate-install-activity # server/service/certificates.go
…cert-retry # Conflicts: # changes/37546-android-certificate-install-activity # server/service/certificates.go
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
<!-- Add the related story/sub-task/bug number, like Resolves fleetdm#123, or remove if NA --> **Related issue:** Resolves fleetdm#37546 Docs: fleetdm#42780 Demo: https://www.youtube.com/watch?v=K44wRg9_79M # Checklist for submitter If some of the following don't apply, delete the relevant line. - [x] Changes file added for user-visible changes in `changes/`, `orbit/changes/` or `ee/fleetd-chrome/changes`. See [Changes files](https://github.com/fleetdm/fleet/blob/main/docs/Contributing/guides/committing-changes.md#changes-files) for more information. ## Testing - [x] Added/updated automated tests - [x] QA'd all new/changed functionality manually ## Database migrations - [x] Checked schema for all modified table for columns that will auto-update timestamps during migration. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Automatic retry for Android certificate installations: failed installs are retried up to 3 times before marked terminal. * Installation activities recorded: install/failed-install events (with details) are logged for better visibility and troubleshooting. * Resend/reset actions now reset retry state so retries behave predictably after manual resend. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
Related issue: Resolves #37546
Docs: #42780
Demo: https://www.youtube.com/watch?v=K44wRg9_79M
Checklist for submitter
If some of the following don't apply, delete the relevant line.
changes/,orbit/changes/oree/fleetd-chrome/changes.See Changes files for more information.
Testing
Database migrations
Summary by CodeRabbit