Fix SCEP autorenew failing for offline hosts#44250
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #44250 +/- ##
==========================================
+ Coverage 66.76% 66.77% +0.01%
==========================================
Files 2633 2633
Lines 211751 211709 -42
Branches 9506 9365 -141
==========================================
+ Hits 141370 141376 +6
+ Misses 57539 57493 -46
+ Partials 12842 12840 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| challenge_retrieved_at = VALUES(challenge_retrieved_at), | ||
| not_valid_before = VALUES(not_valid_before), | ||
| not_valid_after = VALUES(not_valid_after), | ||
| not_valid_before = COALESCE(VALUES(not_valid_before), not_valid_before), |
There was a problem hiding this comment.
Could this cause an infinite loop where the renewal cron keeps renewing forever?
There was a problem hiding this comment.
If they keep failing infinitely on renewal, possibly
There was a problem hiding this comment.
The current behaviour is that if there's a failure it silently forgets to renew forever
There was a problem hiding this comment.
Since the query is:
SELECT
hmmc.host_uuid,
hmmc.profile_uuid,
hmmc.not_valid_after,
DATEDIFF(hmmc.not_valid_after, hmmc.not_valid_before) AS validity_period
FROM
host_mdm_managed_certificates hmmc
INNER JOIN
`+table+` hp
ON hmmc.host_uuid = hp.host_uuid AND hmmc.profile_uuid = hp.profile_uuid
WHERE
hmmc.type = ? AND hp.status IS NOT NULL AND hp.operation_type = ?
HAVING
validity_period IS NOT NULL AND
((validity_period > 30 AND not_valid_after < DATE_ADD(NOW(), INTERVAL 30 DAY)) OR
(validity_period <= 30 AND not_valid_after < DATE_ADD(NOW(), INTERVAL validity_period/2 DAY)))
```
Wouldn't an offline host that we send a renewal to get the cert resent, then it would go from status NULL(set by reconciler)->Pending, then it would get resent again hourly in perpetuity? Right now the null check stops the resends from running forever
There was a problem hiding this comment.
That's a good point, I'll see if I can narrow the renew loop to not try on intermediate statuses
There was a problem hiding this comment.
I changed the renewal check to look for verified/failed instead of NOT NULL in the status field. It shouldn't re-trigger for in-flight SCEP profiles anymore
There was a problem hiding this comment.
I think there might still be a bit of a race here on iOS/iPadOS where, since we don't have osquery the profiles go straight to verified on the ack but it could be up to an hour before they check in again and the renewal cron might run in that interval and re-renew the cert?
There is some code that at least for NDES and maybe others triggers a repush of the cert profile when the challenge is expired. Do we know if that's triggering here?
There was a problem hiding this comment.
I have two possible solutions for that problem. I'm not super familiar with the iOS MDM cycle, so take these suggestions with a grain of salt. I could either add a column to keep track of when the last renewal was on the instantly 'verified' profile, or I could detect if the profile contains a SCEP profile and only set it it to 'verifying' until the certificate list is renewed. Do either of those make more sense?
There was a problem hiding this comment.
I got claude to implement both solutions and the second seemed far less invasive. Let me know what you think
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughSCEP renewal behavior changed to retry after an initial failure and to only reset renewal status for profiles in settled delivery states ( 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@changes/44111-scep-autorenew-fail`:
- Line 1: The release note/commit message in changes/44111-scep-autorenew-fail
contains a typo: change "retying" to "retrying" so the line reads "Fixed SCEP
renewals not retrying after initial failure"; update that exact string in the
file (changes/44111-scep-autorenew-fail) so any references to the issue use the
correct spelling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2ae2f1c1-ec3e-4cc8-b806-b29fed50fbb2
📒 Files selected for processing (5)
changes/44111-scep-autorenew-failserver/datastore/mysql/apple_mdm_test.goserver/datastore/mysql/mdm.goserver/fleet/mdm.goserver/service/integration_mdm_test.go
There was a problem hiding this comment.
Pull request overview
This PR fixes SCEP certificate auto-renewal failures for offline hosts by preventing managed certificate metadata from being wiped during reconcile, extending the one-time SCEP challenge TTL, and ensuring the renewal cron doesn’t retrigger while a profile delivery is still in-flight.
Changes:
- Preserve existing cert metadata (
not_valid_before,not_valid_after,serial) duringBulkUpsertMDMManagedCertificateswhen incoming values areNULL. - Increase
fleet.OneTimeChallengeTTLfrom 1 hour to 7 days to accommodate offline devices. - Restrict renewal cron to only retrigger settled delivery statuses (
verified,failed) to avoid generating hourly orphan challenges/commands for pending deliveries.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| server/service/integration_mdm_test.go | Updates Custom SCEP integration test to backdate challenge expiration relative to fleet.OneTimeChallengeTTL. |
| server/fleet/mdm.go | Extends one-time challenge TTL to 7 days with clarifying comments. |
| server/datastore/mysql/mdm.go | Uses COALESCE in upsert to preserve cert metadata; tightens renewal cron status filtering to settled states. |
| server/datastore/mysql/apple_mdm_test.go | Adds regression coverage for in-flight status skip and cert-metadata preservation behavior. |
| changes/44111-scep-autorenew-fail | Adds release note entries for the bug fix. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/datastore/mysql/challenges_test.go (1)
75-77: Assert that backdating UPDATEs affect exactly one row.Right now these UPDATEs return only execution error. Adding a rows-affected assertion will make TTL tests fail fast if fixture setup silently misses the target row.
Proposed hardening
ExecAdhocSQL(t, ds, func(q sqlx.ExtContext) error { - _, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge) - return err + res, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge) + if err != nil { + return err + } + affected, err := res.RowsAffected() + if err != nil { + return err + } + require.EqualValues(t, 1, affected) + return nil })Apply the same pattern to the UPDATE blocks at Line 95 and Line 116.
Also applies to: 94-97, 115-117
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/challenges_test.go` around lines 75 - 77, The UPDATEs inside the ExecAdhocSQL calls currently only return execution errors and don't verify rows affected; change each anonymous function passed to ExecAdhocSQL (the func(q sqlx.ExtContext) error used with q.ExecContext to run `UPDATE challenges SET created_at = ? WHERE challenge = ?`) to check the result.RowsAffected() and return an error if it is not exactly 1 so tests fail fast; apply the same rows-affected assertion pattern to the other two identical UPDATE blocks used in the test (the other ExecAdhocSQL anonymous functions that perform similar `UPDATE challenges ... WHERE challenge = ?` calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/datastore/mysql/challenges_test.go`:
- Around line 75-77: The UPDATEs inside the ExecAdhocSQL calls currently only
return execution errors and don't verify rows affected; change each anonymous
function passed to ExecAdhocSQL (the func(q sqlx.ExtContext) error used with
q.ExecContext to run `UPDATE challenges SET created_at = ? WHERE challenge = ?`)
to check the result.RowsAffected() and return an error if it is not exactly 1 so
tests fail fast; apply the same rows-affected assertion pattern to the other two
identical UPDATE blocks used in the test (the other ExecAdhocSQL anonymous
functions that perform similar `UPDATE challenges ... WHERE challenge = ?`
calls).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 60f3e937-1b2b-4532-9c9b-e4cbc0d5fefc
📒 Files selected for processing (2)
server/datastore/mysql/challenges_test.goserver/service/integration_mdm_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- server/service/integration_mdm_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
server/datastore/mysql/challenges_test.go (1)
75-78: Assert fixture backdating actually updated one row.The setup updates currently only check execution errors. Adding a
RowsAffected == 1assertion would make TTL-boundary tests more robust and avoid silent false positives if setup ever stops matching a row.Suggested hardening
ExecAdhocSQL(t, ds, func(q sqlx.ExtContext) error { - _, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge) + res, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge) + if err != nil { + return err + } + n, err := res.RowsAffected() + if err != nil { + return err + } + require.EqualValues(t, 1, n) return err })Also applies to: 94-97, 115-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@server/datastore/mysql/challenges_test.go` around lines 75 - 78, The test setup that calls ExecAdhocSQL and runs q.ExecContext to backdate the challenge should assert that the update actually affected one row: after executing q.ExecContext (the UPDATE challenges SET created_at = ? WHERE challenge = ? call used inside ExecAdhocSQL), capture the result, call RowsAffected(), and require it equals 1 (fail the test if not). Apply the same hardening to the other similar setup blocks that use q.ExecContext to update created_at (the other two occurrences mentioned) so each update asserts RowsAffected() == 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@server/datastore/mysql/challenges_test.go`:
- Around line 75-78: The test setup that calls ExecAdhocSQL and runs
q.ExecContext to backdate the challenge should assert that the update actually
affected one row: after executing q.ExecContext (the UPDATE challenges SET
created_at = ? WHERE challenge = ? call used inside ExecAdhocSQL), capture the
result, call RowsAffected(), and require it equals 1 (fail the test if not).
Apply the same hardening to the other similar setup blocks that use
q.ExecContext to update created_at (the other two occurrences mentioned) so each
update asserts RowsAffected() == 1.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 8d3a3f76-527d-4fe2-82ac-55e0c69fe123
📒 Files selected for processing (1)
server/datastore/mysql/challenges_test.go
iOS/iPadOS profiles short-circuit pending -> verified on MDM ack since osquery isn't available. For managed-cert profiles, that left a window where the renewal cron saw 'verified' with stale cert metadata still in the renewal window and re-fired renewal each tick until CertificateList caught up. Skip the short-circuit when a managed-cert row exists; flip verifying -> verified from updateHostMDMManagedCertDetailsDB instead.
JordanMontgomery
left a comment
There was a problem hiding this comment.
I think this is probably OK. It feels a little bit racy but I can't quite think of a better way to fix it unless there's something we're missing here
There was a problem hiding this comment.
Actually I think there's one more potential bug here:
If a previously-delivered cert profile is in a failed state like "failed because you deleted your CA" or "failed because the host no longer has IDP variables" I think the cron will keep picking those up now right? Before it wouldn't because the first failure would blank the values I think. Will this cause the cron to get clogged if there are e.g. > 100 of these?
THis might be worth adding a testcase for, I bet the cron will keep marking them for resend as there's no longer a circuit breaker
|
@JordanMontgomery Added a 24h backoff if the certificate is broken |
Reapplies the three independent improvements from #44250 (reverted via #44535) and adds an ingest-side backfill that catches the actual silent-fail mechanism (missed toInsert matcher) without breaking the natural in-flight synchronization between reconcile and the renewal cron. - Bump OneTimeChallengeTTL 1h → 7d so renewals don't fail with "challenge not found" for offline devices that pick up the InstallProfile push days later. - Restrict the renewal cron to settled delivery states ('verified', 'failed') to avoid re-firing renewal while a previous delivery is still in flight. - Gate the new 'failed' branch on a 24h backoff so permanent render-time failures (CA deleted, missing IDP variables) don't loop hourly. - Add backfillHostMDMManagedCertsFromHostCertsDB: when the toInsert matcher in UpdateHostCertificates misses a renewed cert (replica lag, transaction race, verified-without-actual-renewal), look up a matching cert in host_certificates by the 'fleet-<profile_uuid>' substring and populate hmmc. Gated by a 4h grace on hmmc.updated_at so it doesn't clobber the in-flight blank-out, and a monotonic-forward predicate so it's idempotent. Does NOT reintroduce the COALESCE-preserve in BulkUpsertMDMManagedCertificates or the iOS-only park-at-'verifying' carve-out from #44250 — those broke the natural cron synchronization gate (reconcile NULLs hmmc → cron's HAVING IS NOT NULL excludes the row until ingest repopulates). Resolves #44111
Related issue: Resolves #44111
changes/,orbit/changes/oree/fleetd-chrome/changes.See Changes files for more information.
SELECT *is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters.Testing
For unreleased bug fixes in a release candidate, one of:
Customers reported certificates deployed via custom SCEP proxy were silently failing to auto-renew, leaving devices with expired certs. Five compounding bugs were causing this:
1. Cert metadata was wiped on every reconcile re-render
BulkUpsertMDMManagedCertificatesunconditionally overwrotenot_valid_before,not_valid_after, andserialinON DUPLICATE KEY UPDATE. Since the SCEP-proxy render-time payload has those fields nil (cert details aren't known until the device completes the handshake and osquery reports), every renewal trigger wiped them. Once NULL, the renewal cron'sHAVING validity_period IS NOT NULLclause excluded the row — silently disabling future renewal attempts.Fixed by switching those columns to
COALESCE(VALUES(col), col)so a nil incoming value preserves the existing value. DigiCert's flow (which does set the fields) and osquery's separate UPDATE inupdateHostMDMManagedCertDetailsDBare unaffected.2. 1-hour challenge TTL was too short for offline devices
The challenge is generated at profile-render time but consumed when the device makes its SCEP request — which can be hours or days later (laptop asleep, on a plane, etc.). Devices that didn't pick up the InstallProfile push within the hour hit
challenge not found: sql: no rows in result setand the renewal failed.Bumped
OneTimeChallengeTTLfrom 1 hour to 7 days. Once consumed, the challenge is deleted immediately regardless of TTL.3. Renewal cron re-fired on in-flight deliveries
WHERE hp.status IS NOT NULLmatched'pending'and'verifying'too, so a host whose delivery was still in flight (e.g., offline laptop) would have its profile re-rendered with a fresh challenge every cron tick — generating orphan nano commands and challenge rows hourly. Pre-fix this was masked by bug 1; once the COALESCE preserves cert metadata, the loop becomes visible.Tightened the filter to
WHERE hp.status IN ('verified', 'failed')— settled states only.4. iOS/iPadOS managed-cert profiles short-circuited to verified before cert metadata synced
iOS/iPadOS profiles short-circuit
pending→verifieddirectly on MDM ack (noverifyingstep), since osquery isn't available to drive the standard verification cycle. That's correct for non-cert profiles, but for managed-cert profiles it created a window where the renewal cron sawstatus='verified'paired with stale cert metadata still in the renewal window — and the newIN ('verified', 'failed')filter from bug 3 kept matching, re-firing renewal each tick untilCertificateListingestion eventually caught up.Fixed by parking iOS/iPadOS managed-cert profiles at
'verifying'on MDM ack and flipping them to'verified'fromupdateHostMDMManagedCertDetailsDBonce fresh cert metadata arrives — i.e., reusing the existing state machine instead of inventing a parallel "renewal in flight" tracking column. TheEXISTS(SELECT 1 FROM host_mdm_managed_certificates ...)check is folded into the existing platform-detection query, so no extra round-trip. macOS is unaffected: the new flip is redundant withVerifyHostMDMProfilesbut idempotent.Trade-off worth flagging: if
CertificateListingestion never runs for an iOS managed-cert profile (broken cron, device offline indefinitely), the profile sits at'verifying'and the renewal cron's filter excludes it. In practice both run on the same Apple MDM cron loop — if one is broken, much else is too — but it's a sharper failure mode than letting renewals re-fire wastefully.5. Permanent-failure profiles loop hourly through the renewal cron
Once
'failed'was added to the cron's status filter (bug 3), there was no longer any circuit breaker for profiles that fail at render time for non-transient reasons — CA deleted from app config, IDP variables missing from host, premium license downgraded. Each cron tick (1h interval) the cron flips'failed'→ NULL, reconcile re-renders and immediately re-fails viafleet.MarkProfilesFailed, status returns to'failed', repeat. Pre-fix this was masked by bug 1 (metadata wipe acted as accidental circuit breaker); once metadata is preserved (bug 1 fix), the loop becomes real and produces a profile-render attempt + nano command per failed cert per hour.Added a
renewalFailedRetryBackoffconstant (24h) and gated the'failed'branch onhp.updated_at < DATE_SUB(NOW(), INTERVAL renewalFailedRetryBackoff SECOND). Transient SCEP-server outages still recover (within at most 24h, well under any cert validity window). Permanent failures still get retried daily (so a customer fixing the underlying issue eventually auto-recovers), but they don't churn nano commands hourly.'verified'rows in the renewal window are unaffected — they bypass the gate.Tests
testMDMManagedSCEPCertificates: three new sub-tests covering (a) cert-metadata preservation across reconcile re-renders, (b) in-flight-status skip behavior, (c) the permanent-failure backoff. Exercised against both NDES and Custom SCEP via the existing table-driven harness.testIOSManagedCertProfileStaysVerifying: verifies that on iOS, a managed-cert profile stays at'verifying'after MDM ack and only flips to'verified'onceUpdateHostCertificatesingests fresh cert metadata.challenges_test.gocoveringNewChallenge/ConsumeChallengelifecycle and TTL boundaries.TestCustomSCEPIntegration: updated the hardcoded 2-hour challenge backdate to usefleet.OneTimeChallengeTTLso it stays correct as the constant evolves.TestCustomSCEPRenewalPreservesCertMetadataend-to-end test: drives the full reconcile path (rather than calling the bare datastore method) so a future change to the render-time payload structure can't silently regress the COALESCE preservation.