Skip to content

Fix SCEP autorenew failing for offline hosts#44250

Merged
dantecatalfamo merged 13 commits intomainfrom
44111-scep-autorenew-fail
Apr 29, 2026
Merged

Fix SCEP autorenew failing for offline hosts#44250
dantecatalfamo merged 13 commits intomainfrom
44111-scep-autorenew-fail

Conversation

@dantecatalfamo
Copy link
Copy Markdown
Member

@dantecatalfamo dantecatalfamo commented Apr 27, 2026

Related issue: Resolves #44111

  • Changes file added for user-visible changes in changes/, orbit/changes/ or ee/fleetd-chrome/changes.
    See Changes files for more information.
  • Input data is properly validated, SELECT * is avoided, SQL injection is prevented (using placeholders for values in statements), JS inline code is prevented especially for url redirects, and untrusted data interpolated into shell scripts/commands is validated against shell metacharacters.
  • Timeouts are implemented and retries are limited to avoid infinite loops

Testing

  • Added/updated automated tests
  • QA'd all new/changed functionality manually

For unreleased bug fixes in a release candidate, one of:

  • Confirmed that the fix is not expected to adversely impact load test results

Customers reported certificates deployed via custom SCEP proxy were silently failing to auto-renew, leaving devices with expired certs. Five compounding bugs were causing this:

1. Cert metadata was wiped on every reconcile re-render

BulkUpsertMDMManagedCertificates unconditionally overwrote not_valid_before, not_valid_after, and serial in ON DUPLICATE KEY UPDATE. Since the SCEP-proxy render-time payload has those fields nil (cert details aren't known until the device completes the handshake and osquery reports), every renewal trigger wiped them. Once NULL, the renewal cron's HAVING validity_period IS NOT NULL clause excluded the row — silently disabling future renewal attempts.

Fixed by switching those columns to COALESCE(VALUES(col), col) so a nil incoming value preserves the existing value. DigiCert's flow (which does set the fields) and osquery's separate UPDATE in updateHostMDMManagedCertDetailsDB are unaffected.

2. 1-hour challenge TTL was too short for offline devices

The challenge is generated at profile-render time but consumed when the device makes its SCEP request — which can be hours or days later (laptop asleep, on a plane, etc.). Devices that didn't pick up the InstallProfile push within the hour hit challenge not found: sql: no rows in result set and the renewal failed.

Bumped OneTimeChallengeTTL from 1 hour to 7 days. Once consumed, the challenge is deleted immediately regardless of TTL.

3. Renewal cron re-fired on in-flight deliveries

WHERE hp.status IS NOT NULL matched 'pending' and 'verifying' too, so a host whose delivery was still in flight (e.g., offline laptop) would have its profile re-rendered with a fresh challenge every cron tick — generating orphan nano commands and challenge rows hourly. Pre-fix this was masked by bug 1; once the COALESCE preserves cert metadata, the loop becomes visible.

Tightened the filter to WHERE hp.status IN ('verified', 'failed') — settled states only.

4. iOS/iPadOS managed-cert profiles short-circuited to verified before cert metadata synced

iOS/iPadOS profiles short-circuit pendingverified directly on MDM ack (no verifying step), since osquery isn't available to drive the standard verification cycle. That's correct for non-cert profiles, but for managed-cert profiles it created a window where the renewal cron saw status='verified' paired with stale cert metadata still in the renewal window — and the new IN ('verified', 'failed') filter from bug 3 kept matching, re-firing renewal each tick until CertificateList ingestion eventually caught up.

Fixed by parking iOS/iPadOS managed-cert profiles at 'verifying' on MDM ack and flipping them to 'verified' from updateHostMDMManagedCertDetailsDB once fresh cert metadata arrives — i.e., reusing the existing state machine instead of inventing a parallel "renewal in flight" tracking column. The EXISTS(SELECT 1 FROM host_mdm_managed_certificates ...) check is folded into the existing platform-detection query, so no extra round-trip. macOS is unaffected: the new flip is redundant with VerifyHostMDMProfiles but idempotent.

Trade-off worth flagging: if CertificateList ingestion never runs for an iOS managed-cert profile (broken cron, device offline indefinitely), the profile sits at 'verifying' and the renewal cron's filter excludes it. In practice both run on the same Apple MDM cron loop — if one is broken, much else is too — but it's a sharper failure mode than letting renewals re-fire wastefully.

5. Permanent-failure profiles loop hourly through the renewal cron

Once 'failed' was added to the cron's status filter (bug 3), there was no longer any circuit breaker for profiles that fail at render time for non-transient reasons — CA deleted from app config, IDP variables missing from host, premium license downgraded. Each cron tick (1h interval) the cron flips 'failed' → NULL, reconcile re-renders and immediately re-fails via fleet.MarkProfilesFailed, status returns to 'failed', repeat. Pre-fix this was masked by bug 1 (metadata wipe acted as accidental circuit breaker); once metadata is preserved (bug 1 fix), the loop becomes real and produces a profile-render attempt + nano command per failed cert per hour.

Added a renewalFailedRetryBackoff constant (24h) and gated the 'failed' branch on hp.updated_at < DATE_SUB(NOW(), INTERVAL renewalFailedRetryBackoff SECOND). Transient SCEP-server outages still recover (within at most 24h, well under any cert validity window). Permanent failures still get retried daily (so a customer fixing the underlying issue eventually auto-recovers), but they don't churn nano commands hourly. 'verified' rows in the renewal window are unaffected — they bypass the gate.

Tests

  • testMDMManagedSCEPCertificates: three new sub-tests covering (a) cert-metadata preservation across reconcile re-renders, (b) in-flight-status skip behavior, (c) the permanent-failure backoff. Exercised against both NDES and Custom SCEP via the existing table-driven harness.
  • New testIOSManagedCertProfileStaysVerifying: verifies that on iOS, a managed-cert profile stays at 'verifying' after MDM ack and only flips to 'verified' once UpdateHostCertificates ingests fresh cert metadata.
  • New challenges_test.go covering NewChallenge/ConsumeChallenge lifecycle and TTL boundaries.
  • TestCustomSCEPIntegration: updated the hardcoded 2-hour challenge backdate to use fleet.OneTimeChallengeTTL so it stays correct as the constant evolves.
  • New TestCustomSCEPRenewalPreservesCertMetadata end-to-end test: drives the full reconcile path (rather than calling the bare datastore method) so a future change to the render-time payload structure can't silently regress the COALESCE preservation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 27, 2026

Codecov Report

❌ Patch coverage is 94.11765% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.77%. Comparing base (62b60fe) to head (396f27c).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
server/datastore/mysql/host_certificates.go 81.81% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #44250      +/-   ##
==========================================
+ Coverage   66.76%   66.77%   +0.01%     
==========================================
  Files        2633     2633              
  Lines      211751   211709      -42     
  Branches     9506     9365     -141     
==========================================
+ Hits       141370   141376       +6     
+ Misses      57539    57493      -46     
+ Partials    12842    12840       -2     
Flag Coverage Δ
backend 68.54% <94.11%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

challenge_retrieved_at = VALUES(challenge_retrieved_at),
not_valid_before = VALUES(not_valid_before),
not_valid_after = VALUES(not_valid_after),
not_valid_before = COALESCE(VALUES(not_valid_before), not_valid_before),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this cause an infinite loop where the renewal cron keeps renewing forever?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they keep failing infinitely on renewal, possibly

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current behaviour is that if there's a failure it silently forgets to renew forever

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the query is:

	SELECT
		hmmc.host_uuid,
		hmmc.profile_uuid,
		hmmc.not_valid_after,
		DATEDIFF(hmmc.not_valid_after, hmmc.not_valid_before) AS validity_period
	FROM
		host_mdm_managed_certificates hmmc
	INNER JOIN
		`+table+` hp
		ON hmmc.host_uuid = hp.host_uuid AND hmmc.profile_uuid = hp.profile_uuid
	WHERE
		hmmc.type = ? AND hp.status IS NOT NULL AND hp.operation_type = ?
	HAVING
		validity_period IS NOT NULL AND
		((validity_period > 30 AND not_valid_after < DATE_ADD(NOW(), INTERVAL 30 DAY)) OR
		(validity_period <= 30 AND not_valid_after < DATE_ADD(NOW(), INTERVAL validity_period/2 DAY)))
		```
		
Wouldn't an offline host that we send a renewal to get the cert resent, then it would go from status NULL(set by reconciler)->Pending, then it would get resent again hourly in perpetuity? Right now the null check stops the resends from running forever

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll see if I can narrow the renew loop to not try on intermediate statuses

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the renewal check to look for verified/failed instead of NOT NULL in the status field. It shouldn't re-trigger for in-flight SCEP profiles anymore

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might still be a bit of a race here on iOS/iPadOS where, since we don't have osquery the profiles go straight to verified on the ack but it could be up to an hour before they check in again and the renewal cron might run in that interval and re-renew the cert?

There is some code that at least for NDES and maybe others triggers a repush of the cert profile when the challenge is expired. Do we know if that's triggering here?

Copy link
Copy Markdown
Member Author

@dantecatalfamo dantecatalfamo Apr 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two possible solutions for that problem. I'm not super familiar with the iOS MDM cycle, so take these suggestions with a grain of salt. I could either add a column to keep track of when the last renewal was on the instantly 'verified' profile, or I could detect if the profile contains a SCEP profile and only set it it to 'verifying' until the certificate list is renewed. Do either of those make more sense?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got claude to implement both solutions and the second seemed far less invasive. Let me know what you think

@dantecatalfamo dantecatalfamo marked this pull request as ready for review April 28, 2026 19:27
@dantecatalfamo dantecatalfamo requested a review from a team as a code owner April 28, 2026 19:27
Copilot AI review requested due to automatic review settings April 28, 2026 19:27
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

SCEP renewal behavior changed to retry after an initial failure and to only reset renewal status for profiles in settled delivery states (verified, failed), excluding in-flight states. One-time SCEP challenge TTL increased from 1 hour to 7 days. Certificate upserts now preserve existing SCEP metadata when incoming not_valid_before, not_valid_after, or serial are NULL by using COALESCE(VALUES(col), col). Tests were added/updated to cover one-time challenge TTL/consumption semantics and to ensure renewal preserves SCEP cert metadata and respects delivery-status handling.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely summarizes the main fix: SCEP autorenew failures for offline hosts.
Linked Issues check ✅ Passed All three compounding issues in #44111 are directly addressed: cert metadata preservation with COALESCE, TTL extension from 1 hour to 7 days, and renewal cron filtering to only settled delivery states.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the three identified SCEP autorenew bugs. No unrelated modifications detected.
Description check ✅ Passed PR description is comprehensive and follows the required template structure with all major sections completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch 44111-scep-autorenew-fail

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@changes/44111-scep-autorenew-fail`:
- Line 1: The release note/commit message in changes/44111-scep-autorenew-fail
contains a typo: change "retying" to "retrying" so the line reads "Fixed SCEP
renewals not retrying after initial failure"; update that exact string in the
file (changes/44111-scep-autorenew-fail) so any references to the issue use the
correct spelling.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2ae2f1c1-ec3e-4cc8-b806-b29fed50fbb2

📥 Commits

Reviewing files that changed from the base of the PR and between a97a223 and 8d2db50.

📒 Files selected for processing (5)
  • changes/44111-scep-autorenew-fail
  • server/datastore/mysql/apple_mdm_test.go
  • server/datastore/mysql/mdm.go
  • server/fleet/mdm.go
  • server/service/integration_mdm_test.go

Comment thread changes/44111-scep-autorenew-fail Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes SCEP certificate auto-renewal failures for offline hosts by preventing managed certificate metadata from being wiped during reconcile, extending the one-time SCEP challenge TTL, and ensuring the renewal cron doesn’t retrigger while a profile delivery is still in-flight.

Changes:

  • Preserve existing cert metadata (not_valid_before, not_valid_after, serial) during BulkUpsertMDMManagedCertificates when incoming values are NULL.
  • Increase fleet.OneTimeChallengeTTL from 1 hour to 7 days to accommodate offline devices.
  • Restrict renewal cron to only retrigger settled delivery statuses (verified, failed) to avoid generating hourly orphan challenges/commands for pending deliveries.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
server/service/integration_mdm_test.go Updates Custom SCEP integration test to backdate challenge expiration relative to fleet.OneTimeChallengeTTL.
server/fleet/mdm.go Extends one-time challenge TTL to 7 days with clarifying comments.
server/datastore/mysql/mdm.go Uses COALESCE in upsert to preserve cert metadata; tightens renewal cron status filtering to settled states.
server/datastore/mysql/apple_mdm_test.go Adds regression coverage for in-flight status skip and cert-metadata preservation behavior.
changes/44111-scep-autorenew-fail Adds release note entries for the bug fix.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread server/fleet/mdm.go
Comment thread changes/44111-scep-autorenew-fail Outdated
Comment thread changes/44111-scep-autorenew-fail Outdated
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/datastore/mysql/challenges_test.go (1)

75-77: Assert that backdating UPDATEs affect exactly one row.

Right now these UPDATEs return only execution error. Adding a rows-affected assertion will make TTL tests fail fast if fixture setup silently misses the target row.

Proposed hardening
 ExecAdhocSQL(t, ds, func(q sqlx.ExtContext) error {
-	_, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge)
-	return err
+	res, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge)
+	if err != nil {
+		return err
+	}
+	affected, err := res.RowsAffected()
+	if err != nil {
+		return err
+	}
+	require.EqualValues(t, 1, affected)
+	return nil
 })

Apply the same pattern to the UPDATE blocks at Line 95 and Line 116.

Also applies to: 94-97, 115-117

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/challenges_test.go` around lines 75 - 77, The UPDATEs
inside the ExecAdhocSQL calls currently only return execution errors and don't
verify rows affected; change each anonymous function passed to ExecAdhocSQL (the
func(q sqlx.ExtContext) error used with q.ExecContext to run `UPDATE challenges
SET created_at = ? WHERE challenge = ?`) to check the result.RowsAffected() and
return an error if it is not exactly 1 so tests fail fast; apply the same
rows-affected assertion pattern to the other two identical UPDATE blocks used in
the test (the other ExecAdhocSQL anonymous functions that perform similar
`UPDATE challenges ... WHERE challenge = ?` calls).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/datastore/mysql/challenges_test.go`:
- Around line 75-77: The UPDATEs inside the ExecAdhocSQL calls currently only
return execution errors and don't verify rows affected; change each anonymous
function passed to ExecAdhocSQL (the func(q sqlx.ExtContext) error used with
q.ExecContext to run `UPDATE challenges SET created_at = ? WHERE challenge = ?`)
to check the result.RowsAffected() and return an error if it is not exactly 1 so
tests fail fast; apply the same rows-affected assertion pattern to the other two
identical UPDATE blocks used in the test (the other ExecAdhocSQL anonymous
functions that perform similar `UPDATE challenges ... WHERE challenge = ?`
calls).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 60f3e937-1b2b-4532-9c9b-e4cbc0d5fefc

📥 Commits

Reviewing files that changed from the base of the PR and between d1e616c and fee9b0b.

📒 Files selected for processing (2)
  • server/datastore/mysql/challenges_test.go
  • server/service/integration_mdm_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • server/service/integration_mdm_test.go

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
server/datastore/mysql/challenges_test.go (1)

75-78: Assert fixture backdating actually updated one row.

The setup updates currently only check execution errors. Adding a RowsAffected == 1 assertion would make TTL-boundary tests more robust and avoid silent false positives if setup ever stops matching a row.

Suggested hardening
 ExecAdhocSQL(t, ds, func(q sqlx.ExtContext) error {
-	_, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge)
+	res, err := q.ExecContext(ctx, `UPDATE challenges SET created_at = ? WHERE challenge = ?`, backdated, challenge)
+	if err != nil {
+		return err
+	}
+	n, err := res.RowsAffected()
+	if err != nil {
+		return err
+	}
+	require.EqualValues(t, 1, n)
 	return err
 })

Also applies to: 94-97, 115-118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@server/datastore/mysql/challenges_test.go` around lines 75 - 78, The test
setup that calls ExecAdhocSQL and runs q.ExecContext to backdate the challenge
should assert that the update actually affected one row: after executing
q.ExecContext (the UPDATE challenges SET created_at = ? WHERE challenge = ? call
used inside ExecAdhocSQL), capture the result, call RowsAffected(), and require
it equals 1 (fail the test if not). Apply the same hardening to the other
similar setup blocks that use q.ExecContext to update created_at (the other two
occurrences mentioned) so each update asserts RowsAffected() == 1.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@server/datastore/mysql/challenges_test.go`:
- Around line 75-78: The test setup that calls ExecAdhocSQL and runs
q.ExecContext to backdate the challenge should assert that the update actually
affected one row: after executing q.ExecContext (the UPDATE challenges SET
created_at = ? WHERE challenge = ? call used inside ExecAdhocSQL), capture the
result, call RowsAffected(), and require it equals 1 (fail the test if not).
Apply the same hardening to the other similar setup blocks that use
q.ExecContext to update created_at (the other two occurrences mentioned) so each
update asserts RowsAffected() == 1.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 8d3a3f76-527d-4fe2-82ac-55e0c69fe123

📥 Commits

Reviewing files that changed from the base of the PR and between fee9b0b and a428a76.

📒 Files selected for processing (1)
  • server/datastore/mysql/challenges_test.go

iOS/iPadOS profiles short-circuit pending -> verified on MDM ack since
osquery isn't available. For managed-cert profiles, that left a window
where the renewal cron saw 'verified' with stale cert metadata still in
the renewal window and re-fired renewal each tick until CertificateList
caught up. Skip the short-circuit when a managed-cert row exists; flip
verifying -> verified from updateHostMDMManagedCertDetailsDB instead.
Copy link
Copy Markdown
Member

@JordanMontgomery JordanMontgomery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is probably OK. It feels a little bit racy but I can't quite think of a better way to fix it unless there's something we're missing here

Copy link
Copy Markdown
Member

@JordanMontgomery JordanMontgomery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I think there's one more potential bug here:

If a previously-delivered cert profile is in a failed state like "failed because you deleted your CA" or "failed because the host no longer has IDP variables" I think the cron will keep picking those up now right? Before it wouldn't because the first failure would blank the values I think. Will this cause the cron to get clogged if there are e.g. > 100 of these?

THis might be worth adding a testcase for, I bet the cron will keep marking them for resend as there's no longer a circuit breaker

@dantecatalfamo
Copy link
Copy Markdown
Member Author

@JordanMontgomery Added a 24h backoff if the certificate is broken

@dantecatalfamo dantecatalfamo merged commit 96569a9 into main Apr 29, 2026
51 checks passed
@dantecatalfamo dantecatalfamo deleted the 44111-scep-autorenew-fail branch April 29, 2026 21:14
mostlikelee added a commit that referenced this pull request Apr 30, 2026
Reapplies the three independent improvements from #44250 (reverted via #44535)
and adds an ingest-side backfill that catches the actual silent-fail mechanism
(missed toInsert matcher) without breaking the natural in-flight
synchronization between reconcile and the renewal cron.

- Bump OneTimeChallengeTTL 1h → 7d so renewals don't fail with "challenge not
  found" for offline devices that pick up the InstallProfile push days later.
- Restrict the renewal cron to settled delivery states ('verified', 'failed')
  to avoid re-firing renewal while a previous delivery is still in flight.
- Gate the new 'failed' branch on a 24h backoff so permanent render-time
  failures (CA deleted, missing IDP variables) don't loop hourly.
- Add backfillHostMDMManagedCertsFromHostCertsDB: when the toInsert matcher
  in UpdateHostCertificates misses a renewed cert (replica lag, transaction
  race, verified-without-actual-renewal), look up a matching cert in
  host_certificates by the 'fleet-<profile_uuid>' substring and populate
  hmmc. Gated by a 4h grace on hmmc.updated_at so it doesn't clobber the
  in-flight blank-out, and a monotonic-forward predicate so it's idempotent.

Does NOT reintroduce the COALESCE-preserve in BulkUpsertMDMManagedCertificates
or the iOS-only park-at-'verifying' carve-out from #44250 — those broke the
natural cron synchronization gate (reconcile NULLs hmmc → cron's HAVING IS
NOT NULL excludes the row until ingest repopulates).

Resolves #44111
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Custom SCEP proxy certificates intermittently failing to auto-renew

4 participants