Skip to content

Replace OnceCell::get_or_try_init with deadlock-safe non-blocking load_roots#205

Merged
lukevalenta merged 1 commit into
mainfrom
lvalenta/onceLock-fix
Apr 13, 2026
Merged

Replace OnceCell::get_or_try_init with deadlock-safe non-blocking load_roots#205
lukevalenta merged 1 commit into
mainfrom
lvalenta/onceLock-fix

Conversation

@lukevalenta
Copy link
Copy Markdown
Contributor

The Workers runtime cancels any request that awaits a promise (OnceCell future) created by a different request context. Previously, a concurrent add-chain request arriving while another request was initializing ROOTS via get_or_try_init would be canceled with a 500.

Fix: load_roots now checks ROOTS.get() first (fast path), then builds the pool itself if not yet initialized, then calls ROOTS.set(). If another concurrent request races and sets first, the losing request discards its result and returns the value already in the cell. All concurrent requests do the work independently rather than waiting on each other.

Applied to both ct_worker and bootstrap_mtc_worker.

Copy link
Copy Markdown
Contributor

@rozbb rozbb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question: is that cancelling behavior documented anywhere? I didn’t know this was a thing to look out for

@lukevalenta
Copy link
Copy Markdown
Contributor Author

One question: is that cancelling behavior documented anywhere? I didn’t know this was a thing to look out for

It should probably be documented here: https://developers.cloudflare.com/workers/observability/errors/#the-script-will-never-generate-a-response-errors. I'll follow up with the Workers team to confirm the expected behavior and update the docs.

@lukevalenta
Copy link
Copy Markdown
Contributor Author

/bonk review

Copy link
Copy Markdown

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is sound. Replacing OnceCell::get_or_try_init with a manual get() + build + set() pattern correctly avoids the cross-request promise cancellation in the Workers runtime, and the integration test changes are well-considered.

Code review

load_roots in both workers -- The pattern is correct and idiomatic for this constraint:

  1. Fast path via ROOTS.get() returns immediately if already initialized.
  2. Each concurrent request independently builds a CertPool (including the KV fetch for CCADB roots).
  3. ROOTS.set(pool) is a non-blocking race -- the loser silently discards its work.
  4. ROOTS.get().expect("just set") is safe because some request must have succeeded at set() before this point.

The trade-off is clear: under cold-start contention, N concurrent requests may redundantly fetch from KV and parse PEM certs. This is strictly better than N-1 of them being killed with 500s.

Integration test changes -- The get_roots_returns_valid_certs tests in both suites no longer call ensure_initialized(), which makes sense: get-roots is a stateless endpoint that doesn't depend on sequencer state, and the old ensure_initialized pre-call was specifically working around the get_or_try_init deadlock that this PR eliminates. The static_ct_api ensure_initialized also drops the explicit get-roots readiness loop in favor of a simpler get_log_v3_json retry, which is the right simplification -- the metadata endpoint doesn't touch ROOTS and is a more targeted liveness check.

One minor observation: ROOTS is still typed as tokio::sync::OnceCell even though the code now only uses get() and set() (both of which are synchronous on tokio::sync::OnceCell). A std::sync::OnceLock would work equivalently and make it explicit that no async waiting happens. Not blocking on this -- tokio::sync::OnceCell::get()/set() don't involve futures, so it's functionally identical. Just a readability nicety to consider.

Clippy and tests pass cleanly.

@ask-bonk
Copy link
Copy Markdown

ask-bonk Bot commented Apr 11, 2026

I've reviewed and approved PR #205. Here's a summary of the review:

Verdict: Approved

The fix correctly replaces OnceCell::get_or_try_init with a manual get() / build / set() pattern that avoids the Cloudflare Workers cross-request promise cancellation issue. Key points:

  • The pattern is safe: ROOTS.get().expect("just set") after ROOTS.set(pool) is guaranteed because at least one concurrent request must have succeeded at set().
  • The trade-off (redundant KV fetches under cold-start contention) is strictly better than 500 errors from cancelled requests.
  • Integration test simplifications are well-motivated -- removing the ensure_initialized call from get_roots_returns_valid_certs is correct since that workaround existed specifically for the deadlock this PR fixes.
  • Clippy passes with zero warnings and all unit tests pass.

I left one minor observation that ROOTS could be downgraded from tokio::sync::OnceCell to std::sync::OnceLock since only the synchronous get()/set() methods are used now, but noted it's non-blocking.

github run

@lukevalenta lukevalenta force-pushed the lvalenta/onceLock-fix branch from fd290f7 to 7a4b13c Compare April 13, 2026 13:14
@lukevalenta
Copy link
Copy Markdown
Contributor Author

/bonk review again

Copy link
Copy Markdown

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

The core fix is correct and well-motivated. Replacing tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/build/set() eliminates the cross-request promise cancellation. The trade-off (redundant KV fetches under cold-start contention) is clearly the right one.

A few items:

Doc comment stutter in static_ct_api.rs

Lines 54-55 have a copy-paste artifact where the first sentence was partially edited:

/// Shared once-per-binary initialization: ensures the worker is fully live and
/// Ensures the worker is fully live and has sequenced at least one entry

Should be just:

/// Shared once-per-binary initialization: ensures the worker is fully live and
/// has sequenced at least one entry

CI changes are unrelated but look good

Splitting the single build job into lint/build/test that run in parallel, and separating pedantic clippy into continue-on-error: true, is a nice improvement. The cargo-machete step and new cargo-machete ignored annotations in tlog_tiles_wasm/Cargo.toml and integration_tests/Cargo.toml close the loop on unused-dependency detection.

One note: the old CI had cargo clippy -- -D warnings -W clippy::pedantic as a single step (pedantic warnings were non-fatal because they were -W not -D). The new split is cleaner since it separates "required" from "advisory" explicitly with continue-on-error.

console_error_panic_hook removal from generic_log_worker

This appears in the Cargo.lock diff but I don't see it in the generic_log_worker/Cargo.toml diff shown -- the lock diff shows it being removed. Worth confirming this was intentional (i.e., the dep was truly unused) and not an artifact of a different change.

Agree with bonk's OnceLock observation

Since only get() and set() are used now, std::sync::OnceLock (which is what the PR actually uses) is the right choice over tokio::sync::OnceCell. The PR already made this switch -- good.

Overall this is a clean, well-scoped fix. The only actionable item is the doc comment stutter on lines 54-55 of static_ct_api.rs.

Comment thread crates/integration_tests/tests/static_ct_api.rs Outdated
@ask-bonk
Copy link
Copy Markdown

ask-bonk Bot commented Apr 13, 2026

Both the general review comment and the inline suggestion were posted successfully. Here's a summary of my review:

PR #205 Review Summary:

The fix is sound. It correctly replaces tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/build/set() to avoid the Cloudflare Workers cross-request promise cancellation that was causing 500 errors on concurrent cold-start requests.

I posted:

  1. A general review comment covering:

    • Confirmation the core fix pattern is correct
    • A doc comment stutter bug in static_ct_api.rs lines 54-55 (the only actionable item)
    • Noting the CI split into lint/build/test is a nice unrelated improvement
    • Flagging the console_error_panic_hook removal from generic_log_worker Cargo.lock for confirmation
    • Agreement with the OnceLock choice (already done correctly in the PR)
  2. An inline suggestion on static_ct_api.rs:54-56 with a concrete fix for the doc comment stutter where "Shared once-per-binary initialization: ensures the worker is fully live and" is followed by "Ensures the worker is fully live and has sequenced..." — a copy-paste leftover.

github run

@lukevalenta
Copy link
Copy Markdown
Contributor Author

One question: is that cancelling behavior documented anywhere? I didn’t know this was a thing to look out for

It should probably be documented here: https://developers.cloudflare.com/workers/observability/errors/#the-script-will-never-generate-a-response-errors. I'll follow up with the Workers team to confirm the expected behavior and update the docs.

@rozbb I submitted this PR to document the issue (pending review from the Workers team): cloudflare/cloudflare-docs#29797

@lukevalenta lukevalenta force-pushed the lvalenta/onceLock-fix branch from 7a4b13c to 5136019 Compare April 13, 2026 14:15
@lukevalenta lukevalenta force-pushed the lvalenta/rustcrypto-ecosystem-update branch from 5e8ffb5 to b8787c2 Compare April 13, 2026 15:10
@lukevalenta lukevalenta force-pushed the lvalenta/onceLock-fix branch from 5136019 to 9336b0f Compare April 13, 2026 18:07
@lukevalenta lukevalenta changed the base branch from lvalenta/rustcrypto-ecosystem-update to main April 13, 2026 18:08
…d_roots

The Workers runtime cancels any request that awaits a promise (OnceCell
future) created by a different request context. Previously, a concurrent
add-chain request arriving while another request was initializing ROOTS
via get_or_try_init would be canceled with a 500.

Fix: load_roots now checks ROOTS.get() first (fast path), then builds the
pool itself if not yet initialized, then calls ROOTS.set(). If another
concurrent request races and sets first, the losing request discards its
result and returns the value already in the cell. All concurrent requests
do the work independently rather than waiting on each other.

Applied to both ct_worker and bootstrap_mtc_worker.

Also switch ROOTS from tokio::sync::OnceCell to std::sync::OnceLock —
the code only uses get() and set() (both synchronous), so std::sync::OnceLock
is more appropriate and makes the non-async nature explicit.

Remove tokio from [dependencies] in ct_worker and bootstrap_mtc_worker —
it was unused after removing the OnceCell import.
@lukevalenta lukevalenta force-pushed the lvalenta/onceLock-fix branch from 9336b0f to a019552 Compare April 13, 2026 18:11
@lukevalenta
Copy link
Copy Markdown
Contributor Author

/bonk review

Copy link
Copy Markdown

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix is correct and well-scoped. Replacing tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/set() eliminates the cross-request promise cancellation that was causing 500s on concurrent cold-start requests.

Core pattern is sound:

  • ROOTS.get() fast path avoids redundant work after initialization.
  • Each concurrent request independently builds the CertPool rather than awaiting a shared future, which is the correct approach given the Workers runtime constraint.
  • let _ = ROOTS.set(pool) followed by ROOTS.get().expect("just set") is safe: the current request either won the set() race or another request already set it — either way, get() is guaranteed to return Some.
  • The trade-off of redundant KV fetches under cold-start contention is clearly preferable to killing N-1 concurrent requests with 500s.

Dependency cleanup is correct:

The tokio direct dependency is properly removed from both ct_worker/Cargo.toml and bootstrap_mtc_worker/Cargo.toml since the only usage was tokio::sync::OnceCell. The Cargo.lock diff only removes tokio from those two crate dependency lists — nothing else was inadvertently dropped.

Integration test changes are well-motivated:

  • get_roots_returns_valid_certs no longer calls ensure_initialized() in either test suite. This is correct: get-roots is stateless and the old ensure_initialized call was specifically a workaround for the get_or_try_init deadlock this PR fixes.
  • In static_ct_api.rs, the ensure_initialized warmup loop now retries get_log_v3_json() instead of get_roots() as its liveness probe. This is a better choice: get_log_v3_json is a more targeted reachability check that doesn't touch ROOTS, and is the endpoint actually needed (for meta) in the subsequent checkpoint verification.

Note on a previous review's suggestions: The doc comment at static_ct_api.rs:54-55 reads correctly in the current code ("ensures the worker is fully live and / has sequenced at least one entry") — there is no stutter to fix. The earlier review comment about console_error_panic_hook removal was also inaccurate; the Cargo.lock diff only removes tokio from the two worker crates.

Clean fix. LGTM.

@ask-bonk
Copy link
Copy Markdown

ask-bonk Bot commented Apr 13, 2026

I've reviewed and approved PR #205. Here's a summary:

Approved — The fix correctly replaces tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/set() to avoid the Cloudflare Workers cross-request promise cancellation that was causing 500 errors on concurrent cold starts.

Key points from my review:

  • The get() → build → set()get().expect() pattern is safe and correct for the Workers runtime constraint
  • The trade-off of redundant KV fetches under contention is clearly better than killing requests with 500s
  • tokio dependency properly removed from both worker crates since only tokio::sync::OnceCell was used
  • Integration test simplifications are well-motivated — removing ensure_initialized() from get_roots_returns_valid_certs and switching the liveness probe from get_roots() to get_log_v3_json() are both correct
  • Corrected two inaccurate observations from prior reviews (doc comment stutter and console_error_panic_hook removal were not actual issues)

github run

@lukevalenta
Copy link
Copy Markdown
Contributor Author

Aside from some minor fixes/rebasing, nothing substantial has changed since @rozbb 's review, so I'll go ahead and merge.

@lukevalenta lukevalenta merged commit ddc4ec7 into main Apr 13, 2026
6 checks passed
@lukevalenta lukevalenta deleted the lvalenta/onceLock-fix branch April 13, 2026 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants