Replace OnceCell::get_or_try_init with deadlock-safe non-blocking load_roots#205
Conversation
rozbb
left a comment
There was a problem hiding this comment.
One question: is that cancelling behavior documented anywhere? I didn’t know this was a thing to look out for
It should probably be documented here: https://developers.cloudflare.com/workers/observability/errors/#the-script-will-never-generate-a-response-errors. I'll follow up with the Workers team to confirm the expected behavior and update the docs. |
|
/bonk review |
There was a problem hiding this comment.
The fix is sound. Replacing OnceCell::get_or_try_init with a manual get() + build + set() pattern correctly avoids the cross-request promise cancellation in the Workers runtime, and the integration test changes are well-considered.
Code review
load_roots in both workers -- The pattern is correct and idiomatic for this constraint:
- Fast path via
ROOTS.get()returns immediately if already initialized. - Each concurrent request independently builds a
CertPool(including the KV fetch for CCADB roots). ROOTS.set(pool)is a non-blocking race -- the loser silently discards its work.ROOTS.get().expect("just set")is safe because some request must have succeeded atset()before this point.
The trade-off is clear: under cold-start contention, N concurrent requests may redundantly fetch from KV and parse PEM certs. This is strictly better than N-1 of them being killed with 500s.
Integration test changes -- The get_roots_returns_valid_certs tests in both suites no longer call ensure_initialized(), which makes sense: get-roots is a stateless endpoint that doesn't depend on sequencer state, and the old ensure_initialized pre-call was specifically working around the get_or_try_init deadlock that this PR eliminates. The static_ct_api ensure_initialized also drops the explicit get-roots readiness loop in favor of a simpler get_log_v3_json retry, which is the right simplification -- the metadata endpoint doesn't touch ROOTS and is a more targeted liveness check.
One minor observation: ROOTS is still typed as tokio::sync::OnceCell even though the code now only uses get() and set() (both of which are synchronous on tokio::sync::OnceCell). A std::sync::OnceLock would work equivalently and make it explicit that no async waiting happens. Not blocking on this -- tokio::sync::OnceCell::get()/set() don't involve futures, so it's functionally identical. Just a readability nicety to consider.
Clippy and tests pass cleanly.
|
I've reviewed and approved PR #205. Here's a summary of the review: Verdict: Approved The fix correctly replaces
I left one minor observation that |
fd290f7 to
7a4b13c
Compare
|
/bonk review again |
There was a problem hiding this comment.
Review
The core fix is correct and well-motivated. Replacing tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/build/set() eliminates the cross-request promise cancellation. The trade-off (redundant KV fetches under cold-start contention) is clearly the right one.
A few items:
Doc comment stutter in static_ct_api.rs
Lines 54-55 have a copy-paste artifact where the first sentence was partially edited:
/// Shared once-per-binary initialization: ensures the worker is fully live and
/// Ensures the worker is fully live and has sequenced at least one entry
Should be just:
/// Shared once-per-binary initialization: ensures the worker is fully live and
/// has sequenced at least one entry
CI changes are unrelated but look good
Splitting the single build job into lint/build/test that run in parallel, and separating pedantic clippy into continue-on-error: true, is a nice improvement. The cargo-machete step and new cargo-machete ignored annotations in tlog_tiles_wasm/Cargo.toml and integration_tests/Cargo.toml close the loop on unused-dependency detection.
One note: the old CI had cargo clippy -- -D warnings -W clippy::pedantic as a single step (pedantic warnings were non-fatal because they were -W not -D). The new split is cleaner since it separates "required" from "advisory" explicitly with continue-on-error.
console_error_panic_hook removal from generic_log_worker
This appears in the Cargo.lock diff but I don't see it in the generic_log_worker/Cargo.toml diff shown -- the lock diff shows it being removed. Worth confirming this was intentional (i.e., the dep was truly unused) and not an artifact of a different change.
Agree with bonk's OnceLock observation
Since only get() and set() are used now, std::sync::OnceLock (which is what the PR actually uses) is the right choice over tokio::sync::OnceCell. The PR already made this switch -- good.
Overall this is a clean, well-scoped fix. The only actionable item is the doc comment stutter on lines 54-55 of static_ct_api.rs.
|
Both the general review comment and the inline suggestion were posted successfully. Here's a summary of my review: PR #205 Review Summary: The fix is sound. It correctly replaces I posted:
|
@rozbb I submitted this PR to document the issue (pending review from the Workers team): cloudflare/cloudflare-docs#29797 |
7a4b13c to
5136019
Compare
5e8ffb5 to
b8787c2
Compare
5136019 to
9336b0f
Compare
…d_roots The Workers runtime cancels any request that awaits a promise (OnceCell future) created by a different request context. Previously, a concurrent add-chain request arriving while another request was initializing ROOTS via get_or_try_init would be canceled with a 500. Fix: load_roots now checks ROOTS.get() first (fast path), then builds the pool itself if not yet initialized, then calls ROOTS.set(). If another concurrent request races and sets first, the losing request discards its result and returns the value already in the cell. All concurrent requests do the work independently rather than waiting on each other. Applied to both ct_worker and bootstrap_mtc_worker. Also switch ROOTS from tokio::sync::OnceCell to std::sync::OnceLock — the code only uses get() and set() (both synchronous), so std::sync::OnceLock is more appropriate and makes the non-async nature explicit. Remove tokio from [dependencies] in ct_worker and bootstrap_mtc_worker — it was unused after removing the OnceCell import.
9336b0f to
a019552
Compare
|
/bonk review |
There was a problem hiding this comment.
The fix is correct and well-scoped. Replacing tokio::sync::OnceCell::get_or_try_init with std::sync::OnceLock + manual get()/set() eliminates the cross-request promise cancellation that was causing 500s on concurrent cold-start requests.
Core pattern is sound:
ROOTS.get()fast path avoids redundant work after initialization.- Each concurrent request independently builds the
CertPoolrather than awaiting a shared future, which is the correct approach given the Workers runtime constraint. let _ = ROOTS.set(pool)followed byROOTS.get().expect("just set")is safe: the current request either won theset()race or another request already set it — either way,get()is guaranteed to returnSome.- The trade-off of redundant KV fetches under cold-start contention is clearly preferable to killing N-1 concurrent requests with 500s.
Dependency cleanup is correct:
The tokio direct dependency is properly removed from both ct_worker/Cargo.toml and bootstrap_mtc_worker/Cargo.toml since the only usage was tokio::sync::OnceCell. The Cargo.lock diff only removes tokio from those two crate dependency lists — nothing else was inadvertently dropped.
Integration test changes are well-motivated:
get_roots_returns_valid_certsno longer callsensure_initialized()in either test suite. This is correct:get-rootsis stateless and the oldensure_initializedcall was specifically a workaround for theget_or_try_initdeadlock this PR fixes.- In
static_ct_api.rs, theensure_initializedwarmup loop now retriesget_log_v3_json()instead ofget_roots()as its liveness probe. This is a better choice:get_log_v3_jsonis a more targeted reachability check that doesn't touchROOTS, and is the endpoint actually needed (formeta) in the subsequent checkpoint verification.
Note on a previous review's suggestions: The doc comment at static_ct_api.rs:54-55 reads correctly in the current code ("ensures the worker is fully live and / has sequenced at least one entry") — there is no stutter to fix. The earlier review comment about console_error_panic_hook removal was also inaccurate; the Cargo.lock diff only removes tokio from the two worker crates.
Clean fix. LGTM.
|
I've reviewed and approved PR #205. Here's a summary: Approved — The fix correctly replaces Key points from my review:
|
|
Aside from some minor fixes/rebasing, nothing substantial has changed since @rozbb 's review, so I'll go ahead and merge. |
The Workers runtime cancels any request that awaits a promise (OnceCell future) created by a different request context. Previously, a concurrent add-chain request arriving while another request was initializing ROOTS via get_or_try_init would be canceled with a 500.
Fix: load_roots now checks ROOTS.get() first (fast path), then builds the pool itself if not yet initialized, then calls ROOTS.set(). If another concurrent request races and sets first, the losing request discards its result and returns the value already in the cell. All concurrent requests do the work independently rather than waiting on each other.
Applied to both ct_worker and bootstrap_mtc_worker.