fix(cache): atomic writes to prevent torn-read deletion of valid cache entries#1056
Merged
Conversation
The response cache could silently delete a valid entry on a torn read. Each cache write fires cleanupCache() fire-and-forget at 10% probability; a second write to the same key overwrote the file with a non-atomic writeFile. The first write's async cleanup could then read the file mid-overwrite, fail to JSON.parse the truncated content, and delete it as "corrupted" in collectEntryMetadata — losing a valid cache entry. This surfaced as a ~10% flaky unit test (response-cache.test.ts "--fresh round-trip"), which is why it passed on PR #1051's CI run but failed on the post-merge main run. The bug is pre-existing (cleanup code predates #1051). Fixes: - atomicWriteCacheFile(): write to a unique temp file then rename into place. rename within the same directory is atomic on POSIX and Windows, so a concurrent reader always sees a complete old or new file, never a torn one. - collectEntryMetadata(): skip files on transient parse/read failure instead of deleting them. Genuinely corrupt entries already self-heal on the read path (getCachedResponse) or are overwritten on the next write. Verified: amplified repro 300/300 pass (was ~10% failing); actual test file 20/20 runs green; full unit suite 7421 passed / 13 skipped / 0 failed.
Contributor
|
Contributor
Codecov Results 📊✅ Patch coverage is 88.89%. Project has 4340 uncovered lines. Files with missing lines (1)
Coverage diff@@ Coverage Diff @@
## main #PR +/-##
==========================================
+ Coverage 81.97% 82.08% +0.11%
==========================================
Files 334 335 +1
Lines 24157 24217 +60
Branches 15807 15843 +36
==========================================
+ Hits 19802 19877 +75
- Misses 4355 4340 -15
- Partials 1659 1659 —Generated by Codecov Action |
…le temps Addresses adversarial self-review findings on the torn-read fix: - Add log.debug() to the previously-silent catch blocks in atomicWriteCacheFile and collectEntryMetadata (AGENTS.md prohibits silent catches in src/). - collectEntryMetadata now splits read vs parse failures: a transient read failure is skipped (never deleted), while a fully-read-but-unparseable file is genuine corruption and is marked expired so deleteExpiredEntries reclaims it. This keeps corrupt files visible to MAX_CACHE_ENTRIES eviction (they no longer escape the count) and is safe now that atomic writes preclude torn reads. - cleanupCache sweeps orphaned .tmp files older than 60s, so a crash between writeFile and rename can't leak temp files unbounded. - Soften the Windows atomicity claim in the JSDoc (near-atomic, same volume). - Add tests: atomic write leaves no temp file behind; repeated overwrite-read never loses the entry (torn-read regression).
BYK
added a commit
that referenced
this pull request
Jun 3, 2026
…undary (#1060) ## Summary Fixes a flaky property test, `auto-paginate.property.test.ts` → *"drops nextCursor whenever the result is trimmed to limit"*, which failed on seeds that hit an exact page-boundary input (e.g. counterexample `[total=393, limit=392, pageSize=98]`, since `392 % 98 === 0`). **This is a test-only fix — the `autoPaginate()` implementation is correct.** ## Root cause The test asserted that whenever `total > limit`, `nextCursor` must be `undefined`. That predicate conflates two cases with **opposite, correct** contracts in the multi-page path: - **True overshoot** — a page pushes accumulation strictly past `limit`. Rows are trimmed and the server cursor points *past* the trimmed tail, so it is dropped. Preserving it would skip rows (the original skip-bug guard). ✅ - **Exact boundary** (`limit % effectivePageSize === 0`) — accumulation lands exactly on `limit`, nothing is trimmed, and the last page's cursor points at index `limit` (the first unreturned row). It **must be preserved** so callers can page forward; dropping it strands the tail — the regression explicitly pinned by `events-overshoot.test.ts`. ✅ In the counterexample, `autoPaginate` correctly returns `nextCursor="392"` (following it yields row 392 — contiguous, no skip), but the test demanded `undefined`. ## Fix Split the assertion by the page-boundary discriminator: - `limit % effectivePageSize === 0` (exact boundary) → assert `nextCursor === String(limit)` (positively verifies the cursor resumes contiguously). - otherwise (true overshoot) → assert `nextCursor === undefined` (skip-bug guard, unchanged). Also updated the file header to document the boundary nuance. ## Why not change the implementation? Dropping the cursor at the exact boundary would reintroduce the **stranding** regression that `events-overshoot.test.ts` guards against (`-c next` could never advance past `limit`). The implementation's two-branch logic is correct. ## Verification - Exhaustive sweep of **all** `(total, limit, pageSize)` combinations in the test's arbitrary ranges (103,171 cases): **0 violations** against the new contract. - `auto-paginate.property.test.ts` + `events-overshoot.test.ts`: pass. - Property test run 15× consecutively: **15/15 green** (was intermittently failing). - `pnpm run lint`: clean. `pnpm exec tsc --noEmit`: exit 0. ## Context This flake was spotted during the full-suite run while fixing the response-cache torn-read bug (#1056) and noted there as a follow-up.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the failing
mainbuild caused by a ~10% flaky unit test(
response-cache.test.ts→--fresh round-trip: stale entry is replaced by fresh response) that exposed a real production bug in the HTTP responsecache.
Root cause
In
src/lib/response-cache.ts, each cache write firescleanupCache()fire-and-forget at 10% probability (
CLEANUP_PROBABILITY = 0.1). A secondwrite to the same key overwrote the file with a non-atomic
writeFile.The first write's async cleanup could then read the file mid-overwrite
(torn read), fail to
JSON.parsethe truncated content, and delete it as"corrupted" in
collectEntryMetadata— silently losing a valid cache entry.Proof: A 300-iteration repro failed ~10% of the time (matching
CLEANUP_PROBABILITY), and on every miss the cache directory was empty —the valid entry was deleted, not expired.
This is a genuine correctness bug beyond tests: any two rapid writes to the same
cache key (paginated fetches, concurrent CLI invocations sharing the cache dir)
can race so cleanup's torn read deletes a valid entry.
Why PR checks didn't catch it
The flake is ~10%, so any single CI "Unit Tests" run passes ~90% of the time.
PR #1051's Unit Tests check landed in the lucky 90%; the post-merge
mainrunhit the unlucky 10%. The bug is pre-existing — the cleanup code predates
#1051, which never touched
response-cache.ts.Fix
atomicWriteCacheFile()writes to a unique temp file(
<key>.<pid>.<uuid>.tmp) thenrenames it into place.renameis atomic onPOSIX (same filesystem) and near-atomic on Windows (same volume), so a
concurrent reader sees a complete old or new file, never a torn one.
collectEntryMetadata()now separates failure modes:sweep) → skip, never delete.
torn reads), so the file is marked expired and reclaimed by
deleteExpiredEntries. This keeps corrupt files visible toMAX_CACHE_ENTRIESeviction instead of leaking past the count.cleanupCache()removes orphaned.tmpfiles olderthan 60s, so a crash between
writeFileandrenamecan't leak temp files.log.debug()thesuppressed error (per AGENTS.md no-silent-catch rule).
Verification
pnpm run lint: clean.pnpm exec tsc --noEmit: exit 0.flake noted below).
Out of scope (follow-up)
During full-suite runs I observed an unrelated, pre-existing flaky property
test:
auto-paginate.property.test.ts:94(counterexample[393,392,98]— anautoPaginatenextCursortrim bug whentotal > limit). It passes inisolation and is independent of this change. Tracked as a follow-up.