Skip to content

fix(cache): atomic writes to prevent torn-read deletion of valid cache entries#1056

Merged
BYK merged 2 commits into
mainfrom
fix/response-cache-torn-read-deletion
Jun 3, 2026
Merged

fix(cache): atomic writes to prevent torn-read deletion of valid cache entries#1056
BYK merged 2 commits into
mainfrom
fix/response-cache-torn-read-deletion

Conversation

@BYK
Copy link
Copy Markdown
Member

@BYK BYK commented Jun 3, 2026

Summary

Fixes the failing main build caused by a ~10% flaky unit test
(response-cache.test.ts--fresh round-trip: stale entry is replaced by fresh response) that exposed a real production bug in the HTTP response
cache.

Root cause

In src/lib/response-cache.ts, each cache write fires cleanupCache()
fire-and-forget at 10% probability (CLEANUP_PROBABILITY = 0.1). A second
write to the same key overwrote the file with a non-atomic writeFile.
The first write's async cleanup could then read the file mid-overwrite
(torn read), fail to JSON.parse the truncated content, and delete it as
"corrupted" in collectEntryMetadata — silently losing a valid cache entry.

Proof: A 300-iteration repro failed ~10% of the time (matching
CLEANUP_PROBABILITY), and on every miss the cache directory was empty
the valid entry was deleted, not expired.

This is a genuine correctness bug beyond tests: any two rapid writes to the same
cache key (paginated fetches, concurrent CLI invocations sharing the cache dir)
can race so cleanup's torn read deletes a valid entry.

Why PR checks didn't catch it

The flake is ~10%, so any single CI "Unit Tests" run passes ~90% of the time.
PR #1051's Unit Tests check landed in the lucky 90%; the post-merge main run
hit the unlucky 10%. The bug is pre-existing — the cleanup code predates
#1051, which never touched response-cache.ts.

Fix

  1. Atomic writes — new atomicWriteCacheFile() writes to a unique temp file
    (<key>.<pid>.<uuid>.tmp) then renames it into place. rename is atomic on
    POSIX (same filesystem) and near-atomic on Windows (same volume), so a
    concurrent reader sees a complete old or new file, never a torn one.
  2. Cleanup hardeningcollectEntryMetadata() now separates failure modes:
    • transient read failure (locking, AV scanner, ENOENT from a concurrent
      sweep) → skip, never delete.
    • fully read but unparseable → genuine corruption (atomic writes preclude
      torn reads), so the file is marked expired and reclaimed by
      deleteExpiredEntries. This keeps corrupt files visible to
      MAX_CACHE_ENTRIES eviction instead of leaking past the count.
  3. Temp-file sweepcleanupCache() removes orphaned .tmp files older
    than 60s, so a crash between writeFile and rename can't leak temp files.
  4. Diagnostics — previously-silent catch blocks now log.debug() the
    suppressed error (per AGENTS.md no-silent-catch rule).

Verification

  • Amplified repro: 300/300 pass (was ~10% failing before).
  • Actual test file: 10/10 runs green (36 tests, +2 new regression tests).
  • pnpm run lint: clean. pnpm exec tsc --noEmit: exit 0.
  • Full unit suite: 325 files, all passing (one unrelated pre-existing property
    flake noted below).

Out of scope (follow-up)

During full-suite runs I observed an unrelated, pre-existing flaky property
test: auto-paginate.property.test.ts:94 (counterexample [393,392,98] — an
autoPaginate nextCursor trim bug when total > limit). It passes in
isolation and is independent of this change. Tracked as a follow-up.

The response cache could silently delete a valid entry on a torn read. Each
cache write fires cleanupCache() fire-and-forget at 10% probability; a second
write to the same key overwrote the file with a non-atomic writeFile. The
first write's async cleanup could then read the file mid-overwrite, fail to
JSON.parse the truncated content, and delete it as "corrupted" in
collectEntryMetadata — losing a valid cache entry.

This surfaced as a ~10% flaky unit test (response-cache.test.ts "--fresh
round-trip"), which is why it passed on PR #1051's CI run but failed on the
post-merge main run. The bug is pre-existing (cleanup code predates #1051).

Fixes:
- atomicWriteCacheFile(): write to a unique temp file then rename into place.
  rename within the same directory is atomic on POSIX and Windows, so a
  concurrent reader always sees a complete old or new file, never a torn one.
- collectEntryMetadata(): skip files on transient parse/read failure instead
  of deleting them. Genuinely corrupt entries already self-heal on the read
  path (getCachedResponse) or are overwritten on the next write.

Verified: amplified repro 300/300 pass (was ~10% failing); actual test file
20/20 runs green; full unit suite 7421 passed / 13 skipped / 0 failed.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://cli.sentry.dev/_preview/pr-1056/

Built to branch gh-pages at 2026-06-03 14:16 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Codecov Results 📊

✅ Patch coverage is 88.89%. Project has 4340 uncovered lines.
✅ Project coverage is 82.08%. Comparing base (base) to head (head).

Files with missing lines (1)
File Patch % Lines
src/lib/response-cache.ts 88.89% ⚠️ 3 Missing and 1 partials
Coverage diff
@@            Coverage Diff             @@
##          main       #PR       +/-##
==========================================
+ Coverage    81.97%    82.08%    +0.11%
==========================================
  Files          334       335        +1
  Lines        24157     24217       +60
  Branches     15807     15843       +36
==========================================
+ Hits         19802     19877       +75
- Misses        4355      4340       -15
- Partials      1659      1659         —

Generated by Codecov Action

…le temps

Addresses adversarial self-review findings on the torn-read fix:

- Add log.debug() to the previously-silent catch blocks in atomicWriteCacheFile
  and collectEntryMetadata (AGENTS.md prohibits silent catches in src/).
- collectEntryMetadata now splits read vs parse failures: a transient read
  failure is skipped (never deleted), while a fully-read-but-unparseable file is
  genuine corruption and is marked expired so deleteExpiredEntries reclaims it.
  This keeps corrupt files visible to MAX_CACHE_ENTRIES eviction (they no longer
  escape the count) and is safe now that atomic writes preclude torn reads.
- cleanupCache sweeps orphaned .tmp files older than 60s, so a crash between
  writeFile and rename can't leak temp files unbounded.
- Soften the Windows atomicity claim in the JSDoc (near-atomic, same volume).
- Add tests: atomic write leaves no temp file behind; repeated overwrite-read
  never loses the entry (torn-read regression).
@BYK BYK merged commit 1bed9ea into main Jun 3, 2026
29 checks passed
@BYK BYK deleted the fix/response-cache-torn-read-deletion branch June 3, 2026 14:24
BYK added a commit that referenced this pull request Jun 3, 2026
…undary (#1060)

## Summary

Fixes a flaky property test, `auto-paginate.property.test.ts` →
*"drops nextCursor whenever the result is trimmed to limit"*, which
failed on
seeds that hit an exact page-boundary input (e.g. counterexample
`[total=393, limit=392, pageSize=98]`, since `392 % 98 === 0`).

**This is a test-only fix — the `autoPaginate()` implementation is
correct.**

## Root cause

The test asserted that whenever `total > limit`, `nextCursor` must be
`undefined`. That predicate conflates two cases with **opposite,
correct**
contracts in the multi-page path:

- **True overshoot** — a page pushes accumulation strictly past `limit`.
Rows are
trimmed and the server cursor points *past* the trimmed tail, so it is
dropped.
  Preserving it would skip rows (the original skip-bug guard). ✅
- **Exact boundary** (`limit % effectivePageSize === 0`) — accumulation
lands
exactly on `limit`, nothing is trimmed, and the last page's cursor
points at
index `limit` (the first unreturned row). It **must be preserved** so
callers
can page forward; dropping it strands the tail — the regression
explicitly
  pinned by `events-overshoot.test.ts`. ✅

In the counterexample, `autoPaginate` correctly returns
`nextCursor="392"`
(following it yields row 392 — contiguous, no skip), but the test
demanded
`undefined`.

## Fix

Split the assertion by the page-boundary discriminator:
- `limit % effectivePageSize === 0` (exact boundary) → assert
  `nextCursor === String(limit)` (positively verifies the cursor resumes
  contiguously).
- otherwise (true overshoot) → assert `nextCursor === undefined`
(skip-bug
  guard, unchanged).

Also updated the file header to document the boundary nuance.

## Why not change the implementation?

Dropping the cursor at the exact boundary would reintroduce the
**stranding**
regression that `events-overshoot.test.ts` guards against (`-c next`
could never
advance past `limit`). The implementation's two-branch logic is correct.

## Verification

- Exhaustive sweep of **all** `(total, limit, pageSize)` combinations in
the
test's arbitrary ranges (103,171 cases): **0 violations** against the
new
  contract.
- `auto-paginate.property.test.ts` + `events-overshoot.test.ts`: pass.
- Property test run 15× consecutively: **15/15 green** (was
intermittently
  failing).
- `pnpm run lint`: clean. `pnpm exec tsc --noEmit`: exit 0.

## Context

This flake was spotted during the full-suite run while fixing the
response-cache
torn-read bug (#1056) and noted there as a follow-up.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant