Skip to content

Cache CAS interactions at lower layer to fix lost input handling#29551

Open
fmeum wants to merge 12 commits into
bazelbuild:masterfrom
fmeum:fix-lost-inputs
Open

Cache CAS interactions at lower layer to fix lost input handling#29551
fmeum wants to merge 12 commits into
bazelbuild:masterfrom
fmeum:fix-lost-inputs

Conversation

@fmeum
Copy link
Copy Markdown
Collaborator

@fmeum fmeum commented May 15, 2026

Description

The error handling for uploads in UploadTask depends on the exec path of the uploaded file, which wasn't part of the casUploadCache cache key. Fix this by moving deduplication logic for CAS uploads and FindMissing calls into RemoteCacheClient.

Motivation

Avoids spurious Bazel failures caused by lost inputs that can't be recovered from via build or action rewinding due to CacheNotFoundException being marked with exec paths of inputs to concurrent actions (see the new test case).

Build API Changes

No

Checklist

  • I have added tests for the new use cases (if any).
  • I have updated the documentation (if applicable).

Release Notes

RELNOTES: Fixed an issue that caused Bazel to fail on a lost input even with build or action rewinding enabled.

@fmeum fmeum force-pushed the fix-lost-inputs branch 2 times, most recently from 8b0b556 to 1b38764 Compare May 17, 2026 21:15
@fmeum fmeum marked this pull request as ready for review May 17, 2026 21:24
@fmeum fmeum requested a review from a team as a code owner May 17, 2026 21:24
@fmeum fmeum requested a review from coeuvre May 17, 2026 21:24
@github-actions github-actions Bot added team-Remote-Exec Issues and PRs for the Execution (Remote) team awaiting-review PR is awaiting review from an assigned reviewer labels May 17, 2026
@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented May 17, 2026

@iancha1992 This would be another candidate for a 9.1.1 release.

@fmeum
Copy link
Copy Markdown
Collaborator Author

fmeum commented May 17, 2026

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label May 17, 2026
@fmeum fmeum changed the title Cache CAS uploads at lower layer to fix lost inputs handling Cache CAS interactions at lower layer to fix lost inputs handling May 17, 2026
@fmeum fmeum changed the title Cache CAS interactions at lower layer to fix lost inputs handling Cache CAS interactions at lower layer to fix lost input handling May 17, 2026
@iancha1992
Copy link
Copy Markdown
Member

@bazel-io fork 9.1.1

@iancha1992
Copy link
Copy Markdown
Member

@bazel-io fork 9.2.0

@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label May 18, 2026
Comment thread src/main/java/com/google/devtools/build/lib/remote/common/RemoteCacheClient.java Outdated
fmeum and others added 8 commits May 21, 2026 22:43
Co-authored-by: Son Luong Ngoc <sluongng@gmail.com>
…calls

Restores the dedup behavior the lower-layer CAS caching removed, by
adding a second AsyncTaskCache<Digest, Boolean> that tracks in-flight
findMissingDigests results. The actual upload deduplication still
happens in casUploadCache (after the per-consumer isAvailableLocally
check inside uploadFile), so each consumer's lost-input path remains
distinct.

Also wraps uploadBlob(byte[]), uploadVirtualActionInput, and the
additionalInputs upload path with casUploadCache so directory blobs
and action/command messages are deduped consistently with file uploads.
Converts RemoteCacheClient from interface to abstract class and moves
the casUploadCache dedup logic out of CombinedCache and
RemoteExecutionCache into the base class. Adds `force` overloads on
the public upload methods.

Concrete implementations now override the protected-flavored
uploadFileImpl / uploadBlobImpl methods that perform the raw network
call; the public uploadFile / uploadBlob wrappers dedupe via
casUploadCache before invoking them. The previous double-wrap (one
cache in CombinedCache, another in RemoteExecutionCache around the
same digest) caused a lock-ordering deadlock once the client also had
its own cache, so the wrappers in CombinedCache/RemoteExecutionCache
are removed.

AsyncTaskCache and RxFutures are extracted into a small `rx_helpers`
java_library so the `common` package can depend on them without
creating a cycle with `remote/util`.

Tests that mocked the public upload methods are updated to mock the
*Impl variants instead, and switched from mock() to spy() of
InMemoryCacheClient where the deduplication wrapper needs to run.
- Extract a dedupedUpload() helper in RemoteCacheClient to remove the
  repeated casUploadCache.execute / RxFutures plumbing in uploadFile
  and uploadBlob(Blob).
- Tighten visibility of CombinedCache.uploadFile(force) and
  uploadBlob(force) from protected to private; they are only used
  internally now that subclasses no longer wrap them.
- Drop the second InMemoryCacheClient instance and findMissingDigests
  stub from upload_failedUploads_doNotDeduplicate; the spy can fall
  through via callRealMethod() for the success branch.
Before this commit, a digest that findMissingCache returned as "missing"
stayed cached as missing forever, even after the triggered upload turned
it present. After action rewinding (BuildWithoutTheBytesIntegrationTest)
this caused an infinite rewind loop: the bar action's retry would still
see the cached "missing", re-trigger the upload path, fail the
isAvailableLocally check in BWoB mode, and ask for another rewind.

Fix: after each upload attempt, update findMissingCache to reflect the
actual remote state — replace with "present" on success, or invalidate
on failure so the next caller re-queries. To enable doing this without
the previous lock-ordering deadlock between findMissingCache.lock and
casUploadCache.lock, AsyncTaskCache's `finished` map becomes a
ConcurrentHashMap and the new put/invalidate operations on it don't
take the cache's lock.
@fmeum fmeum force-pushed the fix-lost-inputs branch from 1b38764 to 33748b8 Compare May 21, 2026 21:14
@fmeum fmeum requested a review from coeuvre May 22, 2026 05:35
@coeuvre coeuvre added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally team-Remote-Exec Issues and PRs for the Execution (Remote) team

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants