Skip to content

fix(pds): store blobs from Worker, not the firehose-holding DO#165

Merged
ascorbic merged 2 commits into
mainfrom
fix/blob-upload-off-do
May 15, 2026
Merged

fix(pds): store blobs from Worker, not the firehose-holding DO#165
ascorbic merged 2 commits into
mainfrom
fix/blob-upload-off-do

Conversation

@ascorbic
Copy link
Copy Markdown
Owner

Summary

  • Follow-up to fix(pds): prevent relay desync after failed write #162. A blob upload (even a small link-card OG image) intermittently triggered Durable Object storage operation exceeded timeout which caused object to be reset on com.atproto.repo.uploadBlob, which dropped the relay's subscribeRepos firehose connection and left it desynced until a manual requestCrawl.
  • Root cause: uploadBlob computed the CID and did the R2 put inside the AccountDurableObject. That DO is single-threaded and also holds the firehose WebSocket; awaiting an R2 put inside it pins the input gate, and R2 latency is independent of object size — even a small image can stall long enough for Cloudflare to reset the object.
  • Fix: the stateless Worker now computes the CID and writes to R2 directly, mirroring the existing sync.getBlob download path (which already bypasses the DO with the comment "R2ObjectBody can't be serialized across RPC"). The DO is only called for the small imported_blobs tracking row via a new rpcTrackBlob. rpcUploadBlob is removed.

Why #162 didn't cover this

#162 fixed in-memory Repo divergence in the record-write paths. This is a different mechanism: the whole DO is reset, not a caught write error, so that fix can't apply. The two together close both desync paths.

Test plan

  • pnpm --filter @getcirrus/pds test:unit — 273 tests pass (incl. blobs.test.ts)
  • Production: post a link card with an OG thumbnail repeatedly; confirm the relay keeps tracking with no manual requestCrawl

uploadBlob computed the CID and did the R2 put inside the AccountDurableObject. That DO is single-threaded and also holds the relay's subscribeRepos firehose WebSocket; awaiting an R2 put inside it pins the input gate (R2 latency is independent of object size — even a small link-card OG image can stall), and Cloudflare resets the object with "Durable Object storage operation exceeded timeout", dropping the firehose and desyncing the relay until a manual requestCrawl.

The Worker now computes the CID and writes to R2 directly, mirroring the existing sync.getBlob download path, and only calls the DO (new rpcTrackBlob) for the small imported_blobs tracking row. rpcUploadBlob is removed.
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 15, 2026

Deploying with  Cloudflare Workers  Cloudflare Workers

The latest updates on your project. Learn more about integrating Git with Workers.

Status Name Latest Commit Updated (UTC)
✅ Deployment successful!
View logs
atproto-pds 5e70483 May 15 2026, 07:10 PM

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 15, 2026

Open in StackBlitz

npm i https://pkg.pr.new/create-pds@165
npm i https://pkg.pr.new/@getcirrus/oauth-provider@165
npm i https://pkg.pr.new/@getcirrus/pds@165

commit: 5e70483

@ascorbic ascorbic merged commit 5e058c8 into main May 15, 2026
5 checks passed
@ascorbic ascorbic deleted the fix/blob-upload-off-do branch May 15, 2026 19:11
@mixie-bot mixie-bot Bot mentioned this pull request May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant