Skip to content

Research Review: Pushsync Silent Chunk Loss #5400

@gacevicljubisa

Description

@gacevicljubisa

PR: #5390
Branch: fix/pushsync-out-of-depth-chunk-loss

Summary

Chunks can silently disappear from the network after a successful upload. Three bugs in the pushsync/pusher pipeline allow a chunk to be "delivered" (from the origin's perspective) while no node in the correct neighborhood actually retains it. The proposed fixes change protocol-level forwarding and receipt handling behavior.

Note: These issues are reproducible in smaller clusters such as the public testnet, where limited neighborhood sizes make out-of-depth forwarding and shallow receipts more likely.


Bug 1: Out-of-Depth Storing

Problem: When a node has no closer peer (ErrWantSelf) but the chunk is outside its AOR, it stores the chunk anyway. The chunk lands in a low-proximity bin, unreserve() evicts it immediately, but the origin already has a valid receipt. Chunk is gone.

Fix: Guard with proximity(chunk, self) < rad before storing — return ErrOutOfDepthStoring instead, origin retries with next peer. (pkg/pushsync/pushsync.go:301-309)

Doubling concern (@martinconic): The check uses StorageRadius() which already accounts for doubling (CommittedDepth - doublings). Using CommittedDepth instead would incorrectly reject chunks in sister neighborhoods. Appears correct, but needs research validation for edge cases during radius transitions.


Bug 2: Shallow Receipt Short-Circuits Parallel Pushes

Problem: On any shallow receipt, pushToClosest returns immediately with ErrShallowReceipt, discarding results from other inflight multiplex pushes (up to 32). Only 1 attempt of the error budget is used. Valid receipts from parallel pushes are thrown away.

Fix: Treat shallow receipt like any other peer failure — cache it, burn the budget, wait for inflight pushes. Only return ErrShallowReceipt after the full budget is exhausted or when falling back to ErrWantSelf. (pkg/pushsync/pushsync.go:369-530)


Bug 3: False ChunkSynced After Retry Exhaustion

Problem: After 6 pusher retries, chunks are marked ChunkSynced even though no node in the correct neighborhood stored them. The upload appears complete but the chunk is lost.

Fix: Report ChunkCouldNotSync instead of ChunkSynced. (pkg/pusher/pusher.go:283-289)

Gap identified (@martinconic): upload.Report() tag switch (pkg/storer/internal/upload/uploadstore.go:609-617) doesn't handle ChunkCouldNotSync — chunk gets cleaned up from the upload store but tag counters are never updated. Uploads with undeliverable chunks will show permanently incomplete sync counts.

Possible approaches:

  • A. Add ChunkCouldNotSync case with a new Failed counter in TagItem + surface via API
  • B. Keep chunk in push queue for later retry (risk: infinite loops for undeliverable chunks)
  • C. Surface via metrics/API only, accept the tag gap

Protocol Impact

Scenario Before After
Node outside AOR, no closer peer Stores + sends receipt (evicted shortly after) Returns error, origin retries
Shallow receipt during multiplex Immediate abort, 1 of 32 attempts used Full budget exhausted before giving up
Pusher exhausts 6 retries Marked ChunkSynced (false positive) Marked ChunkCouldNotSync (correct, but tag gap)

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions