PR: #5390
Branch: fix/pushsync-out-of-depth-chunk-loss
Summary
Chunks can silently disappear from the network after a successful upload. Three bugs in the pushsync/pusher pipeline allow a chunk to be "delivered" (from the origin's perspective) while no node in the correct neighborhood actually retains it. The proposed fixes change protocol-level forwarding and receipt handling behavior.
Note: These issues are reproducible in smaller clusters such as the public testnet, where limited neighborhood sizes make out-of-depth forwarding and shallow receipts more likely.
Bug 1: Out-of-Depth Storing
Problem: When a node has no closer peer (ErrWantSelf) but the chunk is outside its AOR, it stores the chunk anyway. The chunk lands in a low-proximity bin, unreserve() evicts it immediately, but the origin already has a valid receipt. Chunk is gone.
Fix: Guard with proximity(chunk, self) < rad before storing — return ErrOutOfDepthStoring instead, origin retries with next peer. (pkg/pushsync/pushsync.go:301-309)
Doubling concern (@martinconic): The check uses StorageRadius() which already accounts for doubling (CommittedDepth - doublings). Using CommittedDepth instead would incorrectly reject chunks in sister neighborhoods. Appears correct, but needs research validation for edge cases during radius transitions.
Bug 2: Shallow Receipt Short-Circuits Parallel Pushes
Problem: On any shallow receipt, pushToClosest returns immediately with ErrShallowReceipt, discarding results from other inflight multiplex pushes (up to 32). Only 1 attempt of the error budget is used. Valid receipts from parallel pushes are thrown away.
Fix: Treat shallow receipt like any other peer failure — cache it, burn the budget, wait for inflight pushes. Only return ErrShallowReceipt after the full budget is exhausted or when falling back to ErrWantSelf. (pkg/pushsync/pushsync.go:369-530)
Bug 3: False ChunkSynced After Retry Exhaustion
Problem: After 6 pusher retries, chunks are marked ChunkSynced even though no node in the correct neighborhood stored them. The upload appears complete but the chunk is lost.
Fix: Report ChunkCouldNotSync instead of ChunkSynced. (pkg/pusher/pusher.go:283-289)
Gap identified (@martinconic): upload.Report() tag switch (pkg/storer/internal/upload/uploadstore.go:609-617) doesn't handle ChunkCouldNotSync — chunk gets cleaned up from the upload store but tag counters are never updated. Uploads with undeliverable chunks will show permanently incomplete sync counts.
Possible approaches:
- A. Add
ChunkCouldNotSync case with a new Failed counter in TagItem + surface via API
- B. Keep chunk in push queue for later retry (risk: infinite loops for undeliverable chunks)
- C. Surface via metrics/API only, accept the tag gap
Protocol Impact
| Scenario |
Before |
After |
| Node outside AOR, no closer peer |
Stores + sends receipt (evicted shortly after) |
Returns error, origin retries |
| Shallow receipt during multiplex |
Immediate abort, 1 of 32 attempts used |
Full budget exhausted before giving up |
| Pusher exhausts 6 retries |
Marked ChunkSynced (false positive) |
Marked ChunkCouldNotSync (correct, but tag gap) |
PR: #5390
Branch:
fix/pushsync-out-of-depth-chunk-lossSummary
Chunks can silently disappear from the network after a successful upload. Three bugs in the pushsync/pusher pipeline allow a chunk to be "delivered" (from the origin's perspective) while no node in the correct neighborhood actually retains it. The proposed fixes change protocol-level forwarding and receipt handling behavior.
Note: These issues are reproducible in smaller clusters such as the public testnet, where limited neighborhood sizes make out-of-depth forwarding and shallow receipts more likely.
Bug 1: Out-of-Depth Storing
Problem: When a node has no closer peer (
ErrWantSelf) but the chunk is outside its AOR, it stores the chunk anyway. The chunk lands in a low-proximity bin,unreserve()evicts it immediately, but the origin already has a valid receipt. Chunk is gone.Fix: Guard with
proximity(chunk, self) < radbefore storing — returnErrOutOfDepthStoringinstead, origin retries with next peer. (pkg/pushsync/pushsync.go:301-309)Doubling concern (@martinconic): The check uses
StorageRadius()which already accounts for doubling (CommittedDepth - doublings). UsingCommittedDepthinstead would incorrectly reject chunks in sister neighborhoods. Appears correct, but needs research validation for edge cases during radius transitions.Bug 2: Shallow Receipt Short-Circuits Parallel Pushes
Problem: On any shallow receipt,
pushToClosestreturns immediately withErrShallowReceipt, discarding results from other inflight multiplex pushes (up to 32). Only 1 attempt of the error budget is used. Valid receipts from parallel pushes are thrown away.Fix: Treat shallow receipt like any other peer failure — cache it, burn the budget, wait for inflight pushes. Only return
ErrShallowReceiptafter the full budget is exhausted or when falling back toErrWantSelf. (pkg/pushsync/pushsync.go:369-530)Bug 3: False
ChunkSyncedAfter Retry ExhaustionProblem: After 6 pusher retries, chunks are marked
ChunkSyncedeven though no node in the correct neighborhood stored them. The upload appears complete but the chunk is lost.Fix: Report
ChunkCouldNotSyncinstead ofChunkSynced. (pkg/pusher/pusher.go:283-289)Gap identified (@martinconic):
upload.Report()tag switch (pkg/storer/internal/upload/uploadstore.go:609-617) doesn't handleChunkCouldNotSync— chunk gets cleaned up from the upload store but tag counters are never updated. Uploads with undeliverable chunks will show permanently incomplete sync counts.Possible approaches:
ChunkCouldNotSynccase with a newFailedcounter inTagItem+ surface via APIProtocol Impact
ChunkSynced(false positive)ChunkCouldNotSync(correct, but tag gap)