UnreserveBatch incorrectly unpins chunks #3037

ldeffenb · 2022-06-22T14:09:55Z

Context

bee 1.6.2-684a7384 - but it has been happening for a long time now based on my anecdotal observations of pinned chunks being subsequently retrieved from the swarm.

Summary

I am pinning my entire OSM tile collection (reference below) to a newly db-nuked node. Each individual manifest node and each file referenced are explicitly pinned using the -X POST /pins/{reference} API (modified so that it doesn't attempt to traverse the entire manifest internally). But even after pinning, chunks for pinned manifest nodes are still being retrieved from the swarm.

I added extensive logging to my -dirty node and learned that the batch reserve uses the underlying pin counter data store which is also used by the /pins API. When the local reserve capacity is exceeded, the Storage Radius is incremented causing portions of the reserve to be evicted. However, in some cases the evictor is decrementing the pin counter and removing the pin allowing chunks to be gc'd even though the corresponding API pin record still exists.

Once the chunk is in the gcIndex, it's only a matter of time before it is actually deleted from the localstore causing subsequent /bytes invocations to go to the swarm for chunk retrieval.

Details of my findings will appear below, but this is really bad when I can't rely on a pinned dataset to remain local!

Expected behavior

API pinned chunks should not be unpinned except via the API.

Actual behavior

References pinned via the API are being unpinned and garbage-collected causing subsequent /bytes invocations to retrieve the chunks from the swarm instead of reading them locally. But the /pins/{reference} claims that the reference is still pinned!

Steps to reproduce

Start with a freshly db-nuked node.
Walk a large manifest collection by individually /bytes-retrieving each component part (using mantaray-js) and POST /pins those parts after retrieval (note: a modified pinner is required to not attempt to pin the entire manifest in one call, but just the node itself).
Monitor the bee_batchstore_storage_radius and bee_localstore_reserve_size metrics for changes
After the change, re-retrieve selected previously pinned references and watch the node's --verbosity 5 logs for "retrieving" messages. Manifest node references which are "furthest" from the node's overlay ID are more likely to exhibit the problem.
If the node retrieves any chunks from the swarm, double-check a GET /pins/{reference} to confirm that the API-requested pin is still present.
This works best for manifest node references which in my experience are all single-chunk references. File references from the manifest may contain multiple chunks but only the root chunk is explicitly pinned. File reference chunks are also unpinned, but they are harder to detect and prove.

This can take a LONG time to show the problem because you need to exceed the 5million chunk reserve capacity (possibly twice) AND have the garbage collector actually delete the evicted chunks before the problem will be visible.

Note: I'm using https://github.com/ldeffenb/Swarming-Wikipedia/blob/main/src/pin-reference.ts for step 2, but I can't stress enough that you need to have a modified pinner in the target node that doesn't attempt to pin an entire manifest with a single invocation. Mainnet reference f09da1184cc9ef6af3b228f72c6ff965fcfee58b47d65d52ef6cf4e5347c766e contains 55+ million chunks and will likely not succeed if passed to a stock POST /pins/f09da1184cc9ef6af3b228f72c6ff965fcfee58b47d65d52ef6cf4e5347c766e

I have also enhanced my node to have extensive logging around the pin-counter updates and have caught the proverbial "smoking gun" in the following set of logs. In both cases you can see that the garbage collector didn't actually delete the chunk until 3 minutes after it was unpinned by UnreserveBatch.

This is a subordinate chunk of a file reference:

xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="retrieval: requesting chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f from peer 5261c968eda739af6ea391bfcfaab61c4a18224574dff858f4f30dee77b5992b"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:CreatePin: ref 24e470c0096d9d04754ca46197e93a210cc430b20cf222caccbd818e558308b3 Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f pinned"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:ModeSetPin: Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f pinned"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:setPin: Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f pin removing from gc"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:setPin: Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f pinned, count now 1"
xdais2-162-4.log:time="2022-06-21T11:46:44+02:00" level=trace msg="pinTrace:UnreserveBatch: Batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df Radius 2 Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f unpinned"
xdais2-162-4.log:time="2022-06-21T11:46:44+02:00" level=trace msg="pinTrace:setUnpin: Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f decrement pin count now 0 - Deleted!"
xdais2-162-4.log:time="2022-06-21T11:46:44+02:00" level=trace msg="pinTrace:setUnpin: Chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f unpinned and added to gcIndex"
xdais2-162-4.log:time="2022-06-21T11:49:07+02:00" level=trace msg="pinTrace:collectGarbage: Deleting chunk 52c0ddb5a6ce96dd548591abfd90776fc9a3fc0eb73bf5297164f43964e4724f"

And this is a directly pinned chunk that suffered the same fate:

xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="name resolve: valid bzz address "4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334""
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="retrieval: requesting chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 from peer 4c28177571b6fdd14680c0c112ee676ebefc791e9a5a038d17a0dce759ba4dcf"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=info msg="api access" duration=0.051172826 host="localhost:11642" ip=127.0.0.1 method=GET proto=HTTP/1.1 size=128 status=200 uri=/bytes/4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 user-agent=axios/0.26.1
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=info msg="api access" duration=0.000139436 host="localhost:11642" ip=127.0.0.1 method=GET proto=HTTP/1.1 size=36 status=404 uri=/pins/4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 user-agent=axios/0.26.1
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:CreatePin: ref 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 pinned"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:ModeSetPin: Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 pinned"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:setPin: Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 pin removing from gc"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:setPin: Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 pinned, count now 1"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=trace msg="pinTrace:CreatePin: ref 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 pinned traverse true"
xdais2-162-4.log:time="2022-06-21T06:14:38+02:00" level=info msg="api access" duration=0.000360288 host="localhost:11642" ip=127.0.0.1 method=POST proto=HTTP/1.1 size=34 status=201 uri=/pins/4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 user-agent=axios/0.26.1
xdais2-162-4.log:time="2022-06-21T11:46:34+02:00" level=trace msg="pinTrace:UnreserveBatch: Batch 0e8366a6fdac185b6f0327dc89af99e67d9d3b3f2af22432542dc5971065c1df Radius 2 Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 unpinned"
xdais2-162-4.log:time="2022-06-21T11:46:34+02:00" level=trace msg="pinTrace:setUnpin: Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 decrement pin count now 0 - Deleted!"
xdais2-162-4.log:time="2022-06-21T11:46:34+02:00" level=trace msg="pinTrace:setUnpin: Chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334 unpinned and added to gcIndex"
xdais2-162-4.log:time="2022-06-21T11:49:07+02:00" level=trace msg="pinTrace:collectGarbage: Deleting chunk 4d60c1f4a566be91c865022368962ddfbad0ac04fad587c0d86df52b6a758334"

I will be running another test with even more logging in the next day or two and will hopefully be able to provide additional details as to what the batch reserve is doing to the chunks' pin counters and particularly which chunks were and were not pinned by preserveOrCache when initially retrieved into the node's localstore.

Possible solution

Still trying to determine one, but the base issue is that a single pin counter record is used by both the reserve and the API pin logic but the reserve evict sometimes decrements the counter either multiple times or when it was not originally pinned into the reserve.

From a comment made to @significance in a private Discord chat:
And in looking at preserveOrCache's logging, I'm having a new suspicion that the "withinRadiusFn" may simply be looking at the wrong radius in making the preserve/cache decision. It (withinRadius in pkg/localstore/reserve.go) is using the Item's Radius which has been zero for the item's I've logged (which is why I'm going to be logging the non-reserve items next time). I'm thinking that should actually be looking at the current Storage Radius or something like that.

The text was updated successfully, but these errors were encountered:

ldeffenb · 2022-08-13T16:01:13Z

I've worked around this issue on my own major pinning node by hacking the code so that any API-based pins (from uploading and explicit pinning) increment the pin counter by 10,000 so that UnreserveBatch can't (in any reasonable period of time) decrement it down to zero and cause the chunk to become unpinned.

istae · 2023-02-27T11:23:09Z

Has this issue gone away with the current version (>= v1.12.0) ?
As a side note, the localstore rewrite will also tackle the previous issues of pinning is a much better way.

ldeffenb · 2023-02-27T15:43:43Z

My uploading node is running a hacked (-dirty) version of local pinning that increments the internal pin-counter by 10,000 to avoid issues with the reserve double-use of this counter which I only have incrementing/decrementing by 1. But a recent pin-check pass on this node revealed a bunch of chunks with an internal pin-counter of 0, 1 or 2 when it should have been ~10,000, so I'm not sure what happened. It is possible that I ran a non-dirty version for a while, but my memory is cloudy on that.

And yes, I'm really looking forward to the updated reserve tracking and multiple-stamp capability of the upcoming localstore rewrite. But who knows what new bugs/oversights it might bring to the node as well. Hopefully we'll have a good period of testnet testing on that before it rolls into the mainnet.

tmm360 · 2023-02-27T15:47:47Z

Will new version fix pinning inconsistencies, or it would be better to nuke db after storage refactoring is complete?

ldeffenb added the needs-triaging new issues that need triaging label Jun 22, 2022

bee-runner bot added the issue label Jun 22, 2022

agazso added bug Something isn't working and removed needs-triaging new issues that need triaging labels Jun 22, 2022

ldeffenb mentioned this issue Jul 27, 2022

Files are lost for unknown reasons #3129

Closed

agazso mentioned this issue Aug 1, 2022

Storage rewrite ethersphere/bee-backlog#41

Closed

This was referenced Aug 12, 2022

Reserve store implementation #3179

Closed

Pin store implementation #3183

Closed

ldeffenb mentioned this issue Aug 27, 2022

Bee may return 200 in case of missing data #3140

Closed

agazso mentioned this issue Aug 29, 2022

Error 500 unpinning contents #3224

Closed

aloknerurkar closed this as completed Aug 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UnreserveBatch incorrectly unpins chunks #3037

UnreserveBatch incorrectly unpins chunks #3037

ldeffenb commented Jun 22, 2022

ldeffenb commented Aug 13, 2022

istae commented Feb 27, 2023

ldeffenb commented Feb 27, 2023 •

edited

tmm360 commented Feb 27, 2023

UnreserveBatch incorrectly unpins chunks #3037

UnreserveBatch incorrectly unpins chunks #3037

Comments

ldeffenb commented Jun 22, 2022

Context

Summary

Expected behavior

Actual behavior

Steps to reproduce

Possible solution

ldeffenb commented Aug 13, 2022

istae commented Feb 27, 2023

ldeffenb commented Feb 27, 2023 • edited

tmm360 commented Feb 27, 2023

ldeffenb commented Feb 27, 2023 •

edited