-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnreserveBatch incorrectly unpins chunks #3037
Comments
I've worked around this issue on my own major pinning node by hacking the code so that any API-based pins (from uploading and explicit pinning) increment the pin counter by 10,000 so that UnreserveBatch can't (in any reasonable period of time) decrement it down to zero and cause the chunk to become unpinned. |
Has this issue gone away with the current version (>= v1.12.0) ? |
My uploading node is running a hacked (-dirty) version of local pinning that increments the internal pin-counter by 10,000 to avoid issues with the reserve double-use of this counter which I only have incrementing/decrementing by 1. But a recent pin-check pass on this node revealed a bunch of chunks with an internal pin-counter of 0, 1 or 2 when it should have been ~10,000, so I'm not sure what happened. It is possible that I ran a non-dirty version for a while, but my memory is cloudy on that. And yes, I'm really looking forward to the updated reserve tracking and multiple-stamp capability of the upcoming localstore rewrite. But who knows what new bugs/oversights it might bring to the node as well. Hopefully we'll have a good period of testnet testing on that before it rolls into the mainnet. |
Will new version fix pinning inconsistencies, or it would be better to nuke db after storage refactoring is complete? |
Context
bee 1.6.2-684a7384 - but it has been happening for a long time now based on my anecdotal observations of pinned chunks being subsequently retrieved from the swarm.
Summary
I am pinning my entire OSM tile collection (reference below) to a newly db-nuked node. Each individual manifest node and each file referenced are explicitly pinned using the -X POST /pins/{reference} API (modified so that it doesn't attempt to traverse the entire manifest internally). But even after pinning, chunks for pinned manifest nodes are still being retrieved from the swarm.
I added extensive logging to my -dirty node and learned that the batch reserve uses the underlying pin counter data store which is also used by the /pins API. When the local reserve capacity is exceeded, the Storage Radius is incremented causing portions of the reserve to be evicted. However, in some cases the evictor is decrementing the pin counter and removing the pin allowing chunks to be gc'd even though the corresponding API pin record still exists.
Once the chunk is in the gcIndex, it's only a matter of time before it is actually deleted from the localstore causing subsequent /bytes invocations to go to the swarm for chunk retrieval.
Details of my findings will appear below, but this is really bad when I can't rely on a pinned dataset to remain local!
Expected behavior
API pinned chunks should not be unpinned except via the API.
Actual behavior
References pinned via the API are being unpinned and garbage-collected causing subsequent /bytes invocations to retrieve the chunks from the swarm instead of reading them locally. But the /pins/{reference} claims that the reference is still pinned!
Steps to reproduce
This can take a LONG time to show the problem because you need to exceed the 5million chunk reserve capacity (possibly twice) AND have the garbage collector actually delete the evicted chunks before the problem will be visible.
Note: I'm using https://github.com/ldeffenb/Swarming-Wikipedia/blob/main/src/pin-reference.ts for step 2, but I can't stress enough that you need to have a modified pinner in the target node that doesn't attempt to pin an entire manifest with a single invocation. Mainnet reference f09da1184cc9ef6af3b228f72c6ff965fcfee58b47d65d52ef6cf4e5347c766e contains 55+ million chunks and will likely not succeed if passed to a stock POST /pins/f09da1184cc9ef6af3b228f72c6ff965fcfee58b47d65d52ef6cf4e5347c766e
I have also enhanced my node to have extensive logging around the pin-counter updates and have caught the proverbial "smoking gun" in the following set of logs. In both cases you can see that the garbage collector didn't actually delete the chunk until 3 minutes after it was unpinned by UnreserveBatch.
This is a subordinate chunk of a file reference:
And this is a directly pinned chunk that suffered the same fate:
I will be running another test with even more logging in the next day or two and will hopefully be able to provide additional details as to what the batch reserve is doing to the chunks' pin counters and particularly which chunks were and were not pinned by preserveOrCache when initially retrieved into the node's localstore.
Possible solution
Still trying to determine one, but the base issue is that a single pin counter record is used by both the reserve and the API pin logic but the reserve evict sometimes decrements the counter either multiple times or when it was not originally pinned into the reserve.
From a comment made to @significance in a private Discord chat:
And in looking at preserveOrCache's logging, I'm having a new suspicion that the "withinRadiusFn" may simply be looking at the wrong radius in making the preserve/cache decision. It (withinRadius in pkg/localstore/reserve.go) is using the Item's Radius which has been zero for the item's I've logged (which is why I'm going to be logging the non-reserve items next time). I'm thinking that should actually be looking at the current Storage Radius or something like that.
The text was updated successfully, but these errors were encountered: