Skip to content
This repository has been archived by the owner on Aug 2, 2021. It is now read-only.

LocalStore GC with Quantiles #2035

Open
janos opened this issue Dec 11, 2019 · 0 comments
Open

LocalStore GC with Quantiles #2035

janos opened this issue Dec 11, 2019 · 0 comments

Comments

@janos
Copy link
Member

janos commented Dec 11, 2019

Previous discussions

localstore/db how to insert items in GC index #1031
Spam protection of Swarm through modifying GC rules #757

Multiple ideas were discussed previously, and the insertion on quantiles is recognized as the best approach. It is beneficial to read discussions briefly in order to get the wider picture, even most of the ideas are now obsolete.

Current state

LocalStore package storage/localstore implements a high level API for chunk chunk.Chunk disk persistence. It exposes interfaces to store, get and change state of individual chunks. The implementation uses LevelDB key/value store indirectly with shed abstractions, most notably shed.Index which is a made specific to localize encoding and decoding of chunk related data defined in shed.Item. It is not a general purpose index.

Garbage collection is an internal (not exported) functionality of LocalStore. It uses shed.Index as a queue for chunks that can be removed by garbage collection. Chunks are put into GC index when they are considered as synced by other methods of LocalStore (Set or Put). Chunks ordering in GC index is changed when a chunk is accessed (by Get method), moving that at top of the queue. Garbage collection run happens when the number of chunks in GC index reaches a certain limit and iterates over it from the tail removing chunks from GC and other LocalStore indexes, most importantly the retrieval data index where the chunk data is stored. This iteration should be as performant as possible, meaning that it should not contain any additional operations such as LevelDB Get, which can slow down iteration significantly.

Chunk lifetime in the GC queue:

  • inserted at the top (by constructing the appropriate index key with access timestamp)
  • moved to top when it is accessed (by changing index key to have the current access timestamp, with two operations remove the old key and insert a new one)
  • removed from the bottom by garbage collection run
  • optionally removed manually by using Set method with ModeSetRemove

GC index insertion at quantiles

Garbage collection can be improved to prioritize some chunks over another. The main reason for prioritization is different profitability of chunks incentivizing more profitable chunks to be stored for longer time then less profitable ones.

The change that is required to be implemented is a new way how chunks are inserted into GC index. Less profitable chunks should be initially inserted into GC index at the lower level. Basically, defining an index key that would ensure that. A concept of quantiles can be used to achieve that. Quantiles can be tracked by keeping a map of quantile values to index keys that are currently at that quantile. When inserting a new chunk into the GC index, a quantile where it should be inserted at must be calculated by its profitability and an appropriate index key should be constructed so that the desired ordering is achieved. There may be multiple ways to achieve this, but all involve defining appropriate values in shed.Item. It is possible that an additional value needs to be added to shed.Item, or AcceseTimestamp can be adjusted instead. The cleanest approach should be found.

Tracking quantiles requires updating quantiles map every time a new chunk is inserted or moved in GC index, as their positions are changing. This may be a costly operation but should have a minimal impact on performance as possible.

If it is found that there is a different data structure with persistence is better then shed.Index it may be considered for implementation, but it must ensure that irrecoverable data inconsistencies are impossible. This is currently achieved by using LevelDB batch for writes.

New GC algorithm should have unit tests for correctness and a series of measurements with larger data size should be performed to measure performance. Larger data size is considered around 20GB of storage which is the default value for storage size.

Chunks in the last quantile

Additional feature is that chunks that are considered that belong to the last quantile should not be stored in LocalStore at all, not even in retrieval data index.

GC abstraction

Garbage collection functionality in LocalStore may be abstracted into its own package and injected into LocalStore as the option in constructor. This is not crucial, but would improve the overall design.

Experimental quantiles branch

Some prototyping code can be found in branch https://github.com/ethersphere/swarm/compare/gc-quantiles-updated.

@janos janos added this to the 1.0 milestone Dec 11, 2019
@janos janos added this to Backlog in Swarm Core - Sprint planning via automation Dec 11, 2019
@acud acud added the v1.0 label Jan 6, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
No open projects
Development

No branches or pull requests

2 participants