-
Notifications
You must be signed in to change notification settings - Fork 169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sector bitfield proposal #133
Comments
I have a few questions/observations, mostly as checks of whether I understand the problem and proposals.
What I'm getting at is that both proposals require sequential scanning/parsing of the data. Based on my guesses about how this needs to be used, it seems that we would need some amount of metadata or additional structure. Essentially, we would chunk the total data into groups with known parity (say, always beginning with a run of zeroes — though there are other strategies). The borders of these chunks would have known indices (known how depends on the strategy). If we needed to select beginning somewhere other than at a chunk boundary, we would have to scan forward from the beginning of the chunk. Does that sound right? |
Looking at http://roaringbitmap.org/ it is more complicated but the roaring+run paper (linked at the top) suggests much better compression than rle, as well as more efficient ways to get all set values. |
@dignifiedquire thanks for the link, though this in the summary of the paper makes me wonder
|
Yeah, that seems like an approach. I'm not sure exactly what youre proposing though |
@whyrusleeping there are ready made libraries for it, so I think it would be best to generate a data set that we think is representative ad test the differences, to make sure we choose an optimal solution. Also note that most of that sentence is remedied by the added features in that paper, as far as I understand.
…On 7. Feb 2019, 17:38 +0100, Whyrusleeping ***@***.***>, wrote:
@porcuquine
• sets do not have to be efficiently updateable
• not sure what you mean by requiring subsets be contiguous
• also not sure what you mean by subsets must be the same size
• yes, random indexing is necessary, but some preprocessing to speed up the random indexing is totally fine.
> The borders of these chunks would have known indices (known how depends on the strategy). If we needed to select beginning somewhere other than at a chunk boundary, we would have to scan forward from the beginning of the chunk.
> Does that sound right?
Yeah, that seems like an approach. I'm not sure exactly what youre proposing though
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I was trying to unpack what you meant by this:
I thought you were talking about a way to record the whole set, with the ability to extract (select) subsets of the whole thing. I now think you might mean that the bitfield would only be used for the much smaller subsets. Is that right? |
No, i'm just saying that the bitfield to select 100 items from 1000000 should take less space to store than a 1000000 bit long bitfield. |
If someone could benchmark various bitfield constructions under different scenarios, that would be super helpful. |
Thanks for the clarification. I agree that some tests on representative data should be the next step. If it looks like RLE might really be better for our use case we can spend more time fiddling with the details. |
I believe CONCISE is a more formal version of the suggested RLE + Extension, which we should test as well. |
Started writing some code for this here: https://github.com/filecoin-project/bitsets Status: Done:
Missing
|
@dignifiedquire if you wanted to continue with these benchmarks i wouldnt be upset ;) |
sounds good, can you give some feedback on the testcases I listed and if you have some others in mind that will come up. Also important for eval is how important are add, remove and merge operations.
…On 21. Feb 2019, 03:19 +0100, Whyrusleeping ***@***.***>, wrote:
@dignifiedquire if you wanted to continue with these benchmarks i wouldnt be upset ;)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Current example numbers.
|
@dignifiedquire how many random bits are being set in the random selection one? I'd like to see an array of say 100,000ish bits with 100 random bits set, 1000 random bits set, 10,000 random bits set. Then i'd also like to see the same size of thing with N contiguous runs of 100 sectors, for N={1,2,3,5,10,20} |
|
@whyrusleeping this should give you good overview of things (each section is the average over 10 runs) |
@dignifiedquire great results! thank you! which rle is that? the one I described? or something else? |
maximum naive rle: https://github.com/filecoin-project/bitsets/blob/master/src/main.rs#L185
…On 23. Feb 2019, 00:36 +0100, Whyrusleeping ***@***.***>, wrote:
@dignifiedquire great results! thank you! which rle is that? the one I described? or something else?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
I implemented an improved rle, called rle+ which I really like from the characteristics: |
@dignifiedquire i feel like something must be wrong with the results. The 2% bits set shows rle taking up 3x the space as the data itself? 2% bits set implies an average of 50 empty bits for every set bit, which gives plenty of room to compress via rle. |
I agree, this does seem strange. Will investigate further. |
Many bugs, little wow in my code. Better results are here now: https://gist.github.com/dignifiedquire/2ee6d850105d2706bf1db5ba6091120f |
rle+ still looks like a winner on all fronts though |
and now in a much nicer format: https://gist.github.com/dignifiedquire/76d98419f003e4015402c810af7ab400 |
Cool, i'm liking rle+. We should write a decoder and fuzz it a bit, but I think this is good enough for now. We probably want to put a marker in front of the serialized bitmap to serialize the type. |
Alright: decoder + tests are up and running: https://github.com/filecoin-project/bitsets/blob/master/src/rleplus.rs |
I also wrote a basic spec describing the format here: https://github.com/filecoin-project/bitsets/blob/master/src/rleplus.rs#L1 |
@dignifiedquire this is great! |
Could we copy that specification into a document in this repo, and then also add a prefix marker to the serialized set so that we could hypothetically change it in the future? |
Problem
Filecoin miners need a compact way to specify which of a (large) set of sectors is either bad or complete. The number of sectors in the set may be in the tens of millions, and a bitset selecting a subset of them must be much much smaller than that.
If sectors are selected randomly, then it is unlikely that we can gain much over one bit per sector. However, it is very likely that miners will fill disks up one at a time, and so sectors added around the same time would likely be physically near (on the same disk). If a disk failure occurred, then an entire range of sectors would be lost, which is really easy to represent with something like run length encoding.
Proposal 0 - Simple RLE
Encode the bitset as follows:
This works pretty well in most ideal cases, but breaks down if bits are set randomly (or in some pattern, like 011011011011).
Proposal 1 - RLE + Extension
To address the cases in which the previous proposal is inefficient, we can give the encoding the ability to specify bitfields directly. To do this, we can say that while reading the encoding, if a negative varint is parsed, that many bytes should be read and treated as the exact bitfield for the duration.
For example:
000000011111100000011111110110110001010101
would encode as
varint<7> varint<6> varint<6> varint<7> varint<-2>[]byte{108, 85}
(implementation note: it is sufficient just to implement a correct parser for this, and leave the optimization of the encoding up to the individuals)
The text was updated successfully, but these errors were encountered: