-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Erasure coding for storage proofs #184
Conversation
design/proof-erasure-coding.md
Outdated
We are however unaware of any implementations of reed solomon that use a field | ||
size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a | ||
prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The | ||
paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FastECC decoder is included in the private repo available to you, https://github.com/Bulat-Ziganshin/private/tree/main/FastECC2 . This code isn't fully systematic for binary data as it works in non-binary fields, but efficient encodings with very little waste exist.
OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all? Leopard should be extensible to GF(2^32), ask the author (or just provide GF(2^32) operations, in addition to 8 and 16 included in the repo). Yeah, it may have slower division, but 1/x = x^(2^32-1) should still work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually very interesting, we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiplication in GF(p) with large p is implemented as a*b mod p
which for fixed p can be somewhat optimized with widely known "division by a known constant" algorithm that involves several usual (binary) multiplications.
OTOH, optimized multiplication in GF(2^n) is implemented by splitting numbers into 4/8-bit "digits" and using the multiplication table lookups for their multiplication. BTW, Leopard afaik provides 12-bit codes, so you may add 20-bit or 24-bit which should work faster than 32-bit.
If binary multiplications are faster than table lookups, then indeed GF(p) arithmetic can be faster than GF(2^n).
====
So, what do you need now?
- algorithms (both fastecc and leopard) were published by 3rd-party researchers in 2014-2015, so they are public property.
- I can quickly add public domain monicker to my fastecc2 code, and overall I plan to push it into the public fastecc repo (sigh)
- Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing
- The long-standing approach to discovering RS codecs' potential performance should be to build their computation model, e.g. fastecc coder executes
k*n*log2(n)
multiplications in GF(p), which results ink2*n*log2(n)
32-bit multiplications (we need to find k and then k2). Then by learning how much time we spend executing GF(P) or 32-bit multiplication in a given SNARK setup, we will be able to predict the performance.
How else can I help?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[FastECC][2] uses a prime field of 20 bits
Check lucky numbers. The prime I used for efficiency of SIMD computations is 32-bit one, there is a nice 61-bit one, and we can implement "complex arithmetic", i.e. over GF(p^2) to work with 61*2 bits while still using operations on 64-bit binary numbers 🤣
Basically, using k-bit arithmetic (with 2k-bit MUL result), we can work with any GF(p) where p<2^k, or using "complex numbers", with any GF(p^2), p<2^k. But practically speaking, p=0xFFF00001 should be enough for any current needs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catid/leopard#2 discusses implementation of >16 bits in Leopard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...
at least in this documentation, SNARK is used only to hash a small subset of blocks, e.g. ECC codec itself has to work only on CPU directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, super helpful stuff, but I (we) need some time to digest it :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, if we use GF(p^n) codec, we should stick to p values close to any 2^k, since they allow efficient recoding to/from binary data (like losing only 1 bit per 4 KiB with p=0xFFF00001). This encoding scheme is generic.
Other implications:
- if used p is close to 2^k, so we recode to k-bit numbers, RS coding will have to use blocks of
i*k
bits. E.g. with k=61, we can't use exactly 1 KiB source blocks, instead we have to use8235=61*135
-bit blocks, adding 43 zero bits to each 1 KiB block, and then having to keep the extra 43 (non-zero!) bits for each 1KiB parity block. These extra data can be kept together with block checksum and other control data, so it's possible, but adds some complications. - If p>2^k, then source data are fine (all their k-bit fields are <p), but parity data need to be recoded and extra bits stored in control data. If p<2^k, it's the opposite - we have to recode source data and then reliably store extra bits (e.g. one bit required to distinguish between 0 and 0xFF if p=0xFF) since we will need these extra bits to successfully restore this source block from other data if it was lost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Bulat! Thanks for the great comments.
FastECC decoder is included in the private repo available to you
Thank you, I didn't know that it was already implemented. I'll update the text to reflect this.
This code isn't fully systematic for binary data as it works in non-binary fields
3. Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing
If I understand this correctly, then this would mean that the parity data would be a little bigger than normal, because of the need to represent elements from a field that is slightly larger than 2^n. But I expect that we can leave the original data untouched, because we can always encode it into field elements when needed? So in that aspect it is a systematic code, because it leaves the original data untouched?
OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all?
Good question, I was under the impression that this would lead to prohibitively large tables in Leopard, and that this wouldn't be the case with FastECC. But I could very well be mistaken.
I added a description of the multi-dimensional schemes that were descibed by @bkomuves in sampling.pdf and discussed in our research meeting. I also added the ability to play with multi-dimensional parameters to the spreedsheet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall i think it is a very nice write-up that explains clearly the trade-offs between erasure coding Galois field limitations and SNARK hashing limitations.
Most of the analysis is motivated by the need to support very large slot sizes, which I would argue, is not necessarily urgent. We could limit dataset sizes in the first MVP and 4GB (or less) slots should be sufficient.
Also, most of the write-up is motivated by a non-rational adversarial attacker that wants to remove the minimum amount of data to render the slot non-recoverable and still try to pass the proofs. However, I am not sure how much such an attack would make sense given that the attacker (in theory) does not control the other slots in other storage nodes. The "attack" would only really work if the other storage nodes do the same.
If an attacker controls enough nodes of a given dataset to render the global dataset really non-recoverable (globally) while still providing the false proofs (exploiting weaknesses of 1D erasure coding or simply no erasure coding), at that moment the fact that the attacker manage to provide the false proofs is the least of our concerns, the real concern is that the attacker can simply withhold or completely erase the data.
With this, I think is good we have this analysis, and we should make an effort to cover all the cases exposed here, but to put things in perspective, I would place this analysis in the context of a very specific scenario: Providing accurate proofs for huge datasets in the presence of non-rational adversarial attackers.
design/proof-erasure-coding.md
Outdated
For the first two items we'll use a different erasure coding scheme than we do | ||
for the last. In this document we focus on the last item; an erasure coding | ||
scheme that makes it easier to detect missing or corrupted data on a host | ||
through storage proofs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically, for the rest of the write-up, the entire dataset is inside one single storage node, and it corresponds to what we call a slot, correct? If so, it would be good to mention it explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not yet convinced we can say "we use different erasure coding" and we can handle them separately. In the construct I'm thinking of I'm not sure the erasure coding can be separated this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we call a slot
Thanks, I've used 'slot data' in the updated version
not yet convinced we can say "we use different erasure coding"
In this proposed design we use a different erasure coding. It would be really cool if we could use a single erasure coding scheme for all purposes, but I haven't seen a design for it yet?
------------------- | ||
|
||
The disadvantage of interleaving is that it weakens the protection against | ||
adversarial erasure that Reed-Solomon provides. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add, " ... in comparison to having arbitrarily large number of shards if the Galois filed allowed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although that would make the sentence a bit more precise, it would also make it harder to read, I think.
of blocks and providing a Merkle proof for those blocks. The Merkle proof is | ||
generated inside a SNARK to compress it to a small size to allow for | ||
cost-effective verification on a blockchain. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merkle proofs inside a SNARK to have an succinct proof is just one of the options. Maybe clarify that you are discussion this option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant for this to be a design document, not a research paper. The design that is described here chooses to use a secondary erasure coding scheme with merkle proofs inside a SNARK. As far as I can tell, this design is the only practical one that we've come up with so far?
Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com> Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>
46d25f7
to
80a88c1
Compare
Co-Authored-By: Balazs Komuves <bkomuves@gmail.com>
Describes our erasure coding scheme for storage proofs, its limitations, and ways of addressing them.
Requires a thorough review, because I am not an expert in erasure coding and might have misunderstood some its intricacies.