Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Erasure coding for storage proofs #184

Merged
merged 11 commits into from
Oct 16, 2023
Merged

Conversation

markspanbroek
Copy link
Member

Describes our erasure coding scheme for storage proofs, its limitations, and ways of addressing them.

Requires a thorough review, because I am not an expert in erasure coding and might have misunderstood some its intricacies.

We are however unaware of any implementations of reed solomon that use a field
size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a
prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The
paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FastECC decoder is included in the private repo available to you, https://github.com/Bulat-Ziganshin/private/tree/main/FastECC2 . This code isn't fully systematic for binary data as it works in non-binary fields, but efficient encodings with very little waste exist.

OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all? Leopard should be extensible to GF(2^32), ask the author (or just provide GF(2^32) operations, in addition to 8 and 16 included in the repo). Yeah, it may have slower division, but 1/x = x^(2^32-1) should still work.

Copy link
Contributor

@dryajov dryajov Sep 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually very interesting, we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiplication in GF(p) with large p is implemented as a*b mod p which for fixed p can be somewhat optimized with widely known "division by a known constant" algorithm that involves several usual (binary) multiplications.

OTOH, optimized multiplication in GF(2^n) is implemented by splitting numbers into 4/8-bit "digits" and using the multiplication table lookups for their multiplication. BTW, Leopard afaik provides 12-bit codes, so you may add 20-bit or 24-bit which should work faster than 32-bit.

If binary multiplications are faster than table lookups, then indeed GF(p) arithmetic can be faster than GF(2^n).

====

So, what do you need now?

  1. algorithms (both fastecc and leopard) were published by 3rd-party researchers in 2014-2015, so they are public property.
  2. I can quickly add public domain monicker to my fastecc2 code, and overall I plan to push it into the public fastecc repo (sigh)
  3. Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing
  4. The long-standing approach to discovering RS codecs' potential performance should be to build their computation model, e.g. fastecc coder executes k*n*log2(n) multiplications in GF(p), which results in k2*n*log2(n) 32-bit multiplications (we need to find k and then k2). Then by learning how much time we spend executing GF(P) or 32-bit multiplication in a given SNARK setup, we will be able to predict the performance.

How else can I help?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[FastECC][2] uses a prime field of 20 bits

Check lucky numbers. The prime I used for efficiency of SIMD computations is 32-bit one, there is a nice 61-bit one, and we can implement "complex arithmetic", i.e. over GF(p^2) to work with 61*2 bits while still using operations on 64-bit binary numbers 🤣

Basically, using k-bit arithmetic (with 2k-bit MUL result), we can work with any GF(p) where p<2^k, or using "complex numbers", with any GF(p^2), p<2^k. But practically speaking, p=0xFFF00001 should be enough for any current needs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catid/leopard#2 discusses implementation of >16 bits in Leopard

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...

at least in this documentation, SNARK is used only to hash a small subset of blocks, e.g. ECC codec itself has to work only on CPU directly

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, super helpful stuff, but I (we) need some time to digest it :-)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, if we use GF(p^n) codec, we should stick to p values close to any 2^k, since they allow efficient recoding to/from binary data (like losing only 1 bit per 4 KiB with p=0xFFF00001). This encoding scheme is generic.

Other implications:

  • if used p is close to 2^k, so we recode to k-bit numbers, RS coding will have to use blocks of i*k bits. E.g. with k=61, we can't use exactly 1 KiB source blocks, instead we have to use 8235=61*135-bit blocks, adding 43 zero bits to each 1 KiB block, and then having to keep the extra 43 (non-zero!) bits for each 1KiB parity block. These extra data can be kept together with block checksum and other control data, so it's possible, but adds some complications.
  • If p>2^k, then source data are fine (all their k-bit fields are <p), but parity data need to be recoded and extra bits stored in control data. If p<2^k, it's the opposite - we have to recode source data and then reliably store extra bits (e.g. one bit required to distinguish between 0 and 0xFF if p=0xFF) since we will need these extra bits to successfully restore this source block from other data if it was lost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Bulat! Thanks for the great comments.

FastECC decoder is included in the private repo available to you

Thank you, I didn't know that it was already implemented. I'll update the text to reflect this.

This code isn't fully systematic for binary data as it works in non-binary fields

3. Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing

If I understand this correctly, then this would mean that the parity data would be a little bigger than normal, because of the need to represent elements from a field that is slightly larger than 2^n. But I expect that we can leave the original data untouched, because we can always encode it into field elements when needed? So in that aspect it is a systematic code, because it leaves the original data untouched?

OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all?

Good question, I was under the impression that this would lead to prohibitively large tables in Leopard, and that this wouldn't be the case with FastECC. But I could very well be mistaken.

@markspanbroek
Copy link
Member Author

I added a description of the multi-dimensional schemes that were descibed by @bkomuves in sampling.pdf and discussed in our research meeting.

I also added the ability to play with multi-dimensional parameters to the spreedsheet.

Copy link
Contributor

@leobago leobago left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall i think it is a very nice write-up that explains clearly the trade-offs between erasure coding Galois field limitations and SNARK hashing limitations.

Most of the analysis is motivated by the need to support very large slot sizes, which I would argue, is not necessarily urgent. We could limit dataset sizes in the first MVP and 4GB (or less) slots should be sufficient.

Also, most of the write-up is motivated by a non-rational adversarial attacker that wants to remove the minimum amount of data to render the slot non-recoverable and still try to pass the proofs. However, I am not sure how much such an attack would make sense given that the attacker (in theory) does not control the other slots in other storage nodes. The "attack" would only really work if the other storage nodes do the same.

If an attacker controls enough nodes of a given dataset to render the global dataset really non-recoverable (globally) while still providing the false proofs (exploiting weaknesses of 1D erasure coding or simply no erasure coding), at that moment the fact that the attacker manage to provide the false proofs is the least of our concerns, the real concern is that the attacker can simply withhold or completely erase the data.

With this, I think is good we have this analysis, and we should make an effort to cover all the cases exposed here, but to put things in perspective, I would place this analysis in the context of a very specific scenario: Providing accurate proofs for huge datasets in the presence of non-rational adversarial attackers.

For the first two items we'll use a different erasure coding scheme than we do
for the last. In this document we focus on the last item; an erasure coding
scheme that makes it easier to detect missing or corrupted data on a host
through storage proofs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically, for the rest of the write-up, the entire dataset is inside one single storage node, and it corresponds to what we call a slot, correct? If so, it would be good to mention it explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not yet convinced we can say "we use different erasure coding" and we can handle them separately. In the construct I'm thinking of I'm not sure the erasure coding can be separated this way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we call a slot

Thanks, I've used 'slot data' in the updated version

not yet convinced we can say "we use different erasure coding"

In this proposed design we use a different erasure coding. It would be really cool if we could use a single erasure coding scheme for all purposes, but I haven't seen a design for it yet?

design/proof-erasure-coding.md Outdated Show resolved Hide resolved
design/proof-erasure-coding.md Show resolved Hide resolved
design/proof-erasure-coding.md Outdated Show resolved Hide resolved
design/proof-erasure-coding.md Outdated Show resolved Hide resolved
design/proof-erasure-coding.md Outdated Show resolved Hide resolved
design/proof-erasure-coding.md Outdated Show resolved Hide resolved
-------------------

The disadvantage of interleaving is that it weakens the protection against
adversarial erasure that Reed-Solomon provides.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add, " ... in comparison to having arbitrarily large number of shards if the Galois filed allowed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although that would make the sentence a bit more precise, it would also make it harder to read, I think.

design/proof-erasure-coding.md Outdated Show resolved Hide resolved
design/proof-erasure-coding.md Show resolved Hide resolved
of blocks and providing a Merkle proof for those blocks. The Merkle proof is
generated inside a SNARK to compress it to a small size to allow for
cost-effective verification on a blockchain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merkle proofs inside a SNARK to have an succinct proof is just one of the options. Maybe clarify that you are discussion this option?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant for this to be a design document, not a research paper. The design that is described here chooses to use a secondary erasure coding scheme with merkle proofs inside a SNARK. As far as I can tell, this design is the only practical one that we've come up with so far?

Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com>
Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants