Erasure coding for storage proofs #184

markspanbroek · 2023-09-21T13:56:19Z

Describes our erasure coding scheme for storage proofs, its limitations, and ways of addressing them.

Requires a thorough review, because I am not an expert in erasure coding and might have misunderstood some its intricacies.

Bulat-Ziganshin · 2023-09-21T14:45:40Z

design/proof-erasure-coding.md

+We are however unaware of any implementations of reed solomon that use a field
+size larger than 2^16 and still be efficient O(N log(N)). [FastECC][2] uses a
+prime field of 20 bits, but it lacks a decoder and a byte encoding scheme. The
+paper ["An Efficient (n,k) Information Dispersal Algorithm Based on Fermat


FastECC decoder is included in the private repo available to you, https://github.com/Bulat-Ziganshin/private/tree/main/FastECC2 . This code isn't fully systematic for binary data as it works in non-binary fields, but efficient encodings with very little waste exist.

OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all? Leopard should be extensible to GF(2^32), ask the author (or just provide GF(2^32) operations, in addition to 8 and 16 included in the repo). Yeah, it may have slower division, but 1/x = x^(2^32-1) should still work.

This is actually very interesting, we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...

Multiplication in GF(p) with large p is implemented as a*b mod p which for fixed p can be somewhat optimized with widely known "division by a known constant" algorithm that involves several usual (binary) multiplications.

OTOH, optimized multiplication in GF(2^n) is implemented by splitting numbers into 4/8-bit "digits" and using the multiplication table lookups for their multiplication. BTW, Leopard afaik provides 12-bit codes, so you may add 20-bit or 24-bit which should work faster than 32-bit.

If binary multiplications are faster than table lookups, then indeed GF(p) arithmetic can be faster than GF(2^n).

====

So, what do you need now?

algorithms (both fastecc and leopard) were published by 3rd-party researchers in 2014-2015, so they are public property.

I can quickly add public domain monicker to my fastecc2 code, and overall I plan to push it into the public fastecc repo (sigh)

Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing

The long-standing approach to discovering RS codecs' potential performance should be to build their computation model, e.g. fastecc coder executes k*n*log2(n) multiplications in GF(p), which results in k2*n*log2(n) 32-bit multiplications (we need to find k and then k2). Then by learning how much time we spend executing GF(P) or 32-bit multiplication in a given SNARK setup, we will be able to predict the performance.

How else can I help?

[FastECC][2] uses a prime field of 20 bits

Check lucky numbers. The prime I used for efficiency of SIMD computations is 32-bit one, there is a nice 61-bit one, and we can implement "complex arithmetic", i.e. over GF(p^2) to work with 61*2 bits while still using operations on 64-bit binary numbers 🤣

Basically, using k-bit arithmetic (with 2k-bit MUL result), we can work with any GF(p) where p<2^k, or using "complex numbers", with any GF(p^2), p<2^k. But practically speaking, p=0xFFF00001 should be enough for any current needs.

catid/leopard#2 discusses implementation of >16 bits in Leopard

we do care about non-binary fields because they might perform better for some specific cryptographic verification schemes, SNARKS, polynomial commitments, etc...

at least in this documentation, SNARK is used only to hash a small subset of blocks, e.g. ECC codec itself has to work only on CPU directly

Hey, super helpful stuff, but I (we) need some time to digest it :-)

Overall, if we use GF(p^n) codec, we should stick to p values close to any 2^k, since they allow efficient recoding to/from binary data (like losing only 1 bit per 4 KiB with p=0xFFF00001). This encoding scheme is generic.

Other implications:

if used p is close to 2^k, so we recode to k-bit numbers, RS coding will have to use blocks of i*k bits. E.g. with k=61, we can't use exactly 1 KiB source blocks, instead we have to use 8235=61*135-bit blocks, adding 43 zero bits to each 1 KiB block, and then having to keep the extra 43 (non-zero!) bits for each 1KiB parity block. These extra data can be kept together with block checksum and other control data, so it's possible, but adds some complications.

If p>2^k, then source data are fine (all their k-bit fields are <p), but parity data need to be recoded and extra bits stored in control data. If p<2^k, it's the opposite - we have to recode source data and then reliably store extra bits (e.g. one bit required to distinguish between 0 and 0xFF if p=0xFF) since we will need these extra bits to successfully restore this source block from other data if it was lost.

Hi Bulat! Thanks for the great comments.

FastECC decoder is included in the private repo available to you

Thank you, I didn't know that it was already implemented. I'll update the text to reflect this.

This code isn't fully systematic for binary data as it works in non-binary fields

3. Efficient recoding to binary is possible when p is close to any 2^n, in particular check https://github.com/Bulat-Ziganshin/FastECC/blob/master/GF.md#efficient-data-packing

If I understand this correctly, then this would mean that the parity data would be a little bigger than normal, because of the need to represent elements from a field that is slightly larger than 2^n. But I expect that we can leave the original data untouched, because we can always encode it into field elements when needed? So in that aspect it is a systematic code, because it leaves the original data untouched?

OTOH, Leopard employs almost the same algo (only replacing FFT with "additive FFT") and works in binary fields, so why do you need other code at all?

Good question, I was under the impression that this would lead to prohibitively large tables in Leopard, and that this wouldn't be the case with FastECC. But I could very well be mistaken.

design/proof-erasure-coding.md

markspanbroek · 2023-09-28T14:08:57Z

I added a description of the multi-dimensional schemes that were descibed by @bkomuves in sampling.pdf and discussed in our research meeting.

I also added the ability to play with multi-dimensional parameters to the spreedsheet.

leobago

Overall i think it is a very nice write-up that explains clearly the trade-offs between erasure coding Galois field limitations and SNARK hashing limitations.

Most of the analysis is motivated by the need to support very large slot sizes, which I would argue, is not necessarily urgent. We could limit dataset sizes in the first MVP and 4GB (or less) slots should be sufficient.

Also, most of the write-up is motivated by a non-rational adversarial attacker that wants to remove the minimum amount of data to render the slot non-recoverable and still try to pass the proofs. However, I am not sure how much such an attack would make sense given that the attacker (in theory) does not control the other slots in other storage nodes. The "attack" would only really work if the other storage nodes do the same.

If an attacker controls enough nodes of a given dataset to render the global dataset really non-recoverable (globally) while still providing the false proofs (exploiting weaknesses of 1D erasure coding or simply no erasure coding), at that moment the fact that the attacker manage to provide the false proofs is the least of our concerns, the real concern is that the attacker can simply withhold or completely erase the data.

With this, I think is good we have this analysis, and we should make an effort to cover all the cases exposed here, but to put things in perspective, I would place this analysis in the context of a very specific scenario: Providing accurate proofs for huge datasets in the presence of non-rational adversarial attackers.

leobago · 2023-09-28T15:13:26Z

design/proof-erasure-coding.md

+For the first two items we'll use a different erasure coding scheme than we do
+for the last. In this document we focus on the last item; an erasure coding
+scheme that makes it easier to detect missing or corrupted data on a host
+through storage proofs.


So basically, for the rest of the write-up, the entire dataset is inside one single storage node, and it corresponds to what we call a slot, correct? If so, it would be good to mention it explicitly.

I'm also not yet convinced we can say "we use different erasure coding" and we can handle them separately. In the construct I'm thinking of I'm not sure the erasure coding can be separated this way.

what we call a slot

Thanks, I've used 'slot data' in the updated version

not yet convinced we can say "we use different erasure coding"

In this proposed design we use a different erasure coding. It would be really cool if we could use a single erasure coding scheme for all purposes, but I haven't seen a design for it yet?

design/proof-erasure-coding.md

leobago · 2023-09-29T08:51:49Z

design/proof-erasure-coding.md

+-------------------
+
+The disadvantage of interleaving is that it weakens the protection against
+adversarial erasure that Reed-Solomon provides.


I would add, " ... in comparison to having arbitrarily large number of shards if the Galois filed allowed it.

Although that would make the sentence a bit more precise, it would also make it harder to read, I think.

design/proof-erasure-coding.md

cskiraly · 2023-09-29T10:59:51Z

design/proof-erasure-coding.md

+of blocks and providing a Merkle proof for those blocks. The Merkle proof is
+generated inside a SNARK to compress it to a small size to allow for
+cost-effective verification on a blockchain.
+


Merkle proofs inside a SNARK to have an succinct proof is just one of the options. Maybe clarify that you are discussion this option?

I meant for this to be a design document, not a research paper. The design that is described here chooses to use a secondary erasure coding scheme with merkle proofs inside a SNARK. As far as I can tell, this design is the only practical one that we've come up with so far?

Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com> Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>

Co-Authored-By: Balazs Komuves <bkomuves@gmail.com>

markspanbroek added 2 commits September 21, 2023 08:11

erasure coding for storage proofs calculations

f4f3e0b

erasure coding for storage proofs writeup

45af1c2

markspanbroek requested review from dryajov, cskiraly and bkomuves September 21, 2023 13:56

Bulat-Ziganshin reviewed Sep 21, 2023

View reviewed changes

dryajov reviewed Sep 22, 2023

View reviewed changes

design/proof-erasure-coding.md Show resolved Hide resolved

design/proof-erasure-coding.md Show resolved Hide resolved

markspanbroek added 5 commits September 26, 2023 13:27

Fix typo

122de02

Differentiate between shards (erasure coding) and blocks (merkle proofs)

c52d9a1

FastECC decoder is availabe, just not released

f28fd91

Update spreadsheet with multi-dimensional scheme

0401b76

Add description of multi-dimensional schemes

fc95744

leobago reviewed Sep 29, 2023

View reviewed changes

cskiraly reviewed Sep 29, 2023

View reviewed changes

Processed review comments.

80a88c1

Co-Authored-By: Leonardo Bautista-Gomez <leobago@gmail.com> Co-Authored-By: Csaba Kiraly <csaba.kiraly@gmail.com>

markspanbroek force-pushed the proof-erasure-coding branch from 46d25f7 to 80a88c1 Compare October 2, 2023 15:23

markspanbroek and others added 3 commits October 3, 2023 13:51

Add size of merkle paths to spreadsheet

65c1df5

Update spreadsheet: snark size as outcome, not as input

6bac490

Add cost of hashing merkle paths

4f43cdd

Co-Authored-By: Balazs Komuves <bkomuves@gmail.com>

markspanbroek mentioned this pull request Oct 4, 2023

Implement new proving scheme using 2D erasure coding codex-storage/nim-codex#562

Open

25 tasks

markspanbroek merged commit f1891bf into master Oct 16, 2023

markspanbroek deleted the proof-erasure-coding branch October 16, 2023 08:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erasure coding for storage proofs #184

Erasure coding for storage proofs #184

markspanbroek commented Sep 21, 2023

Bulat-Ziganshin Sep 21, 2023 •

edited

dryajov Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023

Bulat-Ziganshin Sep 21, 2023

dryajov Sep 21, 2023

Bulat-Ziganshin Sep 21, 2023 •

edited

markspanbroek Sep 26, 2023

markspanbroek commented Sep 28, 2023

leobago left a comment •

edited

leobago Sep 28, 2023

cskiraly Sep 29, 2023

markspanbroek Oct 2, 2023

leobago Sep 29, 2023

markspanbroek Oct 2, 2023

cskiraly Sep 29, 2023

markspanbroek Oct 2, 2023

Erasure coding for storage proofs #184

Erasure coding for storage proofs #184

Conversation

markspanbroek commented Sep 21, 2023

Bulat-Ziganshin Sep 21, 2023 • edited

Choose a reason for hiding this comment

dryajov Sep 21, 2023 • edited

Choose a reason for hiding this comment

Bulat-Ziganshin Sep 21, 2023 • edited

Choose a reason for hiding this comment

Bulat-Ziganshin Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bulat-Ziganshin Sep 21, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

markspanbroek commented Sep 28, 2023

leobago left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Bulat-Ziganshin Sep 21, 2023 •

edited

dryajov Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023 •

edited

Bulat-Ziganshin Sep 21, 2023 •

edited

leobago left a comment •

edited