Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
Inefficiencies in data availability proofs #1194
There are three major sources of inefficiency in the current construction for data availability proofs:
Possible ideas to improve efficiency:
This is still in "think mode"; better ideas may come up. So far (3) seems least bad, plus possibly (7). The issues are far from fatal, but hopefully there are easy ways to remove at least some of the current theoretical ~8x inefficiency.
Very interesting problem. I would love to think about these. How final are these ideas? Could there be other schemes out there and would we be willing to consider them at this point?
One question about https://arxiv.org/pdf/1809.09044.pdf (is this a good reference?), section 7.1: I am currently thinking about polynomial and functional commitment schemes. Is it possible that such a scheme could be used, rather than a full zk-SNARK (which is currently prohibitively expensive)?
I am currently looking at some polynomial commitment schemes that could potentially replace the proof of custody with a non-interactive version. They aren't quite ready for prime time yet (still some problems with efficiency and outsourcability), but I'm planning to post my thoughts to get some more outside thoughts about it.
In terms of how much is final and how much can be changed, I'd say there's good arguments for the use of erasure coding and commitment to erasure coded data in some form being optimal:
As far as choice of scheme goes, I think biggest pragmatic reason to stick to hash-based things in this case, aside from simplicity of implementation, future-proofness and all the other usual reasons, is that erasure codes and Merkle trees are really fast to generate (eg. I can do it within an epoch for 512 MB of data in python) but I don't see any more complicated scheme having the same property (as precedent, think of how the proving time for STARKs is 10x faster than for SNARKs). Additionally purely hash-based schemes can work over binary fields and one reason binary fields are especially nice is that they can represent 32-byte chunks with no overhead, whereas any finite field runs the risk that even if a block is entirely composed of values
Though it does seem to me like polynomial commitment schemes are a superior alternative to SNARKs over Merkle trees; they're more "direct", rather than one construction wrapping over a completely foreign one and hence incurring arithmetization inefficiencies. And the assumptions in polynomial commitment schemes are definitely milder than the assumptions in SNARKs.
Block layout in commitment data is still very flexible, and my intuition is that whatever efficiencies remain would remain the same regardless of the scheme used.
Idea from a conversation with Justin yesterday: if there are large segments of contiguous zero bytes, fill them with any of:
All of these objects are larger than 64kb, so we can either randomly choose indices from where to include data each time, or store indices and try to cycle through the data.
I actually like this. There's definitely something unclean about "grab as many bytes of this thing as possible", but it allows us to preserve the existing (quite clean!) structure of the crosslink data.