Specify maximum byte sequence and tree structure lengths#648
Specify maximum byte sequence and tree structure lengths#648fulldecent wants to merge 2 commits intoethereum:masterfrom
Conversation
|
I think, I haven't thought about ||t||. |
Paper.tex
Outdated
| \mathbb{L} & \equiv & \{ \mathbf{t}: \mathbf{t} = ( \mathbf{t}[0], \mathbf{t}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{t} \rVert} \; \mathbf{t}[n] \in \mathbb{T} \} \\ | ||
| \mathbb{B} & \equiv & \{ \mathbf{b}: \mathbf{b} = ( \mathbf{b}[0], \mathbf{b}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{b} \rVert} \; \mathbf{b}[n] \in \mathbb{O} \} | ||
| \mathbb{L} & \equiv & \{ \mathbf{t}: \mathbf{t} = ( \mathbf{t}[0], \mathbf{t}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{t} \rVert} \; \mathbf{t}[n] \in \mathbb{T} \; \wedge \; \lVert \mathbf{t} \rVert < 2^{256} \} \\ | ||
| \mathbb{B} & \equiv & \{ \mathbf{b}: \mathbf{b} = ( \mathbf{b}[0], \mathbf{b}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{b} \rVert} \; \mathbf{b}[n] \in \mathbb{O} \; \wedge \; \lVert \mathbf{b} \rVert < 2^{256} \} |
There was a problem hiding this comment.
In both cases, it seems that in Appendix B, with B, AIUI, the input can be any arbitrary data array, not just one less than than the stack limit of 2^256. Looking at the definitions of R_b and R_l, this seems to match up.
There was a problem hiding this comment.
AIUI, no, it seems unnecessary, but you are welcome to convince me otherwise
There was a problem hiding this comment.
Right, we can define a function that's not injective. We can later say it's injective for certain short inputs.
|
Hi, I've been reading your specification, and noticed this today. It tripped me up at first, so I figure it's worth preserving this argument for future readers: Injectivity is a desirable property for encoders: a non-injective encoder does not uniquely determine a decoder; in such a situation, a complete specification must decide amongst the space of decoders, or otherwise suitably restrict its encoder's domain. But (counter to OP's claim) the encodings of a pair of distinct inputs agreeing on the first byte is not sufficient claim for non-injectivity: there must be distinct inputs whose encodings' tails agree as well. And it seems to me that while, when we admit byte-arrays with length greater than 2⁶⁴-1, the first-byte of an encoding is no longer sufficient to distinguish between a byte-array and an 𝕃-structure; Suppose that So it must be the case that t₀ ∈ 𝔹 and t₁ ∈ 𝕃 (or symmetrically t₀ ∈ 𝕃 and t₀ ∈ 𝔹.) For Rᵦ(t₀) = Rₗ(t₁), we must have agreement at the head and the tail. When can
There remains an argument that it may be desirable for a decoder to be able to discern between a byte-array and an 𝕃-structure based on the first byte of the encoding. I haven't finished reading the specification, so I can't say whether it'll ever encode a byte-array longer than 2̂⁶⁴-1; but if it doesn't, then implementations can just ignore this case, and if it does, restricting the input here won't help. So I don't think this change is merited. |
|
The Yellow Paper says:
Now we have a byte holding the value 256 or more. Maybe that is injective, but hyper-real-imaginary numbers are outside the scope of the document. Looking at the theorem above, this breaks down on "distinguished by their suffix". RLP are not distinguished by their suffix because:
And to be clear, my complaint is not that the function in non-injective. My complaint is that the function domain is wrongly specified. |
|
@fmap I did something similar in Isabelle/HOL a while ago https://github.com/pirapira/eth-isabelle/blob/master/attic/RLP.thy#L150 |
|
@fmap by the way, an RLP decoder should be able to receive an infinite byte stream and decide how many bytes in the beginning form an RLP object. |
|
Sorry, I don't want to read all of this now, but AIUI, RLP can recursively take any arbitrary input and convert it to a fixed size in order to send it over the wire protocol to users. @fulldecent if you're still concerned that the domain is wrongly specified, maybe you could submit a formal proof? Sorry, I just don't have time to work on too many things that are extraneous to working on sharding. |
|
@fulldecent we have alternatives here
(+ is concatenation). |
|
@pirapira writes:
Hi @pirapira. This is a productive restriction, thanks for introducing it. I believe for this to work the domain of @fulldecent writes:
Sorry @fulldecent, I misinterpreted the context above. I agree with your objection, and believe the domain should be further restricted. How did you come up with the bounds in your patch? My numbers turned out different (in the case of byte-arrays) and in conceptual disagreement (in the case of branch structures): In the case for byte-arrays, which you quote, we have a byte m within the closed ℕ-interval [184, 191], followed by m - 183 bytes (at most 8) encoding the length of the byte array (n), followed by the n bytes comprising the byte array. Correspondingly for a byte-array to have an encoding, I believe its length must be strictly less than 256⁸ = (2⁸)⁸ = 2⁶⁴. In the case for branch structures, we have a byte m within the closed ℕ-interval [248, 255], followed by m - 247 (at most 8) bytes encoding the length of the concatenation of encoded entries in the sequence (n), followed by n bytes comprising the concatenation of encoded entries. Correspondingly I believe for an element of 𝕃 to have an encoding, the length of the concatenation of the encodings of its entries must be strictly less than 2⁶⁴. I don't think restricting the length of an 𝕃-sequence is an appropriate technique for restricting the size of its encoding. I can exhibit a singleton 𝕃-sequence which is too long to have an encoding; take any byte array b of length 2⁶⁴-1, let t = (b), then ∣ |
|
@fmap Thank you for this correction, I have updated the PR. @jamesray1 tl;dr here is a formal proof that demonstrates that a problem exists. Sequence concatenation is represented by hamburgers for clarity. Bytes are shown in decimal for clarity. Specimen 1Byte-array of 2^64 zeros
Sequence of zero values Ø
Specimen 2Byte-array of 2^584 zeros
|
|
Ping @pirapira @jamesray1 is this explanation ^^ satisfactory for you? |
pirapira
left a comment
There was a problem hiding this comment.
The English description and the equation do not match.
| \item If the concatenated serialisations of each contained item is less than 56 bytes in length, then the output is equal to that concatenation prefixed by the byte equal to the length of this byte array plus 192. | ||
| \item Otherwise, the output is equal to the concatenated serialisations prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the concatenated serialisations byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 247. | ||
| \item If the concatenated serialisations of each contained item is less than $2^{64}$ bytes in length, then the output is equal to the concatenated serialisations prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the concatenated serialisations byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 247. | ||
| \item Otherwise, this sequence cannot be encoded. |
There was a problem hiding this comment.
This clause should appear in the next equation.
|
I have found something else that I think should be changed. In equation 180 the summation is from n = 0 but it should be from k = 0, and the n in b_n and 256^(||b|| - 1 - n) should be changed to k. |
|
I have spent more than an hour looking at this today (as well more time previously). It is not very clear. If you have 2^64 zeros, then: let x = 0 ... 0 such that |
|
Let x = 2^584 0s (0, 0, 0, ... , 0) such that ||x|| = 584 Should
I think the latter case that I have supposed makes more sense. This section could do with examples. |
|
Commit 2a79beb from PR #736 should take care of the issues in this PR. The revised definition of RLP encoding returns an "error value" (\varnothing) when an RLP tree cannot be encoded. The definition of \mathbb{B} was not modified because it is a more general definition for all byte arrays, not just the ones that can be RLP-encoded. The definition of \mathbb{T} was not modified because the 2^64 limit applies to the concatenated serializations of the subtrees, not to the number of subtrees, and thus it needs the definition of encoding in order to be expressed. So it is more natural to have RLP encoding return an error value on non-RLP-encodable trees, and implicitly define the set of RLP-encodable trees as the complement (in \mathbb{T}) of the preimage of the error value, i.e. the set of RLP trees on which RLP encoding does not return the error value. Can this PR be closed? |
Without this addition, the coding is ambiguous for very large inputs