Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify maximum byte sequence and tree structure lengths #648

Closed
wants to merge 2 commits into from

Conversation

fulldecent
Copy link
Contributor

Without this addition, the coding is ambiguous for very large inputs

@pirapira
Copy link
Member

I think, ||b|| < 2^{80} is the limit. After that. ||BE(||b||)|| reaches 10. And then, the image of R_b and the image of R_l overlap.

I haven't thought about ||t||.

Paper.tex Outdated
\mathbb{L} & \equiv & \{ \mathbf{t}: \mathbf{t} = ( \mathbf{t}[0], \mathbf{t}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{t} \rVert} \; \mathbf{t}[n] \in \mathbb{T} \} \\
\mathbb{B} & \equiv & \{ \mathbf{b}: \mathbf{b} = ( \mathbf{b}[0], \mathbf{b}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{b} \rVert} \; \mathbf{b}[n] \in \mathbb{O} \}
\mathbb{L} & \equiv & \{ \mathbf{t}: \mathbf{t} = ( \mathbf{t}[0], \mathbf{t}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{t} \rVert} \; \mathbf{t}[n] \in \mathbb{T} \; \wedge \; \lVert \mathbf{t} \rVert < 2^{256} \} \\
\mathbb{B} & \equiv & \{ \mathbf{b}: \mathbf{b} = ( \mathbf{b}[0], \mathbf{b}[1], ... ) \; \wedge \; \forall_{n < \lVert \mathbf{b} \rVert} \; \mathbf{b}[n] \in \mathbb{O} \; \wedge \; \lVert \mathbf{b} \rVert < 2^{256} \}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In both cases, it seems that in Appendix B, with B, AIUI, the input can be any arbitrary data array, not just one less than than the stack limit of 2^256. Looking at the definitions of R_b and R_l, this seems to match up.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jamesray1 so, do you think no limit is necessary?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AIUI, no, it seems unnecessary, but you are welcome to convince me otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we can define a function that's not injective. We can later say it's injective for certain short inputs.

@fmap
Copy link

fmap commented Mar 22, 2018

Hi, I've been reading your specification, and noticed this today. It tripped me up at first, so I figure it's worth preserving this argument for future readers:

Injectivity is a desirable property for encoders: a non-injective encoder does not uniquely determine a decoder; in such a situation, a complete specification must decide amongst the space of decoders, or otherwise suitably restrict its encoder's domain.

But (counter to OP's claim) the encodings of a pair of distinct inputs agreeing on the first byte is not sufficient claim for non-injectivity: there must be distinct inputs whose encodings' tails agree as well. And it seems to me that while, when we admit byte-arrays with length greater than 2⁶⁴-1, the first-byte of an encoding is no longer sufficient to distinguish between a byte-array and an 𝕃-structure; RLP is nonetheless injective.

Suppose that RLP were non-injective, then there must exist t₀ ≠ t₁ ∈ 𝕋 such that RLP(t₀) = RLP(t₁). It cannot be the case that t₀, t₁ ∈ 𝔹 because distinct inputs are distinguished by their suffix under Rᵦ (that is, Rᵦ is trivially injective.) Likewise distinct inputs to Rₗ are distinguished by their suffix, Rₗ is injective, and it cannot be the case that t₀, t₁ ∈ 𝕃.

So it must be the case that t₀ ∈ 𝔹 and t₁ ∈ 𝕃 (or symmetrically t₀ ∈ 𝕃 and t₀ ∈ 𝔹.) For Rᵦ(t₀) = Rₗ(t₁), we must have agreement at the head and the tail. When can hd(Rᵦ(t₀)) = hd(Rₗ(t₁))? The minimum value of any hd(Rₗ(t)) is 192, corresponding to the empty branch structure (distinct from the empty byte array, which is coded at [128].) For which x ∈ 𝔹 do we have that hd(Rᵦ(x)) ≥ 192? Only in the "fallback branch" of Rᵦ in the restricted case where ∣BE(∣x∣)∣ > 8.

  • When 8 < ∣BE(∣t₀∣)∣ < 64, for hd(Rᵦ(t₀)) = hd(Rₗ(t₁)) we must have

      ∣s(t₁)∣ < 56  ∧ 183 + ∣BE(t₀)| = 192 + ∣s(t₁)∣
               ...  ∧ ∣BE(t₀)| = 9 + ∣s(t₁)∣
    

    in which case

      ∣tl(Rᵦ(t₀))∣ = ∣BE(∣t₀∣) ∘ t₀∣        ∣tl(Rₗ(t₁))∣ = ∣s(t₁)∣
                   = ∣BE(∣t₀∣)∣ + ∣t₀∣
                   = 9 + ∣s(t₁)∣ + ∣t₀∣
    

    thus ∣Rᵦ(t₀)∣ > ∣Rₗ(t₁)∣, from which it follows that Rᵦ(t₀) ≠ Rₗ(t₁).

  • When 64 ≤ ∣BE(∣t₀∣)∣, for hd(Rᵦ(t₀)) = hd(Rₗ(t₁)) we must have

      ∣s(t₁)∣ ≥ 56  ∧ 183 + ∣BE(t₀)| = 247 + ∣BE(∣s(t₁)∣)∣
               ...  ∧ ∣BE(t₀)| = 64 + ∣BE(∣s(t₁)∣)∣
    

    in which case

      ∣tl(Rᵦ(t₀))∣                          ∣tl(Rₗ(t₁))∣
        = ∣BE(|t₀|) ∘ t₀∣                     = ∣BE(∣s(t₁)∣) ∘ s(t₁)∣
        = ∣BE(|t₀|)∣ + ∣t₀∣                   = ∣BE(∣s(t₁)∣)∣ + ∣s(t₁)∣
        = 64 + ∣BE(∣s(t₁)∣)∣ + ∣t₀∣
    

    which prima facie suggests we may have tl(Rᵦ(t₀)) = tl(Rₗ(t₁)) when 64 + ∣t₀∣ = ∣s(t₁)∣; however in this situation t₀ ≠ s(t₁), and thus BL(t₀) ∘ t₀ ≠ BL(s(t₁)) ∘ s(t₁), from which it follows that tl(Rᵦ(t₀)) ≠ tl(Rₗ(t₁)).

RLP must therefore be injective ∎. (Sorry if any of the details are wrong, I don't feel like mechanising this right now.)

There remains an argument that it may be desirable for a decoder to be able to discern between a byte-array and an 𝕃-structure based on the first byte of the encoding. I haven't finished reading the specification, so I can't say whether it'll ever encode a byte-array longer than 2̂⁶⁴-1; but if it doesn't, then implementations can just ignore this case, and if it does, restricting the input here won't help. So I don't think this change is merited.

@fulldecent
Copy link
Contributor Author

The Yellow Paper says:

... which is itself prefixed by the number of bytes required to faithfully encode this length value plus 183.

Now we have a byte holding the value 256 or more. Maybe that is injective, but hyper-real-imaginary numbers are outside the scope of the document.


Looking at the theorem above, this breaks down on "distinguished by their suffix". RLP are not distinguished by their suffix because:

When interpreting RLP data, if an expected fragment is decoded as a scalar and leading zeroes are found in the byte sequence, clients are required to consider it non-canonical and treat it in the same manner as otherwise invalid RLP data, dismissing it completely.


And to be clear, my complaint is not that the function in non-injective. My complaint is that the function domain is wrongly specified.

@pirapira
Copy link
Member

@fmap I did something similar in Isabelle/HOL a while ago https://github.com/pirapira/eth-isabelle/blob/master/attic/RLP.thy#L150

@pirapira
Copy link
Member

pirapira commented Mar 22, 2018

@fmap by the way, an RLP decoder should be able to receive an infinite byte stream and decide how many bytes in the beginning form an RLP object.

@jamesray1
Copy link
Contributor

Sorry, I don't want to read all of this now, but AIUI, RLP can recursively take any arbitrary input and convert it to a fixed size in order to send it over the wire protocol to users. @fulldecent if you're still concerned that the domain is wrongly specified, maybe you could submit a formal proof? Sorry, I just don't have time to work on too many things that are extraneous to working on sharding.

@pirapira
Copy link
Member

pirapira commented Mar 23, 2018

@fulldecent we have alternatives here

  1. the encoding function is defined for very big inputs too, and decode(encode(data) + garbage bytes) is not always data. However, for small enough data, the equation holds.
  2. the encoding function is defined only for small enough inputs, and decode(encode(data) + garbage bytes) is always data.
  3. (@fmap) the encoding function is defined for very big inputs too, and decode(encode(data)) is always data, because big data corresponds to big encode(data).

(+ is concatenation).

@fmap
Copy link

fmap commented Mar 23, 2018

@pirapira writes:

@fmap by the way, an RLP decoder should be able to receive an infinite byte stream and decide how many bytes in the beginning form an RLP object.

Hi @pirapira. This is a productive restriction, thanks for introducing it. I believe for this to work the domain of RLP should be restricted such that byte-arrays cannot exceed length 2⁶⁴ - 1. Consider a bytestream with prefix 0xc0, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00. A decoder can't decide whether the encoding ends at the first byte (for an RLP encoding of the empty branch structure), or whether the following 2⁶⁴ bytes in the stream should be interpreted as a byte-array.

@fulldecent writes:

The Yellow Paper says:

... which is itself prefixed by the number of bytes required to faithfully encode this length value plus 183.

Now we have a byte holding the value 256 or more. Maybe that is injective, but hyper-real-imaginary numbers are outside the scope of the document.

Sorry @fulldecent, I misinterpreted the context above. I agree with your objection, and believe the domain should be further restricted. How did you come up with the bounds in your patch? My numbers turned out different (in the case of byte-arrays) and in conceptual disagreement (in the case of branch structures):

In the case for byte-arrays, which you quote, we have a byte m within the closed ℕ-interval [184, 191], followed by m - 183 bytes (at most 8) encoding the length of the byte array (n), followed by the n bytes comprising the byte array. Correspondingly for a byte-array to have an encoding, I believe its length must be strictly less than 256⁸ = (2⁸)⁸ = 2⁶⁴.

In the case for branch structures, we have a byte m within the closed ℕ-interval [248, 255], followed by m - 247 (at most 8) bytes encoding the length of the concatenation of encoded entries in the sequence (n), followed by n bytes comprising the concatenation of encoded entries. Correspondingly I believe for an element of 𝕃 to have an encoding, the length of the concatenation of the encodings of its entries must be strictly less than 2⁶⁴.

  n<∣t∣
   Σ ∣RLP(tₙ)∣ < 2⁶⁴
  n=0

I don't think restricting the length of an 𝕃-sequence is an appropriate technique for restricting the size of its encoding. I can exhibit a singleton 𝕃-sequence which is too long to have an encoding; take any byte array b of length 2⁶⁴-1, let t = (b), then ∣RLP(b)∣ = 2⁶⁴+ 8, which is too long to serialise t.

@fulldecent
Copy link
Contributor Author

fulldecent commented Mar 24, 2018

@fmap Thank you for this correction, I have updated the PR.


@jamesray1 tl;dr here is a formal proof that demonstrates that a problem exists.

Sequence concatenation is represented by hamburgers for clarity. Bytes are shown in decimal for clarity.

Specimen 1

Byte-array of 2^64 zeros

  • 183+‖BE(‖x‖)‖ 🍔 BE(‖X‖) 🍔 x
  • 192 🍔 1, 0, 0, 0, 0, 0, 0, 0, 0 🍔 0, 0, 0, ...

Sequence of zero values Ø

  • 192+‖s(x)‖ 🍔 s(x)
  • 192

Specimen 2

Byte-array of 2^584 zeros

  • 183+‖BE(‖x‖)‖ 🍔 BE(‖X‖) 🍔 x
  • 256 🍔 1, 0, 0, … 🍔 0, 0, 0, ...
  • 256 is not a valid value for a byte

@fulldecent
Copy link
Contributor Author

Ping @pirapira @jamesray1 is this explanation ^^ satisfactory for you?

Copy link
Member

@pirapira pirapira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The English description and the equation do not match.

@@ -1345,7 +1345,9 @@ \section{Recursive Length Prefix}\label{app:rlp}\hypertarget{rlp}{}

\begin{itemize}
\item If the concatenated serialisations of each contained item is less than 56 bytes in length, then the output is equal to that concatenation prefixed by the byte equal to the length of this byte array plus 192.
\item Otherwise, the output is equal to the concatenated serialisations prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the concatenated serialisations byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 247.
\item If the concatenated serialisations of each contained item is less than $2^{64}$ bytes in length, then the output is equal to the concatenated serialisations prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the concatenated serialisations byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 247.
\item Otherwise, this sequence cannot be encoded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clause should appear in the next equation.

@jamesray1
Copy link
Contributor

jamesray1 commented Apr 9, 2018

I have found something else that I think should be changed. In equation 180 the summation is from n = 0 but it should be from k = 0, and the n in b_n and 256^(||b|| - 1 - n) should be changed to k.

@jamesray1
Copy link
Contributor

I have spent more than an hour looking at this today (as well more time previously). It is not very clear.

If you have 2^64 zeros, then:

let x = 0 ... 0 such that
||x|| = 8
BE(||x||) = BE(8) = (1, 0, 0, 0). (Or is this line wrong?)
||BE(||x||)|| = 4
(183 + 4) . (1, 0, 0, 0) . (0, 0, 0, 0)

@jamesray1
Copy link
Contributor

jamesray1 commented Apr 9, 2018

Let x = 2^584 0s (0, 0, 0, ... , 0) such that ||x|| = 584
BE(||584||) = (1, 0, 0, 1, 0, 0, 0, 0, 1, 0)
(183 + 10) . (1, 0, 0, 1, 0, 0, 0, 1, 0, 0) . (0, 0, 0, ... , 0)
Or are my assumptions wrong?

Should
BE(||584||) =
1 . 256^(10-1-9) + 0 . 256^(10-1-8) + 0 . 256^(10-1-7)

  • 1 . 256^(10-1-6) + 0 . 256^(10-1-5) +0 . 256^(10-1-4) + 0 . 256^(10-1-3)
  • 0 . 256^(10-1-2) + 1 . 256^(10-1-1) + 0 . 256^(10-1-0).
    256^8 = (2^8)^8 = 2^64 = 0100_0000
    BE(||584||) =
    1 + 0 . 100_0000 + 0 . 1000_0000+ 1 . 1100_0000
  • 0 . 1_0000_0000 + 0 . 1_0100_0000 + 0 . 1_1000_0000
  • 0 . 1_1100_0000 + 1 . 10_0000_0000+ 0.10_0100_0000
    = 0100_0001 + 1000_0000 + 1_1100_0000
  • 1_0000_0000+ 1_0100_0000 + 1_1000_0000 + 1_1100_0000
  • 110_0000_0000 + 10_0100_0000

584
(Note I am assuming n < ||b|| actually means to n = ||b||-1.)

I think the latter case that I have supposed makes more sense.

This section could do with examples.

@acoglio
Copy link
Member

acoglio commented Apr 13, 2019

Commit 2a79beb from PR #736 should take care of the issues in this PR.

The revised definition of RLP encoding returns an "error value" (\varnothing) when an RLP tree cannot be encoded. The definition of \mathbb{B} was not modified because it is a more general definition for all byte arrays, not just the ones that can be RLP-encoded. The definition of \mathbb{T} was not modified because the 2^64 limit applies to the concatenated serializations of the subtrees, not to the number of subtrees, and thus it needs the definition of encoding in order to be expressed. So it is more natural to have RLP encoding return an error value on non-RLP-encodable trees, and implicitly define the set of RLP-encodable trees as the complement (in \mathbb{T}) of the preimage of the error value, i.e. the set of RLP trees on which RLP encoding does not return the error value.

Can this PR be closed?

@fulldecent
Copy link
Contributor Author

@acoglio Yes, 2a79beb and #736 do completely resolve this issue. Thanks for getting this through

@fulldecent fulldecent closed this Apr 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants