Little-endian vs big-endian (take 2) #1046

JustinDrake · 2019-05-05T13:41:02Z

In the last endianness debate we decided to favour little-endianness. Since then there have been various developments (see points 1-4 in favour of big-endianness) that warrant revisiting that decision.

Pros of little-endian

Consistent with WASM
Friendly to commodity hardware
Consistent with parity-codec

Pros of big-endian

Friendly to mathematical purity and spec readability/reasonability (we get nice things such as the ith bit corresponding to 2**i)
Friendly to generalised indices and bigints (see here)
Consistent with SHA256 (both external and internal serialisation), and potentially friendlier to MPCs and SNARKs/STARKs that involve hashed data.
Consistent with BLS12-381 (Zcash's big-endian implementation is likely to get standardised)
Consistent with big number implementations
Consistent with RLP

All things considered, big-endianness may now to be favoured option. Brief counter-arguments to the little-endianness arguments:

"Consistent with WASM" is the big one that made the scales favour of little-endianness last time. Would be good to get thoughts from WASM people. cc @wanderer, @axic, @jakelang, @lrettig, @cdetrio
As I understand "Friendly to commodity hardware" is a non-issue for dual-endian hardware, and a minor performance penalty for little-endian hardware.
As I understand "Consistent with parity-codec" is an annoyance for the Parity team and can be worked around relatively easily.

The text was updated successfully, but these errors were encountered:

djrtwo · 2019-05-05T14:04:50Z

big-endian

Consistent with SHA256 (both external and internal serialisation)

This is crypto black-box that shouldn't affect us.

potentially friendlier to MPCs and SNARKs/STARKs that involve hashed data

I understand this is your hunch, but is there any actual evidence of this? source or something?

Consistent with BLS12-381

crypto black-box. Not sure it's relevant

Consistent with big number implementations

In general? Across languages? What big int implementations are you talking about?

Consistent with RLP

Consistency with VM serialization (single-endianness from top to bottom) trumps backwards compatibility imo

JustinDrake · 2019-05-05T15:17:45Z

This is crypto black-box that shouldn't affect us.

crypto black-box. Not sure it's relevant

Serialisation of crypto black boxes may matter when those black boxes are implemented as MPC/SNARK/STARK circuits. One generally wants to reduce the depth and complexity of these circuits so avoiding unnecessary "flip" circuitry may help. Imagine an MPC circuit that does bigint arithmetic (naturally done using big-endian) and then does some hashing with the same bigint interpreted as data and SSZ little-endianness forces a bunch of unnecessary flipping. Imperceptible performance degradations in silicon circuits can turn out to be hugely significant in MPC or SNARK circuits.

this is your hunch, but is there any actual evidence of this? source or something?

Yes, just a hunch. The thought is literally from a few hours ago—arguably FUD at this point 😂 To be confirmed/denied by an MPC expert. (I will be meeting Muthu from Ligero at MIT mid-May and will discuss with him.)

In general? Across languages? What big int implementations are you talking about?

My source is @mkalinin (see here). He writes: "all big number implementations in Java that I've seen uses big-endian to encode/decode numbers to/from byte arrays. So does Milagro, even in C implementation". I'm not qualified to comment beyond that.

Consistency with VM serialization trumps backwards compatibility imo

Agreed. The good news that we may not have to enshrine VM serialisation inside consensus.

mratsim · 2019-05-05T15:20:41Z

Contrary to other debates like cough signed vs unsigned, I don't have any preference of big vs little endian but it is not true that big number implementations are big-endian:

GMP, which is probably the most used bigint library, is little endian: https://gmplib.org/manual/Integer-Internals.html
BearSSL is little-endian: https://bearssl.org/bigint.html#representation

The serialization of their internal representation is big-endian however (but even if it was little-endian you would need a specific serialization pass for most bigint/crypto libraries)

lightclient · 2019-05-05T15:24:59Z

From a performance point of view, most architectures support some instruction which efficiently reverses the byte order of a register value (see BSWAP on x86 or REV on ARM). This is specifically to support fast decoding of network packets which are received in big-endian, but could be applicable to passing and retrieving data from our WASM VM.

arnetheduck · 2019-05-05T15:26:47Z

dual-endian hardware

very rare, in particular, x86 and wasm are not on the list - x86 has instructions to help with bigendian (movbe, bswap), but they all work by converting back and forth, instead of operating on the data "natively" (there's no addbe).

wasm

wasm in particular does not have a native byte-swapping instruction, meaning that whenever you have to "read" big-endian data and convert to an integer, you have to do so byte-by-byte which takes (a lot of) space - here's a trivial example of implementing a simple add function operating on data serialized as BE and LE: https://gcc.godbolt.org/z/Dlcd46

compare how, if the data arrives in bigendian format, we have to use a lot of (contract) space to convert to little-endian and back, to perform an the operation.

Consistent with big number implementations

not sure what this means - big number implementations will typically use native endian internally for operations on the limbs (machine-word-size chunks) and compose these operations however is convenient - endianess doesn't come into play. for interoperability, big-endian is sometimes/often used during serialization, but that also depends. Here's GMP, popular bigint library supporting both endians in serialization: https://gmplib.org/manual/Integer-Import-and-Export.html - it supports both, and "native" for maximum efficiency when on a single platform. Most importantly, the serialization is completely arbitrary / disconnected from what goes on inside the library - it's up to the protocol to choose a serialization and be consistent.

if/when we implement a bigint library for wasm, it will first convert everything to little-endian words (like the add example above), perform the operation and convert back to store it. less conversion = smaller code size.

lightclient · 2019-05-05T15:32:38Z

compare how, if the data arrives in bigendian format, we have to use a lot of (contract) space to convert to little-endian and back, to perform an the operation.

Would this conversion not happen on the native machine before handing off the data to the WASM runtime via the linear memory buffer?

djrtwo · 2019-05-05T16:48:40Z

whoops, accidentally closed. Reopened!

arnetheduck · 2019-05-06T13:55:49Z

Would this conversion not happen on the native machine before handing off the data to the WASM runtime via the linear memory buffer?

I guess that depends on where the data comes from and who does the interpreting of that data - a conversion de facto has to happen somewhere unless there's mechanical sympathy between the parts that make up the whole, something to consider when choosing the parts. I'd think we want as much as possible to be possible at the execution layer, where "possible" includes the notion of efficient - for example so we don't have to add to many black boxes and precompiles.

lightclient · 2019-05-06T14:38:26Z

I guess that depends on where the data comes from and who does the interpreting of that data

Since the data will flow into and out of the the WASM runtime via the EEI, I figured a reverse operation could occur at that boundary in the native client.

mkalinin · 2019-05-08T10:12:34Z

not sure what this means - big number implementations will typically use native endian internally for operations on the limbs (machine-word-size chunks) and compose these operations however is convenient - endianess doesn't come into play

Speaking of big numbers endianness may be used to specify an order of limbs in its representation. That what I meant in the post about Java big number impl; it uses an order with most significant byte (limb = byte) first ~ big-endian with no other options.

Why big numbers may matter? Cause, starting from UIntN, N > 64 type any language will have to use big number impl to carry values of these type as they are bigger than a word that processor's register can handle. But it's hardly the case that such types will come into play.

For UInt64 types little-endian is obviously a win for languages that able to work with memory pointers (C, C++, Go, Rust, etc.).

With regard to crypto black boxes. The best way for SSZ would be to serialise them as pure byte string which represents serialisation format of particular crypto black box, just in the same way we do it for BLS points.

I guess that depends on where the data comes from and who does the interpreting of that data

The only way for WASM to benefit from keeping SSZ little-endian that I can see is avoiding flipping in SSZ implementation written in WASM.

big-endian pros:

conformance with some big number implementations (maybe most significant part of them), a case that has low probability due to our needs

little-endian pros:

often win for a number of languages
win for WASM implementation of SSZ

Comparing both pros makes me think of keeping little-endian for SSZ as a rational decision.

P.S. A discussion on endianness in different systems that I've found an interesting one https://news.ycombinator.com/item?id=9451284

vbuterin · 2019-05-11T10:47:38Z

A couple more pros for big endian:

Big-endianness feels more natural to humans that are accustomed with decimal place value
The standard way to serialize big endian into hex doesn't have the awkward mixed endian effect that little endian does (eg. the number 2 * 2**12 + 3 * 2**8 + 4 * 2**4 + 5 in big endian is 0x2345 but in little endian is 0x4523 which is just so hard to read)
With big endians, we have the invariant (a > b) <-> (int_to_bytesn(a) > int_to_bytesn(b))

Since the data will flow into and out of the the WASM runtime via the EEI, I figured a reverse operation could occur at that boundary in the native client.

It's worth noting that most of the serialization is going to happen on layers higher than the base protocol; eg. in the phase 2 proposal the "transaction" that gets passed into the top-level execution is just a bytearray. So if WASM is little endian and SSZ is big endian there is going to be a lot of byte flipping code at many levels of the stack.

And this is honestly a big pro for little endian especially given our goals with abstraction. So it does seem like WASM is forcing our hand....

Agree that BLS-12-381 and SHA256 are black boxes.

djrtwo · 2019-07-30T14:48:51Z

Closed as phase 0 spec is frozen

JustinDrake added scope:SSZ Simple Serialize general:discussion labels May 5, 2019

djrtwo closed this as completed May 5, 2019

djrtwo reopened this May 5, 2019

This was referenced Jun 9, 2019

Handle endianess in client implementation #1149

Closed

Validator address checksums #1183

Closed

djrtwo closed this as completed Jul 30, 2019

poemm mentioned this issue Apr 16, 2020

Switch integer encoding to big-endian ethereum/portal-network-specs#22

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Little-endian vs big-endian (take 2) #1046

Little-endian vs big-endian (take 2) #1046

JustinDrake commented May 5, 2019

djrtwo commented May 5, 2019

JustinDrake commented May 5, 2019 •

edited

Loading

mratsim commented May 5, 2019 •

edited

Loading

lightclient commented May 5, 2019

arnetheduck commented May 5, 2019

lightclient commented May 5, 2019

djrtwo commented May 5, 2019

arnetheduck commented May 6, 2019

lightclient commented May 6, 2019 •

edited

Loading

mkalinin commented May 8, 2019

vbuterin commented May 11, 2019 •

edited

Loading

djrtwo commented Jul 30, 2019

Little-endian vs big-endian (take 2) #1046

Little-endian vs big-endian (take 2) #1046

Comments

JustinDrake commented May 5, 2019

djrtwo commented May 5, 2019

big-endian

JustinDrake commented May 5, 2019 • edited Loading

mratsim commented May 5, 2019 • edited Loading

lightclient commented May 5, 2019

arnetheduck commented May 5, 2019

lightclient commented May 5, 2019

djrtwo commented May 5, 2019

arnetheduck commented May 6, 2019

lightclient commented May 6, 2019 • edited Loading

mkalinin commented May 8, 2019

vbuterin commented May 11, 2019 • edited Loading

djrtwo commented Jul 30, 2019

JustinDrake commented May 5, 2019 •

edited

Loading

mratsim commented May 5, 2019 •

edited

Loading

lightclient commented May 6, 2019 •

edited

Loading

vbuterin commented May 11, 2019 •

edited

Loading