The integer debate #626
Follow up issue after earlier gitter chat and implementers call today (ethereum/eth2.0-pm#29)
To pick the best solution, we need a more structural approach, and list the arguments for each "problem-class".
Go, Rust, Swift, Nim all support signed/unsigned 32/64 bit numbers.
Range here: 1 slot per 6 seconds, for a few thousand years (could upgrade earlier...) = approx.
Signed numbers: No. However, there is a case for the first few epochs, where logic is looking back at old history, beyond the genesis. This could potentially result in an underflow.
Ways to catch the underflow:
If we don't care about very long-term consistency, we can go for 32 bits. It seems unnecessary however, as there's more range available to every language/platform. E.g. we could even opt for an imaginary 48 bit (un)signed integer.
Highlight from earlier gitter chat:
Alternative would be to use big-numbers, like the alternative that Java also has, see below.
Java only supports signed 64 bit numbers ("long"). Of course, you could transport a unsigned 64 bit number over a signed number, as done previously in Guava and supported in Java 8. This does introduce other things to consider, please refer to comments: A, B.
Alternative would be Big integers. (something that looks like
The approximate range here: 0...4,000,000 (worst case validator count).
Signed numbers: No. However, there could be a case for an
Validator indices are realtively low, and would fit easily in 32 bits:
Now the questions here are:
Fits in a 52 bit mantissa. ES6 supports bitwise operations only for 32 bits. If we want to do bit-magic on indices, we may want to just go for 32 bits or less.
Java only supports signed integers. "int": 32 bits, signed. docs
For balances there is a valid concern where we may not even want to use a 64 bit number, if we want precise/small balances.
Range: two options:
Signed: No. However, there is a case for ease in math to consider that clients may want to convert to signed numbers internally. Signs are not necessary anywhere after being encoded.
We know the limitations of these by now. Balances are likely to require the most resolution in the near-term. No shortcuts with ranges (like with slots). Highly important to get right and prevent bugs.
Personally a fan of using big-ints here, and use safe-math.
I tried to outline the problem cases + options + considerations. Please comment if you think anything is missing, or want to make a strong case for any of the options.
The text was updated successfully, but these errors were encountered:
Please share if you have a different view, and why.
Edit: changed opinion slightly, still same integer sizes, but open for either signed or unsigned ints. If I really had to make a choice at gunpoint, I think I would rather have unsigned ints (mostly because of simplicity of having no signs, although they could be just fine).
Beyond the implementers story there is also the spec side to consider.
Potentially the early slots become an edge case with special complex (?) treatment in the spec.
As discussed during the implementer calls, using signed integers requires conversion at very specific boundaries:
However using unsigned integers requires:
I do not support unsigned int because:
In short, signed integers when we need to do math, logic and indexing and unsigned when we need control over memory representation (EVM, serialization).
An important distinction I'd add is to differentiate serialization and logic.
For slots etc, it's entirely reasonable that the serialization is
I'd consider the offset solution impractical mainly because it encourages cutting corners on correctness to gain some performance in the near term - going for that smaller data type when you should be doing a bigint thus effectively penalizing languages that don't naturally support the given "reasonable" range and thus making the real-world deployment either more bug-prone or less rich in terms of implementation diversity.
As an aside, I also find signed integers to be more difficult from a serialization perspective - their byte encoding (now little-endian) is onerous to work with in general (parsing, debugging etc)
It's my opinion that the spec should use unsigned integers in all situations where the described value should never be negative. Additionally, the spec should use unsigned integers of the minimal bit length required for the given purpose, and explicitly define under/overflow behavior.
Implementors can then use signed or unsigned integers as they see fit, as long as the requirements of the specification are maintained.
I believe it is important to properly support teams that work in Java/Kotlin and build Ethereum 2.0 clients. We (hat tip to @cleishm) have implemented an unsigned 64 bit number in Cava:
I have prepared a commit for Artemis to change its logic to use it (I replaced all uses of UnsignedLong):
I believe this alleviates some of the pains, especially as UInt64 supports bit shifting, exact additions and subtractions, and more.
If I can supply an opinion - Kotlin offers the flexibility you seek if you're looking for a flexible DSL that allows overloading operators.
"int: By default, the int data type is a 32-bit signed two's complement integer, which has a minimum value of -231 and a maximum value of 231-1. In Java SE 8 and later, you can use the int data type to represent an unsigned 32-bit integer, which has a minimum value of 0 and a maximum value of 232-1. Use the Integer class to use int data type as an unsigned integer. See the section The Number Classes for more information. Static methods like compareUnsigned, divideUnsigned etc have been added to the Integer class to support the arithmetic operations for unsigned integers."
"long: The long data type is a 64-bit two's complement integer. The signed long has a minimum value of -263 and a maximum value of 263-1. In Java SE 8 and later, you can use the long data type to represent an unsigned 64-bit long, which has a minimum value of 0 and a maximum value of 264-1. Use this data type when you need a range of values wider than those provided by int. The Long class also contains methods like compareUnsigned, divideUnsigned etc to support arithmetic operations for unsigned long."
Totally subjective opinion and not enough empirical evidence to argue for a debate.
Have you tested the speed of this? This should get easily optimized in the JIT. Many large financial institutions use Java for HFT (high frequency trading), which requires insane performance with large numbers.
As for the 'unreadable' part... it is common to build an API to simplify it for your use case.
True, this is what I meant with "transporting over a signed int", because this "support" is completely artificial.
See above, it's really just the same data-type, with a hacky workaround to provide support for unsigned behavior. It's not completely native to the JVM bytecode afaik (please correct me if I'm wrong).
I literally wrote up an entire post to start a debate on considerations with integers, with special attestation to Java. And I'm familiar enough with the JVM and Java bytecode to work with JNI, reflection and know of awful hacks such as the lower integer cache. I tried my best documenting everything, and noted the transport-over-signed integer support, but yes I have my opinions.
Have not tested it myself, but there's plenty of other research/benchmarks into big-integers. And generally, the standard BigInteger is much slower than the one in Go. Now compare it to a native 64 bit integer with native operations, no boxing/allocations, and it's far behind.
Personally, I fundamentally dislike it because of the boxing (it's not necessary if you're on a 64 bit platform) and awful syntax. There's reasons where I would use it however, like safe-math, or > 64 bit integers.
Also, I don't care about "usage in HFT" when it's not the bottleneck of the actual example application. Streaming and distribution across compute are much more important in such a case afaik. As a side note; I wonder if we can make the beacon-chain processing itself more parallelized...
I will just edit this, if it's too subjective.
My proposed alternative, if you want to pursue JVM with less of the concerns that Java raises: Kotlin.
Generally, Kotlin does a much better job at implementing the same thing (although still "experimental" phase in 1.3), but with types, readable constants, and readable operators. Still the same JVM limitations tho, as far as I know (but Kotlin also has a native target as well). For reasons like these, I think it deserves a look to transition to as a JVM based client now that it's still a relatively early phase.
TLDR: avoid hacky pseudo unsigned 64 bit support if possible, i.e. see if just signed numbers would work first. And if you do, use annotations, enforce them, and document the dangers.
Edit: fix quoting markup
If we're comparing big-ints, take a look at http://www.wilfred.me.uk/blog/2014/10/20/the-fastest-bigint-in-the-west/
That said, I think we need to stay on topic, and discuss the benefits and drawbacks of all integer signs and sizes. Let's not get stuck talking about just big-ints, as clearly, we can do better.
The question is: what is the best choice for a solid and fast implementation, while still being reasonable to implement in all the involved languages? (x3, the different problems may require different types of solutions)
@protolambda What is the relevance of quoting a blog post from 2014 with tests against Java 1.7?
In the case of ETH 2.0, which is still being heavily developed and worked on, shouldn't the focus be less on premature optimization and more on a solid implementation? Solid meaning that it passes all unit and integration tests.
Pretty and fast is optimizations that can be added later. Java makes profiling and refactoring quite straightforward.
Kotlin is a bike shed issue. Switching to a whole new language just over this issue seems like another premature optimization. It isn't trivial. Documentation, build systems, technical skills of the developers, etc... all need to be updated and changed. In other words, a whole number of risks for benefit of unknown percentage.
I'd just go with @atoulme's suggestion and use Cava's implementation of UInt64. Problem solved.
I agree that Java/Kotlin is a bit of a bike shed issue, much like big vs little endian, signed vs unsigned, etc. ;-)
That said, I disagree about the triviality. Kotlin is highly compatible with Java, you can mix it in the same source tree, and it all works in the same IDE. I'd argue it's more like a syntax variation than a different language. Indeed, much of Cava is implemented in Kotlin but consumed using each module's Java visible APIs. We do this because it's much easier to use Kotlin in many situations. Such a situation would be if there really is an extensive need to use extensively use unsigned integers - one can switch a bunch of classes into the Kotlin syntax and that problem is solved. Fairly trivially.
(BTW, Cava already provides an SSZ implementation using Kotlin unsigned integers: https://github.com/ConsenSys/cava/blob/master/ssz/src/main/kotlin/net/consensys/cava/ssz/experimental/SSZWriter.kt#L183. I expect other eth-2.0 related libraries in Cava will, where relevant, also expose Kotlin API using unsigned ints.)
Even though I'm in one of the teams working in Java, I don't think this discussion should be driven by what suits Java - one way or another we'll cope with whatever the final decision is.
For me, this ought to be a more purist conversation around formulating the spec in the safest way for implementers generally. Yesterday, I wrote a long justification for using signed integers, or, alternatively, providing clear bounds on valid ranges so implementers could safely do as they please. But then I realised I was just repeating the wisdom of @protolambda and @mratsim above, and there's no point going round in circles.
Having said that...
Using a native type if we can (e.g. signed long for epochs/slots) would be immensely simpler. The main challenge with the wrapped types is the extremely clunky syntax which is horrible to write, awful to read and horrendous to maintain. But, as I said, we'd cope. [Yes, we know about Kotlin!]
What I'd like to see is this issue updated with some actual code to be used in ETH2.0, done both ways (native types vs. 'extremely clunky syntax which is horrible to write'), in Java and maybe even Kotlin.
Then we can do a bit of performance testing as well as syntax evaluation and discussion and work towards a good middle ground. Right now, it is all to subjective IMHO, we should be letting the code speak for itself.
@lookfirst OpenJDK hasn't even changed the big-number implementation since 2014. JDK 8 is from Mar 2014: https://hg.openjdk.java.net/jdk8/jdk8/jdk/file/tip/src/share/classes/java/math/BigInteger.java
@benjaminion Agree, to a certain extent. If there's something to choose, let's go for compatibility.
Yep, we should definitely be able to handle slots through native types for most languages. All we have to do is define a working range (enforced in spec) to support a distant but not too big future (e.g. 1000 years), and then we can support every client, each working with their native types. And unsigned/signed needs to be decided.
So let me try to simplify this slot debate to get it done with:
Slot/Epoch format debate nr. 1:
Slot/Epoch format debate nr. 2:
A) Enforce a range of max. x bits (e.g. 52) to be used (good enough for ... years/centuries), so that every client can safely use their native types.
Fair to say epochs have the same encoding: epochs are not that long, so the encoding is only like 6 bits less. (64 slots). Also just nice for consistency and ease in conversion.
Then there is the validator index encoding: given the "4 million" worst case number, you would think that 32 bits is good and simple enough. So there's two questions to answer:
This should also be easy enough to decide on.
Then lastly, we have balances, which don't have a low range to start thinking about wide native support, and are hard to get right.
For this particular problem class, I kinda agree with @lookfirst here (although it is a lot of work):
Given that slots and indices are important, yet relatively easy to solve, I would prioritize this, and continue the debate on balances later.
Proposal to get the debate to 2/3 problems solved:
Slots & Epochs
We have to deal with the possibility that someone tries to propagate a block, created in a:
Given the 4M worst case often cited, or 8M if you go by the
As mentioned in the original write-up:
And since indices are just indices, not out-of-context keys (unlike the public key that everyone knows of for verification etc.), 32 bits should be enough for a long time. (IMO: Once we get to a point where we need to design to billions of validators, it's mainstream and big enough to have gone through much, much more updates/forks)
I'm think signed integers work slightly better for API-like reasons, but okay with unsigned here.
If anyone can make a strong case for unsigned / against signed, raise your voice.
"indices are not negative" is not good enough imho, as it's useful to have in special-cases (like mentioned above), and negative indices are no different than indices out of range on the other end, i.e.
Besides, life for Java is slightly easier with it being signed, since they never have to wrap them in a
Same for everywhere else, just update the current implicit
If some negative validator index is ever encoded and used in a slashing/block/whatever, it's the same as an index being out of range on the other side of the registry, and we mark it as invalid.
TBD later, part 3/3 of the debate.
Since big-integers are involved for some languages that don't support
For indices signed works better because indices at low-level are pointer accesses and those are signed (cf ptrdiff_t) so this maps very well with C on 32 and 64-bit platforms but if we have the range that is implementation detail.
Edit - Side-note: Quoting Google C++ style guide
For a specification, I still contend that it should describe the minimal requirement. If slots & epoch only need a range of 0 .. 2**52, then define it that way in the spec. If a bit-flag is needed to indicate an error, add that in the specification also. Implementations can then use whatever types their language of choice provides, as long as they can represent the required range and the implementation handles under/overflow as specified.
My teammates and I totally agree with @benjaminion. We are another Java team working on beacon chain implementation.
Alternative proposal, the unsigned int one
Ok, so alternatively, to make the spec more "pure", we can say the types for slots/epochs and validator indices are unsigned. But with very clear wording on the allowed range, and encoding. This way, clients can safely use the native types that they like ("long" (java signed int64), "Number" (JS sign + mantissa)).
Below representation is big-endian, to make it easy to read. Little-endian does not change much, other than the actual memory order of the bytes.
Exceptions can be made for API design (e.g. -1 for slot of unknown block-root).
52 bits should be plenty for slot/epoch numbers.
For validator numbers, we could use the same policy above for maximum capacity.
If you like this better than the earlier proposal, please add an emoji. If not, please show support for the earlier proposal, or write your own.
@protolambda So to paraphrase, are you saying that slot/epoch should be an unsigned 52-bit integer, but any part of the specifications that encodes these values should do so as 64-bits with the top 12 always set to zero?
Do we know all the places these values will be encoded?
@cleishm Not exactly.
What I'm saying is that we can loosen the design space by just stating that every client has to support 52 bits. Every platform can do this. This requirement is important, because if you transfer data through a peer, you want them to be able to handle it (without something alike to reduction in resolution), so that the data can propagate to other peers without problem.
And it's two's complement, so the sign bit will be left-most, next to the unused bits, and all will be 0 for our use case (positive numbers from 0 ... n, with n bounded by time or validator count).
And we encode it as 64 bits, because it's much more standard, and we can loosen up the range towards 64 bits as soon as the need arises (in millions of years with current slot time, or when ethereum validator count explodes beyond imagination...).
Also, being unsigned would just mean the spec has to be explicit on simple underflows (e.g. slot lookback early in the chain, looking for data before genesis). And clients can choose to handle it as signed numbers instead of extra if statements, if they want to.
The alternative-alternative proposal would be to just use
@protolambda Thanks for the clarification - I think that agrees with what I had in mind. Specify slot/epoch as an integer value in the range 0 .. 2^52, and specifying any encoding of that to use 64-bits (thus the most significant 12 bits must be zero). Additionally, define under/overflow semantics as wrap-around or error.
Implementations are then free, in their code, to treat slot/epoch as unsigned 52-bit integers, or 64-bit signed or unsigned values. They just have to make sure it encodes & decodes correctly, and handles under/overflow as specified.
I wouldn't be opposed to adding something like "Any
After experimenting with SSZ encoding implementation and spec implementation more (to be published), I feel like the choice for unsigned integers works much better: it's not hard to deal with the unsigned math at all in practice, and exclusiveness of unsigned numbers reduces the complexity of encoding/decoding.
The complexity reduction I talked about is just not having to deal with two types of integers in your encoding/decoding functions. And the possibility of negative numbers in a communication space where there doesn't exist an idea of a negative number, makes no sense imho. By being clear about the range of unsigned 64 bit integers (i.e. for each of the tree classes: 1) slot/epoch, 2) val. indices, 3) balances, all clients should be able to support uint64.
Just to be clear:
Closing this issue now, it stays uint64, and we'll all try to be more clear about the usage of integers in future changes.