Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
SIMD Operations for the EVM #616
#Current PR: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-616.md
A proposal to provide Single Instruction Multiple Data types and operations for the Ethereum Virtual Machine, making full use of the 256-bit wide EVM stack items, and offering substantial performance gains for vector operations. Since a vector of one element is a scalar, the performance of native scalars for 64-bit and smaller quantities is also provided.
Most all modern CPUs include SIMD hardware that operates on wide registers of data, applying a Single Instruction to Multiple Data lanes in parallel, where lanes divide a register into a vector of scalar elements of equal size. This model is an excellent fit for the wide stack items of the EVM, offering substantial performance boosts for operations that can be expressed as parallel operations on vectors of scalars. For some examples, a brief literature search finds SIMD speedups of
We define the following extended SIMD versions of the EVM's arithmetic, logic, and comparison operations, as well as a set of operations needed for working with SIMD vectors as data. As with the normal versions, they consume their arguments from the stack and place their results on the stack, except that their arguments are vectors rather than scalars.
We propose a simple encoding of SIMD operations as extended two-byte codes. The first byte is the opcode, and the second byte is the SIMD type: scalar type, lane count, and lane width.
Thus in 256-bit vectors, we can specify SIMD types with unsigned integer lanes from 8 to 64 bits wide in vectors of 32 to 2 lanes, respectively. Using all the reserved bits the encoding allows for 256-Kbit vectors.
Note that when the lane count is one the operation is on one scalar, so this specification also provides for native operations on single scalars.
Wide integers, SIMD vectors, and bitstrings.
Wide integer values on the stack, in storage, and in memory, are stored (conceptually) as MSB-first bitstrings. SIMD vectors, to the contrary, are stored to match most SIMD hardware - as LSB-first bitstrings, starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards. But this may yield a surprise: when vectors are first converted to a wide integer, and the wide integer then placed on memory or storage, the vector, like the wide integer, will be stored MSB first. However, if the wide integer is converted back to a vector the correct value results.
Notation and Vector Types
In the pseudo-assembly below we will denote the lane type, lane width, and number of lanes using Solidity's syntax, so the following says to push two SIMD vectors of 4 unsigned 32-bit ints on the stack, and then add two vectors of that type together.
Arithmetic, logic, and bitwise operations
The extended operations from codes B0 through CF have the same semantics as the corresponding operations for codes 00 through 1F, except that the modulus varies by scalar type and the operations are applied pair-wise to the elements of the source operands to compute the destination elements, as above. E.g.
And so on for most of the twenty-three remaining operations in columns B and C.
There are exceptions:
Data motion operations
The operations from D0 through DF are for moving data in and out of vectors and moving vectors around storage, memory, and stack.
Only the SIMD operations are valid on SIMD vectors - this must be validated at contract creation time.
All SIMD operands must have the size, type and number of lanes specified by the operator - this must be validated at contract creation time.
Following EIP #615, a type-safe syntax for declaring subroutines taking vector arguments will be needed.
Currently, the lowest common denominator for SIMD hardware (e.g. Intel SSE2 and ARM Neon) is 16-byte registers supporting integer lanes of 1, 2, 4, and 8 bytes, and floating point lanes of 4 and 8 bytes. More recent SIMD hardware (e.g. Intel AVX) supports 32-byte registers, and EVM stack items are also 32 bytes wide. The limits above derive from these numbers, assuring that EVM code is within the bounds of available hardware - and the reserved bits provide room for growth.
For most modern languages (including Rust, Python, Go, Java, and C++) compilers can do a good job of generating SIMD code for parallelizable loops, and/or there are intrinsics or libraries available for explicit access to SIMD hardware. So a portable software implementation will likely provide good use of the hardware on most platforms, and intrinsics or libraries can be used as available and needed. Thus we can expect these operations to take about the same (or for 256-bit vectors on 128-bit hardware up to twice) the time to execute regardless of element size or number of elements.
One motivation for these operations, besides taking full advantage of the hardware, is assigning lower gas costs for operations on smaller scalars.
Measurements of three of the major clients shed some light on appropriate gas costs. For the C++ interpreter
Some relevant generalities on computation costs...
So measurement will tell, but to a first approximation, on the C++ interpreter:
There are some exceptions to these estimates.
High Level Languages
This new facility will be of limited use if not exposed to programmers in higher-level languages than EVM code. To demonstrate at least one workable approach, here is a possible extension to Solidity.
A SIMD vector type could simply require an explicit annotation to array declarations, which limits arrays to the element types and number of elements supported by the SIMD facility, so that
The arithmetic, logic, and comparison SIMD operations of the EVM would be mirrored by corresponding Solidity operations that operate element-wise on pairs of simd arrays.
is evaluated as
Comparison operations produce arrays of
is evaluated as
And of course other operations would need to be defined, like element access. For SIMD operations without corresponding language operators, named functions would be needed. E,g
As an added bonus,
This was referenced
Apr 29, 2017
referenced this issue
Jul 7, 2017
So XGET gets 2 types as input, but actually there are three vectors to consider in its implementation - source vector, index vector and result vector. The specification should either clarify what's the type of result vector or maybe have 3 distinct types as input for the most general solution.
For XPUT there are really 4 vectors - source, replacement, replacement indices, result. Also we need to specify whether the type of replacement vector is the same as source and what is the type of result
XGET takes only 2 vectors - it's like the get vector sucks data out of the source vector and is modified in the process. The get vector can have a different type than the source vector.
XPUT takes 3 vectors, replacements, put indexes, and source. The source is misnamed, it's really the destination. I should be more clear on that, and on the replacements vector having the same type as the destination. It could have a different type if we want the generality, but that I think that would quadruple the size of the implementation.
(I need a more clear scheme for describing the parameters and results throughout.)