-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
EIP-616: SIMD Operations for the EVM #616
Comments
Sorry for hijacking this EIP, but any plan to include this in the "near" future? This would definitely allow more computational complexity in contracts and I must admit I am really excited for this feature. |
Not hijacking at all. It won't happen before Serenity, however "near" that is. |
So XGET gets 2 types as input, but actually there are three vectors to consider in its implementation - source vector, index vector and result vector. The specification should either clarify what's the type of result vector or maybe have 3 distinct types as input for the most general solution.
For XPUT there are really 4 vectors - source, replacement, replacement indices, result. Also we need to specify whether the type of replacement vector is the same as source and what is the type of result |
XGET takes only 2 vectors - it's like the get vector sucks data out of the source vector and is modified in the process. The get vector can have a different type than the source vector. XPUT takes 3 vectors, replacements, put indexes, and source. The source is misnamed, it's really the destination. I should be more clear on that, and on the replacements vector having the same type as the destination. It could have a different type if we want the generality, but that I think that would quadruple the size of the implementation. (I need a more clear scheme for describing the parameters and results throughout.) |
I see now, XGET replaces the indices in the get vector with the values... |
@gcolvin this EIP doesn't have a copyright statement. Can you please add one? EIP-1 suggests CC0 - see the |
There has been no activity on this issue for two months. It will be closed in a week if no further activity occurs. If you would like to move this EIP forward, please respond to any outstanding feedback or add a comment indicating that you have addressed all required feedback and are ready for a review. |
This issue was closed due to inactivity. If you are still pursuing it, feel free to reopen it and respond to any feedback or request a review in a comment. |
#Current PR: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-616.md
Working draft:
ABSTRACT
A proposal to provide Single Instruction Multiple Data types and operations for the Ethereum Virtual Machine, making full use of the 256-bit wide EVM stack items, and offering substantial performance gains for vector operations. Since a vector of one element is a scalar, the performance of native scalars for 64-bit and smaller quantities is also provided.
MOTIVATION
Most all modern CPUs include SIMD hardware that operates on wide registers of data, applying a Single Instruction to Multiple Data lanes in parallel, where lanes divide a register into a vector of scalar elements of equal size. This model is an excellent fit for the wide stack items of the EVM, offering substantial performance boosts for operations that can be expressed as parallel operations on vectors of scalars. For some examples, a brief literature search finds SIMD speedups of
SPECIFICATION
Opcodes
We define the following extended SIMD versions of the EVM's arithmetic, logic, and comparison operations, as well as a set of operations needed for working with SIMD vectors as data. As with the normal versions, they consume their arguments from the stack and place their results on the stack, except that their arguments are vectors rather than scalars.
Encoding
We propose a simple encoding of SIMD operations as extended two-byte codes. The first byte is the opcode, and the second byte is the SIMD type: scalar type, lane count, and lane width.
Thus in 256-bit vectors, we can specify SIMD types with unsigned integer lanes from 8 to 64 bits wide in vectors of 32 to 2 lanes, respectively. Using all the reserved bits the encoding allows for 256-Kbit vectors.
Note that when the lane count is one the operation is on one scalar, so this specification also provides for native operations on single scalars.
Wide integers, SIMD vectors, and bitstrings.
Wide integer values on the stack, in storage, and in memory, are stored (conceptually) as MSB-first bitstrings. SIMD vectors, to the contrary, are stored to match most SIMD hardware - as LSB-first bitstrings, starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards. But this may yield a surprise: when vectors are first converted to a wide integer, and the wide integer then placed on memory or storage, the vector, like the wide integer, will be stored MSB first. However, if the wide integer is converted back to a vector the correct value results.
Semantics
Notation and Vector Types
In the pseudo-assembly below we will denote the lane type, lane width, and number of lanes using Solidity's syntax, so the following says to push two SIMD vectors of 4 unsigned 32-bit ints on the stack, and then add two vectors of that type together.
Currently,
uint8
,uint16
,uint32
, anduint64
are the only supported lane types and widths, with available lane counts to support the following types:uint8
,uint8[2]
,uint8[4]
,uint8[8]
,uint8[16]
,uint8[32]
uint16
,uint16[2]
,uint16[4]
,uint16[8]
,uint16[16]
uint32
,uint32[2]
,uint32[4]
,uint32[8]
uint64
,uint64[2]
,uint64[4]
Arithmetic, logic, and bitwise operations
The extended operations from codes B0 through CF have the same semantics as the corresponding operations for codes 00 through 1F, except that the modulus varies by scalar type and the operations are applied pair-wise to the elements of the source operands to compute the destination elements, as above. E.g.
XADD type
Takes two
type
vectors on the stack, and leaves on the stack their pair-wise sum.And so on for most of the twenty-three remaining operations in columns B and C.
There are exceptions:
XLT type
XGT type
XSLT type
XSGT type
XEQ type
XISZERO type
The comparison operations return 1 for true and 0 for false.
XNOT type
Applies the one's complement operation to a single vector on the stack.
XVECTOW type
Replaces the bits of a vector on the stack with a 256-bit wide integer. The bits of the vector are taken LSB-first - starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards - and copied to the wide integer LSB first. Missing bits are zero-filled.
XWTOVEC type
Replaces the bits of a 256-bit wide integer on the stack with a vector. The bits of the wide integer are taken LSB first and copied to the vector LSB first - starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards. Extra bits are discarded.
Data motion operations
The operations from D0 through DF are for moving data in and out of vectors and moving vectors around storage, memory, and stack.
XPUSH type, inline_data
Takes its inline data to be the MSB-first representation of the bits of the specified vector, and pushes a vector formed from those bits on the stack. Only as many elements as needed to store the vector are stored inline. Missing elements are zero-filled on the stack.
XPOP type
Removes the topmost vector from the stack.
XDUP type, n
Copies the
SP[n]
vector to the top of the stack.XSWAP type, n
Exchanges the
SP[0]
andSP[n]
stack items.XMLOAD type
Like
MLOAD
, takes one vector on the stack: a vector containing the memory address to load from as its first, 64-bit element. That vector is replaced by the addressed data.XMSTORE type
Like
MSTORE
, takes two vectors on the stack: a vector containing a memory address as its first, 64-bit element, and a vector to store to memory at that address. Only as many elements are stored as are in atype
vector.XSLOAD type
Like
SLOAD
, takes one wide integer on the stack, containing the storage address to load from. That item is replaced by the stored data, converted to atype
vector.XSSTORE type
Like
SSTORE
, takes two items on the stack: a vector containing a storage address as a wide integer, and a vector to place in storage at that address. The vector is converted to a wide integer before being stored.XGETVEC type, index_type
This operation takes two vectors on the stack: an index vector and a source vector. It leaves on the stack the source vector, and the indexed elements of the source vector in the index vector - that is, each element in the index vector is the index of the source element with which to replace it. The indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow. If the element type of the source type and index differ values are taken modulo the size of the index type.
XPUTVEC type, index_type
Takes three vectors on the stack: an index vector, a source vector, and a destination vector. Leaves on the stack the modified destination vector - each element in the index vector is the index of the destination element to replace with the corresponding source element. The indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow. The index vector and the source vector have the same type. If the element types of the source and destination vectors differ values are taken modulo the size of the destination type.
XSWIZZLE type
Takes two vectors on the stack: a permutation mask and a source vector to permute. Leaves on the stack the source vector and a permutation of the source vector in the mask vector - each element in the mask is the index of the source element with which to replace it. The permutation indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow.
XSHUFFLE type
Takes three vectors on the stack: a permutation mask and two source vectors to permute. Leaves on the stack a permutation of the source vectors in the mask vector. Each element in the mask is the index of the element with which to replace it. If the index is less than the number of lanes
n
in the source vectors the replacement is taken from the first source vector, if larger, from the second source vector at((index - n) % n)
. Again, the modulo computation assures deterministic results.Notes
Only the SIMD operations are valid on SIMD vectors - this must be validated at contract creation time.
All SIMD operands must have the size, type and number of lanes specified by the operator - this must be validated at contract creation time.
Subroutines
Following EIP #615, a type-safe syntax for declaring subroutines taking vector arguments will be needed.
BEGINSUBX n_args, arg_types... n_results, result_types...
marks the single entry to a subroutine. The bytecode for a subroutine ends at the next
BEGINSUB
,BEGINSUBX
orBEGINDATA
instruction or at the end of the bytecode.n_args
items ofarg_types
types are taken off of the stack at entry to, andn_results
items ofresults_types
types are placed on the stack at return from the subroutine.n_args
,n_results
, and the types are given as one immediate byte each. For the purposes of specifying argument types, a type of 0x80 - the encoding for a single lane, 32 bytes wide - is taken to be a 256-bit wide integer.Notes
The
arg_types
andresult_types
are given in the same encoding as the second byte of the SIMD opcodes, and must match the values on the stack - this must be validated at contract creation time.RATIONALE
Currently, the lowest common denominator for SIMD hardware (e.g. Intel SSE2 and ARM Neon) is 16-byte registers supporting integer lanes of 1, 2, 4, and 8 bytes, and floating point lanes of 4 and 8 bytes. More recent SIMD hardware (e.g. Intel AVX) supports 32-byte registers, and EVM stack items are also 32 bytes wide. The limits above derive from these numbers, assuring that EVM code is within the bounds of available hardware - and the reserved bits provide room for growth.
For most modern languages (including Rust, Python, Go, Java, and C++) compilers can do a good job of generating SIMD code for parallelizable loops, and/or there are intrinsics or libraries available for explicit access to SIMD hardware. So a portable software implementation will likely provide good use of the hardware on most platforms, and intrinsics or libraries can be used as available and needed. Thus we can expect these operations to take about the same (or for 256-bit vectors on 128-bit hardware up to twice) the time to execute regardless of element size or number of elements.
Gas
One motivation for these operations, besides taking full advantage of the hardware, is assigning lower gas costs for operations on smaller scalars.
Measurements of three of the major clients shed some light on appropriate gas costs. For the C++ interpreter
JUMDEST
costs 1 gas and takes about 20 cyclesADD
orSUB
costs 3 gas and can take over 140 cyclesMUL
orDIV
costs 5 gas and can take from over 220 to over 640 cycles.So
JUMPDEST
is overpriced,DIV
is underpriced,ADD
,SUB
andMUL
are about right. and taking a unit of gas as worth about 40 cycles of computation in the C++ interpreter is reasonable.Some relevant generalities on computation costs...
JUMPDEST
. The best we can do is charge 1 gas until a particle scheme or a general gas repricing lets us value these operations (andJUMPDEST
) more accurately.So measurement will tell, but to a first approximation, on the C++ interpreter:
There are some exceptions to these estimates.
XDIV
XMOD
SIMD units often have no or only partial support for integer division, so these will usually need to be done with a sequence of scalar operations. Even on CPUs division is expensive, so a price of at least 3 units of gas per element of the operand vectors will probably be needed.
XSLOAD
XSSTORE
These are no more difficult than
SLOAD
andSSTORE
- and no less - soXSLOAD
is assigned the same 200 gas asSLOAD
, andXSLOAD
the same gas cost function asSSTORE
.XSWIZZLE
XSHUFFLE
These can usually be helped by SIMD hardware, but not completely, and not in all cases, so a full or partial sequence of scalar operations will often be needed. But the individual operations per element are not that expensive, so a price of 2 units of gas per element of the mask vector seems fair.
High Level Languages
This new facility will be of limited use if not exposed to programmers in higher-level languages than EVM code. To demonstrate at least one workable approach, here is a possible extension to Solidity.
A SIMD vector type could simply require an explicit annotation to array declarations, which limits arrays to the element types and number of elements supported by the SIMD facility, so that
The arithmetic, logic, and comparison SIMD operations of the EVM would be mirrored by corresponding Solidity operations that operate element-wise on pairs of simd arrays.
For example, given
this statement
is evaluated as
and
c
becomesComparison operations produce arrays of
true=1
andfalse=0
values. Sois evaluated as
and
d
becomesAnd of course other operations would need to be defined, like element access. For SIMD operations without corresponding language operators, named functions would be needed. E,g
simd.swizzle(), and simd.shuffle()
.As an added bonus,
uint
variables of sizes <= 64 bits can be compiled to single-element vectors, which an implementation can map to native scalars.The text was updated successfully, but these errors were encountered: