New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIMD Operations for the EVM #616

Open
gcolvin opened this Issue Apr 27, 2017 · 5 comments

Comments

Projects
None yet
3 participants
@gcolvin
Collaborator

gcolvin commented Apr 27, 2017

#Current PR: https://github.com/ethereum/EIPs/blob/master/EIPS/eip-616.md
Working draft:

EIP: 616
Title: SIMD Operations for the EVM
Author: Greg Colvin, greg@colvin.org
Type: Standard Track
Category: Core
Status: Draft
Created: 2017-04-25

ABSTRACT

A proposal to provide Single Instruction Multiple Data types and operations for the Ethereum Virtual Machine, making full use of the 256-bit wide EVM stack items, and offering substantial performance gains for vector operations. Since a vector of one element is a scalar, the performance of native scalars for 64-bit and smaller quantities is also provided.

MOTIVATION

Most all modern CPUs include SIMD hardware that operates on wide registers of data, applying a Single Instruction to Multiple Data lanes in parallel, where lanes divide a register into a vector of scalar elements of equal size. This model is an excellent fit for the wide stack items of the EVM, offering substantial performance boosts for operations that can be expressed as parallel operations on vectors of scalars. For some examples, a brief literature search finds SIMD speedups of

SPECIFICATION

Opcodes

We define the following extended SIMD versions of the EVM's arithmetic, logic, and comparison operations, as well as a set of operations needed for working with SIMD vectors as data. As with the normal versions, they consume their arguments from the stack and place their results on the stack, except that their arguments are vectors rather than scalars.

lo\hi C D E
0 XLT XPUSH
1 XADD XGT XMLOAD
2 XMUL XSLT XMSTORE
3 XSUB XSGT XPOP
4 XDIV XEQ XSLOAD
5 XSDIV XISZERO XSTORE
6 XMOD XAND XDUP
7 XSMOD XOR XSWAP
8 XXOR XGETLOCAL
9 XNOT XPUTLOCAL
A XGETVEC
B XSHL XPUTVEC
C XSHR XSWIZZLE
D XSAR XSHUFFLE
E XVECTOW XROL XSCATTER
F XWTOVEC XROR XGATHER

Encoding

We propose a simple encoding of SIMD operations as extended two-byte codes. The first byte is the opcode, and the second byte is the SIMD type: scalar type, lane count, and lane width.

N bits Field
8 opcode
1 reserved: 0
1 reserved: 0
2 lane width: log base 2 of the number of bytes, as an MSB first integer
1 reserved: 0
3 lane count: log base 2 of the number of lanes, as an MSB first integer

Thus in 256-bit vectors, we can specify SIMD types with unsigned integer lanes from 8 to 64 bits wide in vectors of 32 to 2 lanes, respectively. Using all the reserved bits the encoding allows for 256-Kbit vectors.

Note that when the lane count is one the operation is on one scalar, so this specification also provides for native operations on single scalars.

Wide integers, SIMD vectors, and bitstrings.

Wide integer values on the stack, in storage, and in memory, are stored (conceptually) as MSB-first bitstrings. SIMD vectors, to the contrary, are stored to match most SIMD hardware - as LSB-first bitstrings, starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards. But this may yield a surprise: when vectors are first converted to a wide integer, and the wide integer then placed on memory or storage, the vector, like the wide integer, will be stored MSB first. However, if the wide integer is converted back to a vector the correct value results.

Semantics

Notation and Vector Types

In the pseudo-assembly below we will denote the lane type, lane width, and number of lanes using Solidity's syntax, so the following says to push two SIMD vectors of 4 unsigned 32-bit ints on the stack, and then add two vectors of that type together.

XPUSH [uint32(1), 2, 3, 4]
XPUSH [uint32(5), 6, 7, 8]
XADD uint32[4]

Currently, uint8, uint16, uint32, and uint64 are the only supported lane types and widths, with available lane counts to support the following types:

  • uint8, uint8[2], uint8[4],uint8[8],uint8[16], uint8[32]
  • uint16, uint16[2], uint16[4],uint16[8],uint16[16]
  • uint32, uint32[2], uint32[4],uint32[8]
  • uint64, uint64[2], uint64[4]

Arithmetic, logic, and bitwise operations

The extended operations from codes B0 through CF have the same semantics as the corresponding operations for codes 00 through 1F, except that the modulus varies by scalar type and the operations are applied pair-wise to the elements of the source operands to compute the destination elements, as above. E.g.

  • XADD type
    Takes two type vectors on the stack, and leaves on the stack their pair-wise sum.

E.g.

XPUSH [1, 2, 3, 4]
XPUSH [4, 5, 6, 7]
XADD uint8[3]

leaves

[5, 7, 9, 11]

on the stack

And so on for most of the twenty-three remaining operations in columns B and C.

There are exceptions:

  • XLT type
    XGT type
    XSLT type
    XSGT type
    XEQ type
    XISZERO type
    The comparison operations return 1 for true and 0 for false.

E.g.

XPUSH [1, 5, 3, 7]
XPUSH [4, 2, 6, 4]
XLT uint8[4]

leaves

[0, 1, 0, 1]

on the stack

  • XNOT type
    Applies the one's complement operation to a single vector on the stack.
  • XVECTOW type
    Replaces the bits of a vector on the stack with a 256-bit wide integer. The bits of the vector are taken LSB-first - starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards - and copied to the wide integer LSB first. Missing bits are zero-filled.
  • XWTOVEC type
    Replaces the bits of a 256-bit wide integer on the stack with a vector. The bits of the wide integer are taken LSB first and copied to the vector LSB first - starting with the least significant bit of the lowest-indexed element of the vector and proceeding upwards. Extra bits are discarded.

Data motion operations

The operations from D0 through DF are for moving data in and out of vectors and moving vectors around storage, memory, and stack.

  • XPUSH type, inline_data
    Takes its inline data to be the MSB-first representation of the bits of the specified vector, and pushes a vector formed from those bits on the stack. Only as many elements as needed to store the vector are stored inline. Missing elements are zero-filled on the stack.

  • XPOP type
    Removes the topmost vector from the stack.

  • XDUP type, n
    Copies the SP[n] vector to the top of the stack.

  • XSWAP type, n
    Exchanges the SP[0] and SP[n] stack items.

  • XMLOAD type
    Like MLOAD, takes one vector on the stack: a vector containing the memory address to load from as its first, 64-bit element. That vector is replaced by the addressed data.

  • XMSTORE type
    Like MSTORE, takes two vectors on the stack: a vector containing a memory address as its first, 64-bit element, and a vector to store to memory at that address. Only as many elements are stored as are in a type vector.

  • XSLOAD type
    Like SLOAD, takes one wide integer on the stack, containing the storage address to load from. That item is replaced by the stored data, converted to a type vector.

  • XSSTORE type
    Like SSTORE, takes two items on the stack: a vector containing a storage address as a wide integer, and a vector to place in storage at that address. The vector is converted to a wide integer before being stored.

  • XGETVEC type, index_type
    This operation takes two vectors on the stack: an index vector and a source vector. It leaves on the stack the source vector, and the indexed elements of the source vector in the index vector - that is, each element in the index vector is the index of the source element with which to replace it. The indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow. If the element type of the source type and index differ values are taken modulo the size of the index type.

E.g.

XPUSH [4, 5, 6, 0]    // source
XPUSH [3, 2]       // indexes
XGETVEC uint8[3] uint8[5]

leaves the source and the modified index vector on the stack:

[4, 5, 6, 0] 
[0, 6]
  • XPUTVEC type, index_type
    Takes three vectors on the stack: an index vector, a source vector, and a destination vector. Leaves on the stack the modified destination vector - each element in the index vector is the index of the destination element to replace with the corresponding source element. The indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow. The index vector and the source vector have the same type. If the element types of the source and destination vectors differ values are taken modulo the size of the destination type.

E.g.

XPUSH [4, 5, 6, 0] // destination
XPUSH [7, 8]       // replacements
XPUSH [0, 3]       // indexes
XGETVEC uint8[3] uint8[2]

leaves the modified source vector on the stack.

 [7, 5, 6, 8] 
  • XSWIZZLE type
    Takes two vectors on the stack: a permutation mask and a source vector to permute. Leaves on the stack the source vector and a permutation of the source vector in the mask vector - each element in the mask is the index of the source element with which to replace it. The permutation indexes are taken modulo the number of lanes in the vector, so the result is deterministic and cannot overflow.

E.g.

XPUSH [4, 5, 6, 0]  // source
XPUSH [0, 1, 3, 0]  // mask
XSWIZZLE uint8[4]

leaves the source and the modified mask vector on the stack.

[4, 5, 6, 0]
[4, 5, 0, 4]
  • XSHUFFLE type
    Takes three vectors on the stack: a permutation mask and two source vectors to permute. Leaves on the stack a permutation of the source vectors in the mask vector. Each element in the mask is the index of the element with which to replace it. If the index is less than the number of lanes n in the source vectors the replacement is taken from the first source vector, if larger, from the second source vector at ((index - n) % n). Again, the modulo computation assures deterministic results.

E.g.

XPUSH [0, 0, 0, 0]    // source 2
XPUSH [4, 5, 6, 7]    // source 1
XPUSH [0, 1,  2, 35]  // mask
XSHUFFLE uint8[4]

leaves the sources and the modified mask vector on the stack.

[0, 0, 0, 0]
[4, 5, 6, 7]
[4, 5, 6, 0]

Notes

Only the SIMD operations are valid on SIMD vectors - this must be validated at contract creation time.

All SIMD operands must have the size, type and number of lanes specified by the operator - this must be validated at contract creation time.

Subroutines

Following EIP #615, a type-safe syntax for declaring subroutines taking vector arguments will be needed.

  • BEGINSUBX n_args, arg_types... n_results, result_types...
    marks the single entry to a subroutine. The bytecode for a subroutine ends at the next BEGINSUB, BEGINSUBX or BEGINDATA instruction or at the end of the bytecode. n_args items of arg_types types are taken off of the stack at entry to, and n_results items of results_typestypes are placed on the stack at return from the subroutine. n_args, n_results, and the types are given as one immediate byte each. For the purposes of specifying argument types, a type of 0x80 - the encoding for a single lane, 32 bytes wide - is taken to be a 256-bit wide integer.

Notes

The arg_types and result_types are given in the same encoding as the second byte of the SIMD opcodes, and must match the values on the stack - this must be validated at contract creation time.

RATIONALE

Currently, the lowest common denominator for SIMD hardware (e.g. Intel SSE2 and ARM Neon) is 16-byte registers supporting integer lanes of 1, 2, 4, and 8 bytes, and floating point lanes of 4 and 8 bytes. More recent SIMD hardware (e.g. Intel AVX) supports 32-byte registers, and EVM stack items are also 32 bytes wide. The limits above derive from these numbers, assuring that EVM code is within the bounds of available hardware - and the reserved bits provide room for growth.

For most modern languages (including Rust, Python, Go, Java, and C++) compilers can do a good job of generating SIMD code for parallelizable loops, and/or there are intrinsics or libraries available for explicit access to SIMD hardware. So a portable software implementation will likely provide good use of the hardware on most platforms, and intrinsics or libraries can be used as available and needed. Thus we can expect these operations to take about the same (or for 256-bit vectors on 128-bit hardware up to twice) the time to execute regardless of element size or number of elements.

Gas

One motivation for these operations, besides taking full advantage of the hardware, is assigning lower gas costs for operations on smaller scalars.

Measurements of three of the major clients shed some light on appropriate gas costs. For the C++ interpreter

  • a no-op JUMDEST costs 1 gas and takes about 20 cycles
  • an ADD or SUB costs 3 gas and can take over 140 cycles
  • a MUL or DIV costs 5 gas and can take from over 220 to over 640 cycles.

So JUMPDEST is overpriced, DIV is underpriced, ADD, SUBand MUL are about right. and taking a unit of gas as worth about 40 cycles of computation in the C++ interpreter is reasonable.

Some relevant generalities on computation costs...

  • For arithmetic on quantities up to 64 bits CPU cores can do the job in 1 or 2 cycles or less, (except for division), plus a few cycles to marshal data between stack memory and the ALU if that hasn already happened. So they barely cost enough to charge any gas beyond the overhead charged to JUMPDEST. The best we can do is charge 1 gas until a particle scheme or a general gas repricing lets us value these operations (and JUMPDEST) more accurately.
  • For SIMD units 1 or 2 cycles per vector operation are needed, but with a few low-level operations sometimes needed for one opcode, and up to 10-30 cycles to marshal data between stack memory and the ALU and SIMD units. And most units can only operate on 128 bits at once.

So measurement will tell, but to a first approximation, on the C++ interpreter:

  • operations on up to 64 bits should take about 20-30 cycles and cost 1 gas
  • operations on up to 128 bits should take about 30-80 cycles and cost 2 gas
  • operations on up to 256 bits should take about 40-140 cycles and cost 3 gas.

There are some exceptions to these estimates.

  • XDIV
    XMOD
    SIMD units often have no or only partial support for integer division, so these will usually need to be done with a sequence of scalar operations. Even on CPUs division is expensive, so a price of at least 3 units of gas per element of the operand vectors will probably be needed.

  • XSLOAD
    XSSTORE
    These are no more difficult thanSLOAD and SSTORE - and no less - so XSLOAD is assigned the same 200 gas as SLOAD, and XSLOAD the same gas cost function as SSTORE.

  • XSWIZZLE
    XSHUFFLE
    These can usually be helped by SIMD hardware, but not completely, and not in all cases, so a full or partial sequence of scalar operations will often be needed. But the individual operations per element are not that expensive, so a price of 2 units of gas per element of the mask vector seems fair.

High Level Languages

This new facility will be of limited use if not exposed to programmers in higher-level languages than EVM code. To demonstrate at least one workable approach, here is a possible extension to Solidity.

A SIMD vector type could simply require an explicit annotation to array declarations, which limits arrays to the element types and number of elements supported by the SIMD facility, so that

uint64_t[4] simd example; // would be OK
uint32_t[9] simd example; // would not be OK

The arithmetic, logic, and comparison SIMD operations of the EVM would be mirrored by corresponding Solidity operations that operate element-wise on pairs of simd arrays.
For example, given

uint8_t[4] simd a = [1, 2, 3];
uint8_t[4] simd b = [5, 6, 7];

this statement

uint8_t[4]  simd c = a + b;

is evaluated as

c = [a[0] + b[0], a[1] + b[1],  a[2] + b[2]]

and c becomes

[6, 8, 10]

Comparison operations produce arrays of true=1 and false=0 values. So

uint8_t[3] simd d = a < b;

is evaluated as

d = [a[0] < b[0], a[1] < b[1],  a[2] < b[2]] 

and d becomes

[1, 1, 1]

And of course other operations would need to be defined, like element access. For SIMD operations without corresponding language operators, named functions would be needed. E,g simd.swizzle(), and simd.shuffle().

As an added bonus, uint variables of sizes <= 64 bits can be compiled to single-element vectors, which an implementation can map to native scalars.

@gcolvin gcolvin self-assigned this Jun 13, 2017

@PhABC

This comment has been minimized.

Show comment
Hide comment
@PhABC

PhABC Jun 21, 2017

Contributor

Sorry for hijacking this EIP, but any plan to include this in the "near" future? This would definitely allow more computational complexity in contracts and I must admit I am really excited for this feature.

Contributor

PhABC commented Jun 21, 2017

Sorry for hijacking this EIP, but any plan to include this in the "near" future? This would definitely allow more computational complexity in contracts and I must admit I am really excited for this feature.

@gcolvin

This comment has been minimized.

Show comment
Hide comment
@gcolvin

gcolvin Jun 21, 2017

Collaborator

Not hijacking at all. It won't happen before Serenity, however "near" that is.

Collaborator

gcolvin commented Jun 21, 2017

Not hijacking at all. It won't happen before Serenity, however "near" that is.

@gumb0

This comment has been minimized.

Show comment
Hide comment
@gumb0

gumb0 Jul 10, 2017

Member

XGET type, get_type

So XGET gets 2 types as input, but actually there are three vectors to consider in its implementation - source vector, index vector and result vector. The specification should either clarify what's the type of result vector or maybe have 3 distinct types as input for the most general solution.

XPUT type, put_type

For XPUT there are really 4 vectors - source, replacement, replacement indices, result. Also we need to specify whether the type of replacement vector is the same as source and what is the type of result

Member

gumb0 commented Jul 10, 2017

XGET type, get_type

So XGET gets 2 types as input, but actually there are three vectors to consider in its implementation - source vector, index vector and result vector. The specification should either clarify what's the type of result vector or maybe have 3 distinct types as input for the most general solution.

XPUT type, put_type

For XPUT there are really 4 vectors - source, replacement, replacement indices, result. Also we need to specify whether the type of replacement vector is the same as source and what is the type of result

@gcolvin

This comment has been minimized.

Show comment
Hide comment
@gcolvin

gcolvin Jul 10, 2017

Collaborator

XGET takes only 2 vectors - it's like the get vector sucks data out of the source vector and is modified in the process. The get vector can have a different type than the source vector.

XPUT takes 3 vectors, replacements, put indexes, and source. The source is misnamed, it's really the destination. I should be more clear on that, and on the replacements vector having the same type as the destination. It could have a different type if we want the generality, but that I think that would quadruple the size of the implementation.

(I need a more clear scheme for describing the parameters and results throughout.)

Collaborator

gcolvin commented Jul 10, 2017

XGET takes only 2 vectors - it's like the get vector sucks data out of the source vector and is modified in the process. The get vector can have a different type than the source vector.

XPUT takes 3 vectors, replacements, put indexes, and source. The source is misnamed, it's really the destination. I should be more clear on that, and on the replacements vector having the same type as the destination. It could have a different type if we want the generality, but that I think that would quadruple the size of the implementation.

(I need a more clear scheme for describing the parameters and results throughout.)

@gumb0

This comment has been minimized.

Show comment
Hide comment
@gumb0

gumb0 Jul 10, 2017

Member

I see now, XGET replaces the indices in the get vector with the values...

Member

gumb0 commented Jul 10, 2017

I see now, XGET replaces the indices in the get vector with the values...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment