Skip to content
Permalink
6 contributors

Users who have contributed to this file

@binji @lars-t-hansen @AndrewScheidecker @alexcrichton @aheejin @sunfishcode
470 lines (352 sloc) 19.7 KB

Bulk Memory Operations and Conditional Segment Initialization

Motivation for Bulk Memory Operations

Some people have mentioned that memcpy and memmove functions are hot when profiling some WebAssembly benchmarks. Some examples:

I've been looking at perf profiles for wasm unity benchmark a bit recently and see that some of the hottest functions are doing memcpy or memset like things. If this is any indication of normal wasm code patterns, I think we could see significant improvement with an intrinsic so it may be worth prioritizing.

In a number of game engines I've been optimizing and benchmarking, interestingly the performance of memcpy() does show up relatively high in profiles. (~2%-5% of total execution time)

Bulk Memory Operations Prototype

I implemented a prototype implementation of a memory.copy instruction in v8 which just calls out to v8's MemMove function. I compared this to an implementation generated by emscripten and currently used in the Unity demo. This implementation aligns then performs copies using i32.load and i32.store. I've also included performance achieved by unrolling this loop manually and increasing the size to i64.

Each test copies size bytes from one address to another, non-overlapping. This is repeated N times. Each row copies a total of 1 Gib of data, and only touches 1 Mib of memory in the source and destination ranges.

This is the core loop:

  let mask = Mib - 1;
  let start = performance.now();
  for (let i = 0; i < N; ++i) {
    f(dst_base + dst, src_base + src, size);
    dst = (dst + size) & mask;
    src = (src + size) & mask;
  }
  let end = performance.now();

The code for the benchmark can be found here. Note that this will not run properly without a WebAssembly implementation of memory.copy. For my tests, I hacked a version of v8 to replace any exported function called memcpy or memmove with a new function with the following contents:

(func (param $dst i32) (param $src i32) (param $size i32) (result i32)
  get_local $dst
  get_local $src
  get_local $size
  memory.copy
  get_local $dst)

Here are the results on my machine (x86_64, 2.9GHz, L1 32k, L2 256k, L3 256k):

intrinsic i64 load/store x 4 i64 load/store x 2 i32 load/store x 2 i32 load/store
size=32b, N=33554432 1.382 Gib/s 1.565 Gib/s 1.493 Gib/s 1.275 Gib/s 1.166 Gib/s
size=64b, N=16777216 3.285 Gib/s 2.669 Gib/s 2.383 Gib/s 1.861 Gib/s 1.639 Gib/s
size=128b, N=8388608 6.162 Gib/s 3.993 Gib/s 3.480 Gib/s 2.433 Gib/s 2.060 Gib/s
size=256b, N=4194304 9.939 Gib/s 5.323 Gib/s 4.462 Gib/s 2.724 Gib/s 2.213 Gib/s
size=512b, N=2097152 15.777 Gib/s 6.377 Gib/s 4.913 Gib/s 3.231 Gib/s 2.457 Gib/s
size=1.0Kib, N=1048576 17.902 Gib/s 7.010 Gib/s 6.112 Gib/s 3.568 Gib/s 2.614 Gib/s
size=2.0Kib, N=524288 19.870 Gib/s 8.248 Gib/s 6.915 Gib/s 3.764 Gib/s 2.699 Gib/s
size=4.0Kib, N=262144 20.940 Gib/s 9.145 Gib/s 7.400 Gib/s 3.871 Gib/s 2.729 Gib/s
size=8.0Kib, N=131072 21.162 Gib/s 9.258 Gib/s 7.672 Gib/s 3.925 Gib/s 2.763 Gib/s
size=16.0Kib, N=65536 20.991 Gib/s 9.758 Gib/s 7.756 Gib/s 3.945 Gib/s 2.773 Gib/s
size=32.0Kib, N=32768 22.504 Gib/s 9.956 Gib/s 7.861 Gib/s 3.966 Gib/s 2.780 Gib/s
size=64.0Kib, N=16384 22.534 Gib/s 10.088 Gib/s 7.931 Gib/s 3.974 Gib/s 2.782 Gib/s
size=128.0Kib, N=8192 29.728 Gib/s 10.032 Gib/s 7.934 Gib/s 3.975 Gib/s 2.782 Gib/s
size=256.0Kib, N=4096 29.742 Gib/s 10.116 Gib/s 7.625 Gib/s 3.886 Gib/s 2.781 Gib/s
size=512.0Kib, N=2048 29.994 Gib/s 10.090 Gib/s 7.627 Gib/s 3.985 Gib/s 2.785 Gib/s
size=1.0Mib, N=1024 11.760 Gib/s 10.091 Gib/s 7.959 Gib/s 3.989 Gib/s 2.787 Gib/s

Motivation for Conditional Segment Initialization

Under the current threading proposal, to share a module between multiple agents, the module must be instantiated multiple times: once per agent. Instantiation initializes linear memory with the contents in the module's data segments. If the memory is shared between multiple agents, it will be initialized multiple times, potentially overwriting stores that occurred after the previous initializations.

For example:

;; The module.
(module
  (memory (export "memory") 1)

  ;; Some value used as a counter.
  (data (i32.const 0) "\0")

  ;; Add one to the counter.
  (func (export "addOne")
    (i32.store8
      (i32.const 0)
      (i32.add
        (i32.load8_u (i32.const 0))
        (i32.const 1)))
  )
)
// main.js
let moduleBytes = ...;

WebAssembly.instantiate(moduleBytes).then(
  ({module, instance}) => {
    // Increment our counter.
    instance.exports.addOne();

    // Spawn a new Worker.
    let worker = new Worker('worker.js');

    // Send the module to the new Worker.
    worker.postMessage(module);
  });

// worker.js

function onmessage(event) {
  let module = event.data;

  // Use the module to create another instance.
  WebAssembly.instantiate(module).then(
    (instance) => {
      // Oops, our counter has been clobbered.
    });
}

This can be worked around by storing the data segments in a separate module which is only instantiated once, then exporting this memory to be used by another module that contains only code. This works, but it cumbersome since it requires two modules where one should be enough.

Motivation for combining Bulk Memory Operations + Conditional Segment Initialization Proposals

When discussing the design of Conditional Segment Initialization, we found that programmatic memory initialization from a read-only data segment (via the memory.init instruction, described below) has similar behavior to the proposed instruction to copy memory regions from linear memory (memory.copy, also described below.)

Design

Copying between regions in linear memory or a table is accomplished with the new *.copy instructions:

  • memory.copy: copy from one region of linear memory to another
  • table.copy: copy from one region of a table to another

Filling a memory region can be accomplished with memory.fill:

  • memory.fill: fill a region of linear memory with a given byte value

TODO: should we provide memory.clear and table.clear instead?

The binary format for the data section currently has a collection of segments, each of which has a memory index, an initializer expression for its offset, and its raw data.

Since WebAssembly currently does not allow for multiple memories, the memory index of each segment must be zero. We can repurpose this 32-bit integer as a flags field where new meaning is attached to nonzero values.

When the new flags field is 1, this segment is passive. A passive segment will not be automatically copied into the memory or table on instantiation, and must instead be applied manually using the following new instructions:

  • memory.init: copy a region from a data segment
  • table.init: copy a region from an element segment

A passive segment has no initializer expression, since it will be specified as an operand to memory.init or table.init.

Segments can also be discarded by using the following new instructions:

  • data.drop: prevent further use of a data segment
  • elem.drop: prevent further use of an element segment

An active segment is equivalent to a passive segment, but with an implicit memory.init followed by a data.drop (or table.init followed by a elem.drop) that is prepended to the module's start function.

The new encoding of a data segment is now:

Field Type Present? Description
flags varuint32 always Flags for passive and presence of fields below, only values of 0, 1, and 2 are valid
index varuint32? flags = 2 Memory index; 0 if the field is not present
offset init_expr? flags != 1 an i32 initializer expression for offset
size varuint32 always size of data (in bytes)
data bytes always sequence of size bytes

Another way of looking at it:

Flags Active? index offset
0 Active Always 0 Present
1 Passive - -
2 Active Present Present

Element segments

The new binary format for element segments is similar to the new format for data segments, but also includes an element type when the segment is passive. A passive segment also has a sequence of exprs instead of function indices.

Field Type Present? Description
flags varuint32 always Flags for passive and presence of fields below, only values of 0, 1, and 2 are valid
index varuint32? flags = 2 Table index; 0 if the field is not present
element_type elem_type? flags = 1 element type of this segment; anyfunc if not present
offset init_expr? flags != 1 an i32 initializer expression for offset
count varuint32 always number of elements
elems varuint32* flags != 1 sequence of function indices
elems elem_expr* flags = 1 sequence of element expressions

Another way of looking at it:

Flags Active? index element_type offset
0 Active Always 0 Always anyfunc Present
1 Passive - Present -
2 Active Present Always anyfunc Present

An elem_expr is like an init_expr, but can only contain expressions of the following sequences:

Binary Text Description
0xd0 0x0b ref.null end Returns a null reference
0xd2 varuint32 0x0b ref.func $funcidx end Returns a reference to function $funcidx

TODO: coordinate with other proposals to determine the binary encoding for ref.null and ref.func.

Segment Initialization

In the MVP, segments are initialized during module instantiation. If any segment would be initialized out-of-bounds, then the memory or table instance is not modified.

This behavior is changed in the bulk memory proposal.

Each active segment is initialized in module-definition order. For each segment, each byte in the data segment is copied into the memory, in order of lowest to highest addresses. If, for a given byte, the copy is out-of-bounds, instantiation fails and no further bytes in this segment nor further segments are copied. Bytes written before this point stay written.

The behavior of element segment initialization is changed similarly, with the difference that elements are copied from element segments into tables, instead of bytes being copied from data segments into memories.

memory.init instruction

The memory.init instruction copies data from a given passive segment into a target memory. The target memory and source segment are given as immediates.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: offset into the source segment
  • top-0: size of memory region in bytes

It is a validation error to use memory.init with an out-of-bounds segment index.

A trap occurs if:

  • the segment is used after it has been dropped via data.drop. This includes active segments that were dropped after being copied into memory during module instantiation.
  • any of the accessed bytes lies outside the source data segment or the target memory

Note that it is allowed to use memory.init on the same data segment more than once.

Initialization takes place bytewise from lower addresses toward higher addresses. A trap resulting from an access outside the source data segment or target memory only occurs once the first byte that is outside the source or target is reached. Bytes written before the trap stay written.

(Data are read and written as-if individual bytes were read and written, but various optimizations are possible that avoid reading and writing only individual bytes.)

Note that the semantics require bytewise accesses, so a trap that might result from, say, reading a sequence of several words before writing any, will have to be handled carefully: the reads that succeeded will have to be written, if possible.

data.drop instruction

The data.drop instruction prevents further use of a given segment. After a data segment has been dropped, it is no longer valid to use it in a memory.init instruction. This instruction is intended to be used as an optimization hint to the WebAssembly implementation. After a memory segment is dropped its data can no longer be retrieved, so the memory used by this segment may be freed.

It is a validation error to use data.drop with an out-of-bounds segment index.

A trap occurs if the segment was already dropped. This includes active segments that were dropped after being copied into memory during module instantiation.

memory.copy instruction

Copy data from a source memory region to destination region. The regions are said to overlap if they are in the same memory and the start address of one region is one of the addresses that's read or written (by the copy operation) in the other region.

This instruction has two immediate arguments: the source and destination memory indices. They currently both must be zero.

If the source region starts at a lower address than the target region, then the copy takes place as if from higher to lower addresses: the highest source address is read first and the value is written to the highest target address, then the next highest, and so on. Otherwise, the copy takes place as if from lower to higher addresses: the lowest source address is read first and the value is written to the lowest target address, then the next lowest, and so on.

(The direction of the copy is defined in order to future-proof memory.copy for shared memory and a memory read/write protection feature.)

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: source address
  • top-0: size of memory region in bytes

A trap occurs if:

  • any of the accessed bytes lies outside the source or target memory

A trap resulting from an access outside the source or target region only occurs once the first byte that is outside the source or target is reached (in the defined copy order). Bytes written before the trap stay written.

(Data are read and written as-if individual bytes were read and written, but various optimizations are possible that avoid reading and writing only individual bytes.)

memory.fill instruction

Set all bytes in a memory region to a given byte. This instruction has an immediate argument of which memory to operate on, and it must be zero for now.

The instruction has the signature [i32 i32 i32] -> []. The parameters are, in order:

  • top-2: destination address
  • top-1: byte value to set
  • top-0: size of memory region in bytes

A trap occurs if:

  • any of the accessed bytes lies outside the target memory

Filling takes place bytewise from lower addresses toward higher addresses. A trap resulting from an access outside the target memory only occurs once the first byte that is outside the target is reached. Bytes written before the trap stay written.

(Data are written as-if individual bytes were written, but various optimizations are possible that avoid writing only individual bytes.)

table.init, elem.drop, and table.copy instructions

The table.* instructions behave similary to the memory.* instructions, with the difference that they operate on element segments and tables, instead of data segments and memories. The offset and length operands of table.init and table.copy have element units instead of bytes as well.

Passive Segment Initialization Example

Consider if there are two data sections, the first is always active and the second is conditionally active if global 0 has a non-zero value. This could be implemented as follows:

(import "a" "global" (global i32))  ;; global 0
(memory 1)
(data (i32.const 0) "hello")   ;; data segment 0, is active so always copied
(data passive "goodbye")       ;; data segment 1, is passive

(func $start
  (if (get_global 0)

    ;; copy data segment 1 into memory 0 (the 0 is implicit)
    (memory.init 1
      (i32.const 16)    ;; target offset
      (i32.const 0)     ;; source offset
      (i32.const 7))    ;; length

    ;; The memory used by this segment is no longer needed, so this segment can
    ;; be dropped.
    (data.drop 1))
)

Instruction encoding

All bulk memory instructions are encoded as a 0xfc prefix byte, followed by another opcode, optionally followed by more immediates:

instr ::= ...
        | 0xfc operation:uint8 ...
Name Opcode Immediate Description
memory.init 0xfc 0x08 segment:varuint32, memory:0x00 🤔 copy from a passive data segment to linear memory
data.drop 0xfc 0x09 segment:varuint32 🤔 prevent further use of passive data segment
memory.copy 0xfc 0x0a memory_src:0x00 memory_dst:0x00 🤔 copy from one region of linear memory to another region
memory.fill 0xfc 0x0b memory:0x00 🤔 fill a region of linear memory with a given byte value
table.init 0xfc 0x0c segment:varuint32, table:0x00 🤔 copy from a passive element segment to a table
elem.drop 0xfc 0x0d segment:varuint32 🤔 prevent further use of a passive element segment
table.copy 0xfc 0x0e table_src:0x00 table_dst:0x00 🤔 copy from one region of a table to another region

DataCount section

The WebAssembly binary format is designed to be validated in a single pass. If a section requires information to validate, it is guaranteed that this information will be present in a previous section.

The memory.{init,drop} instructions break this guarantee. Both of these instructions are used in the Code section. They each have a data segment index immediate, but the vector of data segments is not available until the Data section is parsed, which occurs after the Code section.

To keep single-pass validation, the number of data segments defined in the Data section must be available before the Code section. This information is provided in a new DataCount section with the code 12.

Like all sections, the DataCount section is optional. If present, it must appear in the following order:

Section Name Code Description
Type 1 Function signature declarations
Import 2 Import declarations
Function 3 Function declarations
Table 4 Indirect function table and other tables
Memory 5 Memory attributes
Global 6 Global declarations
Export 7 Exports
Start 8 Start function declaration
Element 9 Elements section
DataCount 12 Data segment count
Code 10 Function bodies (code)
Data 11 Data segments

The DataCount section has just one field that specifies the number of data segments in the Data section:

Field Type Description
count varuint32 count of data segments in Data section

It is a validation error if count is not equal to the number of data segments in the Data section. It is also a validation error if the DataCount section is omitted and a memory.init or data.drop instruction is used.

You can’t perform that action at this time.