Describe the bug
table.init appears to have a very expensive non-empty path in Wasmtime in a minimal repeated microbenchmark.
I first found this in a generated differential benchmark, then reduced it to a much smaller testcase. The slowdown remains after removing loop-derived operand shaping and shrinking the table/element resources to the minimum needed.
The smallest clear reproducer I found is:
primary_reproducer_table_init_len1.wat
A close control with len = 0 is:
supporting_control_table_init_len0.wat
Test Case
test_cases.zip
Primary reproducer loop body:
i32.const 0
i32.const 0
i32.const 1
table.init 0 0
Minimal resources:
(table $tab0 1 funcref)
(elem funcref (ref.null func))
Supporting controls:
supporting_control_table_init_len0.wat (len = 0)
supporting_len2_table_init.wat (len = 2 with table/elem size 2)
supporting_table_fill_len1.wat
supporting_table_copy_len1.wat
Steps to Reproduce
- Build the primary testcase:
wat2wasm --enable-all primary_reproducer_table_init_len1.wat -o primary_reproducer_table_init_len1.wasm
- Warm up once:
wasmtime primary_reproducer_table_init_len1.wasm
- Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_table_init_len1.wasm
- For comparison, run the same flow on:
supporting_control_table_init_len0.wasm
supporting_len2_table_init.wasm
supporting_table_fill_len1.wasm
supporting_table_copy_len1.wasm
If helpful, I can also provide the exact commands I used for the other runtimes in the comparison table.
Expected and actual Results
Primary reduced table.init results
| testcase |
shape |
wasmer_llvm (s) |
wasmedge_jit (s) |
wamr_llvm_jit (s) |
wasmer_cranelift (s) |
wasmtime (s) |
wamr_fast_jit (s) |
| const_len0 |
dst=0, src=0, len=0, table=1, elem=1 |
13.2085 |
6.2617 |
2.8532 |
13.4362 |
59.9080 |
3.2286 |
| const_len1 |
dst=0, src=0, len=1, table=1, elem=1 |
13.8520 |
9.0505 |
4.1151 |
13.9670 |
99.9186 |
4.6532 |
| const_len2 |
dst=0, src=0, len=2, table=2, elem=2 |
14.6396 |
9.0903 |
4.41133 |
14.6610 |
132.7836 |
4.9468 |
| const_src1_len1 |
dst=0, src=1, len=1, table=2, elem=2 |
13.7660 |
9.0285 |
4.1430 |
14.1467 |
99.7570 |
4.6662 |
Observed pattern:
- Wasmtime is already much slower than the comparison runtimes for
len = 0.
- The cost rises sharply for
len = 1 and again for len = 2.
- Changing
src from 0 to 1 does not materially change the result.
Target-removed control
A target-removed control with the same outer loop / stack shaping but no table.init is very fast:
| testcase |
wasmer_llvm (s) |
wasmedge_jit (s) |
wamr_llvm_jit (s) |
wasmer_cranelift (s) |
wasmtime (s) |
wamr_fast_jit (s) |
| control_no_target |
0.011744 |
0.022739 |
0.015508 |
0.29056 |
0.28542 |
0.43075 |
So this does not look like a loop/scaffold artifact. The expensive part seems tied to table.init itself.
Related bulk-table instructions
I also compared matched table.fill / table.copy cases with len = 1:
| testcase |
wasmer_llvm (s) |
wasmedge_jit (s) |
wamr_llvm_jit (s) |
wasmer_cranelift (s) |
wasmtime (s) |
wamr_fast_jit (s) |
| table.fill len=1 |
5.0919 |
4.89015 |
2.18633 |
5.36801 |
12.0544 |
2.6832 |
| table.copy len=1 |
6.32213 |
8.8099 |
4.8734 |
6.64548 |
18.5358 |
6.4398 |
Wasmtime is not the fastest there either, but the slowdown is much less dramatic than for table.init.
So the anomaly looks more specific to table.init than to all small bulk-table operations in general.
Versions and Environment
- Wasmtime version:
wasmtime 41.0.0 (4898322a4 2025-12-18)
- wasmer: 6.1.0
- WAMR: iwasm 2.4.4
- wasmedge: 0.16.1-18-gc457fe30
- wabt: 1.0.39
- llvm: 21.1.5
- Host OS: Ubuntu 22.04.5 LTS x64
- CPU: 12th Gen Intel® Core™ i7-12700 × 20
If useful, I can also attach the generated CLIF for the reduced testcase.
Extra Info
For the reduced const_len1 testcase, Wasmtime still keeps the hot loop alive and still lowers the operation through the table.init builtin/helper path.
I generated CLIF with:
wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_table_init_len1.wasm
In the generated CLIF for the reduced case, the hot loop still contains a per-iteration call equivalent to:
call fn0(vmctx, 0, 0, 0, 0, 1)
So this does not appear to be caused by dead-code elimination or by loop-derived operand shaping.
Based on the measurements, the strongest trigger condition I can currently support is:
- repeated
table.init 0 0
- in-bounds
- minimal table / passive element segment
- especially the non-empty path (
len > 0)
I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.
Describe the bug
table.initappears to have a very expensive non-empty path in Wasmtime in a minimal repeated microbenchmark.I first found this in a generated differential benchmark, then reduced it to a much smaller testcase. The slowdown remains after removing loop-derived operand shaping and shrinking the table/element resources to the minimum needed.
The smallest clear reproducer I found is:
primary_reproducer_table_init_len1.watA close control with
len = 0is:supporting_control_table_init_len0.watTest Case
test_cases.zip
Primary reproducer loop body:
Minimal resources:
Supporting controls:
supporting_control_table_init_len0.wat(len = 0)supporting_len2_table_init.wat(len = 2with table/elem size 2)supporting_table_fill_len1.watsupporting_table_copy_len1.watSteps to Reproduce
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_table_init_len1.wasmsupporting_control_table_init_len0.wasmsupporting_len2_table_init.wasmsupporting_table_fill_len1.wasmsupporting_table_copy_len1.wasmIf helpful, I can also provide the exact commands I used for the other runtimes in the comparison table.
Expected and actual Results
Primary reduced
table.initresultsdst=0, src=0, len=0, table=1, elem=1dst=0, src=0, len=1, table=1, elem=1dst=0, src=0, len=2, table=2, elem=2dst=0, src=1, len=1, table=2, elem=2Observed pattern:
len = 0.len = 1and again forlen = 2.srcfrom0to1does not materially change the result.Target-removed control
A target-removed control with the same outer loop / stack shaping but no
table.initis very fast:So this does not look like a loop/scaffold artifact. The expensive part seems tied to
table.inititself.Related bulk-table instructions
I also compared matched
table.fill/table.copycases withlen = 1:Wasmtime is not the fastest there either, but the slowdown is much less dramatic than for
table.init.So the anomaly looks more specific to
table.initthan to all small bulk-table operations in general.Versions and Environment
wasmtime 41.0.0 (4898322a4 2025-12-18)If useful, I can also attach the generated CLIF for the reduced testcase.
Extra Info
For the reduced
const_len1testcase, Wasmtime still keeps the hot loop alive and still lowers the operation through thetable.initbuiltin/helper path.I generated CLIF with:
In the generated CLIF for the reduced case, the hot loop still contains a per-iteration call equivalent to:
So this does not appear to be caused by dead-code elimination or by loop-derived operand shaping.
Based on the measurements, the strongest trigger condition I can currently support is:
table.init 0 0len > 0)I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.