Skip to content

<Performance> fuzzbug: table.init has a very slow non-empty path in a minimal repeated microbenchmark #13258

@gaaraw

Description

@gaaraw

Describe the bug

table.init appears to have a very expensive non-empty path in Wasmtime in a minimal repeated microbenchmark.

I first found this in a generated differential benchmark, then reduced it to a much smaller testcase. The slowdown remains after removing loop-derived operand shaping and shrinking the table/element resources to the minimum needed.

The smallest clear reproducer I found is:

  • primary_reproducer_table_init_len1.wat

A close control with len = 0 is:

  • supporting_control_table_init_len0.wat

Test Case

test_cases.zip

Primary reproducer loop body:

i32.const 0
i32.const 0
i32.const 1
table.init 0 0

Minimal resources:

(table $tab0 1 funcref)
(elem funcref (ref.null func))

Supporting controls:

  • supporting_control_table_init_len0.wat (len = 0)
  • supporting_len2_table_init.wat (len = 2 with table/elem size 2)
  • supporting_table_fill_len1.wat
  • supporting_table_copy_len1.wat

Steps to Reproduce

  1. Build the primary testcase:
wat2wasm --enable-all primary_reproducer_table_init_len1.wat -o primary_reproducer_table_init_len1.wasm
  1. Warm up once:
wasmtime primary_reproducer_table_init_len1.wasm
  1. Measure runtime:
perf stat -r 3 -e 'task-clock' wasmtime primary_reproducer_table_init_len1.wasm
  1. For comparison, run the same flow on:
  • supporting_control_table_init_len0.wasm
  • supporting_len2_table_init.wasm
  • supporting_table_fill_len1.wasm
  • supporting_table_copy_len1.wasm

If helpful, I can also provide the exact commands I used for the other runtimes in the comparison table.

Expected and actual Results

Primary reduced table.init results

testcase shape wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s)
const_len0 dst=0, src=0, len=0, table=1, elem=1 13.2085 6.2617 2.8532 13.4362 59.9080 3.2286
const_len1 dst=0, src=0, len=1, table=1, elem=1 13.8520 9.0505 4.1151 13.9670 99.9186 4.6532
const_len2 dst=0, src=0, len=2, table=2, elem=2 14.6396 9.0903 4.41133 14.6610 132.7836 4.9468
const_src1_len1 dst=0, src=1, len=1, table=2, elem=2 13.7660 9.0285 4.1430 14.1467 99.7570 4.6662

Observed pattern:

  • Wasmtime is already much slower than the comparison runtimes for len = 0.
  • The cost rises sharply for len = 1 and again for len = 2.
  • Changing src from 0 to 1 does not materially change the result.

Target-removed control

A target-removed control with the same outer loop / stack shaping but no table.init is very fast:

testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s)
control_no_target 0.011744 0.022739 0.015508 0.29056 0.28542 0.43075

So this does not look like a loop/scaffold artifact. The expensive part seems tied to table.init itself.

Related bulk-table instructions

I also compared matched table.fill / table.copy cases with len = 1:

testcase wasmer_llvm (s) wasmedge_jit (s) wamr_llvm_jit (s) wasmer_cranelift (s) wasmtime (s) wamr_fast_jit (s)
table.fill len=1 5.0919 4.89015 2.18633 5.36801 12.0544 2.6832
table.copy len=1 6.32213 8.8099 4.8734 6.64548 18.5358 6.4398

Wasmtime is not the fastest there either, but the slowdown is much less dramatic than for table.init.

So the anomaly looks more specific to table.init than to all small bulk-table operations in general.

Versions and Environment

  • Wasmtime version: wasmtime 41.0.0 (4898322a4 2025-12-18)
  • wasmer: 6.1.0
  • WAMR: iwasm 2.4.4
  • wasmedge: 0.16.1-18-gc457fe30
  • wabt: 1.0.39
  • llvm: 21.1.5
  • Host OS: Ubuntu 22.04.5 LTS x64
  • CPU: 12th Gen Intel® Core™ i7-12700 × 20

If useful, I can also attach the generated CLIF for the reduced testcase.

Extra Info

For the reduced const_len1 testcase, Wasmtime still keeps the hot loop alive and still lowers the operation through the table.init builtin/helper path.

I generated CLIF with:

wasmtime compile -C cache=n --emit-clif out_dir primary_reproducer_table_init_len1.wasm

In the generated CLIF for the reduced case, the hot loop still contains a per-iteration call equivalent to:

call fn0(vmctx, 0, 0, 0, 0, 1)

So this does not appear to be caused by dead-code elimination or by loop-derived operand shaping.

Based on the measurements, the strongest trigger condition I can currently support is:

  • repeated table.init 0 0
  • in-bounds
  • minimal table / passive element segment
  • especially the non-empty path (len > 0)

I have not confirmed the internal root cause, so I’m only reporting the measured trigger pattern here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugIncorrect behavior in the current implementation that needs fixingfuzz-bugBugs found by a fuzzer

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions