Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

Merged
merged 7 commits into from
May 20, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented May 18, 2022

This PR adds a basic alias analysis, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.

The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.

This subsumes two common optimizations:

  • Redundant load elimination: when the same memory address is loaded two
    times, and it can be proven that no intervening operations will write
    to that memory, then the second load is redundant and its result
    must be the same as the first. We can use the first load's result and
    remove the second load.

  • Store-to-load forwarding: when a load can be proven to access exactly
    the memory written by a preceding store, we can replace the load's
    result with the store's data operand, and remove the load.

Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)

Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.

This analysis provides an important invariant: if two operations access
memory with the same last-store, then no other store can alias in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is exactly
the same
(same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.

Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.

Note that this does work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.

Fixes #4131.

Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.

Creating this as a draft for early feedback; will likely refine the comments
and explanations a bit more, and benchmark this.

@cfallin cfallin requested a review from fitzgen May 18, 2022 23:58
… fowarding opts.

This PR adds a basic *alias analysis*, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.

The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.

This subsumes two common optimizations:

* Redundant load elimination: when the same memory address is loaded two
  times, and it can be proven that no intervening operations will write
  to that memory, then the second load is *redundant* and its result
  must be the same as the first. We can use the first load's result and
  remove the second load.

* Store-to-load forwarding: when a load can be proven to access exactly
  the memory written by a preceding store, we can replace the load's
  result with the store's data operand, and remove the load.

Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)

Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.

This analysis provides an important invariant: if two operations access
memory with the same last-store, then *no other store can alias* in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is *exactly
the same* (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.

Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.

Note that this *does* work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.

Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:wasm labels May 19, 2022
@fitzgen
Copy link
Member

fitzgen commented May 19, 2022

Great! I will take a look at this tomorrow.

@cfallin
Copy link
Member Author

cfallin commented May 19, 2022

I ran the Sightglass benchmarks on this and the only delta is for meshoptimizer:

execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 957833251.60 ± 610441836.41 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [17612149560 17873615806.60 18098060500] new.so
  [17544919808 18831449058.20 19306513928] old.so

execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 252025449.20 ± 160632667.79 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [4634388228 4703136639.50 4762142132] new.so
  [4616626357 4955162088.70 5080175835] old.so

(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.

//!
//! We partition memory state into several *disjoint pieces* of
//! "abstract state". There are a finite number of such pieces:
//! currently, we call them "heap", "table", "vmctx", and "other".Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if all stack slots would be their own category until their address is leaked using stack_addr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that would be nice, but entails an escape analysis, which is outside the scope of this PR. Happy to discuss further (and review PRs!) once this is in, of course.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even stopping optimization entirely if stack_addr is used on a stack slot would likely help cg_clif a lot. Having each stack slot whose address is never leaked be optimized independently is the most important for cg_clif. I'm fine with leaving that for a later PR.

@bjorn3
Copy link
Contributor

bjorn3 commented May 19, 2022

I think you will need to add support for volatile loads and stores for this to not miscompile rust code. It would also be nice if you could state an exhaustive list of UB it depends on. For example that reading/writing memory racing with the compiled clif function is UB unless it is an atomic or volatile load/store.

@bjorn3
Copy link
Contributor

bjorn3 commented May 19, 2022

By the way what is the time complexity of this optinization pass?

@cfallin
Copy link
Member Author

cfallin commented May 19, 2022

volatile loads/stores to not miscompile

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Otherwise removing a load and replacing it with a known value is always legal, given CLIF semantics today. (If it helps, imagine the disjoint alias categories don't exist: by default all loads/stores fall into "other", so memory locations are named by the last store of any kind.)

Atomics should use atomic ops at the CLIF level of course; I believe those should act as full barriers across all categories, as calls do. (I don't think I've actually added this logic yet; will ensure it's there tomorrow.)

time complexity of this optimization pass?

Same as any of the other iterative dataflow analyses: worst case passes over any given basic block bounded by maximum descending chain depth in the lattice. Here the last-store vector meets to "current inst" on any conflict at a merge point so the lattice is effectively three levels deep, constant; so that's O(|blocks|) to converge last-store info (likely one visit per block, rarely two). Then one more visit per block to discover and use aliases.

@bjorn3
Copy link
Contributor

bjorn3 commented May 19, 2022

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

@bjorn3
Copy link
Contributor

bjorn3 commented May 19, 2022

This is a slight regresion for the simple-raytracer benchmark of cg_clif. I would guess it increases live-range sizes causing a regalloc pessimization. Implementing #4163 (comment) would make it a beneficial optimization I think. I used to have a simple implementation of that in cg_clif which also looks at which bytes of a stack slot were accessed by the load/store to effectively allow SROA, but I removed it in bjorn3/rustc_codegen_cranelift@a793be8 as it was buggy and I didn't think I knew how to fix it. It also only operated within a single basic block. See https://github.com/bjorn3/rustc_codegen_cranelift/issues/846.

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]
  Range (min … max):    5.169 s …  5.211 s    10 runs
 
Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]
  Range (min … max):    5.197 s …  5.263 s    10 runs
 
Summary
  './raytracer_cg_clif_release_alias_analysis' ran
    1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main'

@cfallin
Copy link
Member Author

cfallin commented May 19, 2022

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

OK, so if I understand correctly, cg_clif is currently compiling volatiles at the Rust level to normal loads/stores at the CLIF level? That I agree would result in miscompiles; the issue though isn't alias analysis as the semantics of load and store in CLIF have always been (even if not exploited until now) those of normal loads/stores. Adding volatile memory ops seems like a reasonable feature request though and I'd be happy to review a PR for that as well.

@cfallin
Copy link
Member Author

cfallin commented May 19, 2022

@bjorn3 I may be misreading your benchmark result, but:

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]

Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]

looks like alias_analysis ran a bit faster than main? Then './raytracer_cg_clif_release_alias_analysis' ran 1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main' seems to confirm -- +1%, though maybe within a margin of error.

@cfallin cfallin marked this pull request as ready for review May 19, 2022 18:44
Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great. I'd like to dig more into the tests before giving my sign off and I think having test alias-anlysis and comments will really help with that.

Thanks!

cranelift/codegen/src/inst_predicates.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/ir/memflags.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/ir/memflags.rs Outdated Show resolved Hide resolved
cranelift/filetests/filetests/alias/simple-alias.clif Outdated Show resolved Hide resolved
crates/cranelift/src/func_environ.rs Outdated Show resolved Hide resolved
cranelift/codegen/src/alias_analysis.rs Outdated Show resolved Hide resolved
@bjorn3
Copy link
Contributor

bjorn3 commented May 19, 2022

looks like alias_analysis ran a bit faster than main?

Right, my bad.

Also, fix bug in which basic blocks were skipped.
@cfallin
Copy link
Member Author

cfallin commented May 19, 2022

Updated, thanks!

@bjorn3
Copy link
Contributor

bjorn3 commented May 20, 2022

Hang on, I think I know why it didn't help much for cg_clif. Cg_clif uses stack_load and stack_store rather than the stack_addr+load/store this opt pass needs.

@cfallin
Copy link
Member Author

cfallin commented May 20, 2022

@bjorn3 the alias analysis / redundant-load pass runs after legalization, so it should see plain loads/stores rather than stack_load/stack_store, I think.

Copy link
Member

@fitzgen fitzgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! A whole lot easier to review with test alias-analysis :)

Looks great. A few nitpicks/optional suggestions below. r=me with everything addressed/considered.

cranelift/codegen/src/alias_analysis.rs Show resolved Hide resolved
cranelift/codegen/src/alias_analysis.rs Show resolved Hide resolved
cranelift/codegen/src/alias_analysis.rs Outdated Show resolved Hide resolved
@pepyakin
Copy link
Collaborator

pepyakin commented Jun 16, 2022

FWIW, I've noticed around ≈15% regression and bisection points on this commit.

the results for 0824abb:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      7.098 s ±  0.033 s    [User: 7.080 s, System: 0.015 s]
  Range (min … max):    7.018 s …  7.146 s    10 runs

previous good, 89ccc56:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.071 s ±  0.039 s    [User: 6.052 s, System: 0.016 s]
  Range (min … max):    6.024 s …  6.153 s    10 runs

This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.

UPD:

uname: Linux hetzner 5.10.78 #1-NixOS SMP Sat Nov 6 13:10:10 UTC 2021 x86_64 GNU/Linux
CPU: AMD Ryzen 9 5950X 16-Core Processor

@pepyakin
Copy link
Collaborator

pepyakin commented Jun 16, 2022

Well, it gets weird.

If I apply the following change on top of 0824abb

diff --cc crates/wasmtime/src/config.rs
index 9c1ed87ca,9c1ed87ca..6e9bcd1f7
--- a/crates/wasmtime/src/config.rs
+++ b/crates/wasmtime/src/config.rs
@@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy
              },
          )
          .unwrap();
++    builder.set("enable_pinned_reg", "true").unwrap();
      Ok(builder)
  }

diff --cc crates/wasmtime/src/engine.rs
index c81ae6e67,c81ae6e67..cc150af3a
--- a/crates/wasmtime/src/engine.rs
+++ b/crates/wasmtime/src/engine.rs
@@@ -312,7 -312,7 +312,7 @@@ impl Engine
              "baldrdash_prologue_words" => *value == FlagValue::Num(0),
              "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false),
              "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false),
--            "enable_pinned_reg" => *value == FlagValue::Bool(false),
++            "enable_pinned_reg" => *value == FlagValue::Bool(true),
              "enable_probestack" => *value == FlagValue::Bool(false),
              "use_colocated_libcalls" => *value == FlagValue::Bool(false),
              "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),

then the result would be like this:

Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.111 s ±  0.050 s    [User: 6.094 s, System: 0.014 s]
  Range (min … max):    6.044 s …  6.226 s    10 runs

i.e. the regression disappears.

I'm confused because not sure how that could possibly be related.

@cfallin
Copy link
Member Author

cfallin commented Jun 17, 2022

@pepyakin I'm happy to take a look at this next week (I'm out of office now); can you publish the .wasm somewhere in the meantime?

@pepyakin
Copy link
Collaborator

Sure!

wasmibox.wasm.gz

@cfallin
Copy link
Member Author

cfallin commented Jun 28, 2022

@pepyakin I'm just getting to this now (sorry for the delay!) and I'm actually seeing a speedup from alias analysis, consistently, of about 6% (6.2s to 5.8s):

[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.246 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.84s user 0.01s system 99% cpu 5.856 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.19s user 0.01s system 99% cpu 6.216 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.85s user 0.01s system 99% cpu 5.868 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.242 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.77s user 0.01s system 99% cpu 5.792 total

These were built with current main (c1b3962f7b9fc1c193e1b9709db64c455699a295) with this patch to make alias analysis / redundant-load elimination optional; then

$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=true wasmibox.wasm -o wasmibox_aa.cwasm
$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=false wasmibox.wasm -o wasmibox_no_aa.cwasm

and I double-checked I wasn't swapping the two. Looking at a diff of the disassemblies, it's sort of what I expect: a few extra loads are removed and carried in registers instead, and this I would indeed expect to improve performance, as it does.

Can you confirm you're still seeing a slowdown, not speedup, and if so at which commit and anything else about your environment and measurements?

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jun 28, 2022
…elimination.

This allows for experiments as in here [1] and also generally gives an
option to anyone who is concerned that the extra optimization may be
counterproductive or take too much time. The optimization remains
enabled by default.

[1]
bytecodealliance#4163 (comment)
cfallin added a commit that referenced this pull request Jun 28, 2022
…elimination. (#4349)

This allows for experiments as in here [1] and also generally gives an
option to anyone who is concerned that the extra optimization may be
counterproductive or take too much time. The optimization remains
enabled by default.

[1]
#4163 (comment)
afonso360 pushed a commit to afonso360/wasmtime that referenced this pull request Jun 30, 2022
…elimination. (bytecodealliance#4349)

This allows for experiments as in here [1] and also generally gives an
option to anyone who is concerned that the extra optimization may be
counterproductive or take too much time. The optimization remains
enabled by default.

[1]
bytecodealliance#4163 (comment)
@pepyakin
Copy link
Collaborator

pepyakin commented Jul 4, 2022

With the following script:

./wasmtime-0824abbae compile -g -O wasmibox.wasm -o wasmibox-0824abbae.cwasm
./wasmtime-89ccc56e4 compile -g -O wasmibox.wasm -o wasmibox-89ccc56e4.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=true -o wasmibox-a2197ebbe-aa.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=false -o wasmibox-a2197ebbe-no-aa.cwasm

hyperfine --show-output --warmup 2 "./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2  "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run"```

I am getting the following results:

```shell
++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.887 s ±  0.046 s    [User: 6.872 s, System: 0.013 s]
  Range (min … max):    6.848 s …  6.990 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.941 s ±  0.024 s    [User: 5.927 s, System: 0.013 s]
  Range (min … max):    5.914 s …  5.978 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.666 s ±  0.037 s    [User: 4.651 s, System: 0.013 s]
  Range (min … max):    4.627 s …  4.743 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.524 s ±  0.133 s    [User: 4.510 s, System: 0.013 s]
  Range (min … max):    4.472 s …  4.898 s    10 runs

It confirms that on my machine the pre-AA commit performs considerably better than the commit when AA was introduced. The current main performs better than both of them regardless if AA is enabled or not. No AA performs better than with AA on main though.

However, if I remove the -g flag then the whole picture changes:

++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.826 s ±  0.082 s    [User: 5.815 s, System: 0.010 s]
  Range (min … max):    5.759 s …  6.040 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.600 s ±  0.073 s    [User: 6.589 s, System: 0.009 s]
  Range (min … max):    6.543 s …  6.722 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.375 s ±  0.013 s    [User: 4.364 s, System: 0.010 s]
  Range (min … max):    4.364 s …  4.406 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.721 s ±  0.016 s    [User: 4.711 s, System: 0.009 s]
  Range (min … max):    4.702 s …  4.746 s    10 runs

Now, it starts to make more sense, although I still don't understand how that could be possibly related.

@cfallin
Copy link
Member Author

cfallin commented Jul 5, 2022

OK, I agree that it doesn't make any sense that -g would make the difference here. I suspect either some odd measurement effect or some pessimization caused by debuginfo registration or something like that. Given that (i) the normal config behaves as expected (alias analysis produces a nontrivial speedup), (ii) codegen looks as expected on manual examination (diff of no-AA vs AA in default config on main), and (iii) debuginfo in general is a big mess that needs more focused attention, I don't think I'm able to justify spending significantly more time looking into this; but if you find anything more please let us know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cranelift: develop an alias analysis
5 participants