Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

cfallin · 2022-05-18T23:58:10Z

This PR adds a basic alias analysis, and optimizations that use it.
This is a "mid-end optimization": it operates on CLIF, the
machine-independent IR, before lowering occurs.

The alias analysis (or maybe more properly, a sort of memory-value
analysis) determines when it can prove a particular memory
location is equal to a given SSA value, and when it can, it replaces any
loads of that location.

This subsumes two common optimizations:

Redundant load elimination: when the same memory address is loaded two
times, and it can be proven that no intervening operations will write
to that memory, then the second load is redundant and its result
must be the same as the first. We can use the first load's result and
remove the second load.
Store-to-load forwarding: when a load can be proven to access exactly
the memory written by a preceding store, we can replace the load's
result with the store's data operand, and remove the load.

Both of these optimizations rely on a "last store" analysis that is a
sort of coloring mechanism, split across disjoint categories of abstract
state. The basic idea is that every memory-accessing operation is put
into one of N disjoint categories; it is disallowed for memory to ever
be accessed by an op in one category and later accessed by an op in
another category. (The frontend must ensure this.)

Then, given this, we scan the code and determine, for each
memory-accessing op, when a single prior instruction is a store to the
same category. This "colors" the instruction: it is, in a sense, a
static name for that version of memory.

This analysis provides an important invariant: if two operations access
memory with the same last-store, then no other store can alias in the
time between that last store and these operations. This must-not-alias
property, together with a check that the accessed address is exactly
the same (same SSA value and offset), and other attributes of the
access (type, extension mode) are the same, let us prove that the
results are the same.

Given last-store info, we scan the instructions and build a table from
"memory location" key (last store, address, offset, type, extension) to
known SSA value stored in that location. A store inserts a new mapping.
A load may also insert a new mapping, if we didn't already have one.
Then when a load occurs and an entry already exists for its "location",
we can reuse the value. This will be either RLE or St-to-Ld depending on
where the value came from.

Note that this does work across basic blocks: the last-store analysis
is a full iterative dataflow pass, and we are careful to check dominance
of a previously-defined value before aliasing to it at a potentially
redundant load. So we will do the right thing if we only have a
"partially redundant" load (loaded already but only in one predecessor
block), but we will also correctly reuse a value if there is a store or
load above a loop and a redundant load of that value within the loop, as
long as no potentially-aliasing stores happen within the loop.

Fixes #4131.

Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.

Creating this as a draft for early feedback; will likely refine the comments
and explanations a bit more, and benchmark this.

… fowarding opts. This PR adds a basic *alias analysis*, and optimizations that use it. This is a "mid-end optimization": it operates on CLIF, the machine-independent IR, before lowering occurs. The alias analysis (or maybe more properly, a sort of memory-value analysis) determines when it can prove a particular memory location is equal to a given SSA value, and when it can, it replaces any loads of that location. This subsumes two common optimizations: * Redundant load elimination: when the same memory address is loaded two times, and it can be proven that no intervening operations will write to that memory, then the second load is *redundant* and its result must be the same as the first. We can use the first load's result and remove the second load. * Store-to-load forwarding: when a load can be proven to access exactly the memory written by a preceding store, we can replace the load's result with the store's data operand, and remove the load. Both of these optimizations rely on a "last store" analysis that is a sort of coloring mechanism, split across disjoint categories of abstract state. The basic idea is that every memory-accessing operation is put into one of N disjoint categories; it is disallowed for memory to ever be accessed by an op in one category and later accessed by an op in another category. (The frontend must ensure this.) Then, given this, we scan the code and determine, for each memory-accessing op, when a single prior instruction is a store to the same category. This "colors" the instruction: it is, in a sense, a static name for that version of memory. This analysis provides an important invariant: if two operations access memory with the same last-store, then *no other store can alias* in the time between that last store and these operations. This must-not-alias property, together with a check that the accessed address is *exactly the same* (same SSA value and offset), and other attributes of the access (type, extension mode) are the same, let us prove that the results are the same. Given last-store info, we scan the instructions and build a table from "memory location" key (last store, address, offset, type, extension) to known SSA value stored in that location. A store inserts a new mapping. A load may also insert a new mapping, if we didn't already have one. Then when a load occurs and an entry already exists for its "location", we can reuse the value. This will be either RLE or St-to-Ld depending on where the value came from. Note that this *does* work across basic blocks: the last-store analysis is a full iterative dataflow pass, and we are careful to check dominance of a previously-defined value before aliasing to it at a potentially redundant load. So we will do the right thing if we only have a "partially redundant" load (loaded already but only in one predecessor block), but we will also correctly reuse a value if there is a store or load above a loop and a redundant load of that value within the loop, as long as no potentially-aliasing stores happen within the loop. Passes tests and runs SpiderMonkey correctly locally; benchmarks TBD.

fitzgen · 2022-05-19T00:30:47Z

Great! I will take a look at this tomorrow.

cfallin · 2022-05-19T00:31:28Z

I ran the Sightglass benchmarks on this and the only delta is for meshoptimizer:

execution :: cycles :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 957833251.60 ± 610441836.41 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [17612149560 17873615806.60 18098060500] new.so
  [17544919808 18831449058.20 19306513928] old.so

execution :: nanoseconds :: benchmarks-next/meshoptimizer/benchmark.wasm

  Δ = 252025449.20 ± 160632667.79 (confidence = 99%)

  new.so is 1.02x to 1.09x faster than old.so!
  old.so is 0.92x to 0.98x faster than new.so!

  [4634388228 4703136639.50 4762142132] new.so
  [4616626357 4955162088.70 5080175835] old.so

(with no impact on compilation time for any benchmark). I'm not too surprised we don't see more opportunity in Wasm benchmarks, because a lot of the RLE and store-to-load opts will have already been done by the Wasm toolchain. Regardless it seems nice to have this in, if it doesn't hurt, and it could become more applicable if/when we inline.

bjorn3 · 2022-05-19T05:56:25Z

cranelift/codegen/src/alias_analysis.rs

+//!
+//! We partition memory state into several *disjoint pieces* of
+//! "abstract state". There are a finite number of such pieces:
+//! currently, we call them "heap", "table", "vmctx", and "other".Any


It would be nice if all stack slots would be their own category until their address is leaked using stack_addr.

Sure, that would be nice, but entails an escape analysis, which is outside the scope of this PR. Happy to discuss further (and review PRs!) once this is in, of course.

Even stopping optimization entirely if stack_addr is used on a stack slot would likely help cg_clif a lot. Having each stack slot whose address is never leaked be optimized independently is the most important for cg_clif. I'm fine with leaving that for a later PR.

bjorn3 · 2022-05-19T05:59:44Z

I think you will need to add support for volatile loads and stores for this to not miscompile rust code. It would also be nice if you could state an exhaustive list of UB it depends on. For example that reading/writing memory racing with the compiled clif function is UB unless it is an atomic or volatile load/store.

bjorn3 · 2022-05-19T06:04:06Z

By the way what is the time complexity of this optinization pass?

cfallin · 2022-05-19T06:15:30Z

volatile loads/stores to not miscompile

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Otherwise removing a load and replacing it with a known value is always legal, given CLIF semantics today. (If it helps, imagine the disjoint alias categories don't exist: by default all loads/stores fall into "other", so memory locations are named by the last store of any kind.)

Atomics should use atomic ops at the CLIF level of course; I believe those should act as full barriers across all categories, as calls do. (I don't think I've actually added this logic yet; will ensure it's there tomorrow.)

time complexity of this optimization pass?

Same as any of the other iterative dataflow analyses: worst case passes over any given basic block bounded by maximum descending chain depth in the lattice. Here the last-store vector meets to "current inst" on any conflict at a merge point so the lattice is effectively three levels deep, constant; so that's O(|blocks|) to converge last-store info (likely one visit per block, rarely two). Then one more visit per block to discover and use aliases.

bjorn3 · 2022-05-19T08:59:47Z

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

bjorn3 · 2022-05-19T09:49:36Z

This is a slight regresion for the simple-raytracer benchmark of cg_clif. I would guess it increases live-range sizes causing a regalloc pessimization. Implementing #4163 (comment) would make it a beneficial optimization I think. I used to have a simple implementation of that in cg_clif which also looks at which bytes of a stack slot were accessed by the load/store to effectively allow SROA, but I removed it in bjorn3/rustc_codegen_cranelift@a793be8 as it was buggy and I didn't think I knew how to fix it. It also only operated within a single basic block. See https://github.com/bjorn3/rustc_codegen_cranelift/issues/846.

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]
  Range (min … max):    5.169 s …  5.211 s    10 runs
 
Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]
  Range (min … max):    5.197 s …  5.263 s    10 runs
 
Summary
  './raytracer_cg_clif_release_alias_analysis' ran
    1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main'

cfallin · 2022-05-19T15:02:50Z

@bjorn3 can you clarify if the cg_clif frontend currently expects all loads to not be elided, or otherwise what the requirements are?

Volatile memory operations can't be changed as they may have side effects like initiating a DMA transfer or changing the state of a device. Non-volatile and non-atomic memory accesses can be optimized as rust defines data races to be UB.

OK, so if I understand correctly, cg_clif is currently compiling volatiles at the Rust level to normal loads/stores at the CLIF level? That I agree would result in miscompiles; the issue though isn't alias analysis as the semantics of load and store in CLIF have always been (even if not exploited until now) those of normal loads/stores. Adding volatile memory ops seems like a reasonable feature request though and I'd be happy to review a PR for that as well.

cfallin · 2022-05-19T15:05:40Z

@bjorn3 I may be misreading your benchmark result, but:

Benchmark 1: ./raytracer_cg_clif_release_alias_analysis
  Time (mean ± σ):      5.196 s ±  0.012 s    [User: 5.191 s, System: 0.004 s]

Benchmark 2: ./raytracer_cg_clif_release_main
  Time (mean ± σ):      5.224 s ±  0.024 s    [User: 5.218 s, System: 0.005 s]

looks like alias_analysis ran a bit faster than main? Then './raytracer_cg_clif_release_alias_analysis' ran 1.01 ± 0.01 times faster than './raytracer_cg_clif_release_main' seems to confirm -- +1%, though maybe within a margin of error.

cranelift/codegen/src/alias_analysis.rs

fitzgen

Overall looks great. I'd like to dig more into the tests before giving my sign off and I think having test alias-anlysis and comments will really help with that.

Thanks!

cranelift/codegen/src/inst_predicates.rs

cranelift/codegen/src/ir/memflags.rs

cranelift/filetests/filetests/alias/simple-alias.clif

crates/cranelift/src/func_environ.rs

cranelift/codegen/src/alias_analysis.rs

bjorn3 · 2022-05-19T22:10:56Z

looks like alias_analysis ran a bit faster than main?

Right, my bad.

Also, fix bug in which basic blocks were skipped.

cfallin · 2022-05-19T23:23:09Z

Updated, thanks!

bjorn3 · 2022-05-20T06:17:05Z

Hang on, I think I know why it didn't help much for cg_clif. Cg_clif uses stack_load and stack_store rather than the stack_addr+load/store this opt pass needs.

cfallin · 2022-05-20T06:46:30Z

@bjorn3 the alias analysis / redundant-load pass runs after legalization, so it should see plain loads/stores rather than stack_load/stack_store, I think.

fitzgen

Thanks! A whole lot easier to review with test alias-analysis :)

Looks great. A few nitpicks/optional suggestions below. r=me with everything addressed/considered.

cranelift/filetests/filetests/alias/simple-alias.clif

cranelift/codegen/src/alias_analysis.rs

pepyakin · 2022-06-16T16:36:54Z

FWIW, I've noticed around ≈15% regression and bisection points on this commit.

the results for 0824abb:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      7.098 s ±  0.033 s    [User: 7.080 s, System: 0.015 s]
  Range (min … max):    7.018 s …  7.146 s    10 runs

previous good, 89ccc56:

./target/release/wasmtime compile -g -O wasmibox.wasm && , hyperfine --show-output --warmup 2 "./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm"
Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.071 s ±  0.039 s    [User: 6.052 s, System: 0.016 s]
  Range (min … max):    6.024 s …  6.153 s    10 runs

This particular benchmark is a standalone wasm file, i.e. no imports. The entry point loads a test wasm binary and interprets it with compiled in wasmi. The test binary is also standalone. The test wasm generates random numbers using xorshift and takes keccak hash out of it. I can publish the wasm or the source to get the wasm.

UPD:

uname: Linux hetzner 5.10.78 #1-NixOS SMP Sat Nov 6 13:10:10 UTC 2021 x86_64 GNU/Linux
CPU: AMD Ryzen 9 5950X 16-Core Processor

pepyakin · 2022-06-16T17:15:10Z

Well, it gets weird.

If I apply the following change on top of 0824abb

diff --cc crates/wasmtime/src/config.rs
index 9c1ed87ca,9c1ed87ca..6e9bcd1f7
--- a/crates/wasmtime/src/config.rs
+++ b/crates/wasmtime/src/config.rs
@@@ -1287,6 -1287,6 +1287,7 @@@ fn compiler_builder(strategy: Strategy
              },
          )
          .unwrap();
++    builder.set("enable_pinned_reg", "true").unwrap();
      Ok(builder)
  }

diff --cc crates/wasmtime/src/engine.rs
index c81ae6e67,c81ae6e67..cc150af3a
--- a/crates/wasmtime/src/engine.rs
+++ b/crates/wasmtime/src/engine.rs
@@@ -312,7 -312,7 +312,7 @@@ impl Engine
              "baldrdash_prologue_words" => *value == FlagValue::Num(0),
              "enable_llvm_abi_extensions" => *value == FlagValue::Bool(false),
              "emit_all_ones_funcaddrs" => *value == FlagValue::Bool(false),
--            "enable_pinned_reg" => *value == FlagValue::Bool(false),
++            "enable_pinned_reg" => *value == FlagValue::Bool(true),
              "enable_probestack" => *value == FlagValue::Bool(false),
              "use_colocated_libcalls" => *value == FlagValue::Bool(false),
              "use_pinned_reg_as_heap_base" => *value == FlagValue::Bool(false),

then the result would be like this:

Benchmark 1: ./target/release/wasmtime run -g -O --allow-precompiled --invoke wasm_kernel_run wasmibox.cwasm
  Time (mean ± σ):      6.111 s ±  0.050 s    [User: 6.094 s, System: 0.014 s]
  Range (min … max):    6.044 s …  6.226 s    10 runs

i.e. the regression disappears.

I'm confused because not sure how that could possibly be related.

cfallin · 2022-06-17T20:41:01Z

@pepyakin I'm happy to take a look at this next week (I'm out of office now); can you publish the .wasm somewhere in the meantime?

pepyakin · 2022-06-18T10:16:44Z

Sure!

wasmibox.wasm.gz

cfallin · 2022-06-28T21:43:15Z

@pepyakin I'm just getting to this now (sorry for the delay!) and I'm actually seeing a speedup from alias analysis, consistently, of about 6% (6.2s to 5.8s):

[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.246 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.84s user 0.01s system 99% cpu 5.856 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.19s user 0.01s system 99% cpu 6.216 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.85s user 0.01s system 99% cpu 5.868 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_no_aa.cwasm --invoke  6.22s user 0.01s system 99% cpu 6.242 total
[cfallin@xap]~/work/wasmtime% time target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke wasm_kernel_run
target/release/wasmtime run --allow-precompiled wasmibox_aa.cwasm --invoke   5.77s user 0.01s system 99% cpu 5.792 total

These were built with current main (c1b3962f7b9fc1c193e1b9709db64c455699a295) with this patch to make alias analysis / redundant-load elimination optional; then

$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=true wasmibox.wasm -o wasmibox_aa.cwasm
$ wasmtime compile --cranelift-set opt_level=speed --cranelift-set enable_alias_analysis=false wasmibox.wasm -o wasmibox_no_aa.cwasm

and I double-checked I wasn't swapping the two. Looking at a diff of the disassemblies, it's sort of what I expect: a few extra loads are removed and carried in registers instead, and this I would indeed expect to improve performance, as it does.

Can you confirm you're still seeing a slowdown, not speedup, and if so at which commit and anything else about your environment and measurements?

…elimination. This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] bytecodealliance#4163 (comment)

…elimination. (#4349) This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] #4163 (comment)

…elimination. (bytecodealliance#4349) This allows for experiments as in here [1] and also generally gives an option to anyone who is concerned that the extra optimization may be counterproductive or take too much time. The optimization remains enabled by default. [1] bytecodealliance#4163 (comment)

pepyakin · 2022-07-04T10:43:08Z

With the following script:

./wasmtime-0824abbae compile -g -O wasmibox.wasm -o wasmibox-0824abbae.cwasm
./wasmtime-89ccc56e4 compile -g -O wasmibox.wasm -o wasmibox-89ccc56e4.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=true -o wasmibox-a2197ebbe-aa.cwasm
./wasmtime-a2197ebbe compile -g -O wasmibox.wasm --cranelift-set enable_alias_analysis=false -o wasmibox-a2197ebbe-no-aa.cwasm

hyperfine --show-output --warmup 2 "./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2  "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run"
hyperfine --show-output --warmup 2 "./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run"```

I am getting the following results:

```shell
++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run -g --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.887 s ±  0.046 s    [User: 6.872 s, System: 0.013 s]
  Range (min … max):    6.848 s …  6.990 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run -g --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.941 s ±  0.024 s    [User: 5.927 s, System: 0.013 s]
  Range (min … max):    5.914 s …  5.978 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.666 s ±  0.037 s    [User: 4.651 s, System: 0.013 s]
  Range (min … max):    4.627 s …  4.743 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run -g --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.524 s ±  0.133 s    [User: 4.510 s, System: 0.013 s]
  Range (min … max):    4.472 s …  4.898 s    10 runs

It confirms that on my machine the pre-AA commit performs considerably better than the commit when AA was introduced. The current main performs better than both of them regardless if AA is enabled or not. No AA performs better than with AA on main though.

However, if I remove the -g flag then the whole picture changes:

++ hyperfine --show-output --warmup 2 './wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-0824abbae run --allow-precompiled -O wasmibox-0824abbae.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      5.826 s ±  0.082 s    [User: 5.815 s, System: 0.010 s]
  Range (min … max):    5.759 s …  6.040 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-89ccc56e4 run --allow-precompiled -O wasmibox-89ccc56e4.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      6.600 s ±  0.073 s    [User: 6.589 s, System: 0.009 s]
  Range (min … max):    6.543 s …  6.722 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.375 s ±  0.013 s    [User: 4.364 s, System: 0.010 s]
  Range (min … max):    4.364 s …  4.406 s    10 runs

++ hyperfine --show-output --warmup 2 './wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run'
Benchmark 1: ./wasmtime-a2197ebbe run --allow-precompiled -O wasmibox-a2197ebbe-no-aa.cwasm --invoke wasm_kernel_run
  Time (mean ± σ):      4.721 s ±  0.016 s    [User: 4.711 s, System: 0.009 s]
  Range (min … max):    4.702 s …  4.746 s    10 runs

Now, it starts to make more sense, although I still don't understand how that could be possibly related.

cfallin · 2022-07-05T05:42:29Z

OK, I agree that it doesn't make any sense that -g would make the difference here. I suspect either some odd measurement effect or some pessimization caused by debuginfo registration or something like that. Given that (i) the normal config behaves as expected (alias analysis produces a nontrivial speedup), (ii) codegen looks as expected on manual examination (diff of no-AA vs AA in default config on main), and (iii) debuginfo in general is a big mess that needs more focused attention, I don't think I'm able to justify spending significantly more time looking into this; but if you find anything more please let us know.

cfallin requested a review from fitzgen May 18, 2022 23:58

cfallin force-pushed the alias-analysis branch from 2ebd8f8 to d7f8ff9 Compare May 18, 2022 23:58

github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:wasm labels May 19, 2022

bjorn3 reviewed May 19, 2022

View reviewed changes

cfallin added 2 commits May 19, 2022 09:45

Treat atomic ops and memory fences as barriers for alias analysis

c423f27

Some comment updates.

0e339d0

cfallin marked this pull request as ready for review May 19, 2022 18:44

lqd reviewed May 19, 2022

View reviewed changes

cranelift/codegen/src/alias_analysis.rs Outdated Show resolved Hide resolved

fitzgen reviewed May 19, 2022

View reviewed changes

Fix typo.

7c00174

cfallin added 2 commits May 19, 2022 15:41

Additional tests.

836b523

Also, fix bug in which basic blocks were skipped.

Address review feedback.

20e683a

This was referenced May 19, 2022

Cranelift: alias analysis: track each individual table/heap separately #4166

Open

Cranelift: eliminate dead stores #4167

Open

fitzgen approved these changes May 20, 2022

View reviewed changes

cranelift/filetests/filetests/alias/simple-alias.clif Show resolved Hide resolved

cranelift/codegen/src/alias_analysis.rs Show resolved Hide resolved

cranelift/codegen/src/alias_analysis.rs Show resolved Hide resolved

cranelift/codegen/src/alias_analysis.rs Outdated Show resolved Hide resolved

Address review comments.

a340036

cfallin merged commit 0824abb into bytecodealliance:main May 20, 2022

cfallin deleted the alias-analysis branch May 20, 2022 20:19

This was referenced May 27, 2022

Cranelift: develop a mid-end optimization framework with passes in ISLE #4128

Closed

wasmtime 0.37 has much smaller call stack limit than 0.36 #4214

Closed

cfallin mentioned this pull request Jun 28, 2022

Cranelift: add a config option for alias analysis and redundant-load elimination. #4349

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

cfallin commented May 18, 2022

fitzgen commented May 19, 2022

cfallin commented May 19, 2022 •

edited

Loading

bjorn3 May 19, 2022

cfallin May 19, 2022

bjorn3 May 19, 2022

bjorn3 commented May 19, 2022

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

bjorn3 commented May 19, 2022

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

cfallin commented May 19, 2022

fitzgen left a comment

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

bjorn3 commented May 20, 2022

cfallin commented May 20, 2022

fitzgen left a comment

pepyakin commented Jun 16, 2022 •

edited

Loading

pepyakin commented Jun 16, 2022 •

edited

Loading

cfallin commented Jun 17, 2022

pepyakin commented Jun 18, 2022

cfallin commented Jun 28, 2022

pepyakin commented Jul 4, 2022

cfallin commented Jul 5, 2022

Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

Add a basic alias analysis with redundant-load elim and store-to-load fowarding opts. #4163

Conversation

cfallin commented May 18, 2022

fitzgen commented May 19, 2022

cfallin commented May 19, 2022 • edited Loading

bjorn3 May 19, 2022

Choose a reason for hiding this comment

cfallin May 19, 2022

Choose a reason for hiding this comment

bjorn3 May 19, 2022

Choose a reason for hiding this comment

bjorn3 commented May 19, 2022

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

bjorn3 commented May 19, 2022

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

cfallin commented May 19, 2022

fitzgen left a comment

Choose a reason for hiding this comment

bjorn3 commented May 19, 2022

cfallin commented May 19, 2022

bjorn3 commented May 20, 2022

cfallin commented May 20, 2022

fitzgen left a comment

Choose a reason for hiding this comment

pepyakin commented Jun 16, 2022 • edited Loading

pepyakin commented Jun 16, 2022 • edited Loading

cfallin commented Jun 17, 2022

pepyakin commented Jun 18, 2022

cfallin commented Jun 28, 2022

pepyakin commented Jul 4, 2022

cfallin commented Jul 5, 2022

cfallin commented May 19, 2022 •

edited

Loading

pepyakin commented Jun 16, 2022 •

edited

Loading

pepyakin commented Jun 16, 2022 •

edited

Loading