Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cranelift-wasm: Add a bounds-checking optimization for dynamic memories and guard pages #6031

Merged

Conversation

fitzgen
Copy link
Member

@fitzgen fitzgen commented Mar 16, 2023

This is a new special case for when we know that there are enough guard pages to cover the memory access's offset and access size.

The precise should-we-trap condition is

index + offset + access_size > bound

However, if we instead check only the partial condition

index > bound

then the most out of bounds that the access can be, while that partial check still succeeds, is offset + access_size.

However, when we have a guard region that is at least as large as offset + access_size, we can rely on the virtual memory subsystem handling these out-of-bounds errors at runtime. Therefore, the partial index > bound check is sufficient for this heap configuration.

Additionally, this has the advantage that a series of Wasm loads that use the same dynamic index operand but different static offset immediates -- which is a common code pattern when accessing multiple fields in the same struct that is in linear memory -- will all emit the same index > bound check, which we can GVN.

Partially.

The bounds check comparison is GVN'd but we still branch on values we should know will always be true if we get this far in the code. This is actual br_ifs in the non-Spectre code and select_spectre_guards that we should know will always go a certain way if we have Spectre mitigations enabled. See the second commit for examples.

Improving the non-Spectre case is pretty straightforward: walk the dominator tree and remember which values we've already branched on at this point, and therefore we can simplify any further conditional branches on those same values into direct jumps.

Improving the Spectre case requires something that is morally the same, but has a couple snags:

  • We don't have actual br_ifs to determine whether the bounds checking condition succeeded or not. We need to instead reason about dominating select_spectre_guard; {load, store} instruction pairs.

  • We have to be SUPER careful about reasoning "through" select_spectre_guards. Our general rule is never to do that, since it could break the speculative execution sandboxing that the instruction is designed for.

This pull request leaves implementing these new optimization passes for follow ups.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 16, 2023

FWIW: I looked over the wasm filetest changes pretty closely, but did not go over the ISA-specific versions of those same tests in much detail.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 16, 2023

Going to gather sightglass numbers now...

@fitzgen
Copy link
Member Author

fitzgen commented Mar 16, 2023

Looks like this gives not just execution speed ups, but also compilation speed ups (presumably due to processing less IR):

$ sightglass benchmark -e main.so -e opt.so --engine-flags="--static-memory-maximum-size=0 --dynamic-memory-guard-size=65536" -m insts-retired -- benchmarks/default.suite 
execution :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 354918771.70 ± 209612.26 (confidence = 99%)

  opt.so is 1.08x to 1.08x faster than main.so!

  [4757395008 4757613072.90 4757854325] main.so
  [4402584769 4402694301.20 4403032801] opt.so

execution :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 2246146.90 ± 322.54 (confidence = 99%)

  opt.so is 1.07x to 1.07x faster than main.so!

  [35406698 35406850.20 35407505] main.so
  [33160532 33160703.30 33161229] opt.so

execution :: instructions-retired :: benchmarks/bz2/benchmark.wasm

  Δ = 20037307.30 ± 5.37 (confidence = 99%)

  opt.so is 1.05x to 1.05x faster than main.so!

  [402201103 402201106.00 402201111] main.so
  [382163791 382163798.70 382163803] opt.so

compilation :: instructions-retired :: benchmarks/pulldown-cmark/benchmark.wasm

  Δ = 80907.10 ± 30178.64 (confidence = 99%)

  opt.so is 1.01x to 1.01x faster than main.so!

  [9350309 9389659.00 9444444] main.so
  [9276889 9308751.90 9338252] opt.so

compilation :: instructions-retired :: benchmarks/spidermonkey/benchmark.wasm

  Δ = 1534345.20 ± 211755.93 (confidence = 99%)

  opt.so is 1.01x to 1.01x faster than main.so!

  [218580786 218834739.10 219048529] main.so
  [216969499 217300393.90 217501790] opt.so

@github-actions github-actions bot added the cranelift Issues related to the Cranelift code generator label Mar 16, 2023
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonble to me, but I think it'd be good to get one more approval on this PR as well.

Also FWIW I find the tests sort of unhelpful to review since there's so many (100+ files). I think you added some tests here but I only happend to run into them by happenstance when scrolling through. I realize the full matrix is useful in the limit but, perhaps in a future PR, the test suite could be trimmed down to something a bit mor easily reviewable?

Comment on lines +138 to +180
HeapStyle::Dynamic { bound_gv }
if offset_and_size <= heap.offset_guard_size && spectre_mitigations_enabled =>
{
let bound = builder.ins().global_value(env.pointer_type(), bound_gv);
Reachable(compute_addr(
&mut builder.cursor(),
heap,
env.pointer_type(),
index,
offset,
Some(SpectreOobComparison {
cc: IntCC::UnsignedGreaterThan,
lhs: index,
rhs: bound,
}),
))
}

// 2.b. Emit explicit `index > bound` check.
HeapStyle::Dynamic { bound_gv } if offset_and_size <= heap.offset_guard_size => {
let bound = builder.ins().global_value(env.pointer_type(), bound_gv);
let oob = builder.ins().icmp(IntCC::UnsignedGreaterThan, index, bound);
builder.ins().trapnz(oob, ir::TrapCode::HeapOutOfBounds);
Reachable(compute_addr(
&mut builder.cursor(),
heap,
env.pointer_type(),
index,
offset,
None,
))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know at the top of this match it says that some duplication is necessary, but I personally found it a bit confusing to have the spectre/non-spectre cases split apart. Would it be possible, for dynamic heaps, to fuse these two arms together with a helper?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I was thinking of doing this in a follow up. That cool?

cranelift/wasm/src/code_translator/bounds_checks.rs Outdated Show resolved Hide resolved
@alexcrichton
Copy link
Member

Hm ok well I also see now that a test failed with something that should trap which didn't, which means now that I missed something in review, so I'm much less confident in my review now.

Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The actual bounds-check logic looks correct to me; modulo perhaps the overflow issue that Alex points to.

The failing test has this comment:

    // Technically this is out of bounds...
    assert!(i32_load.call(&mut store, page).is_err());
    // ... but implementation-wise it should still be mapped memory from before.
    // Note though that prior writes should all appear as zeros and we can't see
    // data from the prior instance.

which makes me think that the configuration in that test is a lie (we indicate that a guard region is present that is not, actually), or else the pooling allocator isn't respecting the requested offset-guard size for some reason...

@alexcrichton
Copy link
Member

Ah yeah after digging in I believe that's correct. The configuration requires dynamic memories to have a 64k guard page region but the pooling allocator in this configuration specifically isn't decommitting memory past the end to avoid syscalls, which means that there's actually no guard pages at all upon reinstantiation. That's ok with today's impelmentation of bounds checks which don't try to exploit the fact that the arithmetic can be simplified (e.g. this PR).

This is a tough problem though because that was one of the assumptions about dynamic memory and the pooling allocator, that we could avoid decommitting memory and save on syscalls/contention. This is making me realize though that if we want to take advantage of guard pages on dynamic memories (which I suspect we do since it should help drastically simplify the complexity of analysis required to handle multiple bounds checks) then this optimization isn't possible.

The offending lines I think are here which those need to be executed unconditionally if the guard size for memory is more than 0 bytes, which I think is basically always.

@fitzgen
Copy link
Member Author

fitzgen commented Mar 16, 2023

The offending lines I think are here which those need to be executed unconditionally if the guard size for memory is more than 0 bytes, which I think is basically always.

Thanks! I was digging into the test but hadn't got that far before I was interrupted. Always nice when someone debugs the issue for you :)

@fitzgen fitzgen force-pushed the dynamic-memories-with-guard-regions branch from 41647a3 to b0e2e63 Compare March 17, 2023 02:11
tests/all/pooling_allocator.rs Outdated Show resolved Hide resolved
@fitzgen fitzgen force-pushed the dynamic-memories-with-guard-regions branch from b0e2e63 to 66b4973 Compare March 17, 2023 18:02
@fitzgen fitzgen enabled auto-merge March 17, 2023 18:02
…es and guard pages

This is a new special case for when we know that there are enough guard pages to
cover the memory access's offset and access size.

The precise should-we-trap condition is

    index + offset + access_size > bound

However, if we instead check only the partial condition

    index > bound

then the most out of bounds that the access can be, while that partial check
still succeeds, is `offset + access_size`.

However, when we have a guard region that is at least as large as `offset +
access_size`, we can rely on the virtual memory subsystem handling these
out-of-bounds errors at runtime. Therefore, the partial `index > bound` check is
sufficient for this heap configuration.

Additionally, this has the advantage that a series of Wasm loads that use the
same dynamic index operand but different static offset immediates -- which is a
common code pattern when accessing multiple fields in the same struct that is in
linear memory -- will all emit the same `index > bound` check, which we can GVN.
… index but different offsets

The bounds check comparison is GVN'd but we still branch on values we should
know will always be true if we get this far in the code. This is actual `br_if`s
in the non-Spectre code and `select_spectre_guard`s that we should know will
always go a certain way if we have Spectre mitigations enabled.

Improving the non-Spectre case is pretty straightforward: walk the dominator
tree and remember which values we've already branched on at this point, and
therefore we can simplify any further conditional branches on those same values
into direct jumps.

Improving the Spectre case requires something that is morally the same, but has
a few snags:

* We don't have actual `br_if`s to determine whether the bounds checking
  condition succeeded or not. We need to instead reason about dominating
  `select_spectre_guard; {load, store}` instruction pairs.

* We have to be SUPER careful about reasoning "through" `select_spectre_guard`s.
  Our general rule is never to do that, since it could break the speculative
  execution sandboxing that the instruction is designed for.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants