Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fiber switching for WebAssembly #13107

Open
wants to merge 29 commits into
base: master
Choose a base branch
from

Conversation

lbguilherme
Copy link
Contributor

This PR is the base work to support GC and Event Loop on Wasm. With these changes fibers are now usable and you can play with Channels, as long as you don't do IO or sleep inside a Fiber.

This PR requires some explanation since stack switching on WebAssembly is very different from other platforms.

WebAssmbly as built by LLVM has 3 stacks:

  • A value stack for low level operations. Wasm is a stack machine. For example: to write data to memory you first push the target address, then the value, then perform the 'store' instruction.
  • A stack with private frames. Each function has access to a stack frame accessed with the local.get and local.set instructions. These frames are private and can't be manipulated outside of the function.
  • A shadow stack in the main memory. It is controlled by a __stack_pointer global and grows down. This is not a feature of WebAssembly, just a convention that LLVM uses.

We need to save and restore all three stacks for this to work.

There is an ongoing proposal to add stack switching to WebAssembly, but as of this writing it is not standardized and isn't implemented by any runtime. But I believe it will still take a few years before it is supported everywhere. At this point, this proposal is just an "optimization" of what Asyncify already does. See details at https://github.com/WebAssembly/stack-switching.

Here we use an alternative implementation on top of Binaryen's Asyncify pass. It is a static transformation on the WebAssembly program that must be performed after the Crystal program is compiled. It rewrites every relevant function so that it writes its stack frame to memory and rewinds to the caller, as well as recovering the stack from memory. It is all controlled by a global state variable. Please read more details at https://github.com/WebAssembly/binaryen/blob/main/src/passes/Asyncify.cpp.

Stack switching is accomplished by the cooperation between two functions: Fiber.swapcontext and Fiber.wrap_with_fibers. Before they are explained, the memory layout must be understood:

+--------+--------+------------------+----------------
| Unused | Data   | Main stack       | Heap ...
+--------+--------+------------------+----------------
|        |        |                  |
0    __data_start |              __heap_base
             __data_end        __stack_pointer

The first 1024 bytes of memory are unused (but they are usable, including the address 0!). Follows the static data like constant strings. This would be read only data on other platforms, but here it is writable. The special symbols __data_start and __data_end are generated by LLVM to mark this region. Then follows the main stack. A __heap_base symbol points to the bottom of this stack. The global __stack_pointer starts at this position and is moved by the program during execution. After that, the rest of the memory follows, used mainly by malloc. It is important to note that during execution everything is just memory, there is no real difference between those sections.

For our implementation of stack switching we manage this memory layout on every stack:

+--------+---------------------------+------------------------+
| Header | Asyncify buffer =>        |               <= Stack |
+--------+---------------------------+------------------------+

The header stores 4 pointers as metadata:

  • [0] => A function pointer to fiber_main
  • [1] => A pointer to the Fiber instance
  • [2] => The current position on the asyncify buffer (starts after the header, grows up)
  • [4] => The end position on the asyncify buffer (the current stack top)

On the main stack all of them are guaranteed to start null.

Stack switching should happens as follows:

  1. CrystalMain should call Fiber.wrap_with_fibers, passing a proc to the main function.
  2. Fiber.wrap_with_fibers will call this proc immediately.
  3. At some point a new Fiber will be created and Fiber.swapcontext will be called. It will:
    a. store the new context at Fiber.@@next_context.
    b. store the current stack_pointer global on context.stack_top.
    c. update the current position of the asyncify buffer to be just after the header
    d. update the end position of the asyncify buffer to be the current stack top
    e. mark Fiber.is_manipulating_stack_with_asyncify = true
    f. begin stack unwinding with LibAsyncify.start_unwind() and returns.
  4. As a consequence of the Asyncify transformation, all functions behave differently and instead of executing, they will write their local stack to the Asyncify buffer and return.
  5. At some point execution will arrive at Fiber.wrap_with_fibers again. We know that we are unwinding as Fiber.is_manipulating_stack_with_asyncify is marked. This means we have to either start a new fiber or rewind into a previously running fiber. If there is a asyncify buffer, then setup the rewinding process. Then call into the fiber main function. If it's null, then this is the main fiber, just call the original block.

For all of this to work we need to run the asyncify pass on the compiler output after linking. I modified the compiler to just do that using Binaryen wasm-opt. As I was already there, I also enabled running Binaryen optimizations when building with --release, those should improve the output size a lot.


Also, seems like something is broken when importing custom wasm modules and building without single_module: the import module name disappears. I'll open an issue for that, but in the mean time I added a workaround to always build wasm as a single module. It is required to "link" against asyncify.

In the future the GC will work by unwinding and then rewinding the same fiber again, so that it is able to scan pointers on the asyncify buffer.

Copy link
Contributor

@Sija Sija left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of stylistic suggestions.

src/compiler/crystal/compiler.cr Outdated Show resolved Hide resolved
src/fiber/context/wasm32.cr Outdated Show resolved Hide resolved
lbguilherme and others added 2 commits February 23, 2023 11:49
Co-authored-by: Sijawusz Pur Rahnama <sija@sija.pl>
@lbguilherme
Copy link
Contributor Author

Thanks! Too much time not programming in a beautiful language, I guess 😂

src/fiber/context/wasm32.cr Outdated Show resolved Hide resolved
@lbguilherme
Copy link
Contributor Author

lbguilherme commented Feb 27, 2023

Turns out LLVM produces some really repetitive code when not optimizing.

The generated wasm code for String::Grapheme::codepoints has 34151 local variables without --release, but only 2 locals when optimizing. This is only a problem because CraneLift (wasmtime's compiler) is unable to handle with functions that are "too large". But if we take the same wasm file and run with Node.js (using V8) it runs fine, so it is valid and there is nothing wrong on our side.

We can either run the spec with V8 or optimize before running with wasmtime. As wasmtime is such a popular runtime for wasm, I opted for optimizing before executing.

This only became a problem with this PR because the asyncify pass increases the generated code size even more and it crossed some limit.

@HertzDevil
Copy link
Contributor

So does that mean the workaround in #4516 (comment) somehow does not work when targetting WASM?

@lbguilherme
Copy link
Contributor Author

So does that mean the workaround in #4516 (comment) somehow does not work when targetting WASM?

Yes, it still generates too much code. Also, compiling the specs in release mode takes 20s on my machine. But without optimizations it takes 50s! I didn't check, but these huge functions are most likely the cause.

Using static data embedded in a string literal would do wonders here. #4516 (comment)

@@ -51,7 +51,7 @@ jobs:
run: make crystal # TODO: Remove this after next version update

- name: Build spec/wasm32_std_spec.cr
run: bin/crystal build spec/wasm32_std_spec.cr -o wasm32_std_spec.wasm --target wasm32-wasi --release
run: bin/crystal build spec/wasm32_std_spec.cr -o wasm32_std_spec.wasm --target wasm32-wasi -Duse_pcre --release
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wasm lib for PCRE2 should be available since #13109. Why -Duse_pcre?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was just merging master, so this is what's in master today

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but that seems to be a mistake. Apparently 46a6ac7 unintentionally merged it back in.
I think we can just remove it here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't it make more sense to fix master and re-merge?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, maybe. Don't think it matters much, though.

@beta-ziliani
Copy link
Member

Just stating the obvious: wasm specs are failing.

Error: failed to run main module `wasm32_std_spec.wasm`

Caused by:
    0: failed to invoke command default
    1: wasm trap: out of bounds memory access
       wasm backtrace:
           0: 0x6c588a - <unknown>!dlmalloc
           1: 0x6c4d19 - <unknown>!malloc
           2: 0x134ea3 - <unknown>!*PrettyPrint::GroupQueue#enq<PrettyPrint::Group>:Array(PrettyPrint::Group)
           3: 0x5a64f8 - <unknown>!~procProc(Nil)@spec/std/pretty_print_spec.cr:247
           4: 0xfd3b8 - <unknown>!*Spec::Example#internal_run<Time::Span, Proc(Nil)>:(Array(Spec::Result) | Nil)
           5: 0xdbb95 - <unknown>!*Spec::Example#run:Nil
           6: 0xdc49f - <unknown>!*Spec::ExampleGroup#run:Nil
           7: 0x3a803 - <unknown>!~procProc(Int32, (Exception | Nil), Nil)@src/spec/dsl.cr:208
           8: 0xae689 - <unknown>!main
           9: 0xaf264 - <unknown>!~procProc(Int32)@src/crystal/system/wasi/main.cr:32
          10: 0xaf35c - <unknown>!*Fiber::wrap_with_fibers<&Proc(Int32)>:Int32
          11: 0xaf1ee - <unknown>!main.1
          12: 0x6c84fc - <unknown>!__main_void
          13: 0xaf135 - <unknown>!_start
       note: using the `WASMTIME_BACKTRACE_DETAILS=1` environment variable to may show more debugging information

@lbguilherme
Copy link
Contributor Author

lbguilherme commented Mar 16, 2023

I did a significant refactoring of this PR to make is safer and more organized. Relevant changes:

  • Created the Crystal::Asyncify module. It holds the logic of stack switching and keep everything more organized. It is there because this will also be used by the exception handling runtime (see WebAssembly Exceptions #13130) and also by the GC.
  • When --release is enabled, I was enabling Binaryen optimization passes together with the asyncify pass. This brings some problems because the asyncify runtime functions cannot be inlined before the transformation is applied. Now I divided those in two steps, first we apply the asyncify pass, then we optimize. That should be safe now.
  • I'm now using Wasmer singlepass executor to run WebAssembly in the CI instead of Wasmtime. That's because this singlepass compiler is simpler and more predictable. It won't have any issues with functions that are "too big". Now we are building with --release simply because it is faster to build this way. But the result will still run fine with wasmer.

Unfortunately the CI is still failing and I can't reproduce locally (even with wasm being almost deterministic). As the crash is inside dlmalloc, I'm suspecting that we are running out of memory and we need the GC... I'll investigate that futher.

@lbguilherme
Copy link
Contributor Author

It's fixed!

What was going on?

By default the main stack comes after data and grows down. Nothing prevents it from overflowing and overwriting something in the data section. This is very dangerous.

That was chosen deliberately in order to give static address very small absolute values in order to minimize code size (the i32.const instruction takes fewer bytes when const is small).

The thing I overlooked is that by default the stack has only 64KB. That's very likely to be too short for most Crystal programs.

Also, it seems that the dlmalloc data structures happens to be allocated right at the end of the data section, the first place to be corrupted by a stack overflow.

See this issue: WebAssembly/wasi-libc#233

+--------+--------+------------------+----------------
| Unused | Data   | Main stack       | Heap ...
+--------+--------+------------------+----------------
|        |        |                  |
0    __data_start |              __heap_base
             __data_end        __stack_pointer

How was it fixed?

The first change is to use the linker flag --stack-first to place the data section after the stack, so a stack overflow can't corrupt it. The second change is increase the stack size to 8MB.

This is the new layout:

+--------+----------------+---------------+----------------
| Unused | Main stack     | Data          | Heap ...
+--------+----------------+---------------+----------------
|        |                |               |
0      1024         __data_start      __data_end
                   __stack_pointer    __heap_base

The stack can still grow into the unused region, but that's fine since this region is really unused.


I'm still not sure why I couldn't reproduce this locally and it only happened with the CI. C'est la vie.

Please give me another review round. Thanks!

@@ -384,6 +384,7 @@ module Crystal
[{cl, cmd, nil}]
elsif program.has_flag? "wasm32"
link_flags = @link_flags || ""
link_flags += " --stack-first -z stack-size=#{8 * 1024 * 1024}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we should reuse Fiber::StackPool::STACK_SIZE or perhaps introduce a more generic property for this. It would also be great to allow configuration of the stack size at compile time.
Perhaps a generic CRYSTAL_STACK_SIZE environment variable could be useful for this? It would populate this value here and Fiber::StackPool::STACK_SIZE and have 8MB as default value.
Just thinking ahead, this can be implemented outside this PR. (It's probably better to extract it).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like having a CRYSTAL_STACK_SIZE environment variable! But let's do that on another PR.

@HertzDevil
Copy link
Contributor

Does this PR alone enable any new standard library specs?

@lbguilherme
Copy link
Contributor Author

lbguilherme commented Mar 27, 2023

Not any entire spec file. But there are some I can enable by adding flag checks inside the spec files. Should I?

Channels is a good example. Most tests pass, but some require exception handling to work.

@straight-shoota
Copy link
Member

Yes, that would be great. All specs that work should be enabled 👍

@refi64
Copy link
Contributor

refi64 commented Jan 31, 2024

I'm not sure what the progress of this was, but I merged master back on and (successfully) enabled concurrent_spec.cr (most of the channel specs also pass, but I'd have to litter the file with ifs to work around the lack of exceptions).

Bizarrely enough though I can't run all the specs, because I ran into a cranelift bug for wasmtime, and wasmer's single-pass compiler crashes on arm64, and some other runtimes fail because the functions are so large...etc etc. (I think LLVM's being a little over-aggressive with its inlining(?), but I digress.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants