Skip to content

Latest commit

 

History

History
261 lines (208 loc) · 10.7 KB

File metadata and controls

261 lines (208 loc) · 10.7 KB

Baseline Compilation in Wasmtime

Authors: Saúl Cabrera (@saulecabrera); Chris Fallin (@cfallin)

Summary

This RFC proposes the addition of a new WebAssembly (Wasm) compiler to Wasmtime: a single-pass or “baseline” compiler. A baseline compiler works to improve overall compilation performance, yielding faster startup times, at the cost of less optimized code.

Motivation

Wasmtime currently uses Cranelift by default, which is an optimizing compiler. Cranelift performs code optimizations at the expense of slower compilation times. This makes Just-In-Time (JIT) compilation of Wasm unsuitable for cases where higher compilation performance is desired (e.g. short-lived trivial programs, cases in which startup time is more critical than runtime performance).

The introduction of a baseline compiler is a first step towards: (i) faster compilation and startup times (ii) enabling a tiered compilation model in Wasmtime, similar to what is present in Wasm engines in Web browsers. This RFC does not account for tiered compilation, it only accounts for the introduction of a baseline compiler.

Approximate measurements taken on top of a subset of the Sightglass benchmark suite – using different optimizing and baseline compilers (Cranelift from Wasmtime, Liftoff from V8 and RabaldrMonkey from SpiderMonkey) – show that a baseline compiler on average yields 15x to 20x faster compilation while producing code that is on average 1.1x to 1.5x slower than the one produced by an optimizing compiler. These measurements align on average with other measurements observed when comparing interpretation and compilation for WebAssembly1.

Proposal: Winch, a baseline compiler for Wasmtime

Winch: WebAssembly Intentionally-Non-Optimizing Compiler and Host

Design Principles

  • Single pass over Wasm bytecode
  • Function as the unit of compilation
  • Machine code generation directly from Wasm bytecode – no intermediate representation
  • Avoid reinventing machine-code emission – use Cranelift's instruction emitter code to create an assembler library
  • Prioritize compilation performance over runtime performance
  • Simple to verify by looking. It should be evident which machine instructions are emitted per WebAssembly Opcode
  • Adding and iterating on new (WebAssembly and developer-facing) features should be simpler than doing it in an optimizing tier (Cranelift)

High-level overview

  graph TD;
  A(wasmparser)-->B(cranelift-wasm);
  A-->C(winch);
  C-->D(Assembler);
  D-->X(cranelift-asm);
  X-->E(MachInst);
  X-->F(MachBuffer);
  B-->G(cranelift);
  G-->X;
Loading

Assembler and Borrowing from Cranelift

We plan to factor out the lower layers of Cranelift that produce and operate on machine code in order to reuse them as a generic assembler library (“Assembler”).

The two key abstractions that will be useful to reuse are the MachInst (“machine instruction”) trait and its implementations for each architecture; and the MachBuffer, which is a machine-code emission buffer with some knowledge of branches and ability to do peephole optimizations on them. The former lets us reuse all the logic to encode instructions for an ISA; the latter lets us emit code with “labels” and references to labels, and have the fixups done for us.

The MachInst trait and its implementations, and the MachBuffer, can be mostly factored out into a separate crate cranelift_asm. This will require some care with respect to layering: in particular, definitions of machine-instruction types are currently done in the ISLE backends for each ISA within Cranelift. We can continue to use ISLE for these, but they will need to be moved to the separate crate.

As a result of this initial layering, one will be able to build a MachInst as a Rust data structure and emit it manually, for example:

let add = cranelift_asm::x64::AluRmiR { op: AluRmiR::Add,};
let mut buf = MachBuffer::new();
add.emit(&mut buf,);

However this is still quite cumbersome. As a next step, we will develop an API over this that provides for procedural generation of instructions: i.e., one method call for each instruction. Something like:

let mut masm = cranelift_asm::x64::Assembler::new(); 
masm.add(rd, rm);
masm.store(rd, MemArg::base_offset(ra, 64));

This would allow for fairly natural single-pass code emission. In essence, this is a lower-level approximation of the MacroAssembler idea from SpiderMonkey. Each architecture will have an implementation of the Assembler API; perhaps there can be a trait that abstracts commonalities, but enough will be different (e.g., instruction set quirks beyond the usual “add/sub/and/or/not” suspects, x64 two-operand form vs aarch64 three-operand form, and more) that we expect there to be different Assembler types for each ISA. This in turn implies different lowering code that invokes the Assembler per ISA in the baseline compiler. The lowering code can perhaps share many helpers that are monomorphized on the “common ISA core” trait.

In the above examples, we bypass the register-allocation support, i.e. the ability to hold virtual register operands rather than real registers, in the MachInsts. This is supported today by passing through RealRegs (“real registers”) instead. In the baseline compiler we expect register allocation to occur before invoking the Assembler; i.e., when generating the instructions we already know which register we are using for each operand. Doing otherwise (emitting with vregs first and editing later) requires actually buffering the MachInst structs in memory, which we do not wish to do.

We don’t expect to make any changes to Cranelift itself beyond the layering refactor to borrow its MachInst and MachBuffer implementations. In particular we don’t expect to use the Assembler wrapper in Cranelift, at least at first, because it will be built around constructing and emitting instructions to machine code right away, without buffering (as in Cranelift’s VCode). It’s possible in the future that we may find other ways to make Assembler generic and leverage it in Cranelift too, but that is beyond the scope of this RFC.

Register Allocation

We plan to implement register allocation in a single-pass fashion.

The baseline compiler will hold a reference to a register allocator abstraction, which will keep a list of registers, represented by Cranelit's Reg abstraction, per ISA, along with their availability. It will also hold a reference to a value stack abstraction, to keep track of operands and results and their location as it performs compilation. These are the two key abstractions for register allocation:

pub struct Compiler {
  //...
  allocator: RegisterAllocator,
  value_stack: ValueStack,
  //...
}

The value stack is expected to keep track of the location of its values. A particular value can be tagged as either a:

  • Local: representing a function local slot (index and type). The address of the local will be resolved lazily to reduce register pressure.
  • Register
  • Constant: representing an immediate value.
  • Memory Offset: the location of the value at a given memory offset

Registers will be requested to the register allocator every time an operation requires it. If no registers are available, the baseline compiler will move all locals and all registers to memory, changing their tag to a memory offset, performing what's known as spilling, effectively freeing up registers. Spilling will also be performed at control flow points. To reduce the number of spills, the baseline compiler will also perform limited constant rematerialization.

Assuming that we have an immediate at the top of the stack, emitting an add instruction with an immediate operand would look something like this:

let mut masm = cranelift_asm::x64::Assembler::new();
let imm = self.value_stack.pop(); 
// request a general purpose register;
// spill if none available
let rd = self.gpr();
masm.add(rd, imm);

Integration with Wasmtime

We plan to integrate the baseline compiler incrementally into Wasmtime, as an in-tree crate, winch. It will be introduced as a compile-time feature, off by default. Taking as a guideline Wasmtime's tiers of support, this means that the baseline compiler will be introduced as a Tier 3 feature.

In general, the development of the baseline compiler will be done in phases, each phase covering a specific set of features:

Phase Feature Feature Type
1 cranelift_asm crate Refactoring
1 x64 support Target architecture
1 Initial aarch64 support Target architecture
1 wasi_snapshot_preview1 WASI proposal
1 wasi_unstable WASI proposal
1 Multi-Memory Wasm proposal
1 Epoch-based interruption Wasmtime feature
1 Parallel compilation Wasmtime feature
1 Fuzzing integration Test coverage
2 Reference Types Wasm proposal
2 Fuel Wasmtime feature
2 SIMD Wasm proposal
2 Memory 64 Wasm proposal
2 Finalize aarch64 support Target architecture
3 s390x Target architecture
3 Debugging integration Debugging

Configuring compilation

We plan to extend wasmtime::Strategy to include a baseline compiler entry:

pub enum Strategy {
  Auto,
  Cranelift,
  Winch
} 

Which will be configurable via the strategy method in the wasmtime::Config struct:

config.strategy(Strategy::Winch);

We also plan to extend Wasmtime's run and compile subcommands to support a compiler argument:

wasmtime compile --compiler=<winch|cranelift> file.wasm
wasmtime run --compiler=<winch|cranelift> file.wasm

Performing compilation

The baseline compiler will implement the wasmtime_environ::Compiler trait, serving as the separation layer between Wasmtime and the compiler. We plan to modify the wasmtime::Engine::compiler method to account for the compilation strategy and choose the compiler accordingly.

Development and long term maintenance

Saúl Cabrera (@saulecabrera) will be the main maintainer of the baseline compiler with support from Chris Fallin (@cfallin).

Footnotes

  1. Ben L. Titzer. A fast in-place interpreter for WebAssembly