Skip to content

Add ahead-of-time ICs.#45

Merged
cfallin merged 3 commits intobytecodealliance:fastly/ff-124-0-2from
cfallin:cfallin/aot-ics-ff124
Jul 17, 2024
Merged

Add ahead-of-time ICs.#45
cfallin merged 3 commits intobytecodealliance:fastly/ff-124-0-2from
cfallin:cfallin/aot-ics-ff124

Conversation

@cfallin
Copy link
Copy Markdown
Member

@cfallin cfallin commented Jul 16, 2024

The SpiderMonkey CacheIR mechanism for inline caches (ICs) generates IC
bodies dynamically based on observed cases invoked by the user program.
Only these ICs will be compiled and exist at runtime. This is ideal from
a flexibility standpoint: we have the ability to add new ICs without
writing their bodies in full (and can, for example, programmatically
generate parts of them). Also, it avoids the overhead of compilation
until an IC is actually needed.

However, some environments require fully ahead-of-time code generation.
In addition, in some environments, we may have ample compilation time
available during a "preparation" phase, and wish to minimize latency for
the first use of an IC instead. In these cases, it would be better to
have a corpus of inline cache bodies, known ahead of time.

This PR adds an "ahead-of-time ICs" feature that includes a corpus of IC
bodies collected while running tests, builtin mechanisms to keep this
corpus up-to-date, and a mechanism to load the corpus when a JitZone
is created, so all ICs are ready.

The expectation is that any reasonable user program will likely only
generate ICs that are in this corpus; thus, there is no need to compile
ICs at first actual occurrence. In a system that can only AOT-compile
ICs, this means we will have ICs available in more cases.

Because CacheIR still allows for programmatically-generated IC bodies of
arbitrary content (e.g., due to arbitrarily long prototype chains), we
may not always have an IC in the corpus when we encounter its CacheIR in
the wild: that is, the corpus is not guaranteed to be "complete".
However, because it includes all ICs observed during execution of all
tests, we expect that any reasonable IC should be included. Note that
this aligns the incentives of keeping IC generation tested and keeping
the corpus close to complete.

In order to maintain the corpus, this feature includes an "enforcing"
mode. This should ONLY be used during testing: it aborts the process
when an unknown (new) IC body is encountered, after dumping the file.
For maintainence convenience, this file is in the format that we check
into the tree: one must simply move it into js/src/ics/ and rebuild.
The idea is that as one adds new IC bodies, one runs tests, sees these
failures, and "blesses" the IC bodies as part of the corpus by adding
the file(s) as needed.

This functionality is a prerequisite for later AOT compilation work, but
is also potentially useful on its own.

(I'm PR'ing this against the 124.0.2 branch to start the review, and including
one subsequent bugfix from upstream cherry-picked in that it depends on, but
I'm happy to rebase onto 127 once that lands!)

…ader. r=iain

Continuation of bug 1690702 to also set `isSameRealm` when reading call-flags.

LCallKnown from standard calls already left out the same realm check:
```js
function f() {
  // Don't inline this function to ensure we compile through LCallKnown.
  with ({}) ;
}

for (var i = 0; i < 1_000_000; ++i) {
  // Standard call.
  f();
}
```

But LCallKnown from FunCall calls had an extra same realm check:
```js
function f() {
  // Don't inline this function to ensure we compile through LCallKnown.
  with ({}) ;
}

for (var i = 0; i < 1_000_000; ++i) {
  // FunCall call
  f.call();
}
```

Differential Revision: https://phabricator.services.mozilla.com/D206350
@cfallin cfallin requested a review from JakeChampion July 16, 2024 07:30
@cfallin
Copy link
Copy Markdown
Member Author

cfallin commented Jul 16, 2024

(I'm PR'ing this against the 124.0.2 branch to start the review, and including
one subsequent bugfix from upstream cherry-picked in that it depends on, but
I'm happy to rebase onto 127 once that lands!)

Actually a brief update on this: to avoid excessively headache-inducing second-order diffing, I'm going to finish my existing rebase and patch surgery on top of 124 and will hope to land these as PRs onto the fastly/ff-124-0-2 branch; once that's in, I'll do the cherry-pick and any needed updates for 127 separately.

Comment thread js/src/jit/BaselineCacheIRCompiler.cpp Outdated
Comment thread js/src/jit/moz.build Outdated
Comment thread js/src/jit/JitSpewer.cpp Outdated
Comment thread js/src/jit/CompactBuffer.h Outdated
Comment thread js/src/jit/BaselineCacheIRCompiler.cpp Outdated
Comment thread js/src/jit/BaselineCacheIRCompiler.cpp Outdated
The SpiderMonkey CacheIR mechanism for inline caches (ICs) generates IC
bodies dynamically based on observed cases invoked by the user program.
Only these ICs will be compiled and exist at runtime. This is ideal from
a flexibility standpoint: we have the ability to add new ICs without
writing their bodies in full (and can, for example, programmatically
generate parts of them). Also, it avoids the overhead of compilation
until an IC is actually needed.

However, some environments require fully ahead-of-time code generation.
In addition, in some environments, we may have ample compilation time
available during a "preparation" phase, and wish to minimize latency for
the first use of an IC instead. In these cases, it would be better to
have a corpus of inline cache bodies, known ahead of time.

This PR adds an "ahead-of-time ICs" feature that includes a corpus of IC
bodies collected while running tests, builtin mechanisms to keep this
corpus up-to-date, and a mechanism to load the corpus when a `JitZone`
is created, so all ICs are ready.

The expectation is that any reasonable user program will likely only
generate ICs that are in this corpus; thus, there is no need to compile
ICs at first actual occurrence. In a system that can only AOT-compile
ICs, this means we will have ICs available in more cases.

Because CacheIR still allows for programmatically-generated IC bodies of
arbitrary content (e.g., due to arbitrarily long prototype chains), we
may not always have an IC in the corpus when we encounter its CacheIR in
the wild: that is, the corpus is not guaranteed to be "complete".
However, because it includes all ICs observed during execution of all
tests, we expect that any *reasonable* IC should be included. Note that
this aligns the incentives of keeping IC generation tested and keeping
the corpus close to complete.

In order to maintain the corpus, this feature includes an "enforcing"
mode. This should *ONLY* be used during testing: it aborts the process
when an unknown (new) IC body is encountered, after dumping the file.
For maintainence convenience, this file is in the format that we check
into the tree: one must simply move it into `js/src/ics/` and rebuild.
The idea is that as one adds new IC bodies, one runs tests, sees these
failures, and "blesses" the IC bodies as part of the corpus by adding
the file(s) as needed.

This functionality is a prerequisite for later AOT compilation work, but
is also potentially useful on its own.
@cfallin cfallin force-pushed the cfallin/aot-ics-ff124 branch from 44c19c8 to 1fb69da Compare July 17, 2024 15:07
Copy link
Copy Markdown
Collaborator

@JakeChampion JakeChampion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me - good idea to also add a mozconfig for this and including it in CI 👍

@cfallin cfallin merged commit f888f9e into bytecodealliance:fastly/ff-124-0-2 Jul 17, 2024
@cfallin cfallin deleted the cfallin/aot-ics-ff124 branch July 17, 2024 17:02
cfallin added a commit to cfallin/spidermonkey-wasi-embedding that referenced this pull request Jul 25, 2024
This pulls in work from

- bytecodealliance/gecko-dev#45 (Add ahead-of-time ICs.)
- bytecodealliance/gecko-dev#46 (JS shell on WASI: add basic Wizer
  integration for standalone testing.)
- bytecodealliance/gecko-dev#47 (Update PBL for performance and in
  preparation for applying weval.)
- bytecodealliance/gecko-dev#48 (Add weval support to PBL.)

as originally PR'd onto a SpiderMonkey v124.0.2 branch then rebased to
v127.0.2 in bytecodealliance/gecko-dev#51.
cfallin added a commit to cfallin/spidermonkey-wasi-embedding that referenced this pull request Jul 26, 2024
This pulls in work from

- bytecodealliance/gecko-dev#45 (Add ahead-of-time ICs.)
- bytecodealliance/gecko-dev#46 (JS shell on WASI: add basic Wizer
  integration for standalone testing.)
- bytecodealliance/gecko-dev#47 (Update PBL for performance and in
  preparation for applying weval.)
- bytecodealliance/gecko-dev#48 (Add weval support to PBL.)

as originally PR'd onto a SpiderMonkey v124.0.2 branch then rebased to
v127.0.2 in bytecodealliance/gecko-dev#51.
cfallin added a commit to cfallin/js-compute-runtime that referenced this pull request Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial
evaluator, to perform ahead-of-time compilation of JavaScript using the
PBL interpreter we previously contributed to SpiderMonkey. This work has
been merged into the BA fork of SpiderMonkey in
bytecodealliance/gecko-dev#45,  bytecodealliance/gecko-dev#46,
bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48,
bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52,
bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54,
bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey
in bytecodealliance/StarlingMonkey#91.

The feature is off by default; it requires a `--enable-experimental-aot`
flag to be passed to `js-compute-runtime-cli.js`. This requires a
separate build of the engine Wasm module to be used when the flag is
passed.

This should still be considered experimental until it is tested more
widely. The PBL+weval combination passes all jit-tests and jstests in
SpiderMonkey, and all integration tests in StarlingMonkey; however, it
has not yet been widely tested in real-world scenarios.

Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks)
are in the 3x-5x range. This is roughly equivalent to the speedup that a
native JS engine's "baseline JIT" compiler tier gets over its
interpreter, and it uses the same basic techniques -- compiling all
polymorphic operations (all basic JS operators) to inline-cache sites
that dispatch to stubs depending on types. Further speedups can be
obtained eventually by inlining stubs from warmed-up IC chains, but that
requires warmup.

Important to note is that this compilation approach is *fully
ahead-of-time*: it requires no profiling or observation or warmup of
user code, and compiles the JS directly to Wasm that does not do any
further codegen/JIT at runtime. Thus, it is suitable for the per-request
isolation model (new Wasm instance for each request, with no shared
state).
cfallin added a commit to cfallin/js-compute-runtime that referenced this pull request Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial
evaluator, to perform ahead-of-time compilation of JavaScript using the
PBL interpreter we previously contributed to SpiderMonkey. This work has
been merged into the BA fork of SpiderMonkey in
bytecodealliance/gecko-dev#45,  bytecodealliance/gecko-dev#46,
bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48,
bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52,
bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54,
bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey
in bytecodealliance/StarlingMonkey#91.

The feature is off by default; it requires a `--enable-experimental-aot`
flag to be passed to `js-compute-runtime-cli.js`. This requires a
separate build of the engine Wasm module to be used when the flag is
passed.

This should still be considered experimental until it is tested more
widely. The PBL+weval combination passes all jit-tests and jstests in
SpiderMonkey, and all integration tests in StarlingMonkey; however, it
has not yet been widely tested in real-world scenarios.

Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks)
are in the 3x-5x range. This is roughly equivalent to the speedup that a
native JS engine's "baseline JIT" compiler tier gets over its
interpreter, and it uses the same basic techniques -- compiling all
polymorphic operations (all basic JS operators) to inline-cache sites
that dispatch to stubs depending on types. Further speedups can be
obtained eventually by inlining stubs from warmed-up IC chains, but that
requires warmup.

Important to note is that this compilation approach is *fully
ahead-of-time*: it requires no profiling or observation or warmup of
user code, and compiles the JS directly to Wasm that does not do any
further codegen/JIT at runtime. Thus, it is suitable for the per-request
isolation model (new Wasm instance for each request, with no shared
state).
cfallin added a commit to cfallin/js-compute-runtime that referenced this pull request Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial
evaluator, to perform ahead-of-time compilation of JavaScript using the
PBL interpreter we previously contributed to SpiderMonkey. This work has
been merged into the BA fork of SpiderMonkey in
bytecodealliance/gecko-dev#45,  bytecodealliance/gecko-dev#46,
bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48,
bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52,
bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54,
bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey
in bytecodealliance/StarlingMonkey#91.

The feature is off by default; it requires a `--enable-experimental-aot`
flag to be passed to `js-compute-runtime-cli.js`. This requires a
separate build of the engine Wasm module to be used when the flag is
passed.

This should still be considered experimental until it is tested more
widely. The PBL+weval combination passes all jit-tests and jstests in
SpiderMonkey, and all integration tests in StarlingMonkey; however, it
has not yet been widely tested in real-world scenarios.

Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks)
are in the 3x-5x range. This is roughly equivalent to the speedup that a
native JS engine's "baseline JIT" compiler tier gets over its
interpreter, and it uses the same basic techniques -- compiling all
polymorphic operations (all basic JS operators) to inline-cache sites
that dispatch to stubs depending on types. Further speedups can be
obtained eventually by inlining stubs from warmed-up IC chains, but that
requires warmup.

Important to note is that this compilation approach is *fully
ahead-of-time*: it requires no profiling or observation or warmup of
user code, and compiles the JS directly to Wasm that does not do any
further codegen/JIT at runtime. Thus, it is suitable for the per-request
isolation model (new Wasm instance for each request, with no shared
state).
cfallin added a commit to cfallin/js-compute-runtime that referenced this pull request Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial
evaluator, to perform ahead-of-time compilation of JavaScript using the
PBL interpreter we previously contributed to SpiderMonkey. This work has
been merged into the BA fork of SpiderMonkey in
bytecodealliance/gecko-dev#45,  bytecodealliance/gecko-dev#46,
bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48,
bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52,
bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54,
bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey
in bytecodealliance/StarlingMonkey#91.

The feature is off by default; it requires a `--enable-experimental-aot`
flag to be passed to `js-compute-runtime-cli.js`. This requires a
separate build of the engine Wasm module to be used when the flag is
passed.

This should still be considered experimental until it is tested more
widely. The PBL+weval combination passes all jit-tests and jstests in
SpiderMonkey, and all integration tests in StarlingMonkey; however, it
has not yet been widely tested in real-world scenarios.

Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks)
are in the 3x-5x range. This is roughly equivalent to the speedup that a
native JS engine's "baseline JIT" compiler tier gets over its
interpreter, and it uses the same basic techniques -- compiling all
polymorphic operations (all basic JS operators) to inline-cache sites
that dispatch to stubs depending on types. Further speedups can be
obtained eventually by inlining stubs from warmed-up IC chains, but that
requires warmup.

Important to note is that this compilation approach is *fully
ahead-of-time*: it requires no profiling or observation or warmup of
user code, and compiles the JS directly to Wasm that does not do any
further codegen/JIT at runtime. Thus, it is suitable for the per-request
isolation model (new Wasm instance for each request, with no shared
state).
cfallin added a commit to cfallin/js-compute-runtime that referenced this pull request Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial
evaluator, to perform ahead-of-time compilation of JavaScript using the
PBL interpreter we previously contributed to SpiderMonkey. This work has
been merged into the BA fork of SpiderMonkey in
bytecodealliance/gecko-dev#45,  bytecodealliance/gecko-dev#46,
bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48,
bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52,
bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54,
bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey
in bytecodealliance/StarlingMonkey#91.

The feature is off by default; it requires a `--enable-experimental-aot`
flag to be passed to `js-compute-runtime-cli.js`. This requires a
separate build of the engine Wasm module to be used when the flag is
passed.

This should still be considered experimental until it is tested more
widely. The PBL+weval combination passes all jit-tests and jstests in
SpiderMonkey, and all integration tests in StarlingMonkey; however, it
has not yet been widely tested in real-world scenarios.

Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks)
are in the 3x-5x range. This is roughly equivalent to the speedup that a
native JS engine's "baseline JIT" compiler tier gets over its
interpreter, and it uses the same basic techniques -- compiling all
polymorphic operations (all basic JS operators) to inline-cache sites
that dispatch to stubs depending on types. Further speedups can be
obtained eventually by inlining stubs from warmed-up IC chains, but that
requires warmup.

Important to note is that this compilation approach is *fully
ahead-of-time*: it requires no profiling or observation or warmup of
user code, and compiles the JS directly to Wasm that does not do any
further codegen/JIT at runtime. Thus, it is suitable for the per-request
isolation model (new Wasm instance for each request, with no shared
state).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants