Add ahead-of-time ICs.#45
Merged
cfallin merged 3 commits intobytecodealliance:fastly/ff-124-0-2from Jul 17, 2024
Merged
Conversation
…ader. r=iain
Continuation of bug 1690702 to also set `isSameRealm` when reading call-flags.
LCallKnown from standard calls already left out the same realm check:
```js
function f() {
// Don't inline this function to ensure we compile through LCallKnown.
with ({}) ;
}
for (var i = 0; i < 1_000_000; ++i) {
// Standard call.
f();
}
```
But LCallKnown from FunCall calls had an extra same realm check:
```js
function f() {
// Don't inline this function to ensure we compile through LCallKnown.
with ({}) ;
}
for (var i = 0; i < 1_000_000; ++i) {
// FunCall call
f.call();
}
```
Differential Revision: https://phabricator.services.mozilla.com/D206350
Member
Author
Actually a brief update on this: to avoid excessively headache-inducing second-order diffing, I'm going to finish my existing rebase and patch surgery on top of 124 and will hope to land these as PRs onto the |
The SpiderMonkey CacheIR mechanism for inline caches (ICs) generates IC bodies dynamically based on observed cases invoked by the user program. Only these ICs will be compiled and exist at runtime. This is ideal from a flexibility standpoint: we have the ability to add new ICs without writing their bodies in full (and can, for example, programmatically generate parts of them). Also, it avoids the overhead of compilation until an IC is actually needed. However, some environments require fully ahead-of-time code generation. In addition, in some environments, we may have ample compilation time available during a "preparation" phase, and wish to minimize latency for the first use of an IC instead. In these cases, it would be better to have a corpus of inline cache bodies, known ahead of time. This PR adds an "ahead-of-time ICs" feature that includes a corpus of IC bodies collected while running tests, builtin mechanisms to keep this corpus up-to-date, and a mechanism to load the corpus when a `JitZone` is created, so all ICs are ready. The expectation is that any reasonable user program will likely only generate ICs that are in this corpus; thus, there is no need to compile ICs at first actual occurrence. In a system that can only AOT-compile ICs, this means we will have ICs available in more cases. Because CacheIR still allows for programmatically-generated IC bodies of arbitrary content (e.g., due to arbitrarily long prototype chains), we may not always have an IC in the corpus when we encounter its CacheIR in the wild: that is, the corpus is not guaranteed to be "complete". However, because it includes all ICs observed during execution of all tests, we expect that any *reasonable* IC should be included. Note that this aligns the incentives of keeping IC generation tested and keeping the corpus close to complete. In order to maintain the corpus, this feature includes an "enforcing" mode. This should *ONLY* be used during testing: it aborts the process when an unknown (new) IC body is encountered, after dumping the file. For maintainence convenience, this file is in the format that we check into the tree: one must simply move it into `js/src/ics/` and rebuild. The idea is that as one adds new IC bodies, one runs tests, sees these failures, and "blesses" the IC bodies as part of the corpus by adding the file(s) as needed. This functionality is a prerequisite for later AOT compilation work, but is also potentially useful on its own.
44c19c8 to
1fb69da
Compare
JakeChampion
approved these changes
Jul 17, 2024
cfallin
added a commit
to cfallin/spidermonkey-wasi-embedding
that referenced
this pull request
Jul 25, 2024
This pulls in work from - bytecodealliance/gecko-dev#45 (Add ahead-of-time ICs.) - bytecodealliance/gecko-dev#46 (JS shell on WASI: add basic Wizer integration for standalone testing.) - bytecodealliance/gecko-dev#47 (Update PBL for performance and in preparation for applying weval.) - bytecodealliance/gecko-dev#48 (Add weval support to PBL.) as originally PR'd onto a SpiderMonkey v124.0.2 branch then rebased to v127.0.2 in bytecodealliance/gecko-dev#51.
cfallin
added a commit
to cfallin/spidermonkey-wasi-embedding
that referenced
this pull request
Jul 26, 2024
This pulls in work from - bytecodealliance/gecko-dev#45 (Add ahead-of-time ICs.) - bytecodealliance/gecko-dev#46 (JS shell on WASI: add basic Wizer integration for standalone testing.) - bytecodealliance/gecko-dev#47 (Update PBL for performance and in preparation for applying weval.) - bytecodealliance/gecko-dev#48 (Add weval support to PBL.) as originally PR'd onto a SpiderMonkey v124.0.2 branch then rebased to v127.0.2 in bytecodealliance/gecko-dev#51.
cfallin
added a commit
to cfallin/js-compute-runtime
that referenced
this pull request
Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial evaluator, to perform ahead-of-time compilation of JavaScript using the PBL interpreter we previously contributed to SpiderMonkey. This work has been merged into the BA fork of SpiderMonkey in bytecodealliance/gecko-dev#45, bytecodealliance/gecko-dev#46, bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48, bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52, bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54, bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey in bytecodealliance/StarlingMonkey#91. The feature is off by default; it requires a `--enable-experimental-aot` flag to be passed to `js-compute-runtime-cli.js`. This requires a separate build of the engine Wasm module to be used when the flag is passed. This should still be considered experimental until it is tested more widely. The PBL+weval combination passes all jit-tests and jstests in SpiderMonkey, and all integration tests in StarlingMonkey; however, it has not yet been widely tested in real-world scenarios. Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks) are in the 3x-5x range. This is roughly equivalent to the speedup that a native JS engine's "baseline JIT" compiler tier gets over its interpreter, and it uses the same basic techniques -- compiling all polymorphic operations (all basic JS operators) to inline-cache sites that dispatch to stubs depending on types. Further speedups can be obtained eventually by inlining stubs from warmed-up IC chains, but that requires warmup. Important to note is that this compilation approach is *fully ahead-of-time*: it requires no profiling or observation or warmup of user code, and compiles the JS directly to Wasm that does not do any further codegen/JIT at runtime. Thus, it is suitable for the per-request isolation model (new Wasm instance for each request, with no shared state).
cfallin
added a commit
to cfallin/js-compute-runtime
that referenced
this pull request
Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial evaluator, to perform ahead-of-time compilation of JavaScript using the PBL interpreter we previously contributed to SpiderMonkey. This work has been merged into the BA fork of SpiderMonkey in bytecodealliance/gecko-dev#45, bytecodealliance/gecko-dev#46, bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48, bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52, bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54, bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey in bytecodealliance/StarlingMonkey#91. The feature is off by default; it requires a `--enable-experimental-aot` flag to be passed to `js-compute-runtime-cli.js`. This requires a separate build of the engine Wasm module to be used when the flag is passed. This should still be considered experimental until it is tested more widely. The PBL+weval combination passes all jit-tests and jstests in SpiderMonkey, and all integration tests in StarlingMonkey; however, it has not yet been widely tested in real-world scenarios. Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks) are in the 3x-5x range. This is roughly equivalent to the speedup that a native JS engine's "baseline JIT" compiler tier gets over its interpreter, and it uses the same basic techniques -- compiling all polymorphic operations (all basic JS operators) to inline-cache sites that dispatch to stubs depending on types. Further speedups can be obtained eventually by inlining stubs from warmed-up IC chains, but that requires warmup. Important to note is that this compilation approach is *fully ahead-of-time*: it requires no profiling or observation or warmup of user code, and compiles the JS directly to Wasm that does not do any further codegen/JIT at runtime. Thus, it is suitable for the per-request isolation model (new Wasm instance for each request, with no shared state).
cfallin
added a commit
to cfallin/js-compute-runtime
that referenced
this pull request
Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial evaluator, to perform ahead-of-time compilation of JavaScript using the PBL interpreter we previously contributed to SpiderMonkey. This work has been merged into the BA fork of SpiderMonkey in bytecodealliance/gecko-dev#45, bytecodealliance/gecko-dev#46, bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48, bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52, bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54, bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey in bytecodealliance/StarlingMonkey#91. The feature is off by default; it requires a `--enable-experimental-aot` flag to be passed to `js-compute-runtime-cli.js`. This requires a separate build of the engine Wasm module to be used when the flag is passed. This should still be considered experimental until it is tested more widely. The PBL+weval combination passes all jit-tests and jstests in SpiderMonkey, and all integration tests in StarlingMonkey; however, it has not yet been widely tested in real-world scenarios. Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks) are in the 3x-5x range. This is roughly equivalent to the speedup that a native JS engine's "baseline JIT" compiler tier gets over its interpreter, and it uses the same basic techniques -- compiling all polymorphic operations (all basic JS operators) to inline-cache sites that dispatch to stubs depending on types. Further speedups can be obtained eventually by inlining stubs from warmed-up IC chains, but that requires warmup. Important to note is that this compilation approach is *fully ahead-of-time*: it requires no profiling or observation or warmup of user code, and compiles the JS directly to Wasm that does not do any further codegen/JIT at runtime. Thus, it is suitable for the per-request isolation model (new Wasm instance for each request, with no shared state).
cfallin
added a commit
to cfallin/js-compute-runtime
that referenced
this pull request
Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial evaluator, to perform ahead-of-time compilation of JavaScript using the PBL interpreter we previously contributed to SpiderMonkey. This work has been merged into the BA fork of SpiderMonkey in bytecodealliance/gecko-dev#45, bytecodealliance/gecko-dev#46, bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48, bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52, bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54, bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey in bytecodealliance/StarlingMonkey#91. The feature is off by default; it requires a `--enable-experimental-aot` flag to be passed to `js-compute-runtime-cli.js`. This requires a separate build of the engine Wasm module to be used when the flag is passed. This should still be considered experimental until it is tested more widely. The PBL+weval combination passes all jit-tests and jstests in SpiderMonkey, and all integration tests in StarlingMonkey; however, it has not yet been widely tested in real-world scenarios. Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks) are in the 3x-5x range. This is roughly equivalent to the speedup that a native JS engine's "baseline JIT" compiler tier gets over its interpreter, and it uses the same basic techniques -- compiling all polymorphic operations (all basic JS operators) to inline-cache sites that dispatch to stubs depending on types. Further speedups can be obtained eventually by inlining stubs from warmed-up IC chains, but that requires warmup. Important to note is that this compilation approach is *fully ahead-of-time*: it requires no profiling or observation or warmup of user code, and compiles the JS directly to Wasm that does not do any further codegen/JIT at runtime. Thus, it is suitable for the per-request isolation model (new Wasm instance for each request, with no shared state).
cfallin
added a commit
to cfallin/js-compute-runtime
that referenced
this pull request
Aug 1, 2024
This PR pulls in my work to use "weval", the WebAssembly partial evaluator, to perform ahead-of-time compilation of JavaScript using the PBL interpreter we previously contributed to SpiderMonkey. This work has been merged into the BA fork of SpiderMonkey in bytecodealliance/gecko-dev#45, bytecodealliance/gecko-dev#46, bytecodealliance/gecko-dev#47, bytecodealliance/gecko-dev#48, bytecodealliance/gecko-dev#51, bytecodealliance/gecko-dev#52, bytecodealliance/gecko-dev#53, bytecodealliance/gecko-dev#54, bytecodealliance/gecko-dev#55, and then integrated into StarlingMonkey in bytecodealliance/StarlingMonkey#91. The feature is off by default; it requires a `--enable-experimental-aot` flag to be passed to `js-compute-runtime-cli.js`. This requires a separate build of the engine Wasm module to be used when the flag is passed. This should still be considered experimental until it is tested more widely. The PBL+weval combination passes all jit-tests and jstests in SpiderMonkey, and all integration tests in StarlingMonkey; however, it has not yet been widely tested in real-world scenarios. Initial speedups we are seeing on Octane (CPU-intensive JS benchmarks) are in the 3x-5x range. This is roughly equivalent to the speedup that a native JS engine's "baseline JIT" compiler tier gets over its interpreter, and it uses the same basic techniques -- compiling all polymorphic operations (all basic JS operators) to inline-cache sites that dispatch to stubs depending on types. Further speedups can be obtained eventually by inlining stubs from warmed-up IC chains, but that requires warmup. Important to note is that this compilation approach is *fully ahead-of-time*: it requires no profiling or observation or warmup of user code, and compiles the JS directly to Wasm that does not do any further codegen/JIT at runtime. Thus, it is suitable for the per-request isolation model (new Wasm instance for each request, with no shared state).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The SpiderMonkey CacheIR mechanism for inline caches (ICs) generates IC
bodies dynamically based on observed cases invoked by the user program.
Only these ICs will be compiled and exist at runtime. This is ideal from
a flexibility standpoint: we have the ability to add new ICs without
writing their bodies in full (and can, for example, programmatically
generate parts of them). Also, it avoids the overhead of compilation
until an IC is actually needed.
However, some environments require fully ahead-of-time code generation.
In addition, in some environments, we may have ample compilation time
available during a "preparation" phase, and wish to minimize latency for
the first use of an IC instead. In these cases, it would be better to
have a corpus of inline cache bodies, known ahead of time.
This PR adds an "ahead-of-time ICs" feature that includes a corpus of IC
bodies collected while running tests, builtin mechanisms to keep this
corpus up-to-date, and a mechanism to load the corpus when a
JitZoneis created, so all ICs are ready.
The expectation is that any reasonable user program will likely only
generate ICs that are in this corpus; thus, there is no need to compile
ICs at first actual occurrence. In a system that can only AOT-compile
ICs, this means we will have ICs available in more cases.
Because CacheIR still allows for programmatically-generated IC bodies of
arbitrary content (e.g., due to arbitrarily long prototype chains), we
may not always have an IC in the corpus when we encounter its CacheIR in
the wild: that is, the corpus is not guaranteed to be "complete".
However, because it includes all ICs observed during execution of all
tests, we expect that any reasonable IC should be included. Note that
this aligns the incentives of keeping IC generation tested and keeping
the corpus close to complete.
In order to maintain the corpus, this feature includes an "enforcing"
mode. This should ONLY be used during testing: it aborts the process
when an unknown (new) IC body is encountered, after dumping the file.
For maintainence convenience, this file is in the format that we check
into the tree: one must simply move it into
js/src/ics/and rebuild.The idea is that as one adds new IC bodies, one runs tests, sees these
failures, and "blesses" the IC bodies as part of the corpus by adding
the file(s) as needed.
This functionality is a prerequisite for later AOT compilation work, but
is also potentially useful on its own.
(I'm PR'ing this against the 124.0.2 branch to start the review, and including
one subsequent bugfix from upstream cherry-picked in that it depends on, but
I'm happy to rebase onto 127 once that lands!)