{"payload":{"feedbackUrl":"https://github.com/orgs/community/discussions/53140","repo":{"id":242231200,"defaultBranch":"main","name":"wasmtime","ownerLogin":"cfallin","currentUserCanPush":false,"isFork":true,"isEmpty":false,"createdAt":"2020-02-21T21:09:36.000Z","ownerAvatar":"https://avatars.githubusercontent.com/u/216148?v=4","public":true,"private":false,"isOrgOwned":false},"refInfo":{"name":"","listCacheKey":"v0:1725074405.0","currentOid":""},"activityList":{"items":[{"before":"098430f3c8fd7bb92968402beef0670d08023fba","after":"0bce096832b94da99d9f54ba46b7c904ca7877bb","ref":"refs/heads/main","pushedAt":"2024-09-07T23:25:11.000Z","pushType":"push","commitsCount":14,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Warn against `clippy::cast_possible_truncation` in Wasmtime (#9209)\n\n* Warn against `clippy::cast_possible_truncation` in Wasmtime\n\nThis commit explicitly enables the `clippy::cast_possible_truncation`\nlint in Clippy for just the `wasmtime::runtime` module. This does not\nenable it for the entire workspace since it's a very noisy lint and in\ngeneral has a low signal value. For the domain that `wasmtime::runtime`\nis working in, however, this is a much more useful lint. We in general\nwant to be very careful about casting between `usize`, `u32`, and `u64`\nand the purpose of this module-targeted lint is to help with just that.\nI was inspired to do this after reading over #9206 where especially when\nrefactoring code and changing types I think it would be useful to have\nlocations flagged as \"truncation may happen here\" which previously\nweren't truncating.\n\nThe failure mode for this lint is that panics might be introduced where\ntruncation is explicitly intended. Most of the time though this isn't\nactually desired so the more practical consequence of this lint is to\nprobably slow down wasmtime ever so slightly and bloat it ever so\nslightly by having a few more checks in a few places. This is likely\nbest addressed in a more comprehensive manner, however, rather than\nspecifically for just this one case. This problem isn't unique to just\ncasts, but to many other forms of `.unwrap()` for example.\n\n* Fix some casts in tests","shortMessageHtmlLink":"Warn against <code>clippy::cast_possible_truncation</code> in Wasmtime (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2511047653\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9209\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9209/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9209\">bytecode…</a>"}},{"before":"c0c3a68c05971afa3888d6ac4ffed5ac275e0ce7","after":"098430f3c8fd7bb92968402beef0670d08023fba","ref":"refs/heads/main","pushedAt":"2024-08-31T03:20:32.000Z","pushType":"push","commitsCount":25,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Upgrade regalloc2 to 0.9.4 (#9191)\n\n* Upgrade to regalloc-0.9.4\n\n* Update filetests\n\n* Run `cargo vet`","shortMessageHtmlLink":"Upgrade regalloc2 to 0.9.4 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2497829870\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9191\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9191/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9191\">bytecodealliance#9191</a>)"}},{"before":null,"after":"21036a4ba3428d3679c53bd238a938f6ba66621c","ref":"refs/heads/pcc-update","pushedAt":"2024-08-31T03:20:05.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"wip: reproduce PCC fuzzing failures (run instantiate target)","shortMessageHtmlLink":"wip: reproduce PCC fuzzing failures (run instantiate target)"}},{"before":"a8607bf87cbeaff0883fbd832b65a6d2a7a6ece1","after":"c0c3a68c05971afa3888d6ac4ffed5ac275e0ce7","ref":"refs/heads/main","pushedAt":"2024-08-21T16:26:54.000Z","pushType":"push","commitsCount":4,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Cranelift: Remove the old stack maps implementation (#9159)\n\nThey are superseded by the new user stack maps implementation.","shortMessageHtmlLink":"Cranelift: Remove the old stack maps implementation (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2476642195\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9159\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9159/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9159\">bytecodealliance…</a>"}},{"before":"b526865150a2ef131e644069022d9890c4c6d870","after":"a8607bf87cbeaff0883fbd832b65a6d2a7a6ece1","ref":"refs/heads/main","pushedAt":"2024-08-20T19:07:03.000Z","pushType":"push","commitsCount":21,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Install `git` executable in container builds (#9152)\n\nPrecompiled artifacts for macOS show this for `wasmtime --version`\n\n    wasmtime-cli 24.0.0 (6fc3d274c 2024-08-20)\n\nwhereas for Linux they show\n\n    wasmtime-cli 24.0.0\n\nand this is due to `git` not being available in the build environment on\nLinux.","shortMessageHtmlLink":"Install <code>git</code> executable in container builds (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2476097718\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9152\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9152/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9152\">bytecodealliance#9152</a>)"}},{"before":"ba864e987ef1ab87c439ca6b396264547d2425e1","after":"b526865150a2ef131e644069022d9890c4c6d870","ref":"refs/heads/main","pushedAt":"2024-08-15T03:20:30.000Z","pushType":"push","commitsCount":51,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Cranelift: Add a new backend for emitting Pulley bytecode (#9089)\n\n* Cranelift: Add a new backend for emitting Pulley bytecode\n\nThis commit adds two new backends for Cranelift that emits 32- and 64-bit Pulley\nbytecode. The backends are both actually the same, with a common implementation\nliving in `cranelift/codegen/src/isa/pulley_shared`. Each backend configures an\nISA flag that determines the pointer size, and lowering inspects this flag's\nvalue when lowering memory accesses.\n\nTo avoid multiple ISLE compilation units, and to avoid compiling duplicate\ncopies of Pulley's generated `MInst`, I couldn't use `MInst` as the `MachInst`\nimplementation directly. Instead, there is an `InstAndKind` type that is a\nnewtype over the generated `MInst` but which also carries a phantom type\nparameter that implements the `PulleyTargetKind` trait. There are two\nimplementations of this trait, a 32- and 64-bit version. This is necessary\nbecause there are various static trait methods for the mach backend which we\nmust implement, and which return the pointer width, but don't have access to any\n`self`. Therefore, we are forced to monomorphize some amount of code. This type\nparameter is fairly infectious, and all the \"big\" backend\ntypes (`PulleyBackend<P>`, `PulleyABICallSite<P>`, etc...) are parameterized\nover it. Nonetheless, not everything is parameterized over a `PulleyTargetKind`,\nand we manage to avoid duplicate `MInst` definitions and lowering code.\n\nNote that many methods are still stubbed out with `todo!`s. It is expected that\nwe will fill in those implementations as the work on Pulley progresses.\n\n* Trust the `pulley-interpreter` crate, as it is part of our workspace\n\n* fix some clippy warnings\n\n* Fix a dead-code warning from inside generated code\n\n* Use a helper for emitting br_if+comparison instructions\n\n* Add a helper for converting `Reg` to `pulley_interpreter::XReg`\n\n* Add version to pulley workspace dependency\n\n* search the pulley directory for crates in the publish script","shortMessageHtmlLink":"Cranelift: Add a new backend for emitting Pulley bytecode (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2454585588\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9089\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9089/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9089\">bytecodeal…</a>"}},{"before":"7ac3fda7f25d3e6efc53181e45db309b63465350","after":"ba864e987ef1ab87c439ca6b396264547d2425e1","ref":"refs/heads/main","pushedAt":"2024-08-01T17:22:00.000Z","pushType":"push","commitsCount":117,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"docs: move wasi-keyvalue proposal to tier 3 (#9050)","shortMessageHtmlLink":"docs: move wasi-keyvalue proposal to tier 3 (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2439323781\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/9050\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/9050/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/9050\">bytecodealliance#9050</a>)"}},{"before":"601a96d1a48cc28b7ae3d8f2818963d05abc99c3","after":null,"ref":"refs/heads/remove-indirect-call-cache","pushedAt":"2024-06-27T17:36:37.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"}},{"before":"92930d5a6fefa9768e4e5e7307b6c7ae57461f6c","after":"601a96d1a48cc28b7ae3d8f2818963d05abc99c3","ref":"refs/heads/remove-indirect-call-cache","pushedAt":"2024-06-27T17:12:27.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Wasmtime: remove indirect-call caching.\n\nIn the original development of this feature, guided by JS AOT\ncompilation to Wasm of a microbenchmark heavily focused on IC sites, I\nwas seeing a ~20% speedup. However, in more recent measurements, on full\nprograms (e.g., the Octane benchmark suite), the benefit is more like\n5%.\n\nMoreover, in #8870, I attempted to switch over to a direct-mapped cache,\nto address a current shortcoming of the design, namely that it has a\nhard-capped number of callsites it can apply to (50k) to limit impact on\nVMContext struct size. With all of the needed checks for correctness,\nthough, that change results in a 2.5% slowdown relative to no caching at\nall, so it was dropped.\n\nIn the process of thinking through that, I discovered the current design\non `main` incorrectly handles null funcrefs: it invokes a null code pointer,\nrather than loading a field from a null struct pointer. The latter was\nspecifically designed to cause the necessary Wasm trap in #8159, but I\nhad missed that the call to a null code pointer would not have the same\neffect. As a result, we actually can crash the VM (safely at least, but\nstill no good vs. a proper Wasm trap!) with the feature enabled. (It's\noff by default still.) That could be fixed too, but at this point with\nthe small benefit on real programs, together with the limitation on\nmodule size for full benefit, I think I'd rather opt for simplicity and\nremove the cache entirely.\n\nThus, this PR removes call-indirect caching. It's not a direct revert\nbecause the original PR refactored the call-indirect generation into\nsmaller helpers and IMHO it's a bit nicer to keep that. But otherwise\nall traces of the setting, code pre-scan during compilation and special\nconditions tracked on tables, and codegen changes are gone.","shortMessageHtmlLink":"Wasmtime: remove indirect-call caching."}},{"before":"f1125ab512508cb11051fe761c553d1f3877e573","after":"92930d5a6fefa9768e4e5e7307b6c7ae57461f6c","ref":"refs/heads/remove-indirect-call-cache","pushedAt":"2024-06-27T15:58:19.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Wasmtime: remove indirect-call caching.\n\nIn the original development of this feature, guided by JS AOT\ncompilation to Wasm of a microbenchmark heavily focused on IC sites, I\nwas seeing a ~20% speedup. However, in more recent measurements, on full\nprograms (e.g., the Octane benchmark suite), the benefit is more like\n5%.\n\nMoreover, in #8870, I attempted to switch over to a direct-mapped cache,\nto address a current shortcoming of the design, namely that it has a\nhard-capped number of callsites it can apply to (50k) to limit impact on\nVMContext struct size. With all of the needed checks for correctness,\nthough, that change results in a 2.5% slowdown relative to no caching at\nall, so it was dropped.\n\nIn the process of thinking through that, I discovered the current design\non `main` incorrectly handles null funcrefs: it invokes a null code pointer,\nrather than loading a field from a null struct pointer. The latter was\nspecifically designed to cause the necessary Wasm trap in #8159, but I\nhad missed that the call to a null code pointer would not have the same\neffect. As a result, we actually can crash the VM (safely at least, but\nstill no good vs. a proper Wasm trap!) with the feature enabled. (It's\noff by default still.) That could be fixed too, but at this point with\nthe small benefit on real programs, together with the limitation on\nmodule size for full benefit, I think I'd rather opt for simplicity and\nremove the cache entirely.\n\nThus, this PR removes call-indirect caching. It's not a direct revert\nbecause the original PR refactored the call-indirect generation into\nsmaller helpers and IMHO it's a bit nicer to keep that. But otherwise\nall traces of the setting, code pre-scan during compilation and special\nconditions tracked on tables, and codegen changes are gone.","shortMessageHtmlLink":"Wasmtime: remove indirect-call caching."}},{"before":null,"after":"f1125ab512508cb11051fe761c553d1f3877e573","ref":"refs/heads/remove-indirect-call-cache","pushedAt":"2024-06-27T15:50:43.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Wasmtime: remove indirect-call caching.\n\nIn the original development of this feature, guided by JS AOT\ncompilation to Wasm of a microbenchmark heavily focused on IC sites, I\nwas seeing a ~20% speedup. However, in more recent measurements, on full\nprograms (e.g., the Octane benchmark suite), the benefit is more like\n5%.\n\nMoreover, in #8870, I attempted to switch over to a direct-mapped cache,\nto address a current shortcoming of the design, namely that it has a\nhard-capped number of callsites it can apply to (50k) to limit impact on\nVMContext struct size. With all of the needed checks for correctness,\nthough, that change results in a 2.5% slowdown relative to no caching at\nall, so it was dropped.\n\nIn the process of thinking through that, I discovered the current design\non `main` incorrectly handles null funcrefs: it invokes a null code pointer,\nrather than loading a field from a null struct pointer. The latter was\nspecifically designed to cause the necessary Wasm trap in #8159, but I\nhad missed that the call to a null code pointer would not have the same\neffect. As a result, we actually can crash the VM (safely at least, but\nstill no good vs. a proper Wasm trap!) with the feature enabled. (It's\noff by default still.) That could be fixed too, but at this point with\nthe small benefit on real programs, together with the limitation on\nmodule size for full benefit, I think I'd rather opt for simplicity and\nremove the cache entirely.\n\nThus, this PR removes call-indirect caching. It's not a direct revert\nbecause the original PR refactored the call-indirect generation into\nsmaller helpers and IMHO it's a bit nicer to keep that. But otherwise\nall traces of the setting, code pre-scan during compilation and special\nconditions tracked on tables, and codegen changes are gone.","shortMessageHtmlLink":"Wasmtime: remove indirect-call caching."}},{"before":"f4b49b8c8942523b5bc44d09aa455fadaf848e41","after":"7ac3fda7f25d3e6efc53181e45db309b63465350","ref":"refs/heads/main","pushedAt":"2024-06-27T15:30:39.000Z","pushType":"push","commitsCount":2,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Initial `f16` and `f128` support (#8860)","shortMessageHtmlLink":"Initial <code>f16</code> and <code>f128</code> support (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2367291014\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/8860\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/8860/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/8860\">bytecodealliance#8860</a>)"}},{"before":"9cd2218e95f863e72e0bc672c468ea3b4e46bc29","after":"9c528bb2fc8bc3bdcdb3ca4808e05a40dca022b5","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-27T05:34:44.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Test updates.","shortMessageHtmlLink":"Test updates."}},{"before":"0d6d1c0d0d7678407fe1935b508c0ba03ddab77f","after":"9cd2218e95f863e72e0bc672c468ea3b4e46bc29","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-27T05:32:59.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Fixes: (i) tag cache entries with signature ID as well; (ii) handle null code pointers.","shortMessageHtmlLink":"Fixes: (i) tag cache entries with signature ID as well; (ii) handle n…"}},{"before":"13b9f2378b9de6994a4e978569492c7545a63823","after":"0d6d1c0d0d7678407fe1935b508c0ba03ddab77f","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T03:17:48.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nThe downside to caching indexed on callee rather than callsite is that\nif there are a large number of callees, we can expect more cache\nconflicts and hence misses. (If funcref table indices 1 and 1025 are\nboth frequently called, a 1024-entry direct-mapped cache will thrash.)\nBut I expect with ICs in particular to have a lot of callsites and\nrelatively few (shared) callees.\n\nOn Octane-compiled-to-Wasm with my JS AOT compilation tooling using\n`call_indirect` for all ICs, I see: baseline score (higher is better,\nproportional to runtime speed) of 2406, score with old\none-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured on a\nmicrobenchmark (20%) -- at 5% on all of Octane -- but that's still worth\nit, IMHO.)\n\n(*) Note that slots are not actually contiguous: I did a\n    struct-of-arrays trick, separating cache tags from cache values, so\n    that the assembly lowering can use scaling amodes (`vmctx + offset +\n    4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more\n    efficient code.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"d5a2e06600fd5dedb19af713165003acaf0dbffc","after":"13b9f2378b9de6994a4e978569492c7545a63823","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T02:05:56.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nThe downside to caching indexed on callee rather than callsite is that\nif there are a large number of callees, we can expect more cache\nconflicts and hence misses. (If funcref table indices 1 and 1025 are\nboth frequently called, a 1024-entry direct-mapped cache will thrash.)\nBut I expect with ICs in particular to have a lot of callsites and\nrelatively few (shared) callees.\n\nOn Octane-compiled-to-Wasm with my JS AOT compilation tooling using\n`call_indirect` for all ICs, I see: baseline score (higher is better,\nproportional to runtime speed) of 2406, score with old\none-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured on a\nmicrobenchmark (20%) -- at 5% on all of Octane -- but that's still worth\nit, IMHO.)\n\n(*) Note that slots are not actually contiguous: I did a\n    struct-of-arrays trick, separating cache tags from cache values, so\n    that the assembly lowering can use scaling amodes (`vmctx + offset +\n    4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more\n    efficient code.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"ce96dcb71cc99c82ad1ce6ee3bb202652c980493","after":"d5a2e06600fd5dedb19af713165003acaf0dbffc","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T02:05:01.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nThe downside to caching indexed on callee rather than callsite is that\nif there are a large number of callees, we can expect more cache\nconflicts and hence misses. (If funcref table indices 1 and 1025 are\nboth frequently called, a 1024-entry direct-mapped cache will thrash.)\nBut I expect with ICs in particular to have a lot of callsites and\nrelatively few (shared) callees.\n\nOn Octane-compiled-to-Wasm with my JS AOT compilation tooling using\n`call_indirect` for all ICs, I see: baseline score (higher is better,\nproportional to runtime speed) of 2406, score with old\none-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured on a\nmicrobenchmark (20%) -- at 5% on all of Octane -- but that's still worth\nit, IMHO.)\n\n(*) Note that slots are not actually contiguous: I did a\n    struct-of-arrays trick, separating cache tags from cache values, so\n    that the assembly lowering can use scaling amodes (`vmctx + offset +\n    4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more\n    efficient code.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"753fc5c7d60212a45bc2867144e0439a2ae4f46b","after":"ce96dcb71cc99c82ad1ce6ee3bb202652c980493","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T02:01:29.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nThe downside to caching indexed on callee rather than callsite is that\nif there are a large number of callees, we can expect more cache\nconflicts and hence misses. (If funcref table indices 1 and 1025 are\nboth frequently called, a 1024-entry direct-mapped cache will thrash.)\nBut I expect with ICs in particular to have a lot of callsites and\nrelatively few (shared) callees.\n\nOn Octane-compiled-to-Wasm with my JS AOT compilation tooling using\n`call_indirect` for all ICs, I see: baseline score (higher is better,\nproportional to runtime speed) of 2406, score with old\none-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured on a\nmicrobenchmark (20%) -- at 5% on all of Octane -- but that's still worth\nit, IMHO.)\n\n(*) Note that slots are not actually contiguous: I did a\n    struct-of-arrays trick, separating cache tags from cache values, so\n    that the assembly lowering can use scaling amodes (`vmctx + offset +\n    4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more\n    efficient code.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"339d1fc5ea934142f7a0ed64af8fd7f3acd10c5a","after":"753fc5c7d60212a45bc2867144e0439a2ae4f46b","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T01:55:15.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots(*) (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nThe downside to caching indexed on callee rather than callsite is that\nif there are a large number of callees, we can expect more cache\nconflicts and hence misses. (If funcref table indices 1 and 1025 are\nboth frequently called, a 1024-entry direct-mapped cache will thrash.)\nBut I expect with ICs in particular to have a lot of callsites and\nrelatively few (shared) callees.\n\nOn Octane-compiled-to-Wasm with my JS AOT compilation tooling using\n`call_indirect` for all ICs, I see: baseline score (higher is better,\nproportional to runtime speed) of 2406, score with old\none-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured on a\nmicrobenchmark (20%) -- at 5% on all of Octane -- but that's still worth\nit, IMHO.)\n\n(*) Note that slots are not actually contiguous: I did a\n    struct-of-arrays trick, separating cache tags from cache values, so\n    that the assembly lowering can use scaling amodes (`vmctx + offset +\n    4*idx` for u32 accesses, and `8*idx` for u64 accesses) for more\n    efficient code.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"f7b0deb041c8bf86c8d9b44bc616996b22674ece","after":"339d1fc5ea934142f7a0ed64af8fd7f3acd10c5a","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T01:49:40.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nOn Octane-compiled-to-Wasm with my JS AOT stuff, I see: baseline score\n(higher is better, proportional to runtime speed) of 2406, score with\nold one-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured -- at\n5% now -- likely due to a bunch of other optimizations I've made\nelsewhere -- but that's still worth it IMHO.)","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"7c78fbb6be01a0cf2e5112806c66d91fd72d3f8a","after":"f7b0deb041c8bf86c8d9b44bc616996b22674ece","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-25T01:49:28.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nOn Octane-compiled-to-Wasm with my JS AOT stuff, I see: baseline score\n(higher is better, proportional to runtime speed) of 2406, score with\nold one-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured -- at\n5% now -- likely due to a bunch of other optimizations I've made\nelsewhere -- but that's still worth it IMHO.)","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":null,"after":"7c78fbb6be01a0cf2e5112806c66d91fd72d3f8a","ref":"refs/heads/direct-mapped-indirect-cache","pushedAt":"2024-06-24T23:23:09.000Z","pushType":"branch_creation","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nOn Octane-compiled-to-Wasm with my JS AOT stuff, I see: baseline score\n(higher is better, proportional to runtime speed) of 2406, score with\nold one-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured -- at\n5% now -- likely due to a bunch of other optimizations I've made\nelsewhere -- but that's still worth it IMHO.)","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"85af0b8721b9fc4f4ead3659124f5e285f380e20","after":"f4b49b8c8942523b5bc44d09aa455fadaf848e41","ref":"refs/heads/main","pushedAt":"2024-06-24T23:23:02.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"riscv64: Dynamically emit islands for return calls (#8868)\n\n* riscv64: Increase max inst size\n\n* riscv64: Emit islands in return call sequence\n\n* riscv64: Update worst case size tests\n\nHaving duplicate registers was preventing\nsome moves from being generated","shortMessageHtmlLink":"riscv64: Dynamically emit islands for return calls (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2370738242\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/8868\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/8868/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/8868\">bytecodealliance#…</a>"}},{"before":"79146f0213dbfd319eb08bb7cc82d00b1612ba0e","after":"85af0b8721b9fc4f4ead3659124f5e285f380e20","ref":"refs/heads/main","pushedAt":"2024-06-24T23:19:59.000Z","pushType":"push","commitsCount":96,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nOn Octane-compiled-to-Wasm with my JS AOT stuff, I see: baseline score\n(higher is better, proportional to runtime speed) of 2406, score with\nold one-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured -- at\n5% now -- likely due to a bunch of other optimizations I've made\nelsewhere -- but that's still worth it IMHO.)","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"00d3928f6f595cfec72544b370ef7c7e368368a6","after":"5f8d8704e052958aba3cfa18e82bbe4893adfcf7","ref":"refs/heads/experiment-fast-calls","pushedAt":"2024-06-24T23:10:39.000Z","pushType":"force_push","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.\n\nCurrently, the indirect-call cache operates on a basis of *one slot per\ncallsite*: each `call_indirect` instruction in a module, up to a limit,\nhas its own slot of storage in the VM context struct that caches a\ncalled-table-index to called-raw-function-pointer mapping.\n\nThis is fine but means the storage requirement scales with the module\nsize; hence \"up to a limit\" above. It also means that each callsite\nneeds to \"warm up\" separately, whereas we could in theory reuse the\nresolved index->code pointer mapping for the same index across\ncallsites.\n\nThis PR switches instead to a \"direct-mapped cache\": that is, we have a\nfixed number of cache slots per table per instance, of user-configurable\ncount, and we look in a slot selected by the called table index (modulo\nthe cache size). As before, if the \"tag\" (cache key, called table index)\nmatches, we use the \"value\" (raw code pointer).\n\nThe main advantage of this scheme, and my motivation for making the\nswitch, is that the storage size is fixed and quite small, even for\narbitrarily-large modules: for example, on a 64-bit platform with\n12-byte slots (4-byte key, 8-byte resolved pointer), for a module with\none funcptr table, a 1K-entry cache uses 12KiB per instance. That's much\nsmaller than the total VMFuncRef array size in large modules and should\nbe no problem. My goal in getting to this constant size offset is that\nturning this on by default eventually will be easier to justify, and\nthat we won't have unexpected perf cliffs for callsites beyond a certain\nindex.\n\nThis also means that if one callsite resolves index 23 to some raw code\npointer, other callsites that call index 23 also receive a \"hit\" from\nthat warmup. This could be beneficial when there are many callsites but\na relatively smaller pool of called functions (e.g., ICs).\n\nOn Octane-compiled-to-Wasm with my JS AOT stuff, I see: baseline score\n(higher is better, proportional to runtime speed) of 2406, score with\nold one-entry-per-callsite scheme of 2479, score with this scheme of\n2509. So it's slightly faster as well, probably due to a combination of\nthe warmup benefit and a smaller cache footprint, even with the more\ninvolved logic to compute the slot address. (This also tells me the\nbenefit of this cache is smaller than I had originally measured -- at\n5% now -- likely due to a bunch of other optimizations I've made\nelsewhere -- but that's still worth it IMHO.)","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"6144fe450f329d34cbdfc0c1635549f48e768841","after":"00d3928f6f595cfec72544b370ef7c7e368368a6","ref":"refs/heads/experiment-fast-calls","pushedAt":"2024-06-24T23:02:11.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Switch to direct-mapped indirect-call cache.","shortMessageHtmlLink":"Switch to direct-mapped indirect-call cache."}},{"before":"813753921fbfb45aa860620183484c419a7edc32","after":"79146f0213dbfd319eb08bb7cc82d00b1612ba0e","ref":"refs/heads/main","pushedAt":"2024-05-31T05:24:37.000Z","pushType":"push","commitsCount":40,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"riscv64: Special-case `f32const 0` and `f64const 0` (#8701)\n\n* riscv64: Special-case `f32const 0` and `f64const 0`\n\nThis commit is inspired by discussion on #8695 which made me remember\nthe discussion around #7162 historically. In lieu of a deeper fix for\nthe issue of \"why can't `iconst 0` use `(zero_reg)`\" it's still possible\nto add special-cases to rules throughout the backend so this commit does\nthat for generating zero-value floats.\n\n* Fix tests\n\n* Run all tests on CI\n\nprtest:full","shortMessageHtmlLink":"riscv64: Special-case <code>f32const 0</code> and <code>f64const 0</code> (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2321813919\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/8701\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/8701/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/8701\">bytecodealliance…</a>"}},{"before":"91ec9a589cc6c7f031ef4cacdb295331c07b6063","after":"813753921fbfb45aa860620183484c419a7edc32","ref":"refs/heads/main","pushedAt":"2024-05-20T15:13:37.000Z","pushType":"push","commitsCount":17,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"Update docs to match new logging env var (#8656)","shortMessageHtmlLink":"Update docs to match new logging env var (<a class=\"issue-link js-issue-link\" data-error-text=\"Failed to load title\" data-id=\"2305809098\" data-permission-text=\"Title is private\" data-url=\"https://github.com/bytecodealliance/wasmtime/issues/8656\" data-hovercard-type=\"pull_request\" data-hovercard-url=\"/bytecodealliance/wasmtime/pull/8656/hovercard\" href=\"https://github.com/bytecodealliance/wasmtime/pull/8656\">bytecodealliance#8656</a>)"}},{"before":"1370b71f7d1b791a024e91bf82f998c309004289","after":null,"ref":"refs/heads/stackslot-alignment","pushedAt":"2024-05-16T17:02:51.000Z","pushType":"branch_deletion","commitsCount":0,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"}},{"before":"6aeb4d5dc9d1c63ed27452690ab54f800277d3e9","after":"1370b71f7d1b791a024e91bf82f998c309004289","ref":"refs/heads/stackslot-alignment","pushedAt":"2024-05-16T16:26:36.000Z","pushType":"push","commitsCount":1,"pusher":{"login":"cfallin","name":"Chris Fallin","path":"/cfallin","primaryAvatarUrl":"https://avatars.githubusercontent.com/u/216148?s=80&v=4"},"commit":{"message":"cargo-fmt from suggestion update.","shortMessageHtmlLink":"cargo-fmt from suggestion update."}}],"hasNextPage":true,"hasPreviousPage":false,"activityType":"all","actor":null,"timePeriod":"all","sort":"DESC","perPage":30,"cursor":"djE6ks8AAAAEr9nnmAA","startCursor":null,"endCursor":null}},"title":"Activity · cfallin/wasmtime"}