-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing an immediate operand from 0 to 1 makes suspicion performance decreasing #8706
Comments
Hi @hungryzzz -- while we appreciate these reports, they seem to be generated by some infrastructure and "mass-produced", so to speak, so to help us out, it may be good to do a little bit more work. In particular: for each performance-delta report (including this one), would you be able to reduce the report to assembly, show us the difference in generated assembly, and write a microbenchmark in assembly that isolates the performance effect and shows it to be valid (as I mentioned in a previous comment on a very similar issue)? |
Hi @cfallin, I write a microbenchmark in assembly based on the generated machine code of # The execution times of `asm_good` and `asm_bad` are as follows:
➜ binary-asm time ./asm_good
./asm_good 1.52s user 0.00s system 99% cpu 1.524 total
➜ binary-asm time ./asm_bad
./asm_bad 3.00s user 0.00s system 99% cpu 3.004 total |
Thanks @hungryzzz. Unfortunately there's still a nontrivial diff, I guess I had hoped for a little more investigation to point to exactly what the microarchitectural difference was; take the two inner loops, transform one into the other one step at a time, see where the cliff happens, that sort of thing. To be more direct: most folks here (at least those working on Cranelift full-time) don't have time to dig into a performance anomaly of several hundred lines of assembly with unclear prioritization (i.e., found via fuzzing rather than via real-use-case report) so for this effort to be successful you'll have to do a little more of the investigation yourself and point to what we can change. Sorry and thanks! |
@hungryzzz another point to keep in mind, the diff represented by incidentally, these run in about the same time on a Ryzen 9 3950X and i'm downright shocked at the IPC it managed. here's
vs
i suspect an iterative transformation from one loop to the other probably won't be super illuminating: something like "movs to and from the stack are slow on some machines, and as you add more it gets more slow". but @hungryzzz another approach that would certainly help here is if you can reduce the input wasm program down - if the program were 30 instructions and switching the constant from 0 to 1 still produces significantly different codegen, that's much easier to investigate. |
Hi, when I try to delete part of the code in the above bad case, I find many small deletion cannot replay this bug, so I think I cannot reduce the bad case to a shorter one. But I find another interesting thing during the reduction. I comment out the following part of the code as this branch must be triggered. However, execution times are quite different after this deletion.
And still I find the generated machine codes are quite difference which are attached. So I guess maybe some analysis during compilation are wrong, resulting the suboptimal code generation. But actually I don't know how to inspect further for this. If you need more investigation, I am willing to do that under your suggestions. block
;; i32.const 1
;; if
local.get 6
local.set 5
;; br 1
;; end
end |
Ah, if making that assignment from local 6 to local 5 unconditional causes the performance delta to disappear, likely what is happening here is that you're seeing the lack of constant-branch folding in Cranelift. That's not a "wrong analysis", that's just an optimization we haven't implemented yet (and it's somewhat nontrivial to do in our egraph mid-end, though we have ideas). It's entirely possible that that could cascade into other optimizations (GVN, ...?) applying or not applying, causing more register pressure and the spills you're seeing. |
So the only way to validate the guessing is to implement the constant-branch folding in Cranelift, right? But how to explain the performance difference shown at first? |
The direct cause of the perf delta seems to be the additional spills, as @iximeow pointed out. |
I got it, thank you! But I still cannot understand the two cases shown at first would have such difference machine code...so I continue to do some changes in the This time I change a value which will be set to the global variable [1] from
➜ cases diff bad.wat good.wat
124c124
< i32.const -65537
---
> i32.const 65536 ;; part of bad.wat
i32.const -65537
global.set 1
;; part of good.wat
i32.const 65536
global.set 1 I have some confusions and findings about these cases:
Therefore, I still think there is something wrong during the compilation of |
@cfallin Hi, I go though the generated machine codes and I think I know something about the performance difference. I find that if the value which will be set to global variable[1] is a negtive number, then the compiler will use two instructions to represent the # part of the machine code of bad.wasm
104 | mov $0xfffeffff,%r10d
105 | mov %r10d,0xa0(%rdi)
# part of the machine code of good.wasm
95 | movl $0x10000,0xa0(%rdi) After the above all investigation, I guess the root cause of all the performance differences is the different register allocation algorithm used by |
Ah, interesting, there may be something odd about the way we lower stores-of-immediates in the x64 backend that precludes this case -- @abrown do you think you'd be willing to take a look at this? (If true it should be reproducible with a small CLIF test that stores a constant |
Sure, I can put this on the TODO list but no promises about when I get to it! |
This commit removes the `simm32` extractor from lowerings as it's not as useful as it was when it was first introduced. Nowadays an `Imm64` needs to be interpreted with the type known as well to understand whether bits being masked off is significant or not. The old `simm32` extractor only took `Imm64` meaning that it was unable to do this and wouldn't match negative numbers. This is because the high 32 bits of `Imm64` were always zero and `simm64` would take the `i64` value from `Imm64` and try to convert it to an `i32`. This commit replaces `simm32`, and uses of it, with a new extractor `i32_from_iconst`. This matches the preexisting `i64_from_iconst` and is able to take the type of the value into account and produce a correctly sign-extended value. cc bytecodealliance#8706
@alexcrichton Hi, I pulled your PR and found it can fix the above case, thank you! But could I ask why positive and negative numbers are stored in different ways in memory before this PR? And how do you fix it? (I tried to understand these when I went through the PR but failed...) Thanks again! |
@hungryzzz the issue was a missed case in the instruction lowering. x86-64 has a store instruction (a form of |
Got it! Thanks a lot! |
This commit removes the `simm32` extractor from lowerings as it's not as useful as it was when it was first introduced. Nowadays an `Imm64` needs to be interpreted with the type known as well to understand whether bits being masked off is significant or not. The old `simm32` extractor only took `Imm64` meaning that it was unable to do this and wouldn't match negative numbers. This is because the high 32 bits of `Imm64` were always zero and `simm64` would take the `i64` value from `Imm64` and try to convert it to an `i32`. This commit replaces `simm32`, and uses of it, with a new extractor `i32_from_iconst`. This matches the preexisting `i64_from_iconst` and is able to take the type of the value into account and produce a correctly sign-extended value. cc #8706
Hi, I am also wondering why only the one register requirement leads to so many additional spills? If requiring more registers, would it get worse? Thank you very much! (I really hope that my questions would not spend too much of your time...) |
@hungryzzz we're happy to help with questions like these, but the Bytecode Alliance Zulip is probably easier for such discussions (that are really about how compilers work rather than resolving this particular issue). With Alex's fix in #8842 are we ok to close this issue or do you see any more performance deltas? |
The performance anomaly has been fixed, thanks a lot! |
Test Cases
cases.zip
Steps to Reproduce
Hi, I run the attached two cases(
good.wasm
&bad.wasm
) inWasmtime
andWasmEdge
(AOT), and collect their execution time respectively (measured bytime
tool).Expected Results & Actual Results
For
good.wasm
, the execution time in different runtimes are as follows:For
bad.wasm
, the execution time in different runtimes are as follows:The difference between the attached two cases is as follow, i.e., changing one of the operand of
i32.and
from0
to1
. The difference and bring 1.8s performance decreasing onWasmtime
but has no negative effect onWasmEdge
.I check the machine code generated by
Wasmtime
andWasmEdge
respectively, the instruction numbers offunc2
(the difference is in it) are as follows:WasmEdge
good.wasm
: 89bad.wasm
: 92Wasmtime
good.wasm
: 132bad.wasm
: 157I think maybe the difference affect the data flow analysis during optimization so maybe some optimization decisions are different , but I really confuse why the instructions number of generated machine code could change so much, and I think maybe the extra instructions in
bad.wasm
make it slower than the good one.I also use Perf tool to profile the hotspots in
Wasmtime
but I cannot find something useful.Versions and Environment
The text was updated successfully, but these errors were encountered: