-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved register promotion in the presence of exception handling #6212
Comments
Here's a variant that demonstrates a different issue.
The inner loop comes out as:
And indeed this variant is still a little bit slower than |
(and of course @dotnet/jit-contrib) |
This is a known issue and we have a work-item on our TFS side for it. I will say it's "by-design" right now in that it is intentional although we would like to improve the design in the future. The current implementation was chosen to be easy to get correct and stay reliable. I believe this was inherited from the x86 JIT approach to handling register allocation in the presence of EH. JIT64, which was based on the C++ compiler, has a more sophisticated handling of this problem albeit at more complexity and slower compile-time. It is not true in general that you can wrap a section of code with a TRY/CATCH and except similar performance even when no exception is taken. This is because the compiler must handle the fact that exceptions thrown transfer control into the runtime stack walker and do not preserve register state when the handler is invoked or flow continues at the continuation after handling an exception. So in general for any variable live on an exception edge, a store must be inserted to the frame so that it's value can be restored. This is the default assumption that each instruction executed may transfer control to the handler. Both cases that you mention above are in fact the same case, and this explains the stores that you see. The reloads however are not necessary and this is an opportunity for improvement in our implementation. This requires more sophisticated algorithms. Improving this involves major design decisions in the fundamental data flow representation and model of the compiler and so given our current ship deliverables we have not tackled it yet. I don't believe this is an area were we can put the work up for grabs. It is something we want to fix in a future release. One additional point before someone else points it out, in your case there can be proven to be no exception-occurring in the TRY and the compiler could eliminate the TRY/CATCH entirely, at which point the JIT could generate fully unregistered code. This is a separate optimization that we also don't do today. BTW, thank you for the stand-alone benchmark that illustrates the problem. I would be very happy to accept someone adding this as an xunit JIT benchmark for us to run to keep reminding the team until we get better. |
Thank you for that enlightening explanation. I just tried it with It seems to me that the only way to make it fast again would be to track the correspondence of registers and locals and to generate code that spills the regs to stack slots only when the Surely, the existence of exception filters does not make this easier. And in turn this seems to make it harder to coalesce stores because the effect of stores could be visible to an exception filter. Indeed I was not able to find any case where the JIT coalesces multiple stores to the same location. |
I'm not sure how to define "potentially really slow", it really depends on the workload. I wouldn't suggest wrapping your critical compute-bound inner loops in TRY/FINALLY, rather nest them into an outer method. But for most cases I would typically tend to think of it as a small overhead that we'll chip away at in the JIT. Generally people say our EH-model is "zero cost" compared to older dynamic EH models that require code overhead on entry/exit to and from TRY methods (remember SETJMP/LONGJMP based EH models from back in the day). The 64-bit Windows model is static and so generally cheaper at runtime. Yes, finally/catch/filter/using all present similar complications to the JIT optimizer. Experience suggests the most critical thing is to make sure we can avoid unnecessary reloads in the TRY bodies since loads typically cause latency. While stores do take up memory ports and instruction encoding space in the cache, they have near zero latency. So focusing on store coalescing is typically 2nd or 3rd order impact. |
Makes sense. Thank you! These information tidbits are really valuable. |
I get the following on main today, on a 5950X: Handler0: 426.4ms
Handler1: 434.2ms But still doesn't look like EH write-thru kicks in (diffs), so maybe we should keep this open anyway? Will leave it up to @kunalspathak |
Here's a micro benchmark to test the performance effect of an exception handler. The handler is never executed, it's just there.
The slowdown is quite severe (6x). Inspecting the disassembly I see that the inner loop in
Handler1
spills to the stack after each arithmetic operation. This is so strange and undesirable that I'm reporting this as a bug.This is .NET Desktop, 4.6.1, x64, Release, no debugger attached.
category:cq
theme:eh
skill-level:expert
cost:extra-large
The text was updated successfully, but these errors were encountered: