Improved register promotion in the presence of exception handling #6212

GSPP · 2016-06-25T13:58:49Z

Here's a micro benchmark to test the performance effect of an exception handler. The handler is never executed, it's just there.

    static void Main()
    {
        var outerIterationCount = 100 * 1000;
        var innerIterationCount = 10000; //use a high value so that the handler is entered rarely

        {
            var sw = Stopwatch.StartNew();
            Handler0(outerIterationCount, innerIterationCount);
            sw.Stop();
            Console.WriteLine($"Handler0: {sw.Elapsed.TotalMilliseconds:F1}ms");
        }

        {
            var sw = Stopwatch.StartNew();
            Handler1(outerIterationCount, innerIterationCount);
            sw.Stop();
            Console.WriteLine($"Handler1: {sw.Elapsed.TotalMilliseconds:F1}ms");
        }
    }

    static void Handler1(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                try
                {
                    for (int iInner = 0; iInner < innerIterationCount; iInner++)
                    {
                        result += 0x1FEFE590; //dummy work that is not optimized out
                        result ^= 0x59209DFC;
                    }
                }
                catch (Exception ex)
                {
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

    static void Handler0(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                for (int iInner = 0; iInner < innerIterationCount; iInner++)
                {
                    result += 0x1FEFE590; //dummy work that is not optimized out
                    result ^= 0x59209DFC;
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

Handler0: 590,4ms
Handler1: 3745,1ms

The slowdown is quite severe (6x). Inspecting the disassembly I see that the inner loop in Handler1 spills to the stack after each arithmetic operation. This is so strange and undesirable that I'm reporting this as a bug.

This is .NET Desktop, 4.6.1, x64, Release, no debugger attached.

category:cq
theme:eh
skill-level:expert
cost:extra-large

The text was updated successfully, but these errors were encountered:

GSPP · 2016-06-25T14:04:28Z

Here's a variant that demonstrates a different issue. innerIterationCountLocal is not enregistered and loaded for each inner loop iteration. This is a code gen quality issue:

    static void Handler1_Variant2(int outerIterationCount, int innerIterationCount)
    {
        unchecked
        {
            uint result = 0;
            for (int iOuter = 0; iOuter < outerIterationCount; iOuter++)
            {
                try
                {
                    uint result1 = 0; //use a local accumulator
                    for (int iInner = 0; iInner < innerIterationCount; iInner++)
                    {
                        result1 += 0x1FEFE590; //write to local
                        result1 ^= 0x59209DFC; //write to local
                    }
                    result += result1;
                }
                catch (Exception ex)
                {
                }
            }
            GC.KeepAlive(result); //sink result
        }
    }

The inner loop comes out as:

000007FE970D466F  add         ecx,1FEFE590h  
                            result1 ^= 0x59209DFC;
000007FE970D4675  xor         ecx,59209DFCh  
                        for (int iInner = 0; iInner < innerIterationCount; iInner++)
000007FE970D467B  inc         eax  
000007FE970D467D  mov         edx,dword ptr [rbp+18h]  //unnecessary
000007FE970D4680  cmp         eax,edx  
000007FE970D4682  jl          000007FE970D466F

And indeed this variant is still a little bit slower than Handler0.

pgavlin · 2016-06-25T16:17:28Z

cc @BruceForstall @cmckinsey

pgavlin · 2016-06-25T16:17:39Z

(and of course @dotnet/jit-contrib)

cmckinsey · 2016-06-25T19:08:26Z

@GSPP

This is a known issue and we have a work-item on our TFS side for it. I will say it's "by-design" right now in that it is intentional although we would like to improve the design in the future. The current implementation was chosen to be easy to get correct and stay reliable. I believe this was inherited from the x86 JIT approach to handling register allocation in the presence of EH. JIT64, which was based on the C++ compiler, has a more sophisticated handling of this problem albeit at more complexity and slower compile-time.

It is not true in general that you can wrap a section of code with a TRY/CATCH and except similar performance even when no exception is taken. This is because the compiler must handle the fact that exceptions thrown transfer control into the runtime stack walker and do not preserve register state when the handler is invoked or flow continues at the continuation after handling an exception. So in general for any variable live on an exception edge, a store must be inserted to the frame so that it's value can be restored. This is the default assumption that each instruction executed may transfer control to the handler. Both cases that you mention above are in fact the same case, and this explains the stores that you see. The reloads however are not necessary and this is an opportunity for improvement in our implementation. This requires more sophisticated algorithms.

Improving this involves major design decisions in the fundamental data flow representation and model of the compiler and so given our current ship deliverables we have not tackled it yet. I don't believe this is an area were we can put the work up for grabs. It is something we want to fix in a future release.

One additional point before someone else points it out, in your case there can be proven to be no exception-occurring in the TRY and the compiler could eliminate the TRY/CATCH entirely, at which point the JIT could generate fully unregistered code. This is a separate optimization that we also don't do today.

BTW, thank you for the stand-alone benchmark that illustrates the problem. I would be very happy to accept someone adding this as an xunit JIT benchmark for us to run to keep reminding the team until we get better.

GSPP · 2016-06-25T20:28:35Z

Thank you for that enlightening explanation.

I just tried it with finally and found the same issue. This is logical. Does this mean that whatever is inside of a try (or using) is potentially really slow? This would be important to know as an optimization guideline. So far I only found "people on the web" claiming that unused EH is "zero-cost". It's important to know that at least right now this is not the case.

It seems to me that the only way to make it fast again would be to track the correspondence of registers and locals and to generate code that spills the regs to stack slots only when the try jump is triggered in any way.

Surely, the existence of exception filters does not make this easier.

And in turn this seems to make it harder to coalesce stores because the effect of stores could be visible to an exception filter.

Indeed I was not able to find any case where the JIT coalesces multiple stores to the same location.

cmckinsey · 2016-06-27T18:21:03Z

@GSPP

I'm not sure how to define "potentially really slow", it really depends on the workload. I wouldn't suggest wrapping your critical compute-bound inner loops in TRY/FINALLY, rather nest them into an outer method. But for most cases I would typically tend to think of it as a small overhead that we'll chip away at in the JIT. Generally people say our EH-model is "zero cost" compared to older dynamic EH models that require code overhead on entry/exit to and from TRY methods (remember SETJMP/LONGJMP based EH models from back in the day). The 64-bit Windows model is static and so generally cheaper at runtime. Yes, finally/catch/filter/using all present similar complications to the JIT optimizer.

Experience suggests the most critical thing is to make sure we can avoid unnecessary reloads in the TRY bodies since loads typically cause latency. While stores do take up memory ports and instruction encoding space in the cache, they have near zero latency. So focusing on store coalescing is typically 2nd or 3rd order impact.

GSPP · 2016-06-27T19:47:25Z

Makes sense. Thank you! These information tidbits are really valuable.

jakobbotsch · 2023-07-24T09:55:50Z

I get the following on main today, on a 5950X:

Handler0: 426.4ms
Handler1: 434.2ms

But still doesn't look like EH write-thru kicks in (diffs), so maybe we should keep this open anyway? Will leave it up to @kunalspathak

cmckinsey changed the title ~~Exception handling causes significant slowdown~~ Improved register promotion in the presence of exception handling Jun 25, 2016

msftgits transferred this issue from dotnet/coreclr Jan 31, 2020

msftgits added this to the Future milestone Jan 31, 2020

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

kunalspathak mentioned this issue Jan 27, 2021

JIT: Enable EH Write Thru by default #35923

Open

4 tasks

This was referenced Mar 21, 2023

[Snyk] Upgrade @typescript-eslint/parser from 5.44.0 to 5.54.0 Woodpile37/runtime#1

Merged

[Snyk] Upgrade @typescript-eslint/eslint-plugin from 5.44.0 to 5.54.0 Woodpile37/runtime#2

Merged

This was referenced Mar 25, 2023

[Snyk] Upgrade @typescript-eslint/eslint-plugin from 5.44.0 to 5.54.0 Woodpile37/runtime#15

Merged

[Snyk] Upgrade @typescript-eslint/parser from 5.44.0 to 5.54.0 Woodpile37/runtime#16

Merged

TIHan removed the JitUntriaged CLR JIT issues needing additional triage label Nov 1, 2023

TIHan modified the milestones: Future, 9.0.0 Nov 1, 2023

JulieLeeMSFT assigned kunalspathak Apr 4, 2024

kunalspathak removed this from the 9.0.0 milestone May 3, 2024

JulieLeeMSFT added this to the Future milestone May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved register promotion in the presence of exception handling #6212

Improved register promotion in the presence of exception handling #6212

GSPP commented Jun 25, 2016

GSPP commented Jun 25, 2016

pgavlin commented Jun 25, 2016

pgavlin commented Jun 25, 2016

cmckinsey commented Jun 25, 2016

GSPP commented Jun 25, 2016 •

edited

Loading

cmckinsey commented Jun 27, 2016

GSPP commented Jun 27, 2016

jakobbotsch commented Jul 24, 2023 •

edited

Loading

Improved register promotion in the presence of exception handling #6212

Improved register promotion in the presence of exception handling #6212

Comments

GSPP commented Jun 25, 2016

GSPP commented Jun 25, 2016

pgavlin commented Jun 25, 2016

pgavlin commented Jun 25, 2016

cmckinsey commented Jun 25, 2016

GSPP commented Jun 25, 2016 • edited Loading

cmckinsey commented Jun 27, 2016

GSPP commented Jun 27, 2016

jakobbotsch commented Jul 24, 2023 • edited Loading

GSPP commented Jun 25, 2016 •

edited

Loading

jakobbotsch commented Jul 24, 2023 •

edited

Loading