Revert "Enable cetcompat for corerun" #103654

EgorBo · 2024-06-18T16:35:08Z

let's see if reverting #103311 helps to fix massive regressions reported by dotnet/perf-autofiling-issues#36619

EgorBo · 2024-06-18T16:38:25Z

ah, my bot is Linux only, will check locally

EgorBo · 2024-06-18T16:44:21Z

Yes, looks like #103311 caused this:

[Perf] Windows/x64: 2373 Regressions on 6/12/2024 2:53:16 AM perf-autofiling-issues#36619

I've just reproduced it locally on this revert. @mangod9 @janvorli @jkotas any idea why?

If I read it correctly, it enables CET for corerun so presumably it should not affect VM/GC?

mangod9 · 2024-06-18T16:54:27Z

Interesting. Is this only affecting the Collections benchmarks?

EgorBo · 2024-06-18T16:56:45Z

Interesting. Is this only affecting the Collections benchmarks?

It looks like it affects more than a half of all micro-benchmarks we have, different kinds. e.g. I was testing it on Perf_Version and IfStatements

EgorBo · 2024-06-18T16:59:40Z

it's possible that it affected BDN itself I guess

jkotas · 2024-06-18T17:01:14Z

CET makes the processor to do more work, so it will make things run slower. It is by design. https://www.bing.com/search?q=Control+flow+enforcement+technology+performace+overhead

EgorBo · 2024-06-18T17:03:28Z

CET makes the processor to do more work, so it will make things run slower. It is by design. https://www.bing.com/search?q=Control+flow+enforcement+technology+performace+overhead

Yeah I know, it's just that I've got an impression that it's only enabled for the corerun binary and it shouldn't affect perf anyhow

jkotas · 2024-06-18T17:03:45Z

CET is enabled process wide. It is not a per-dll setting.

EgorBo · 2024-06-18T17:08:44Z

CET is enabled process wide. It is not a per-dll setting.

Ok, do we accept the regressions then? So far I don't see any impact on TE, except startup time regressions for some

EgorBo · 2024-06-18T17:11:05Z

unless aspnet uses a different core host

mangod9 · 2024-06-18T17:11:40Z

It was also recently enabled for apphost too. Guess we should make it the new baseline.

We should check if TE runs with dotnet or apphost, since dotnet is not CET compliant yet.

jkotas · 2024-06-18T17:13:42Z

Ok, do we accept the regressions then?

I do not see any other option. CET is going to be hard requirement for all MS software eventually.

EgorBo · 2024-06-18T17:21:51Z

Ok, do we accept the regressions then?

I do not see any other option. CET is going to be hard requirement for all MS software eventually.

Does this mean CET will be eventually enabled for the jitted code by default as well?

mangod9 · 2024-06-18T17:30:44Z

well it has been supported for all runtime scenarios (on windows) since 7, we have now enabled it for apphosts so explicit enablement isn't required per app.

jkotas · 2024-06-18T18:09:23Z

Does this mean CET will be eventually enabled for the jitted code by default as well?

Windows only use shadow stack part of the CET hardware feature. The shadow stack is enabled process wide. JITed code will get it too if it is enabled for the process.

mangod9 · 2024-06-18T18:13:42Z

I think we should leave the issue open, the regression feels rather high so we should look at profiles to determine what is contributing to the increase.

VSadov · 2024-07-05T20:22:28Z

I see similar extra costs in NativeAOT. With CET enabled there is impact on code that just does a lot of tiny calls.

My benchmark:

    internal class Program
    {
        static void Main(string[] args)
        {
            for(int i = 10; i<100; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                var result = Fib(i);
                sw.Stop();

                Console.Write("n: " + i);
                Console.Write(", time: " + sw.ElapsedMilliseconds);
                Console.WriteLine("ms,  result= " + result);
            }
        }

        static long Fib(long i) 
        { 
            if (i <= 1)
            {
                return 1;
            }

            return Fib(i - 2) + Fib (i - 1);
        }
    }

I am using the bits compiled off the current main.

When I compile with default settings, I get

n: 10, time: 0ms,  result= 89
n: 11, time: 0ms,  result= 144
n: 12, time: 0ms,  result= 233
n: 13, time: 0ms,  result= 377
n: 14, time: 0ms,  result= 610
n: 15, time: 0ms,  result= 987
n: 16, time: 0ms,  result= 1597
n: 17, time: 0ms,  result= 2584
n: 18, time: 0ms,  result= 4181
n: 19, time: 0ms,  result= 6765
n: 20, time: 0ms,  result= 10946
n: 21, time: 0ms,  result= 17711
n: 22, time: 0ms,  result= 28657
n: 23, time: 0ms,  result= 46368
n: 24, time: 0ms,  result= 75025
n: 25, time: 0ms,  result= 121393
n: 26, time: 0ms,  result= 196418
n: 27, time: 0ms,  result= 317811
n: 28, time: 1ms,  result= 514229
n: 29, time: 1ms,  result= 832040
n: 30, time: 3ms,  result= 1346269
n: 31, time: 5ms,  result= 2178309
n: 32, time: 8ms,  result= 3524578
n: 33, time: 13ms,  result= 5702887
n: 34, time: 22ms,  result= 9227465
n: 35, time: 36ms,  result= 14930352
n: 36, time: 58ms,  result= 24157817
n: 37, time: 94ms,  result= 39088169
n: 38, time: 153ms,  result= 63245986
n: 39, time: 248ms,  result= 102334155
n: 40, time: 402ms,  result= 165580141
n: 41, time: 650ms,  result= 267914296
n: 42, time: 1055ms,  result= 433494437
n: 43, time: 1703ms,  result= 701408733
n: 44, time: 2763ms,  result= 1134903170
n: 45, time: 4467ms,  result= 1836311903
n: 46, time: 7233ms,  result= 2971215073
n: 47, time: 11694ms,  result= 4807526976
n: 48, time: 18929ms,  result= 7778742049
n: 49, time: 30632ms,  result= 12586269025

If I compile with <CETCompat>false</CETCompat>

I get:

n: 10, time: 0ms,  result= 89
n: 11, time: 0ms,  result= 144
n: 12, time: 0ms,  result= 233
n: 13, time: 0ms,  result= 377
n: 14, time: 0ms,  result= 610
n: 15, time: 0ms,  result= 987
n: 16, time: 0ms,  result= 1597
n: 17, time: 0ms,  result= 2584
n: 18, time: 0ms,  result= 4181
n: 19, time: 0ms,  result= 6765
n: 20, time: 0ms,  result= 10946
n: 21, time: 0ms,  result= 17711
n: 22, time: 0ms,  result= 28657
n: 23, time: 0ms,  result= 46368
n: 24, time: 0ms,  result= 75025
n: 25, time: 0ms,  result= 121393
n: 26, time: 0ms,  result= 196418
n: 27, time: 0ms,  result= 317811
n: 28, time: 0ms,  result= 514229
n: 29, time: 1ms,  result= 832040
n: 30, time: 2ms,  result= 1346269
n: 31, time: 4ms,  result= 2178309
n: 32, time: 6ms,  result= 3524578
n: 33, time: 10ms,  result= 5702887
n: 34, time: 17ms,  result= 9227465
n: 35, time: 27ms,  result= 14930352
n: 36, time: 45ms,  result= 24157817
n: 37, time: 73ms,  result= 39088169
n: 38, time: 118ms,  result= 63245986
n: 39, time: 192ms,  result= 102334155
n: 40, time: 312ms,  result= 165580141
n: 41, time: 504ms,  result= 267914296
n: 42, time: 817ms,  result= 433494437
n: 43, time: 1322ms,  result= 701408733
n: 44, time: 2142ms,  result= 1134903170
n: 45, time: 3465ms,  result= 1836311903
n: 46, time: 5606ms,  result= 2971215073
n: 47, time: 9066ms,  result= 4807526976
n: 48, time: 14679ms,  result= 7778742049
n: 49, time: 23750ms,  result= 12586269025

The no-CET version is about 30% faster on code that is dominated by making calls.

This is on AMD Ryzen 9 7950X, Windows10

EgorBo · 2024-07-05T20:26:58Z

@tannergooding mentioned that, technically, new AMDs shouldn't be impacted by this, so, perhaps Windows/whoever to blame?

EgorBo · 2024-07-05T20:28:38Z

Unfortunately, we don't run TE benchmarks on AMD-Windows to confirm (our only Windows-x64 target is Intel)

EgorBo · 2024-07-05T20:33:11Z

This is on AMD Ryzen 9 7950X, Windows10

From https://techcommunity.microsoft.com/t5/windows-os-platform-blog/developer-guidance-for-hardware-enforced-stack-protection/ba-p/2163340

Requirements
You can begin building and testing your application to support Hardware-enforced Stack Protection today, by ensuring you have the following:

Hardware: 11th Gen Intel Core Mobile processors and AMD Zen 3 Core (and newer)

Hardware-enforced Stack Protection capable OS: 19041.622 or 19042.622 and newer versions

mangod9 · 2024-07-05T21:02:00Z

perhaps some new AMD machines are available in Azure. Regardless, this feels like it's a non-trivial overhead. Would be good to ensure TE is not seeing much impact.

VSadov · 2024-07-05T21:16:40Z

uprof says nearly all the time is spent in the FIb and with CET the cost attributed to calls is noticeably higher.

I guess, however cheap the extra memory operations could be, doing extra write when making a call and extra read (and compare) when returning end up costing some extra.

This is with no-CET for computing FIb up to 50:

This is with CET

VSadov · 2024-07-05T21:25:45Z

Would be good to ensure TE is not seeing much impact.

This is an extreme case. A microbenchmark to measure cost of calls.

TE benchmarks do a lot of other things, so the cost of calls may be much less impactful and possibly even hidden behind latencies of other memory accesses.

VSadov · 2024-07-05T22:57:10Z

I've tried same on an intel machine;

Intel(R) Core(TM) Ultra 7 165H Windows11

It is a laptop, and the CPU has 3 kinds of cores. Fast, Efficient and ... something else.
I do not know yet which are which, so I just tried pinning the benchmark to core #2 and core #20

=== On core #2
No CET:

n: 48, time: 19595ms,  result= 7778742049
n: 49, time: 32292ms,  result= 12586269025

With CET:

n: 48, time: 19426ms,  result= 7778742049
n: 49, time: 31420ms,  result= 12586269025

=== On core #20 (seems like slower than #2)
No CET:

n: 48, time: 29328ms,  result= 7778742049
n: 49, time: 47464ms,  result= 12586269025

With CET:

n: 48, time: 29320ms,  result= 7778742049
n: 49, time: 47442ms,  result= 12586269025

The impact of CET is much less on this machine.
In particular on the slower core CET basically does not cause any difference.

Non-laptop AMD 7950X is much faster on this benchmark. That could be a possible reason why it is impacted more - more chances that memory/L1 operations do not keep up with computation and start impacting the throughput.

janvorli · 2024-07-08T13:29:04Z

I have chatted with Windows developers about this before my vacations. They told me that the cost of pushing to two stacks is what's showing up here.
One of the .NET benchmarks that shows 40% regression on my Ryzen 5 CPU is invoking the System.Collections.Generic.LinkedList<int>(System.Collections.Generic.IEnumerable<int>) constructor for an enumerable with 512 items. The body of the loop in the constructor that makes the copy looks as follows. As you can see it is just a sequence of calls to JIT_WriteBarrier with two or three instructions in between those.

00007ff9`ca4829e7 e8c484b65f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (7ffa29feaeb0)
00007ff9`ca4829ec 4c8bf8 mov r15, rax
00007ff9`ca4829ef 498d4f08 lea rcx, [r15+8]
00007ff9`ca4829f3 488bd3 mov rdx, rbx
00007ff9`ca4829f6 e815d6e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca4829fb 45897720 mov dword ptr [r15+20h], r14d
00007ff9`ca4829ff 4c8b6b08 mov r13, qword ptr [rbx+8]
00007ff9`ca482a03 4d85ed test r13, r13
00007ff9`ca482a06 743a je 00007FF9CA482A42
00007ff9`ca482a08 498d4f10 lea rcx, [r15+10h]
00007ff9`ca482a0c 498bd5 mov rdx, r13
00007ff9`ca482a0f e8fcd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a14 498b5518 mov rdx, qword ptr [r13+18h]
00007ff9`ca482a18 498d4f18 lea rcx, [r15+18h]
00007ff9`ca482a1c e8efd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a21 498b4d18 mov rcx, qword ptr [r13+18h]
00007ff9`ca482a25 488d4910 lea rcx, [rcx+10h]
00007ff9`ca482a29 498bd7 mov rdx, r15
00007ff9`ca482a2c e8dfd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a31 498d4d18 lea rcx, [r13+18h]
00007ff9`ca482a35 498bd7 mov rdx, r15
00007ff9`ca482a38 e8d3d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a3d e94affffff jmp 00007FF9CA48298C
00007ff9`ca482a42 498d4f10 lea rcx, [r15+10h]
00007ff9`ca482a46 498bd7 mov rdx, r15
00007ff9`ca482a49 e8c2d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a4e 498d4f18 lea rcx, [r15+18h]
00007ff9`ca482a52 498bd7 mov rdx, r15
00007ff9`ca482a55 e8b6d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a5a 488d4b08 lea rcx, [rbx+8]
00007ff9`ca482a5e 498bd7 mov rdx, r15
00007ff9`ca482a61 e8aad5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier

So it seems that the regression would hurt performance of applications that use this constructor and similar ones for large blocks of input data.

janvorli · 2024-07-08T13:32:54Z

This simple asm code also shows 30-40% regression with /CETCOMPAT

LEAF_ENTRY EmptyCallee, _TEXT
    ret
LEAF_END EmptyCallee, _TEXT

NESTED_ENTRY CallingLoop, _TEXT
.ENDPROLOG
    mov rbx, 100000000
MainLoop:
    nop
    nop
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    dec rbx
    jnz MainLoop
    ret
NESTED_END  CallingLoop, _TEXT

and interestingly, if I changed the EmptyCallee body to contain 6 nops (that was an approximation of the shortest code path in the JIT_WriteBarrier), the regression is about 70% (the perf is unchanged without /CETCOMPAT w.r.t. the original EmptyCallee). So some internal CPU optimization seems to be working

revert "Enable cetcompat for corerun"

8431130

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 18, 2024

dotnet-policy-service bot assigned EgorBo Jun 18, 2024

EgorBo closed this Jun 18, 2024

This was referenced Jun 18, 2024

NativeAOT legs timing out in CI #102239

Closed

The Operation will be canceled. The next steps may not contain expected logs. dotnet/dnceng#3008

Open

github-actions bot locked and limited conversation to collaborators Aug 9, 2024

EgorBo reopened this Nov 12, 2024

Merge branch 'main' into revert-cmake-change

102ae1e

EgorBo closed this Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revert "Enable cetcompat for corerun" #103654

Revert "Enable cetcompat for corerun" #103654

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024 •

edited

Loading

mangod9 commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

mangod9 commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

mangod9 commented Jun 18, 2024 •

edited

Loading

jkotas commented Jun 18, 2024

mangod9 commented Jun 18, 2024

VSadov commented Jul 5, 2024

EgorBo commented Jul 5, 2024

EgorBo commented Jul 5, 2024

EgorBo commented Jul 5, 2024

mangod9 commented Jul 5, 2024

VSadov commented Jul 5, 2024

VSadov commented Jul 5, 2024

VSadov commented Jul 5, 2024 •

edited

Loading

janvorli commented Jul 8, 2024

janvorli commented Jul 8, 2024

Revert "Enable cetcompat for corerun" #103654

Revert "Enable cetcompat for corerun" #103654

Conversation

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024 • edited Loading

mangod9 commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

EgorBo commented Jun 18, 2024

mangod9 commented Jun 18, 2024

jkotas commented Jun 18, 2024

EgorBo commented Jun 18, 2024

mangod9 commented Jun 18, 2024 • edited Loading

jkotas commented Jun 18, 2024

mangod9 commented Jun 18, 2024

VSadov commented Jul 5, 2024

EgorBo commented Jul 5, 2024

EgorBo commented Jul 5, 2024

EgorBo commented Jul 5, 2024

mangod9 commented Jul 5, 2024

VSadov commented Jul 5, 2024

VSadov commented Jul 5, 2024

VSadov commented Jul 5, 2024 • edited Loading

janvorli commented Jul 8, 2024

janvorli commented Jul 8, 2024

EgorBo commented Jun 18, 2024 •

edited

Loading

mangod9 commented Jun 18, 2024 •

edited

Loading

VSadov commented Jul 5, 2024 •

edited

Loading