Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Enable cetcompat for corerun" #103654

Closed
wants to merge 2 commits into from

Conversation

EgorBo
Copy link
Member

@EgorBo EgorBo commented Jun 18, 2024

let's see if reverting #103311 helps to fix massive regressions reported by dotnet/perf-autofiling-issues#36619

@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 18, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

ah, my bot is Linux only, will check locally

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

Yes, looks like #103311 caused this:

I've just reproduced it locally on this revert. @mangod9 @janvorli @jkotas any idea why?

If I read it correctly, it enables CET for corerun so presumably it should not affect VM/GC?

@mangod9
Copy link
Member

mangod9 commented Jun 18, 2024

Interesting. Is this only affecting the Collections benchmarks?

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

Interesting. Is this only affecting the Collections benchmarks?

It looks like it affects more than a half of all micro-benchmarks we have, different kinds. e.g. I was testing it on Perf_Version and IfStatements

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

it's possible that it affected BDN itself I guess

@jkotas
Copy link
Member

jkotas commented Jun 18, 2024

CET makes the processor to do more work, so it will make things run slower. It is by design. https://www.bing.com/search?q=Control+flow+enforcement+technology+performace+overhead

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

CET makes the processor to do more work, so it will make things run slower. It is by design. https://www.bing.com/search?q=Control+flow+enforcement+technology+performace+overhead

Yeah I know, it's just that I've got an impression that it's only enabled for the corerun binary and it shouldn't affect perf anyhow

@jkotas
Copy link
Member

jkotas commented Jun 18, 2024

CET is enabled process wide. It is not a per-dll setting.

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

CET is enabled process wide. It is not a per-dll setting.

Ok, do we accept the regressions then? So far I don't see any impact on TE, except startup time regressions for some

@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

unless aspnet uses a different core host

@mangod9
Copy link
Member

mangod9 commented Jun 18, 2024

It was also recently enabled for apphost too. Guess we should make it the new baseline.

We should check if TE runs with dotnet or apphost, since dotnet is not CET compliant yet.

@jkotas
Copy link
Member

jkotas commented Jun 18, 2024

Ok, do we accept the regressions then?

I do not see any other option. CET is going to be hard requirement for all MS software eventually.

@EgorBo EgorBo closed this Jun 18, 2024
@EgorBo
Copy link
Member Author

EgorBo commented Jun 18, 2024

Ok, do we accept the regressions then?

I do not see any other option. CET is going to be hard requirement for all MS software eventually.

Does this mean CET will be eventually enabled for the jitted code by default as well?

@mangod9
Copy link
Member

mangod9 commented Jun 18, 2024

well it has been supported for all runtime scenarios (on windows) since 7, we have now enabled it for apphosts so explicit enablement isn't required per app.

@jkotas
Copy link
Member

jkotas commented Jun 18, 2024

Does this mean CET will be eventually enabled for the jitted code by default as well?

Windows only use shadow stack part of the CET hardware feature. The shadow stack is enabled process wide. JITed code will get it too if it is enabled for the process.

@mangod9
Copy link
Member

mangod9 commented Jun 18, 2024

I think we should leave the issue open, the regression feels rather high so we should look at profiles to determine what is contributing to the increase.

@VSadov
Copy link
Member

VSadov commented Jul 5, 2024

I see similar extra costs in NativeAOT. With CET enabled there is impact on code that just does a lot of tiny calls.

My benchmark:

    internal class Program
    {
        static void Main(string[] args)
        {
            for(int i = 10; i<100; i++)
            {
                Stopwatch sw = Stopwatch.StartNew();
                var result = Fib(i);
                sw.Stop();

                Console.Write("n: " + i);
                Console.Write(", time: " + sw.ElapsedMilliseconds);
                Console.WriteLine("ms,  result= " + result);
            }
        }

        static long Fib(long i) 
        { 
            if (i <= 1)
            {
                return 1;
            }

            return Fib(i - 2) + Fib (i - 1);
        }
    }

I am using the bits compiled off the current main.

When I compile with default settings, I get

n: 10, time: 0ms,  result= 89
n: 11, time: 0ms,  result= 144
n: 12, time: 0ms,  result= 233
n: 13, time: 0ms,  result= 377
n: 14, time: 0ms,  result= 610
n: 15, time: 0ms,  result= 987
n: 16, time: 0ms,  result= 1597
n: 17, time: 0ms,  result= 2584
n: 18, time: 0ms,  result= 4181
n: 19, time: 0ms,  result= 6765
n: 20, time: 0ms,  result= 10946
n: 21, time: 0ms,  result= 17711
n: 22, time: 0ms,  result= 28657
n: 23, time: 0ms,  result= 46368
n: 24, time: 0ms,  result= 75025
n: 25, time: 0ms,  result= 121393
n: 26, time: 0ms,  result= 196418
n: 27, time: 0ms,  result= 317811
n: 28, time: 1ms,  result= 514229
n: 29, time: 1ms,  result= 832040
n: 30, time: 3ms,  result= 1346269
n: 31, time: 5ms,  result= 2178309
n: 32, time: 8ms,  result= 3524578
n: 33, time: 13ms,  result= 5702887
n: 34, time: 22ms,  result= 9227465
n: 35, time: 36ms,  result= 14930352
n: 36, time: 58ms,  result= 24157817
n: 37, time: 94ms,  result= 39088169
n: 38, time: 153ms,  result= 63245986
n: 39, time: 248ms,  result= 102334155
n: 40, time: 402ms,  result= 165580141
n: 41, time: 650ms,  result= 267914296
n: 42, time: 1055ms,  result= 433494437
n: 43, time: 1703ms,  result= 701408733
n: 44, time: 2763ms,  result= 1134903170
n: 45, time: 4467ms,  result= 1836311903
n: 46, time: 7233ms,  result= 2971215073
n: 47, time: 11694ms,  result= 4807526976
n: 48, time: 18929ms,  result= 7778742049
n: 49, time: 30632ms,  result= 12586269025

If I compile with <CETCompat>false</CETCompat>

I get:

n: 10, time: 0ms,  result= 89
n: 11, time: 0ms,  result= 144
n: 12, time: 0ms,  result= 233
n: 13, time: 0ms,  result= 377
n: 14, time: 0ms,  result= 610
n: 15, time: 0ms,  result= 987
n: 16, time: 0ms,  result= 1597
n: 17, time: 0ms,  result= 2584
n: 18, time: 0ms,  result= 4181
n: 19, time: 0ms,  result= 6765
n: 20, time: 0ms,  result= 10946
n: 21, time: 0ms,  result= 17711
n: 22, time: 0ms,  result= 28657
n: 23, time: 0ms,  result= 46368
n: 24, time: 0ms,  result= 75025
n: 25, time: 0ms,  result= 121393
n: 26, time: 0ms,  result= 196418
n: 27, time: 0ms,  result= 317811
n: 28, time: 0ms,  result= 514229
n: 29, time: 1ms,  result= 832040
n: 30, time: 2ms,  result= 1346269
n: 31, time: 4ms,  result= 2178309
n: 32, time: 6ms,  result= 3524578
n: 33, time: 10ms,  result= 5702887
n: 34, time: 17ms,  result= 9227465
n: 35, time: 27ms,  result= 14930352
n: 36, time: 45ms,  result= 24157817
n: 37, time: 73ms,  result= 39088169
n: 38, time: 118ms,  result= 63245986
n: 39, time: 192ms,  result= 102334155
n: 40, time: 312ms,  result= 165580141
n: 41, time: 504ms,  result= 267914296
n: 42, time: 817ms,  result= 433494437
n: 43, time: 1322ms,  result= 701408733
n: 44, time: 2142ms,  result= 1134903170
n: 45, time: 3465ms,  result= 1836311903
n: 46, time: 5606ms,  result= 2971215073
n: 47, time: 9066ms,  result= 4807526976
n: 48, time: 14679ms,  result= 7778742049
n: 49, time: 23750ms,  result= 12586269025

The no-CET version is about 30% faster on code that is dominated by making calls.

This is on AMD Ryzen 9 7950X, Windows10

@EgorBo
Copy link
Member Author

EgorBo commented Jul 5, 2024

@tannergooding mentioned that, technically, new AMDs shouldn't be impacted by this, so, perhaps Windows/whoever to blame?

@EgorBo
Copy link
Member Author

EgorBo commented Jul 5, 2024

Unfortunately, we don't run TE benchmarks on AMD-Windows to confirm (our only Windows-x64 target is Intel)

@EgorBo
Copy link
Member Author

EgorBo commented Jul 5, 2024

This is on AMD Ryzen 9 7950X, Windows10

From https://techcommunity.microsoft.com/t5/windows-os-platform-blog/developer-guidance-for-hardware-enforced-stack-protection/ba-p/2163340

Requirements
You can begin building and testing your application to support Hardware-enforced Stack Protection today, by ensuring you have the following:

Hardware: 11th Gen Intel Core Mobile processors and AMD Zen 3 Core (and newer)

Hardware-enforced Stack Protection capable OS: 19041.622 or 19042.622 and newer versions

@mangod9
Copy link
Member

mangod9 commented Jul 5, 2024

perhaps some new AMD machines are available in Azure. Regardless, this feels like it's a non-trivial overhead. Would be good to ensure TE is not seeing much impact.

@VSadov
Copy link
Member

VSadov commented Jul 5, 2024

uprof says nearly all the time is spent in the FIb and with CET the cost attributed to calls is noticeably higher.

I guess, however cheap the extra memory operations could be, doing extra write when making a call and extra read (and compare) when returning end up costing some extra.

This is with no-CET for computing FIb up to 50:

image

This is with CET

image

@VSadov
Copy link
Member

VSadov commented Jul 5, 2024

Would be good to ensure TE is not seeing much impact.

This is an extreme case. A microbenchmark to measure cost of calls.

TE benchmarks do a lot of other things, so the cost of calls may be much less impactful and possibly even hidden behind latencies of other memory accesses.

@VSadov
Copy link
Member

VSadov commented Jul 5, 2024

I've tried same on an intel machine;

Intel(R) Core(TM) Ultra 7 165H Windows11

It is a laptop, and the CPU has 3 kinds of cores. Fast, Efficient and ... something else.
I do not know yet which are which, so I just tried pinning the benchmark to core #2 and core #20

=== On core #2
No CET:

n: 48, time: 19595ms,  result= 7778742049
n: 49, time: 32292ms,  result= 12586269025

With CET:

n: 48, time: 19426ms,  result= 7778742049
n: 49, time: 31420ms,  result= 12586269025

=== On core #20 (seems like slower than #2)
No CET:

n: 48, time: 29328ms,  result= 7778742049
n: 49, time: 47464ms,  result= 12586269025

With CET:

n: 48, time: 29320ms,  result= 7778742049
n: 49, time: 47442ms,  result= 12586269025

The impact of CET is much less on this machine.
In particular on the slower core CET basically does not cause any difference.

Non-laptop AMD 7950X is much faster on this benchmark. That could be a possible reason why it is impacted more - more chances that memory/L1 operations do not keep up with computation and start impacting the throughput.

@janvorli
Copy link
Member

janvorli commented Jul 8, 2024

I have chatted with Windows developers about this before my vacations. They told me that the cost of pushing to two stacks is what's showing up here.
One of the .NET benchmarks that shows 40% regression on my Ryzen 5 CPU is invoking the System.Collections.Generic.LinkedList<int>(System.Collections.Generic.IEnumerable<int>) constructor for an enumerable with 512 items. The body of the loop in the constructor that makes the copy looks as follows. As you can see it is just a sequence of calls to JIT_WriteBarrier with two or three instructions in between those.

00007ff9`ca4829e7 e8c484b65f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (7ffa29feaeb0)
00007ff9`ca4829ec 4c8bf8 mov r15, rax
00007ff9`ca4829ef 498d4f08 lea rcx, [r15+8]
00007ff9`ca4829f3 488bd3 mov rdx, rbx
00007ff9`ca4829f6 e815d6e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca4829fb 45897720 mov dword ptr [r15+20h], r14d
00007ff9`ca4829ff 4c8b6b08 mov r13, qword ptr [rbx+8]
00007ff9`ca482a03 4d85ed test r13, r13
00007ff9`ca482a06 743a je 00007FF9CA482A42
00007ff9`ca482a08 498d4f10 lea rcx, [r15+10h]
00007ff9`ca482a0c 498bd5 mov rdx, r13
00007ff9`ca482a0f e8fcd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a14 498b5518 mov rdx, qword ptr [r13+18h]
00007ff9`ca482a18 498d4f18 lea rcx, [r15+18h]
00007ff9`ca482a1c e8efd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a21 498b4d18 mov rcx, qword ptr [r13+18h]
00007ff9`ca482a25 488d4910 lea rcx, [rcx+10h]
00007ff9`ca482a29 498bd7 mov rdx, r15
00007ff9`ca482a2c e8dfd5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a31 498d4d18 lea rcx, [r13+18h]
00007ff9`ca482a35 498bd7 mov rdx, r15
00007ff9`ca482a38 e8d3d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a3d e94affffff jmp 00007FF9CA48298C
00007ff9`ca482a42 498d4f10 lea rcx, [r15+10h]
00007ff9`ca482a46 498bd7 mov rdx, r15
00007ff9`ca482a49 e8c2d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a4e 498d4f18 lea rcx, [r15+18h]
00007ff9`ca482a52 498bd7 mov rdx, r15
00007ff9`ca482a55 e8b6d5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier
00007ff9`ca482a5a 488d4b08 lea rcx, [rbx+8]
00007ff9`ca482a5e 498bd7 mov rdx, r15
00007ff9`ca482a61 e8aad5e5ff call 00007FF9CA2E0010; JIT_WriteBarrier

So it seems that the regression would hurt performance of applications that use this constructor and similar ones for large blocks of input data.

@janvorli
Copy link
Member

janvorli commented Jul 8, 2024

This simple asm code also shows 30-40% regression with /CETCOMPAT

LEAF_ENTRY EmptyCallee, _TEXT
    ret
LEAF_END EmptyCallee, _TEXT

NESTED_ENTRY CallingLoop, _TEXT
.ENDPROLOG
    mov rbx, 100000000
MainLoop:
    nop
    nop
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    call EmptyCallee
    dec rbx
    jnz MainLoop
    ret
NESTED_END  CallingLoop, _TEXT

and interestingly, if I changed the EmptyCallee body to contain 6 nops (that was an approximation of the shortest code path in the JIT_WriteBarrier), the regression is about 70% (the perf is unchanged without /CETCOMPAT w.r.t. the original EmptyCallee). So some internal CPU optimization seems to be working

@github-actions github-actions bot locked and limited conversation to collaborators Aug 9, 2024
@EgorBo EgorBo reopened this Nov 12, 2024
@EgorBo EgorBo closed this Nov 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants