-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revert "Enable cetcompat for corerun" #103654
Conversation
ah, my bot is Linux only, will check locally |
Interesting. Is this only affecting the Collections benchmarks? |
It looks like it affects more than a half of all micro-benchmarks we have, different kinds. e.g. I was testing it on Perf_Version and IfStatements |
it's possible that it affected BDN itself I guess |
CET makes the processor to do more work, so it will make things run slower. It is by design. https://www.bing.com/search?q=Control+flow+enforcement+technology+performace+overhead |
Yeah I know, it's just that I've got an impression that it's only enabled for the corerun binary and it shouldn't affect perf anyhow |
CET is enabled process wide. It is not a per-dll setting. |
Ok, do we accept the regressions then? So far I don't see any impact on TE, except startup time regressions for some |
unless aspnet uses a different core host |
It was also recently enabled for apphost too. Guess we should make it the new baseline. We should check if TE runs with dotnet or apphost, since dotnet is not CET compliant yet. |
I do not see any other option. CET is going to be hard requirement for all MS software eventually. |
Does this mean CET will be eventually enabled for the jitted code by default as well? |
well it has been supported for all runtime scenarios (on windows) since 7, we have now enabled it for apphosts so explicit enablement isn't required per app. |
Windows only use shadow stack part of the CET hardware feature. The shadow stack is enabled process wide. JITed code will get it too if it is enabled for the process. |
I think we should leave the issue open, the regression feels rather high so we should look at profiles to determine what is contributing to the increase. |
I see similar extra costs in NativeAOT. With CET enabled there is impact on code that just does a lot of tiny calls. My benchmark: internal class Program
{
static void Main(string[] args)
{
for(int i = 10; i<100; i++)
{
Stopwatch sw = Stopwatch.StartNew();
var result = Fib(i);
sw.Stop();
Console.Write("n: " + i);
Console.Write(", time: " + sw.ElapsedMilliseconds);
Console.WriteLine("ms, result= " + result);
}
}
static long Fib(long i)
{
if (i <= 1)
{
return 1;
}
return Fib(i - 2) + Fib (i - 1);
}
} I am using the bits compiled off the current main. When I compile with default settings, I get
If I compile with I get:
The no-CET version is about 30% faster on code that is dominated by making calls. This is on AMD Ryzen 9 7950X, Windows10 |
@tannergooding mentioned that, technically, new AMDs shouldn't be impacted by this, so, perhaps Windows/whoever to blame? |
Unfortunately, we don't run TE benchmarks on AMD-Windows to confirm (our only Windows-x64 target is Intel) |
|
perhaps some new AMD machines are available in Azure. Regardless, this feels like it's a non-trivial overhead. Would be good to ensure TE is not seeing much impact. |
uprof says nearly all the time is spent in the FIb and with CET the cost attributed to calls is noticeably higher. I guess, however cheap the extra memory operations could be, doing extra write when making a call and extra read (and compare) when returning end up costing some extra. This is with no-CET for computing FIb up to 50: This is with CET |
This is an extreme case. A microbenchmark to measure cost of calls. TE benchmarks do a lot of other things, so the cost of calls may be much less impactful and possibly even hidden behind latencies of other memory accesses. |
I've tried same on an intel machine; Intel(R) Core(TM) Ultra 7 165H Windows11 It is a laptop, and the CPU has 3 kinds of cores. Fast, Efficient and ... something else. === On core
With CET:
=== On core
With CET:
The impact of CET is much less on this machine. Non-laptop AMD 7950X is much faster on this benchmark. That could be a possible reason why it is impacted more - more chances that memory/L1 operations do not keep up with computation and start impacting the throughput. |
I have chatted with Windows developers about this before my vacations. They told me that the cost of pushing to two stacks is what's showing up here.
So it seems that the regression would hurt performance of applications that use this constructor and similar ones for large blocks of input data. |
This simple asm code also shows 30-40% regression with /CETCOMPAT
and interestingly, if I changed the |
let's see if reverting #103311 helps to fix massive regressions reported by dotnet/perf-autofiling-issues#36619