Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Methods JITed during startup #67398

Closed
jkotas opened this issue Mar 31, 2022 · 32 comments
Closed

Methods JITed during startup #67398

jkotas opened this issue Mar 31, 2022 · 32 comments
Labels
Milestone

Comments

@jkotas
Copy link
Member

jkotas commented Mar 31, 2022

Repro:

  • Build and run Console.WriteLine("Hello world".Replace('e', 'a'));
  • Collect list of JITed methods using e.g. perfview

Actual result: String.Replace is JITed in the foreground

Expected result: R2R version of String.Replace is, no JITing of String.Replace observed.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI untriaged New issue has not been triaged by the area owner labels Mar 31, 2022
@ghost
Copy link

ghost commented Mar 31, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Repro:

  • Build and run Console.WriteLine("Hello world".Replace('e', 'a'));
  • Collect list of JITed methods using e.g. perfview

Actual result: String.Replace is JITed in the foreground

Expected result: R2R version of String.Replace is, no JITing of String.Replace observed.

Author: jkotas
Assignees: -
Labels:

area-CodeGen-coreclr, untriaged

Milestone: -

@jkotas jkotas added area-ReadyToRun-coreclr tenet-performance Performance related issue and removed area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Mar 31, 2022
@gfoidl
Copy link
Member

gfoidl commented Mar 31, 2022

Could this be caused by the use of Vector<T> in string.Replace(char, char), and at runtime it's detected that AVX(2) is available, so the method gets JITed?

@danmoseley
Copy link
Member

Cc @brianrob

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2022

crossgen with --singlemethodname Replace --singlemethodtypename System.String --singlemethodindex 3 --verbose:

Info: Method `string.Replace(char,char)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT

--instruction-set:avx2 fixes it

@jkotas
Copy link
Member Author

jkotas commented Mar 31, 2022

Could this be caused by the use of Vector in string.Replace(char, char), and at runtime it's detected that AVX(2) is available, so the method gets JITed?

Yes, it is probably related. However, we should have enough flexibility in crossgen now to make it possible to precompile this successfully (as long as the app is running on current hardware at least).

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2022

Related: dotnet/designs#173 (and dotnet/core#7131) I assume we're going to default to Avx2 so it kinds of fixes itself

@mangod9
Copy link
Member

mangod9 commented Mar 31, 2022

@trylek @dotnet/crossgen-contrib @richlander . Yeah we have been discussing compiling with avx2 by default.

@richlander
Copy link
Member

As @EgorBo raises, it seems like the base question on why the R2R code cannot be used as-is is the basic question. If all non-AVX2 code is thrown out, then that R2R code is useless, right?

Ideally, the following is true:

  • All R2R code is used (not just in weird scenarios, but commonly).
  • R2R code is optimized for modern hardware.
  • R2R code includes high-value (currently) JIT-only optimizations, like guarded de-virtualization, at least for known high-value methods (call-sites or targets).
  • The JIT optimizes high-value methods at runtime (tiered comp, dynamic PGO, ...).

As we all know, we are not doing this today. For this conversation, it seems like the first topic is the concern.

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

As @EgorBo raises, it seems like the base question on why the R2R code cannot be used as-is is the basic question. If all non-AVX2 code is thrown out, then that R2R code is useless, right?

Ideally, the following is true:

  • All R2R code is used (not just in weird scenarios, but commonly).
  • R2R code is optimized for modern hardware.
  • R2R code includes high-value (currently) JIT-only optimizations, like guarded de-virtualization, at least for known high-value methods (call-sites or targets).
  • The JIT optimizes high-value methods at runtime (tiered comp, dynamic PGO, ...).

As we all know, we are not doing this today. For this conversation, it seems like the first topic is the concern.

From my understanding the main problem that Vector<T> behaves differently in R2R'd code and Tier1 because it's basically of different widths for them, consider the following example:

// This method is Jitted (uses AVX2)
void Method1()
{
    Console.WriteLine(Method2(new Vector<long>(1)));
    // Expected: {2,2,2,2}
    // Actual: {2,2,1,1} or {2,2,0,0} or {2,2,garbage values}
}

// This method is R2R'd with SSE2
Vector<long> Method2(Vector<long> v)
{
    return v * 2;  // {v.long0 * 2, v.long1 * 2 }
}

So we don't allow this to happen and when we see Vector on a non-trimmed path and don't have AVX2 we fallback to JIT.
However it won't happen in e.g.

if (Sse2.IsSupported)
{
}
else if (Vector.IsHardwareAccelerated)
{
 Vector<T>
}

@richlander
Copy link
Member

Right. That's my understanding, too. To my mind, our approach doesn't make sense. Why bother R2R-compiling methods that won't be used on most hardware? We should either compile those methods for the most common hardware (not SSE2) or not compile them at all.

non-trimmed path

trimmed? Do you mean tiered?

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

trimmed? Do you mean tiered?

Trimmed, but it's rather a wrong word here, consider the following example:

    public static void Test1()
    {
        if (Sse2.IsSupported)
        {
            Console.WriteLine(Vector128.Create(2) * 2);
        }
        else
        {
            Console.WriteLine(new Vector<int>(2) * 2);
        }
    }

    public static void Test2(bool cond)
    {
        if (cond && Sse2.IsSupported) // the only difference is `cond &&`
        {
            Console.WriteLine(Vector128.Create(2) * 2);
        }
        else
        {
            Console.WriteLine(new Vector<int>(2) * 2);
        }
    }

When we run crossgen for this snippet only Test1 is successfully precompiled because the code path with Vector<T> is not even used as if (Sse2.IsSupported) is always true. Test2 fails to precompile because it's not clear whether Vector<T> path will be used or not - it depends on unknow (at compile time) cond so we will have to always JIT Test2 on start.

Compiling [ConsoleApp23]Program..ctor()
Compiling [ConsoleApp23]Program.Test2(bool)
Info: Method `[ConsoleApp23]Program.Test2(bool)` was not compiled because `This function is using SIMD intrinsics, their size is machine specific` requires runtime JIT
Compiling [ConsoleApp23]Program.Test1()
Compiling [ConsoleApp23]Program.Main()
Processing 0 dependencies
Processing 1 dependencies
Moved to phase 1
Emitting R2R PE file:

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

There are few potentional solutions for the String.Replace problem:

  1. When we prejit code and meet if (Vector.IsHardwareAccelerated) (if AVX2 is not enabled for crossgen) we replace it with false -- should allow us to use the fallback user provided for their algorithm. The only problem that user's fallback might be throw new PlatformException("what kind of CPU is that? A potato? At least SSE2 or Neon is expected");
  2. Duplicate code under if (Vector.IsHardwareAccelerated) to use SSE and AVX in preJIT
  3. Return true for Vector.IsHardwareAccelerated but use "software" fallbacks for all Vector<> functions
  4. Just address Proposal for vector instruction default designs#173 and we will be fine as is - AVX2 will always be available for crossgen to emit (might be painful for e.g. Rosetta)

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

List of methods in System.Private.CoreLib.dll which aren't prejitted due to missing AVX2 cap:

Info: Method `[S.P.CoreLib]System.Text.Unicode.Utf16Utility.GetPointerToFirstInvalidChar(char*,int32,int64&,int32&)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.Latin1Utility.WidenLatin1ToUtf16_Fallback(uint8*,char*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.Latin1Utility.NarrowUtf16ToLatin1(char*,uint8*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.Latin1Utility.GetIndexOfFirstNonLatin1Char_Default(char*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.ASCIIUtility.WidenAsciiToUtf16(uint8*,char*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<int8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.ASCIIUtility.NarrowUtf16ToAscii(char*,uint8*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.Text.ASCIIUtility.GetIndexOfFirstNonAsciiChar_Default(char*,native uint)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<__Canon>(__Canon&,native uint,__Canon)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.LastIndexOfAny(uint8&,uint8,uint8,uint8,int32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.LastIndexOfAny(uint8&,uint8,uint8,int32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `string.Replace(char,char)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint16>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<char>(char&,native uint,char)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<native int>(native int&,native uint,native int)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<int32>(int32&,native uint,int32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<GuidResult>(GuidResult&,native uint,GuidResult)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<float32>(float32&,native uint,float32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<uint16>(uint16&,native uint,uint16)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<uint32>(uint32&,native uint,uint32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<int64>(int64&,native uint,int64)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<uint64>(uint64&,native uint,uint64)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<int16>(int16&,native uint,int16)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.IndexOfValueType<int32>(int32&,int32,int32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<int32>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.IndexOfValueType<int64>(int64&,int64,int32)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<int64>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<KeyValuePair`2<__Canon,__Canon>>(KeyValuePair`2<__Canon,__Canon>&,native uint,KeyValuePair`2<__Canon,__Canon>)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<CalendarId>(CalendarId&,native uint,CalendarId)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<float64>(float64&,native uint,float64)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<KeyValuePair`2<SessionInfo,bool>>(KeyValuePair`2<SessionInfo,bool>&,native uint,KeyValuePair`2<SessionInfo,bool>)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<KeyValuePair`2<int32,__Canon>>(KeyValuePair`2<int32,__Canon>&,native uint,KeyValuePair`2<int32,__Canon>)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT
Info: Method `[S.P.CoreLib]System.SpanHelpers.Fill<SessionInfo>(SessionInfo&,native uint,SessionInfo)` was not compiled because `[S.P.CoreLib]System.Numerics.Vector`1<uint8>` requires runtime JIT

@richlander
Copy link
Member

Option 4 seems the most attractive. It's simple. I think we should validate that it is unacceptable before looking at more exotic options.

@ivdiazsa has been doing startup tests with R2R+AVX2 vs retail (what we ship today). He's been able to demonstrate a win with R2R composite but not with our regular crossgen/R2R configuration. That seems odd to me. It's hard for me to imagine that JITing would be faster than no JITing. I'm guessing that there is something else at play. It wouldn't surprise me if there is something else in our system not expecting AVX2 code in our R2R images.

Also, the test that @ivdiazsa was using might not be the best one (TE Plaintext). He's going to switch to a different ASP.NET test that is known to rely more on SIMD. I'm not sure if he was collecting the diff in # of JITed methods with those configuration. That would also be interesting to know. Perhaps @EgorBo could pair with @ivdiazsa to double the brain power on this problem. You know, we could crunch more data with the same clock cycles.

@MichalPetryka
Copy link
Contributor

Wouldn't option 4 mean blocking creating non AVX2 R2R images?

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

@ivdiazsa has been doing startup tests with R2R+AVX2 vs retail (what we ship today). He's been able to demonstrate a win with R2R composite but not with our regular crossgen/R2R configuration.
@ivdiazsa was using might not be the best one (TE Plaintext).

I assume it was Platform-Plaintext? I've just tried it on our perflab (linux-citrine 28cores machine) and here is the results:

| application           | nor2r                     | r2r                       |         |
| --------------------- | ------------------------- | ------------------------- | ------- |
| Start Time (ms)       |                       517 |                       193 | -62.67% |


| load                   | nor2r       | r2r         |         |
| ---------------------- | ----------- | ----------- | ------- |
| First Request (ms)     |          97 |          77 | -20.62% |
| Mean latency (ms)      |        1.39 |        1.23 | -11.51% |
| Max latency (ms)       |       81.27 |       68.26 | -16.01% |

nor2r is DOTNET_ReadyToRun=0 (don't use R2R) and r2r is /p:PublishReadyToRun=true (I didn't even use Composite mode). R2R gives pretty nice start up bonus compared to only-jitting

@richlander
Copy link
Member

Wouldn't option 4 mean blocking creating non AVX2 R2R images?

No. What led you to that conclusion?

There are three decision points:

  • What R2R flavors are allowed/enabled?
  • What R2R flavor is created by default?
  • What R2R flavor does MS uses for official builds?

We're exclusively talking about the last topic. The middle topic is also interesting. There is no discussion about the first topic.

I've just tried it on our perflab

That configuration is not what I intended nor what @ivdiazsa is testing.

There are two things to measure:

  • Baseline with retail product (the one we're building today).
  • The whole stack compiled with AVX2.

Ideally, the app is NOT R2R compiled but JITs. That's what most customer apps do. Measuring all R2R is also interesting but a SECONDARY scenario since most customers don't do that.

/p:PublishReadyToRun=true -- This doesn't produce the second case above.

@EgorBo
Copy link
Member

EgorBo commented Jun 26, 2022

@richlander sorry, but I am still not sure I understand what is being tested. I fully realize that the most common case is when only BCL libs are prejitted (well, maybe a few 3rd party libs) and the rest is jitted - against what we test this mode? Against noR2R even for BCL?

The whole stack compiled with AVX2.

What does this mean exactly? VM/JIT compiled with mcpu=native to use AVX2? Or all managed libs are prejitted into a giant Composite R2R?

@richlander
Copy link
Member

I'm only talking about R2R not how we configure the native toolset.

AFAIK, the ability to specify AVX2 as the SIMD instruction set is not specific to composite. It was at one point, but I believe that's been resolved. That's certainly something to validate!

I'd expect that all the managed assemblies we have today are R2R compiled the same way as today (not composite) with the addition of a flag to affect SIMD instructions. This necessarily include NetCoreApp and ASP.NET Core frameworks.

We're very close to having container images (hopefully this coming week; it is Sunday here) that have this configuration. That will enable us to do A vs B perf testing much easier. It will also enable smart people inspect the configuration to validate that it is correct. It is VERY EASY to do this wrong. We've been working on this problem for months and have had a lot of problems.

@mangod9 mangod9 removed the untriaged New issue has not been triaged by the area owner label Jul 7, 2022
@mangod9 mangod9 added this to the 7.0.0 milestone Jul 7, 2022
@mangod9
Copy link
Member

mangod9 commented Jul 27, 2022

@EgorBo @richlander is this still pending on more perf validation of non-composite AVX2 config?

@richlander
Copy link
Member

I'm not sure. Do we have any data that we can share?

@EgorBo
Copy link
Member

EgorBo commented Jul 28, 2022

Neither am I, the root issue here can be fixed if we bump minimal baseline for R2R (dotnet/designs#173) do we plan to do it in 7.0?

@richlander
Copy link
Member

richlander commented Jul 30, 2022

We're not making any more significant changes in 7.0, as you likely know.

We had a fair number of conversations about this topic over the last few months. Here's the basic context/problem:

  • AVX2 has a benefit at startup, but it isn't as high as one might expect (~2%). The real win is for steady-state, which is already solved via the JIT.
  • There is a win for AVX2 machines, however, the loss on non-AVX2 machines is expected to much greater due to unintuitive side-effects. In short:
    • All SIMD R2R code is SSE2 today. There are a relatively small number of SIMD-dependent methods whose R2R methods need to be rejected on AVX2-capable machines. The reason is that we need APIs to always return the same answer, and we cannot guarantee this unless we use the same underlying functionality throughout process lifetime for the method implementation (the generated native code). This is particularly relevant for floating point math (per my understanding). This requires early jitting of high-quality code for those methods, which has both time and space costs.
    • A key point (again per my understanding) is that most vector code doesn't have this characteristic, so most R2R vector code does not need to be rejected and can safely upgrade from SSE2 R2R code to AVX2 JIT code across process lifetime.
    • If we consider the reverse (where all R2R code is compiled to AVX2), then all SIMD methods need to rejects on non-AVX2 machines and jitted at first use. That makes the cost disproportionate on non-AVX2 machines relatively to the gain on AVX2. Also, it's likely that AVX2 chips are better and more able to adapt.
    • Last, a number of emulators (or maybe all of them) don't support AVX2. The same is true for the lowest end chips on sale for the cheapest laptops.
    • Compiling for both SSE2 and AVX2 for the methods where it matters is (A) a lot of work since we don't have that capability to do (in either the R2R format or in crossgen), and (B) would bloat R2R images. Neither are attractive and don't feel worth it.
  • We have not done extensive measurements here, so a lot of this is conjecture. We should do more extensive measurements so that we can make better decisions. This change is not an "if" but "when". Actual measurements would help us choose the "when" with confidence.

Did I get that right? @ivdiazsa @mangod9 @davidwrighton @tannergooding

@EgorBo
Copy link
Member

EgorBo commented Jul 30, 2022

Btw, String.Replace can be quickly fixed by adding Vector128 path (should also help with small inputs) cc @gfoidl

@tannergooding
Copy link
Member

There are a relatively small number of SIMD-dependent methods whose R2R methods need to be rejected on AVX2-capable machines

IIRC, corelib is special and ignores and works under the presumption that we "do the right thing" and ensure all paths behave identically. While user code has the presumption that the two paths could differ in behavior and so will always throw out the R2R implementation if any path attempted to use an incompatible higher ISA. -- That is, SSE3, SSSE3, SSE4.1, and SSE4.2 are "encoding compatible with" SSE2 and so the if (Isa.IsSupported) checks get emitted in R2R as actual dynamic checks. AVX, AVX2, FMA, BMI1, BMI2, and others are "encoding incompatible" and so the if (Isa.IsSupported) checks will force R2R to throw out the code.

There are a few things to then keep in mind for corelib code:

  1. SIMD is used in almost all our core code paths, often indirectly due to things like ROSpan<T> using it internally
  2. Vector itself is always "encoding incompatible" and so any code path that uses Vector<T> will not have any R2R code
  3. Vector128 is the "common" size that all platforms support (x86, x64, Arm64, etc.)
  4. Vector128 will include WASM in the future, could include Arm32 if we added the support, etc
  5. Vector256 is, today, only supported by x86/x64
  6. Vector64 is, today, only supported by Arm64

x86-x64-v3 is the "level" that is the "baseline" for Azure, AWS, and more. It supports AVX/AVX2/BMI1/BMI2/FMA/LZCNT/MOVBE and has been around since ~2013 for Intel, ~2015 for AMD, and is reported by Steam Hardware Survey to be in 87.78% of reporting hardware.

x86-x64-v2 is the "level" that most emulators (including the x64 on Arm64 emulators provided by Microsoft/Apple) support. It supports SSE3/SSSE3/SSE4.1/SSE4.2/POPCNT/CX16 and has been around since ~2008 for Intel, ~2009 for AMD, and is reported by Steam Hardware Survey to be in 98.74% of reporting hardware.

x86-x64-v1 is the "level" we currently target. It supports SSE/SSE2/CMOV/CX8 and has been around since x64 was first introduced in 2003. For 32-bit it's been around since around 2000 and is reported by Steam Hardware Survey to be in 100% of reporting hardware.

The discussion around targeting AVX/AVX2 is really one around targeting x86-x64-v3. Since targeting it would only negatively impact startup for a likely minority of customers, I think its still worth considering making our baseline and if we, for whatever reason think that this is too much, the x86-x64-v2 is a reasonable alternative.


There is also a consideration that this impacts managed code but not native. Continuing to have the JIT, GC, and other native bits target a 20 year old baseline computer has some negative downsides with the main upside being that devs can "xcopy" their bits onto a flash drive and then run it on any computer.

Even Arm64 has the consideration that armv8.0-a is "missing" some fairly important instructions (like the atomics instruction set) and this is a major benefit that Apple M1 sees since it targets an armv8.5-a baseline instead.

I expect we could see some reasonable perf gains if we experimented with allowing a "less-portable" version of the JIT/GC that targeted a higher baseline (for both x64 and Arm64). I expect that this would be a net benefit for a majority of cases with it being "worse" only for a small subset of users with hardware that more than 15 years old (at which point, they likely aren't running a modern OS and likely have many other considerations around general app usage).

@mangod9
Copy link
Member

mangod9 commented Aug 3, 2022

Btw, String.Replace can be quickly fixed by adding Vector128 path (should also help with small inputs) cc @gfoidl

is this something which can be done in 7? Otherwise we could move this broader decision around avx2 to 8.

@richlander
Copy link
Member

We'll take another run in .NET 8.

@mangod9 mangod9 modified the milestones: 7.0.0, 8.0.0 Aug 5, 2022
@mangod9
Copy link
Member

mangod9 commented Jul 3, 2023

we have enabled avx2 for the new "composite" container images in 8. Is there any other specific work required here?

@EgorBo
Copy link
Member

EgorBo commented Jul 3, 2023

we have enabled avx2 for the new "composite" container images in 8. Is there any other specific work required here?

Just tried Console.WriteLine("Hello world".Replace('e', 'a')); on the latest daily SDK:

   1: JIT compiled System.Runtime.CompilerServices.CastHelpers:StelemRef(System.Array,long,System.Object) [Tier1, IL size=88, code size=93]
   2: JIT compiled System.Runtime.CompilerServices.CastHelpers:LdelemaRef(System.Array,long,ulong) [Tier1, IL size=44, code size=44]
   3: JIT compiled System.SpanHelpers:IndexOfNullCharacter(ulong) [Instrumented Tier0, IL size=805, code size=1160]
   4: JIT compiled System.Guid:TryFormatCore[ushort](System.Span`1[ushort],byref,int) [Tier0, IL size=894, code size=892]
   5: JIT compiled System.Guid:FormatGuidVector128Utf8(System.Guid,bool) [Tier0, IL size=331, code size=584]
   6: JIT compiled System.HexConverter:AsciiToHexVector128(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]) [Tier0, IL size=78, code size=359]
   7: JIT compiled System.Runtime.Intrinsics.Vector128:ShuffleUnsafe(System.Runtime.Intrinsics.Vector128`1[ubyte],System.Runtime.Intrinsics.Vector128`1[ubyte]) [Tier0, IL size=41, code size=50]
   8: JIT compiled System.Number:UInt32ToDecChars[ushort](ulong,uint) [Instrumented Tier0, IL size=114, code size=315]
   9: JIT compiled System.ArgumentOutOfRangeException:ThrowIfNegative[int](int,System.String) [Tier0, IL size=22, code size=50]
  10: JIT compiled System.Text.Unicode.Utf16Utility:GetPointerToFirstInvalidChar(ulong,int,byref,byref) [Instrumented Tier0, IL size=994, code size=1296]
  11: JIT compiled System.Text.Ascii:NarrowUtf16ToAscii(ulong,ulong,ulong) [Instrumented Tier0, IL size=491, code size=792]
  12: JIT compiled System.SpanHelpers:IndexOfNullByte(ulong) [Instrumented Tier0, IL size=844, code size=1540]
  13: JIT compiled System.PackedSpanHelpers:IndexOf[System.SpanHelpers+DontNegate`1[short]](byref,short,int) [Instrumented Tier0, IL size=698, code size=1911]
  14: JIT compiled System.PackedSpanHelpers:PackSources(System.Runtime.Intrinsics.Vector256`1[short],System.Runtime.Intrinsics.Vector256`1[short]) [Tier0, IL size=13, code size=52]
  15: JIT compiled System.PackedSpanHelpers:ComputeFirstIndexOverlapped(byref,byref,byref,System.Runtime.Intrinsics.Vector256`1[ubyte]) [Tier0, IL size=52, code size=124]
  16: JIT compiled System.PackedSpanHelpers:FixUpPackedVector256Result(System.Runtime.Intrinsics.Vector256`1[ubyte]) [Tier0, IL size=22, code size=42]
  17: JIT compiled System.SpanHelpers:LastIndexOfValueType[short,System.SpanHelpers+DontNegate`1[short]](byref,short,int) [Instrumented Tier0, IL size=963, code size=2111]
  18: JIT compiled System.Text.Ascii:WidenAsciiToUtf16(ulong,ulong,ulong) [Instrumented Tier0, IL size=604, code size=1784]
  19: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Level(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=27]
  20: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Message(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=28]
  21: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Task(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=27]
  22: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Opcode(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=27]
  23: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Version(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=28]
  24: JIT compiled (dynamicClass):InvokeStub_EventAttribute.set_Keywords(System.Object,System.Object,ulong) [FullOpts, IL size=25, code size=28]
  25: JIT compiled System.Number:Int64ToHexChars[ushort](ulong,ulong,int,int) [Instrumented Tier0, IL size=67, code size=313]
  26: JIT compiled System.SpanHelpers:IndexOfAnyInRangeUnsignedNumber[ushort,System.SpanHelpers+DontNegate`1[ushort]](byref,ushort,ushort,int) [Tier0, IL size=142, code size=190]
  27: JIT compiled System.PackedSpanHelpers:IndexOfAnyInRange[System.SpanHelpers+DontNegate`1[short]](byref,short,short,int) [Instrumented Tier0, IL size=659, code size=1801]
  28: JIT compiled Program:<Main>$(System.String[]) [Tier0, IL size=20, code size=57]
  29: JIT compiled System.ArgumentOutOfRangeException:ThrowIfNegativeOrZero[int](int,System.String) [Tier0, IL size=36, code size=63]
Hallo world

I don't see string.Replace but I remember we used to have 4 or 5 functions jitted for a hello world 🤔 (tracked in #85791 so presumably @mangod9 this one can be closed)

@jkotas jkotas closed this as completed Jul 3, 2023
@mangod9
Copy link
Member

mangod9 commented Jul 3, 2023

#85791 is marked as Future. Surprised that System.ArgumentOutOfRangeException:ThrowIfNegativeOrZero is being JITted.

@EgorBo
Copy link
Member

EgorBo commented Jul 3, 2023

Surprised that System.ArgumentOutOfRangeException:ThrowIfNegativeOrZero is being JITted.

public static void ThrowIfNegativeOrZero<T>(T value, [CallerArgumentExpression(nameof(value))] string? paramName = null)
    where T : INumberBase<T>
{
    if (T.IsNegative(value) || T.IsZero(value))
        ThrowNegativeOrZero(paramName, value);
}

I guess it's the same SVM issue so presumably it will be fixed with #87438

@mangod9
Copy link
Member

mangod9 commented Jul 3, 2023

hmm, makes sense

@ghost ghost locked as resolved and limited conversation to collaborators Aug 2, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

8 participants