Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

Open
4 of 19 tasks
adamsitnik opened this issue Mar 30, 2022 · 24 comments
Open
4 of 19 tasks

x64 vs ARM64 Microbenchmarks Performance Study Report #67339

adamsitnik opened this issue Mar 30, 2022 · 24 comments
Labels
area-Meta tenet-performance Performance related issue tracking This issue is tracking the completion of other related issues.
Milestone

Comments

@adamsitnik
Copy link
Member

adamsitnik commented Mar 30, 2022

Recently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison.

I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:

  • my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores
  • @kunalspathak Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores

Of course it was not an apples-to-apples comparision, just the best thing we could do right now.

Full public results (without absolute values, as I don't have the permission to share them) can be found here.
Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.

As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.

Benchmarks:

@kunalspathak

  • System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization). PopCount_uint is slower only on Windows.

@tannergooding @GrabYourPitchforks

@stephentoub @kouvel

  • Some RentReturnArrayPoolTests benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. Faster thread local statics #63619
  • System.Threading.Tests.Perf_Timer.AsynchronousContention is 2-3 times slower.

@wfurt @MihaZupan

  • A lot of SocketSendReceivePerfTest benchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend are 2 times slower.

@dotnet/area-system-drawing

  • System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation are few times slower on Windows. Only the NoValidation benchmarks seem to run slower.

@stephentoub

  • Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower. This pattern uses IndexOfAny("HOho") to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.

@jkotas @AndyAyersMS

  • PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod.

@dotnet/jit-contrib

@tannergooding

@dotnet/area-system-globalization

  • System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). Initializing the "ja" culture takes 200ms when using ICU #31273

  • Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.

  • Various Perf_Process.Start benchmarks are slower, but only on macOS so it's most likely a macOS issue.

@adamsitnik adamsitnik added area-Meta tenet-performance Performance related issue tracking This issue is tracking the completion of other related issues. labels Mar 30, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the untriaged New issue has not been triaged by the area owner label Mar 30, 2022
@ghost
Copy link

ghost commented Mar 30, 2022

Tagging subscribers to this area: @dotnet/area-meta
See info in area-owners.md if you want to be subscribed.

Issue Details

Recently @kunalspathak asked me if I could produce a report similar to #66848 for x64 vs arm64 comparison.

I took .NET 7 Preview2 results provided by @AndyAyersMS, @kunalspathak and myself for #66848, hacked the tool a little bit (it was not designed to compare different architecture results) and compared x64 vs arm64 using the following configs:

  • my old 4 year old macBook Pro x64: macOS Monterey 12.2.1, Intel Core i7-5557U CPU 3.10GHz (Broadwell), 1 CPU, 4 logical and 2 physical cores vs @AndyAyersMS M1 Max arm64: macOS Monterey 12.2.1, Apple M1 Max 2.40GHz, 1 CPU, 10 logical and 10 physical cores
  • @kunalspathak Windows 10 (10.0.20348.587) Intel Xeon Platinum 8272CL CPU 2.60GHz, 2 CPU, 104 logical and 52 physical cores vs @kunalspathak Windows 11 (10.0.25058.1000) ARM64 machine with lots of cores

Of course it was not an apples-to-apples comparision, just the best thing we could do right now.

Full public results (without absolute values, as I don't have the permission to share them) can be found here.
Internal MS results (with absolute values) can be found here. If you don't have the access please ping me on Teams.

As usual, I've focused on the benchmarks that take longer to execute on arm64 compared to x64. If you are interested in benchmarks that take less to execute, you need to read the report linked above in the reverse order.

Benchmarks:

  • A lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are 6 up to 16 times slower (most likely due to lack of vectorization). @tannergooding @GrabYourPitchforks is it expected?
  • System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization). PopCount_uint is slower only on Windows. @kunalspathak is this expected?
  • Some RentReturnArrayPoolTests benchmarks are up to few times slower, but these are multi-threaded and very often multimodal benchmarks. @stephentoub @kouvel is it expected?
  • System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem). @dotnet/area-system-globalization is it expected?
  • A lot of System.Collections.Contains benchmarks are 2-3 times slower (most likely due to lack of vectorization). Same goes for System.Memory.Span<Char>.IndexOfValue, System.Memory.Span<Char>.Fill, System.Memory.Span<Int32>.StartsWith, System.Memory.Span<Byte>.IndexOfAnyTwoValues and System.Memory.ReadOnlySpan.IndexOfString(Ordinal). @tannergooding @EgorBo is it expected?
  • A lot of SequenceCompareTo benchmarks are 30% up to 4 times slower (most likely due to lack of vectorization). @tannergooding @EgorBo is it expected?
  • System.Text.Json.Serialization.Tests.WriteJson<BinaryData>.SerializeToStream benchmark can be from 16% to x4 times slower. @dotnet/jit-contrib is this expected?
  • System.Threading.Tests.Perf_Timer.AsynchronousContention is 2-3 times slower. @stephentoub @kouvel is it expected?
  • A lot of SocketSendReceivePerfTest benchmarks likeSystem.Net.WebSockets.Tests.SocketSendReceivePerfTest.ReceiveSend are 2 times slower. @wfurt @MihaZupan is it expected?
  • System.Drawing.Tests.Perf_Image_Load.Image_FromStream_NoValidation are few times slower on Windows. @dotnet/area-system-drawing is it expected? Only the NoValidation benchmarks seem to run slower.
  • PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?
  • Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower (most likely it's using a method that has not been vectorized). @stephentoub is it expected?
  • Burgers.Test3 is 12-59% slower (most likely it's using a method that has not been vectorized). @dotnet/jit-contrib is it expected?
  • System.Security.Cryptography.Tests.Perf_Hashing.Sha1 is 17-55% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected?
  • SIMD.ConsoleMandel benchmarks are 40% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected?
  • System.IO.Tests.Perf_StreamWriter.WriteString(writeLength: 100) is 21-46% slower (most likely due to lack of vectorization). @dotnet/jit-contrib is it expected?
  • System.MathBenchmarks.Double.Exp and ystem.MathBenchmarks.Single.Exp are 35% slower. @tannergooding is it expected?
  • Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.
  • Various Perf_Process.Start benchmarks are slower, but only on macOS so it's most likely a macOS issue.
Author: adamsitnik
Assignees: -
Labels:

area-Meta, tenet-performance, tracking

Milestone: -

@adamsitnik adamsitnik removed the untriaged New issue has not been triaged by the area owner label Mar 30, 2022
@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2022

Nice! I did a similar report last week and shared on our perf meeting last Monday

A lot of Base64Encode benchmarks like System.Buffers.Text.Tests.Base64Tests.Base64Encode(NumberOfBytes: 1000) are 6 up to 16 times slower (most likely due to lack of vectorization). @tannergooding @GrabYourPitchforks is it expected?

Base64 (for utf8) is only vectorized for x64, there is an issue for arm64 #35033 (I think we wanted to assign it to someone to ramp up)


System.Numerics.Tests.Perf_BitOperations.PopCount_ulong is 5-8 time slower (most likely due to lack of vectorization).

it is properly accelerated (I compared it with __builtin_popcnt in LLVM), the problem that popcnt is vector only on arm64 so we have some overhead on packing/extracting - 5 instructions vs 1 on x64


Some RentReturnArrayPoolTests benchmarks are up to few times slower

My guess that Rent-Return is most likely bottle-necked on TLS access speed, can be improved with #63619 if arm64 has special registers for that.


A lot of System.Collections.Contains benchmarks are 2-3 times slower (most likely due to lack of vectorization).

A lot of SequenceCompareTo benchmarks are 30% up to 4 times slower (most likely due to lack of vectorization

That is expected due to lack of Vector256 I believe, I proposed to add dual-vector128 for arm64 here #66993

Burgers.Test3 is 12-59% slower (most likely it's using a method that has not been vectorized)

SIMD.ConsoleMandel benchmarks are 40% slower

Same here, it uses Vector<T> so it's Vector256 on x64 vs Vector128 on arm64


Various Perf_Interlocked benchmarks are slower, but this is expected due to memory model differences.

Correct, the codegen for interlocked ops is completely fine on both arm64 8.0 and 8.1 (Atomics)


System.MathBenchmarks.Double.Exp and ystem.MathBenchmarks.Single.Exp are 35% slower.

If arm64 was M1 than it's the jump-stubs issue, see #62302 (comment)


PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?

My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field. E.g.:

static int field;

void IncrementField() => field++;

X64:

       FF05C6CC4200         inc      dword ptr [(reloc 0x7ffeb73eac3c)]

arm64:

        D2958780          movz    x0, #0xac3c
        F2B6E760          movk    x0, #0xb73b LSL #16
        F2CFFFC0          movk    x0, #0x7ffe LSL #32
        B9400001          ldr     w1, [x0]
        11000421          add     w1, w1, #1
        B9000001          str     w1, [x0]

Overall, I have a feeling that we might get a very nice boost for many benchmarks/GC if we integrate PGO for native code (VM/GC)

@vcsjones
Copy link
Member

System.Security.Cryptography.Tests.Perf_Hashing.Sha1 is 17-55% slower (most likely due to lack of vectorization). jit-contrib is it expected?

The SHA1.ComputeHash is going to be backed by the platform's SHA1 implementation (OpenSSL, CNG, SecurityTransforms) and doesn't do any vectorization itself anywhere. It's possible that the platform the tests were run under do not have optimized ARM64 implementations of SHA1.

@danmoseley
Copy link
Member

Nice! I did a similar report last week and shared on our perf meeting last Monday

@EgorBo that data seems like something you could share on a gist for everyone? (Or perhaps just the scenarios with unusual ratios)

@danmoseley
Copy link
Member

The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.

@AndyAyersMS
Copy link
Member

PerfLabTests.LowLevelPerf.GenericClassGenericStaticField benchmark can be from 16% to x3 times slower. Same goes for PerfLabTests.LowLevelPerf.GenericClassGenericStaticMethod. @jkotas @AndyAyersMS is it expected?

My guess that it's because we don't use relocs on arm64 and have to compose full 64bit address using several instructions to access a static field.

https://github.com/dotnet/performance/blob/d7dac8a7ca12a28d099192f8a901cf8e30361384/src/benchmarks/micro/runtime/perflab/LowLevelPerf.cs#L320-L325

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

@tarekgh
Copy link
Member

tarekgh commented Mar 30, 2022

System.Globalization.Tests.Perf_DateTimeCultureInfo.Parse(culturestring: ja) benchmark can be from 20% to x7 times slower (it's most likely an ICU problem).

Most likely it is because of ICU. We already having the issue #31273 tracking that. I don't know though why ARM64 runs make more slower.

@danmoseley
Copy link
Member

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

@EgorBo perhaps you could open an issue and update the top post?

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2022

@EgorBo perhaps you could open an issue and update the top post?

Access for generic statics (for shared generics at least, maybe for all?) can more complicated -- the address must be looked up in runtime data structures. Worth investigating.

right, but it doesn't look to be the case here since it's not shared

@EgorBo that data seems like something you could share on a gist for everyone?

Sure, let me see how to export an excel sheet to gist 😄

@ericstj
Copy link
Member

ericstj commented Mar 30, 2022

The System.Drawing ones may just be a difference in Windows GDI+ performance since it's largely a wrapper.

There is a lot of interop in this scenario. Could be differences in interop or performance of this callback

public unsafe Interop.HRESULT CopyTo(IntPtr pstm, ulong cb, ulong* pcbRead, ulong* pcbWritten)

Could compare to performance of a load that doesn't use stream, and thus would be more of a GDI+ baseline. cc @eerhardt

@danmoseley
Copy link
Member

@jkoritzinsky for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?

@EgorBo
Copy link
Member

EgorBo commented Mar 30, 2022

System.Text.Json.Serialization.Tests.WriteJson.SerializeToStream benchmark can be from 16% to x4 times slower.

this one serializes an array of bytes so it spends most of the time encoding data into base64. So it's the same as #35033

image

@jkoritzinsky
Copy link
Member

for that interop possibility. Jeremy anything notable in the interop here - any potentially relevant known issue on Arm64?

We don't have any notable differences (or even any differences I can think of) in the portion of interop used there for ARM64 vs x64. I definitely wouldn't be amazed at all if some portion of GDI+ is better optimized for x64 and we're just seeing that here. @dotnet/interop-contrib if anyone else on the interop team has any issues that come to mind.

@danmoseley
Copy link
Member

danmoseley commented Mar 30, 2022

For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)

@stephentoub
Copy link
Member

stephentoub commented Mar 31, 2022

Few RegularExpressions benchmarks like System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sher[a-z]+|Hol[a-z]+", Options: Compiled) are 40-50% slower (most likely it's using a method that has not been vectorized).

For the regex ones -- do we know we have vectorization gaps that are specific to Arm64 in any areas like -- StartsWith, IndexOf, IndexOfAny - @EgorBo ? (For char, not byte)

The cited pattern will use IndexOfAny("HOho") to find the next possible match location. It has a 256-bit vectorization path on x64 but only 128-bit on ARM64.

@danmoseley
Copy link
Member

danmoseley commented Mar 31, 2022

@EgorBo is that IndexOfAny(char, char..) work part of #66993 ?

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2022

@EgorBo is that IndexOfAny(char, char..) work part of #66993 ?

It is, but I start to think that we won't be able to properly lower Vector256 to double Vector128s in JIT, so I wonder if we should do that on C#/IL level instead e.g. Source-Generators if we really want to - some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..

@tannergooding
Copy link
Member

I really don't think its worth focusing on or investing in that.


Like you mentioned, doing it in the JIT is somewhat problematic because you have to take Vector256<T> which is a user-defined non HVA struct (not equivalent to struct Hva256<T> { Vector128<T> _lower; Vector128<T> _upper; }) and then decompose it into 2x efficient 128-bit operations.

Decomposition here isn't necessarily trivial and has questionable perf throughput for various operations leading users to a potential pit of failure, particularly when running on low-power devices (may negatively impact Mobile).

We could do some clever things here and other various optimizations to make it work nicely (including treating it as an HVA), but its not a small amount of work.


On top of that, it won't really "close" the gap. The places where doing 2x 128-bit ops on ARM64 are likely the same places where doing 2x 256-bit ops on x64 would provide similar gains.

We simply shouldn't be trying to compare 128-bit Arm64 vs 256-bit x64, just like we shouldn't compare 256-bit x64 to 512-bit x64 (or 128-bit x64 to 256-bit x64); nor should we try to compare ARM SVE (if/when we get that support) against x64.

We should instead, when doing x64 vs Arm64 comparisons compare 128-bit Arm64 to 128-bit x64. The "simplest" way to do that here is generally COMPlus_EnableAVX2=0, but more ideally we'd just have a way to force 128-bit code paths without disabling any ISAs.

@danmoseley
Copy link
Member

some say that generally these APIs mostly work with small data and cases when we need to open a 0.5Mb book and find a word in it are rare..

I don't think you can assume this given they're critical to regex matching. @stephentoub @joperezr may have a better sense of typical regex text lengths (of course it also depends on how common hits are)

We simply shouldn't be trying to compare 128-bit Arm64 vs 256-bit x64

Comparing across hardware is inevitably bogus -- I thought the purpose of this exercise was to look for unusual ratios that might suggest room for targeted improvement by whatever means. Just sounds like there may not be a means, in this case.

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2022

On top of that, it won't really "close" the gap. The places where doing 2x 128-bit ops on ARM64 are likely the same places where doing 2x 256-bit ops on x64 would provide similar gains.

I support your point, however, I think SpanHelpers methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf, IndexOfAny and SequenceEqual, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2*(4*256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄

We can add JIT support here, e.g. JIT will be responsible to replace SpanHelpers.IndexOf with a call to a heavily optimized pipelined version if inputs are usually big (PGO)

@EgorBo
Copy link
Member

EgorBo commented Mar 31, 2022

https://godbolt.org/z/MxhGPPvaj

here I wrote a simple loop to add 2 to all elements in an array of integers.

  1. arm64 with all ISAs available - two SVE2 vectors
  2. arm64 for Apple-M1 - two Vector128 operations
  3. x64 Skylake - 2 groups of 4 Vector256 operations

I didn't even use -O3 here 😐

@tannergooding
Copy link
Member

I support your point, however, I think SpanHelpers methods are core performance primitives (just like memset and memcpy) in many things, especially IndexOf, IndexOfAny and SequenceEqual, I've seen these 3 in a lot of profiles in different apps (but I've not measured the average input size they worked on) so they might deserve to have 2x256 path or even 4x256 - that's what native compilers do when you ask them to unroll a loop on e.g. Skylake - they will even do 2*(4*256) per iteration. Although, in order to close the gap here for arm64 we need SVE2 😄

Right. My point is that we shouldn't drive the work solely based on closing some non-representative Arm64 vs x64 perf gap, because that will be impossible given the two sets of hardware we have (particularly if we actually try and do our best for each platform).

If it is perf critical, we should be hand tuning this to fit our needs for all the relevant platforms. If that includes manually unrolling and pipelining, then that's fine (assuming numbers across the hardware we care about show the respective gains).

@danmoseley
Copy link
Member

These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?

@EgorBo
Copy link
Member

EgorBo commented Apr 1, 2022

These API's are perf critical (certainly for 'char', if it matters)-- if we think it's feasible at reasonable cost to make them significantly faster on this architecture by whatever means, can we get an issue open for that?

Sure, but I'd love to mine some data first for some apps, 1st parties, benchmarks to understand typical inputs better

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-Meta tenet-performance Performance related issue tracking This issue is tracking the completion of other related issues.
Projects
None yet
Development

No branches or pull requests