Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arm64] Use SIMD register to zero init frame #46609

Merged
merged 14 commits into from Jan 26, 2021

Conversation

echesakov
Copy link
Contributor

@echesakov echesakov commented Jan 6, 2021

Below are the results of running benchmarks as in #32538

  • baseline
  • diff where DC ZVA instruction is prohibited
  • diff where DC ZVA instruction is permitted for memory regions greater than or equal to 256 bytes

windows-10

ubuntu-18-04

@echesakov echesakov added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Jan 6, 2021
@echesakov echesakov added this to the 6.0.0 milestone Jan 6, 2021
@echesakov echesakov self-assigned this Jan 6, 2021
@echesakov echesakov linked an issue Jan 6, 2021 that may be closed by this pull request
@echesakov echesakov added this to Needs Triage in .NET Core CodeGen via automation Jan 12, 2021
@echesakov echesakov moved this from Needs Triage to In progress in .NET Core CodeGen Jan 12, 2021
@echesakov echesakov mentioned this pull request Jan 12, 2021
29 tasks
@echesakov
Copy link
Contributor Author

@TamarChristinaArm I was surprised to find that on Windows I almost don't see any performance improvement from the change. However, you can notice that the distribution becomes unimodal as opposed to baseline where you can see two trends.

Would you say that the results look reasonable? Do you have any other suggestions what else I can try to improve here other than using dc zva instruction (I decided to do this as a separate step).

I figured I don't need to employ all the optimizations as in https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S since the region size is multiple of 4 (actually, I think it is multiple of 8 but I need to double check this) while for memset it is important to handle all possible sizes.

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm I was surprised to find that on Windows I almost don't see any performance improvement from the change. However, you can notice that the distribution becomes unimodal as opposed to baseline where you can see two trends.

Correct me if I'm wrong, but the SQ2 is a Cortex-A76 based big.LITTLE core right? In which case the bimodal behavior of the core is most likely because it didn't enter streaming mode[1] for all the cases. The larger stores of the stp of Q-regs is most likely tickling the heuristic more consistently.

Would you say that the results look reasonable? Do you have any other suggestions what else I can try to improve here other than using dc zva instruction (I decided to do this as a separate step).

No, the only other thing you already have on your roadmap, which is the loop alignment. 16b should be enough for most cases.
The ZVA should help as well as [1] explains it will enter streaming mode more readily.

[1] https://developer.arm.com/documentation/100798/0300/functional-description/level-1-memory-system/cache-behavior/write-streaming-mode

@TamarChristinaArm
Copy link
Contributor

I figured I don't need to employ all the optimizations as in https://github.com/ARM-software/optimized-routines/blob/0f4ae0c5b561de25acb10130fd5e473ec038f89d/string/aarch64/memset.S since the region size is multiple of 4 (actually, I think it is multiple of 8 but I need to double check this) while for memset it is important to handle all possible sizes.

Yeah if on average the memsets are of large regions then the smaller optimizations don't really make sense doing for now. For AoR it made sense as benchmark suites such as SPECCPU were mostly using smaller memsets.

@echesakov
Copy link
Contributor Author

echesakov commented Jan 16, 2021

Correct me if I'm wrong, but the SQ2 is a Cortex-A76 based big.LITTLE core right? In which case the bimodal behavior of the core is most likely because it didn't enter streaming mode[1] for all the cases. The larger stores of the stp of Q-regs is most likely tickling the heuristic more consistently.

I don't know how to tell what microarchitecture it's based on. Is there any Windows tool that can do this?

The Wikipedia article says it is 4+4 cores that suggests it is big.LITTLE.

The ZVA should help as well as [1] explains it will enter streaming mode more readily.

[1] https://developer.arm.com/documentation/100798/0300/functional-description/level-1-memory-system/cache-behavior/write-streaming-mode

Thanks for the link - I am working on supporting dc zva in the JIT and VM. Will be back soonwith new benchmark results.

@echesakov
Copy link
Contributor Author

Updated the first post with new results. Experimentally found that it becomes profitable to use dc zva inside a loop for memory regions >= 256 bytes. Note that I haven't used the instruction outside of the loop. Not sure if the benefits are worth complexity.

@TamarChristinaArm I have a question about using this instruction on big.LITTLE with different cache line sizes for big and little cores. Is it guaranteed for DCZID_EL0 to contain the same value if executed on a big and a LITTLE core? Would it be safe to check the register only once and cache its value as it is done here?

Would you suggest to use this instruction more broadly - for example, in other places where memory needs to be zeroed (e.g. GC heap)? If so, I can open an issue to track this work.

@dotnet dotnet deleted a comment from azure-pipelines bot Jan 21, 2021
@TamarChristinaArm
Copy link
Contributor

Nice speedup! Clearly a gain on both platforms, good work :)

I don't know how to tell what microarchitecture it's based on. Is there any Windows tool that can do this?

Hmm no not that I'm aware of.. You could get the parts number but unless specific support for it has been added to a public compiler we wouldn't be able to tell what it means.
Not really "official" but [1] seems to indicate it's indeed an Cortex-A55 and Cortex-A76 big.LITTLE pair.

[1] https://www.cpu-monkey.com/en/cpu-qualcomm_snapdragon_microsoft_sq2-1850

Updated the first post with new results. Experimentally found that it becomes profitable to use dc zva inside a loop for memory regions >= 256 bytes. Note that I haven't used the instruction outside of the loop. Not sure if the benefits are worth complexity.

Hmm do you actually see a performance degradation with < 256 bytes? If not using ZVA always when it's available and aligned region is >= 64 would simplify your logic a bit wouldn't it?

@TamarChristinaArm I have a question about using this instruction on big.LITTLE with different cache line sizes for big and little cores. Is it guaranteed for DCZID_EL0 to contain the same value if executed on a big and a LITTLE core? Would it be safe to check the register only once and cache its value as it is done here?

So there are a few parts to this question. It's theoretically not enforced anywhere that the little and big cores have the same cache line sizes. Practically speaking however, almost all systems have this. However those that don't can't ever use ZVA because you don't know when the kernel migrates you. You could be migrated after you read the register but before you clear it.

It's a fairly dangerous setup when the cores don't match. To deal with this, in the Linux kernel DCZID_EL0 is virtualized and its turned off when the cache lines don't match. I don't know what the Windows kernel does here, but I assume it must do something similar for correctness.

Would it be safe to check the register only once and cache its value as it is done here?

So on Linux/Android the answer is yes [2] (No code in link, just cover letter so should be safe to read). On Windows I can only assume yes as I don't know the behavior of the kernel here. But if it doesn't do something like the above then that would be a bug.

[2] http://lists.infradead.org/pipermail/linux-arm-kernel/2016-September/453859.html

Would you suggest to use this instruction more broadly - for example, in other places where memory needs to be zeroed (e.g. GC heap)? If so, I can open an issue to track this work.

Yeah, I think it would be beneficial to use anywhere you need a memset. You might want to consider linking against AoR on Windows as well for the native code if that does a lot of memsets. On Linux you would get these for free through new glibc versions.

@echesakov
Copy link
Contributor Author

Hmm do you actually see a performance degradation with < 256 bytes? If not using ZVA always when it's available and aligned region is >= 64 would simplify your logic a bit wouldn't it?

Yes. First, I tried to unconditionally allow using DC ZVA for regions >= 192 bytes. But given that the JIT would still need to emit 4 stp q-reg for zeroing the first and the last 64 bytes of such region the overhead of those was outweighing the performance benefit of the instruction. Then, I did two experiments - JIT would emit a loop with DC ZVA only when the loop would run for at least 2 or 3 iterations - that means DC ZVA was allowed for regions >= 256 or >= 320 bytes, correspondingly. I compared them and found the former to be slightly more performant.

ubuntu-18-04-256-vs-320

I am not sure I understand how I could simplify the logic. As far as I know, JIT can't tell whether an initialized memory region is aligned to 64 bytes boundaries. The best we can do is to assume that the frame pointer is aligned to 16 bytes boundary and extend this assumption to the region in case when untrLclLo/untrLclHi are 16-multipliers. JIT would still need to manually initialize first and last 64 bytes with stp q-reg, right?

As for R2R scenarios - if we don't know whether DC ZVA is going to be supported on a target machine and properly configured (i.e. its block size is 64 bytes) then JIT in AOT scenario issues a loop with pair of stp q-reg. Hence JIT still needs to support two different ways of zeroing the frame - with and without DC ZVA use.

@echesakov
Copy link
Contributor Author

So there are a few parts to this question. It's theoretically not enforced anywhere that the little and big cores have the same cache line sizes. Practically speaking however, almost all systems have this. However those that don't can't ever use ZVA because you don't know when the kernel migrates you. You could be migrated after you read the register but before you clear it.

It's a fairly dangerous setup when the cores don't match. To deal with this, in the Linux kernel DCZID_EL0 is virtualized and its turned off when the cache lines don't match. I don't know what the Windows kernel does here, but I assume it must do something similar for correctness.

Thank you for the reply, Tamar! If that is a case, then I guess DCZID_EL0<4> would signal than DC ZVA is prohibited to use?

@echesakov echesakov marked this pull request as ready for review January 21, 2021 19:48
@echesakov
Copy link
Contributor Author

@dotnet/jit-contrib This is ready for review, PTAL.

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I wonder if the loop versions, especially the larger loop versions, should be activated for smaller sizes under a JIT stress mode.

We're not going to see the DC ZVA instruction except for native hosted JIT scenarios (or maybe SPMI playback of arm64 native PMI/test run collections). Should there be a way to generate this for altjit scenarios?

InstructionSet_Rdm_Arm64=17,
InstructionSet_Sha1_Arm64=18,
InstructionSet_Sha256_Arm64=19,
InstructionSet_Dczva=12,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to change the JIT/EE version GUID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@TamarChristinaArm
Copy link
Contributor

Yes. First, I tried to unconditionally allow using DC ZVA for regions >= 192 bytes. But given that the JIT would still need to emit 4 stp q-reg for zeroing the first and the last 64 bytes of such region the overhead of those was outweighing the performance benefit of the instruction. Then, I did two experiments - JIT would emit a loop with DC ZVA only when the loop would run for at least 2 or 3 iterations - that means DC ZVA was allowed for regions >= 256 or >= 320 bytes, correspondingly. I compared them and found the former to be slightly more performant.

Yeah, You can squeeze it to 160 by using overlapping stores (which is what AoR gets away with) but I have been told that older cores do not like this very much. So I think your approach is correct (and seems to match the threshold where the glibc ifunc with ZVA enabled is used) as I think you guys are tuning for as broad range of cores as possible.

I am not sure I understand how I could simplify the logic. As far as I know, JIT can't tell whether an initialized memory region is aligned to 64 bytes boundaries.

No, I think the confusion I had was my initial look I had misread the remainder check. It does look as optimal as possible which wouldn't inadvertently slow down older cores.

Thank you for the reply, Tamar! If that is a case, then I guess DCZID_EL0<4> would signal than DC ZVA is prohibited to use?

Yeah, that bit would then be set.

@BruceForstall
Copy link
Member

@janvorli maybe you could double-check the VM side changes. And, fyi about the zeroing algorithm change.

…nSet_Dczva flag in src/coreclr/vm/arm64/asmhelpers.S src/coreclr/vm/arm64/asmhelpers.asm src/coreclr/vm/codeman.cpp
…rc/coreclr/tools/Common/Internal/Runtime/ReadyToRunInstructionSetHelper.cs src/coreclr/tools/Common/JitInterface/CorInfoInstructionSet.cs src/coreclr/tools/Common/JitInterface/ThunkGenerator/InstructionSetDesc.txt
…/jit/emitfmtsarm64.h src/coreclr/jit/instrsarm64.h
@echesakov echesakov force-pushed the Arm64-Use-SIMDReg-ZeroInitFrame branch from 98452f6 to 2352f16 Compare January 26, 2021 02:41
@echesakov
Copy link
Contributor Author

Had to rebase on top of master in order to update jiteeversionguid.h

@echesakov
Copy link
Contributor Author

cc @jeffschwMSFT @richlander @jkotas who might be interested in this work

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor feedback and a question.

// 3) if simdRegZeroed is true then 128-bit wide zeroSimdReg contains zeroes.

const int bytesUseZeroingLoop = 192;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use #define for the numeric constants like 192, 256, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but why that would be more preferred than using const int? I doubt that having these numbers outside of the algorithm implementation (e.g. in target.h) would be more useful

// add x9, fp, #(untrLclLo-32)
// mov x10, #(bytesToWrite-64)
//
// loop:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify if aligning the loop gives any benefits as @TamarChristinaArm pointed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did two experiments with aligning the head of a loop on 16-bytes and 32-bytes boundaries - neither demonstrated measurable performance differences, so I decided not to implement the alignment.

@echesakov echesakov merged commit c1d8656 into dotnet:master Jan 26, 2021
.NET Core CodeGen automation moved this from In progress to Done Jan 26, 2021
@echesakov echesakov deleted the Arm64-Use-SIMDReg-ZeroInitFrame branch January 26, 2021 19:51
@AndyAyersMS
Copy link
Member

Nice work, @echesakovMSFT. We should be looking at microbenchmark perf early next week. Will make sure to post back here if we see anything there from your change.

A couple of things for follow up:

  • Is it worth considering this for large internal initobj/initblk too? And the zeroing part of a zeroing stackalloc?
  • @TamarChristinaArm are the writes done by dc zva guaranteed to atomically update with some granularity?

@TamarChristinaArm
Copy link
Contributor

@TamarChristinaArm are the writes done by dc zva guaranteed to atomically update with some granularity?

@AndyAyersMS No, Architecturally there's nothing that makes any guarantee that DC ZVA writes are atomic or about the granularity of the final writes. Practically speaking the behavior of ZVA is quite core dependent and how each handle it can be quite different. The Cortex-A76 having a write streaming mode for instance makes it handle it quite differently from older cores.

@echesakov
Copy link
Contributor Author

Is it worth considering this for large internal initobj/initblk too? And the zeroing part of a zeroing stackalloc?

@AndyAyersMS Yes, I was planning to look into other places where this instruction can be used and give us some performance improvements. In addition, we can improve JIT_MemSet implementation on windows-arm64 and use SIMD registers in same fashion as AoR does for larger memory regions.

@msftbot msftbot bot locked as resolved and limited conversation to collaborators Feb 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
Development

Successfully merging this pull request may close these issues.

[Arm64] Use stp and str (SIMD) for stack prolog zeroing
6 participants