Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring the ARM intrinsics to match API review and share code with x86 #25508

Merged
merged 30 commits into from Oct 11, 2019

Conversation

@tannergooding
Copy link
Member

commented Jun 30, 2019

This updates the ARM64 intrinsics to match the proposed layout from: dotnet/corefx#37199.

This also updates the ARM64 intrinsics to share much of the importation logic and various data structures that were already created for x86.

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.
-- I don't think merging this should be blocked on that, but since this can't be merged until after master starts targeting .NET 5, I will try to get it completed before then.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Jun 30, 2019

Tagging @CarolEidt and @TamarChristinaArm in advance.

This can't/shouldn't be merged until after master is updated to target .NET 5, but this is a larger PR, so it will take more time to review.

// /// int64x1_t vabs_s64 (int64x1_t a)
// /// A64: ABS Dd, Dn
// /// </summary>
// public static Vector64<ulong> AbsScalar(Vector64<long> value) { throw new PlatformNotSupportedException(); }

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

There are native intrinsics that correspond to these and many other Vector64<double>, Vector64<long>, and Vector64<ulong> methods.

We should fixup the JIT to properly support these types so the methods can be exposed.

This comment has been minimized.

Copy link
@TamarChristinaArm

TamarChristinaArm Jul 1, 2019

Contributor

Hmm so the JIT doesn't currently handle Vector64? Should I then also comment them out in my local branch with the new intrinsics?

This comment has been minimized.

Copy link
@TamarChristinaArm

TamarChristinaArm Jul 1, 2019

Contributor

Oh, I see, you probably only mean the x1_t types.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jul 1, 2019

Author Member

Right. Vector64 works for 7/10 of the primitive types right now. The JIT doesn't properly support the x1_t types today.

#ifdef FEATURE_HW_INTRINSICS
#include "hwintrinsic.h"

instruction CodeGen::getOpForHWIntrinsic(GenTreeHWIntrinsic* node, var_types instrType)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

This logic was removed and replaced with new logic in hwintrinsiccodegenarm64.cpp, to mirror the xarch file.

opts.setSupportedISA(InstructionSet_Vector128);
}

if(jitFlags.IsSet(JitFlags::JIT_FLAG_HAS_ARM64_ATOMICS) && JitConfig.EnableArm64Atomics())

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

All of these were pre-existing. I don't think we even support most of them today.

@@ -3,9 +3,44 @@
// See the LICENSE file in the project root for more information.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

This file contains all the importation and HWIntrinsicInfo logic that can be shared between both architectures.

There are probably a few more places code could be shared with some ifdefs, but I haven't looked where.

#include "hwintrinsiclistxarch.h"
#elif defined (_TARGET_ARM64_)
#define HARDWARE_INTRINSIC(isa, name, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, flag) \
{NI_##isa##_##name, #name, InstructionSet_##isa, ival, size, numarg, t1, t2, t3, t4, t5, t6, t7, t8, t9, t10, category, static_cast<HWIntrinsicFlag>(flag)},

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

I changed ARM to use this because that is the pattern we were already following for naming things. Might be nice to also update x86 to do the same as it makes the #defines in hwintrinsiclist*.h smaller, but I felt that was a separate PR.

//
// Return Value:
// true if the node has an imm operand; otherwise, false
bool HWIntrinsicInfo::isImmOp(NamedIntrinsic id, const GenTree* op)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

Some of these methods aren't used by ARM anywhere yet, but they should still be applicable in a few spots later down the line.

#include "hwintrinsicArm64.h"
#ifdef FEATURE_HW_INTRINSICS

enum HWIntrinsicCategory : unsigned int

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

Categories, flags, and the HWIntrinsicInfo layout are all the same.

case NI_Vector128_AsSingle:
case NI_Vector128_AsUInt16:
case NI_Vector128_AsUInt32:
case NI_Vector128_AsUInt64:

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

This logic (specifically for As and get_Count) could be shared, but I didn't determine a good way to do so yet and decided to leave it to a future PR.

// Arguments:
// node - The hardware intrinsic node
//
void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

A lot of this logic is actually very similar to the x86 logic, there might be more opportunities to share bits here as well, but that isn't part of this PR.

@@ -1,306 +0,0 @@
// Licensed to the .NET Foundation under one or more agreements.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

The hwintrinsicxarch.h and hwintrinsicarm64.h files aren't needed anymore, as all of this is being shared now.

@@ -981,105 +981,193 @@ int LinearScan::BuildSIMD(GenTreeSIMD* simdTree)
//
int LinearScan::BuildHWIntrinsic(GenTreeHWIntrinsic* intrinsicTree)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Jun 30, 2019

Author Member

There is, again, quite a bit of logic that is similar between the x86 and ARM paths in lsra and lowering (basically everything that isn't the individual intrinsic handling), so there might be a good opportunity to share code in the future.

@tannergooding tannergooding added this to the Future milestone Jun 30, 2019
@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch 5 times, most recently from e31fd67 to 9d766b0 Jun 30, 2019
@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from 9d766b0 to f9dfc03 Jul 19, 2019
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Aug 20, 2019

CC. @BruceForstall.

This is the PR that updates most of the ARM infrastructure to be similar to the infrastructure we setup for x86/x64.

The one drawback, that I am aware of is what I called out in the original post:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class, but I am working on updating those to match the above proposal as well.

It, however, should be a good starting point for anyone wanting to finish this up (since I had gotten pulled off onto other work and haven't been able to yet).

Copy link
Member

left a comment

It seems that I never published these comments, so I'm doing so now.

src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsic.cpp Outdated Show resolved Hide resolved
op2 = getArgForHWIntrinsic(argType, argClass);

#ifdef _TARGET_XARCH_
var_types op2Type = TYP_UNDEF;

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Aug 20, 2019

Member

It's unclear to me why this can't be merged with the code below that handles the gather intrinsics, rather than having two separate ifdef regions.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 5, 2019

Author Member

I don't see any reason either. I've merged them.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Sep 5, 2019

Author Member

It looks like it was split because gtIndexBaseType for the retNode needs to be set based on the argClass of op2. However, you have to get op1 before the retNode can be created and getting op1 mutates argClass.

I solved the issue by stashing the op2ArgClass in a local.

src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsicarm64.cpp Outdated Show resolved Hide resolved
@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from f9dfc03 to ee5d6c2 Sep 4, 2019
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Sep 4, 2019

Rebased onto dotnet/master. Will work on addressing comments made so far.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2019

Addressed feedback given so far. Will resolve any CI failures once the jobs finish.

@CarolEidt

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

@tannergooding you say above that:

Currently, this also removes many of the APIs that were exposed as part of the Arm.AdvSimd class

But it seems that they've been added back? However, the tests have been deleted. I realize it's probably a bit of work to resurrect the old tests, and perhaps it's best to just move to auto-generated tests, but I'm uncomfortable having no test exposure.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Sep 5, 2019

But it seems that they've been added back?

No, the Abs/Add APIs are the only ones exposed in the commit right now. The other APIs that were exposed are still missing.

I'm uncomfortable having no test exposure.

Right, I, at a minimum, will be adding generated tests for the APIs that are supported by this PR.

@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from d2e1dcd to fce5683 Sep 6, 2019
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Sep 6, 2019

@CarolEidt, I came across two issues:

  1. I can't add tests until after the reference assemblies are also updated (which requires something to be merged CoreCLR side first; at least until the repo merge).
  2. As best as I can tell, we don't currently support the LD1 instruction for doing Vector64 and Vector128 loads from memory (which limits what we can test). This is less of a problem, but I need to sit down with someone and ensure I understand how to add new instructions for ARM (it is quite a bit different than x86; and I think I've gotten 50-60% of the way there, but there are some unclear parts still).
@@ -872,59 +822,19 @@ void Lowering::ContainCheckSIMD(GenTreeSIMD* simdNode)
//
void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 9, 2019

Author Member

@CarolEidt, @echesakovMSFT

Just to confirm, there doesn't need to be containment for ARM64 except for immediates, correct?

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Oct 9, 2019

Member

That's my understanding, i.e. there are no Arm64 intrinsics that can take either a register or memory operand.

This comment has been minimized.

Copy link
@TamarChristinaArm

TamarChristinaArm Oct 9, 2019

Contributor

Indeed, not SIMD but you do have other intrinsics such as the prefetch intrinsics (pld) that do take a memory operand. But that's probably out of scope for now I'd imagine.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 9, 2019

Author Member

Definitely out of scope for the PR; likely not outside the scope of the total work. We have Sse.Prefetch0/1/2 and Sse.PrefetchNonTemporal on the x86 side already.

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Oct 10, 2019

Member

My point was that we only need to worry about the containment question when the same intrinsic can take either a memory or a register operand, which would not be the case for a prefetch.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Oct 9, 2019

At least locally, vector tests are passing. Looks like I need to define an insOpt for scalar values on a vector registers to support AbsScalar and AddScalar.

Edit: Nevermind, looks like I just need to ensure insOptsAnyArrangement returns false :)

@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from 1092343 to cc76ea3 Oct 9, 2019
@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from cc76ea3 to 06b1c76 Oct 9, 2019
HARDWARE_INTRINSIC(AdvSimd, Abs, -1, -1, 1, {INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_abs, INS_invalid, INS_invalid, INS_fabs, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_UnfixedSIMDSize)
HARDWARE_INTRINSIC(AdvSimd, AbsScalar, -1, 8, 1, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_fabs, INS_fabs}, HW_Category_SIMDScalar, HW_Flag_NoContainment)
HARDWARE_INTRINSIC(AdvSimd, Add, -1, -1, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_invalid}, HW_Category_SimpleSIMD, HW_Flag_NoContainment|HW_Flag_Commutative|HW_Flag_UnfixedSIMDSize)
HARDWARE_INTRINSIC(AdvSimd, AddScalar, -1, 8, 2, {INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_add, INS_fadd, INS_fadd}, HW_Category_SIMDScalar, HW_Flag_NoContainment|HW_Flag_Commutative)

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 9, 2019

Author Member

@TamarChristinaArm.

What is the semantics around the unused bits for a given vector register?

That is, on Arm64 the vector register is 128-bits in length. However, some encodings only access the first 64-bits (Vector64<T> operations) and some instructions only access the first element (scalar instructions only access the first T).

So, for say AddScalar which takes in Vector64<float> left, Vector64<float> right and returns a Vector64<T>; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

Is it cleared to zero, is it preserved to the last value in the register, etc?

The same question goes for Vector64<T> Add(Vector64<T> left, Vector64<T> right) (what is the contents of bits 64-96 and bits 96-128 of the backing register)?

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 9, 2019

Author Member

I'm asking because I'm seeing for:

 left: (0.41178197, 0.0033083581)
right: (0.11088856, 0.89114616)

You get:

result: (0.5226705, 0)

and I'm wanting to find out if this is deterministically 0 or if it is the last value contained in those bits.

This comment has been minimized.

Copy link
@TamarChristinaArm

TamarChristinaArm Oct 10, 2019

Contributor

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

This comment has been minimized.

Copy link
@TamarChristinaArm

TamarChristinaArm Oct 10, 2019

Contributor

So, for say AddScalar which takes in Vector64 left, Vector64 right and returns a Vector64; the contents of result[0] = left[0] + right[0], but what is the contents of result[1] and what is the result of indice 3 and 4 of the backing register (bits 64-96 and bits 96-128)?

They're all cleared to 0 in this case. They wouldn't be for an insert (e.g. INS) a structure load into one lane (e.g. LD3...[<lane>]) or on some of the instructions which have a second part (those generally end with a 2 in the name such as PMULL2.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 10, 2019

Author Member

The default is unless otherwise specified by the instruction (such is inserts, or high operations) the unused bits of a register are always cleared to 0 on writes.

Awesome, thanks!

This is the same behavior as for x86 when looking at 128-bit vs 256-bit registers (just with 64-bit vs 128-bit on ARM). However, it differs from x86 when looking at scalar vs vector operations (x86 preserves the upper bits, through bit 128 and clears bits 128-256; where-as ARM just clears all upper bits).

So I just wanted to get this validated in my head. It would likely also be something worth noting for when dealing with operations like Vector128.GetLower(), Vector64.ToVector128(), and when performing operations in general.

This comment has been minimized.

Copy link
@tannergooding

tannergooding Oct 10, 2019

Author Member

@CarolEidt, just for a sanity check. Is this register allocator aware of these parts (scalar vs vector64 vs vector128)? Is there anything we need to set to tell it when an operation will preserve vs zero the "upper bits"?

This comment has been minimized.

Copy link
@CarolEidt

CarolEidt Oct 10, 2019

Member

That raises a very interesting point. The register allocator only understands fully-overwritten or RMW operands. That is, an operation that partially preserves the value of a target should have that as both a source and a target, and that source should be marked as "delayFree". It is then up to the code generator to copy it if the RMW src and dst aren't allocated the same register. I suspect there are some issues here in what we're doing today.

@tannergooding tannergooding force-pushed the tannergooding:arm-intrinsics branch from 706fbda to 8f2a03e Oct 10, 2019
@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Oct 10, 2019

Going to mark this as "ready for review" now as all tests are passing locally (on my Windows on Arm64 device).

There are still a few methods which need tests (namely ArmBase), but I'd like to add those in a follow up PR if possible, as that will unblock this PR from being merged and unblock the work being done by @TamarChristinaArm and @echesakovMSFT

@tannergooding tannergooding marked this pull request as ready for review Oct 10, 2019
Copy link
Member

left a comment

LGTM (still/again)

code |= insEncodeReg_Rn(id->idReg2()); // nnnnn
code |= insEncodeVLSElemsize(elemsize); // ss
code |= 0x5000; // xx.x - We only support the one register variant right now
code |= insEncodeReg_Rm(id->idReg3()); // mmmmm

This comment has been minimized.

Copy link
@echesakovMSFT

echesakovMSFT Oct 11, 2019

Member

Actually, it looks like we might not need these as the instructions that load to multiple vector registers have different names (e.g. ld1, ld2, ld3, ld4).

I think we will need them to support ld1 with multiple vector registers - there are four variant of this instruction.

@tannergooding

This comment has been minimized.

Copy link
Member Author

commented Oct 11, 2019

I've logged #27139 (and assigned it to myself) to track adding the tests for the ArmBase APIs.

I have already started working on these and should hopefully have a PR up not terribly long after this one is merged 😄

@tannergooding tannergooding merged commit 8184d3f into dotnet:master Oct 11, 2019
44 of 46 checks passed
44 of 46 checks passed
coreclr-ci Build #20191010.37 had test failures
Details
coreclr-ci (Build OSX x64 checked) Build OSX x64 checked failed
Details
WIP Ready for review
Details
coreclr-ci (Build Linux arm checked) Build Linux arm checked succeeded
Details
coreclr-ci (Build Linux arm64 checked) Build Linux arm64 checked succeeded
Details
coreclr-ci (Build Linux arm64 release) Build Linux arm64 release succeeded
Details
coreclr-ci (Build Linux x64 checked) Build Linux x64 checked succeeded
Details
coreclr-ci (Build Linux_musl x64 checked) Build Linux_musl x64 checked succeeded
Details
coreclr-ci (Build Linux_musl x64 release) Build Linux_musl x64 release succeeded
Details
coreclr-ci (Build Linux_rhel6 x64 release) Build Linux_rhel6 x64 release succeeded
Details
coreclr-ci (Build Test Pri0 CoreFX Linux x64 checked) Build Test Pri0 CoreFX Linux x64 checked succeeded
Details
coreclr-ci (Build Test Pri0 CoreFX Windows_NT x64 checked) Build Test Pri0 CoreFX Windows_NT x64 checked succeeded
Details
coreclr-ci (Build Test Pri0 Linux arm checked) Build Test Pri0 Linux arm checked succeeded
Details
coreclr-ci (Build Test Pri0 Linux arm64 checked) Build Test Pri0 Linux arm64 checked succeeded
Details
coreclr-ci (Build Test Pri0 Linux_musl x64 release) Build Test Pri0 Linux_musl x64 release succeeded
Details
coreclr-ci (Build Test Pri0 R2R Windows_NT x64 checked) Build Test Pri0 R2R Windows_NT x64 checked succeeded
Details
coreclr-ci (Build Test Pri0 R2R Windows_NT x86 checked) Build Test Pri0 R2R Windows_NT x86 checked succeeded
Details
coreclr-ci (Build Test Pri0 Windows_NT arm checked) Build Test Pri0 Windows_NT arm checked succeeded
Details
coreclr-ci (Build Test Pri0 Windows_NT arm64 checked) Build Test Pri0 Windows_NT arm64 checked succeeded
Details
coreclr-ci (Build Test Pri0 Windows_NT x64 checked) Build Test Pri0 Windows_NT x64 checked succeeded
Details
coreclr-ci (Build Test Pri0 Windows_NT x86 checked) Build Test Pri0 Windows_NT x86 checked succeeded
Details
coreclr-ci (Build Windows_NT arm checked) Build Windows_NT arm checked succeeded
Details
coreclr-ci (Build Windows_NT arm release) Build Windows_NT arm release succeeded
Details
coreclr-ci (Build Windows_NT arm64 checked) Build Windows_NT arm64 checked succeeded
Details
coreclr-ci (Build Windows_NT arm64 release) Build Windows_NT arm64 release succeeded
Details
coreclr-ci (Build Windows_NT x64 checked) Build Windows_NT x64 checked succeeded
Details
coreclr-ci (Build Windows_NT x64 debug) Build Windows_NT x64 debug succeeded
Details
coreclr-ci (Build Windows_NT x64 release) Build Windows_NT x64 release succeeded
Details
coreclr-ci (Build Windows_NT x86 checked) Build Windows_NT x86 checked succeeded
Details
coreclr-ci (Build Windows_NT x86 debug) Build Windows_NT x86 debug succeeded
Details
coreclr-ci (Checkout (Unix)) Checkout (Unix) succeeded
Details
coreclr-ci (Checkout (Windows)) Checkout (Windows) succeeded
Details
coreclr-ci (Formatting Linux x64) Formatting Linux x64 succeeded
Details
coreclr-ci (Run Test Pri0 CoreFX Linux x64 checked) Run Test Pri0 CoreFX Linux x64 checked succeeded
Details
coreclr-ci (Run Test Pri0 CoreFX Windows_NT x64 checked) Run Test Pri0 CoreFX Windows_NT x64 checked succeeded
Details
coreclr-ci (Run Test Pri0 Linux arm checked) Run Test Pri0 Linux arm checked succeeded
Details
coreclr-ci (Run Test Pri0 Linux arm64 checked) Run Test Pri0 Linux arm64 checked succeeded
Details
coreclr-ci (Run Test Pri0 Linux_musl x64 release) Run Test Pri0 Linux_musl x64 release succeeded
Details
coreclr-ci (Run Test Pri0 R2R Windows_NT x64 checked) Run Test Pri0 R2R Windows_NT x64 checked succeeded
Details
coreclr-ci (Run Test Pri0 R2R Windows_NT x86 checked) Run Test Pri0 R2R Windows_NT x86 checked succeeded
Details
coreclr-ci (Run Test Pri0 Windows_NT arm checked) Run Test Pri0 Windows_NT arm checked succeeded
Details
coreclr-ci (Run Test Pri0 Windows_NT arm64 checked) Run Test Pri0 Windows_NT arm64 checked succeeded
Details
coreclr-ci (Run Test Pri0 Windows_NT x64 checked) Run Test Pri0 Windows_NT x64 checked succeeded
Details
coreclr-ci (Run Test Pri0 Windows_NT x86 checked) Run Test Pri0 Windows_NT x86 checked succeeded
Details
coreclr-ci (Test crossgen-comparison Linux arm checked) Test crossgen-comparison Linux arm checked succeeded
Details
license/cla All CLA requirements met.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants
You can’t perform that action at this time.