Consider expanding `Vector<T>` to support `nint` and `nuint` #36160

tannergooding · 2020-05-09T05:34:23Z

Rationale

Today Vector<T> supports the following 10 primitive types: byte, sbyte, short, ushort, int, uint, long, ulong, float, and double.

C# 9 is introducing support for nint and nuint which are variable sized integers matching the bitness of the underlying platform. That is, they are 32-bits on 32-bit systems and 64-bits on 64-bit systems.

As such, it may be beneficial to expand Vector<T> to additionally support these types so we can get rid of the using aliases and support performing the cross-platform vector operations using these new primitive types.

Proposal

Extend Vector<T> to support nint and nuint as valid primitive types. This will extend a number of existing generic functions which take a Vector<T> to also support taking the new types rather than throwing a PlatformNotSupportedException.

Additionally, the following non-generic APIs should be added for parity with the existing surface area:

namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorIntPtr<T>(Vector<T> value);
        public static Vector<nuint> AsVectorUIntPtr<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

Other Considerations

For API names, we have a guideline that states to use the framework name rather than the language keyword name (e.g. Int32 and Int64 rather than int and long). However, the framework name for nint is IntPtr but the operators exposed and general use case for the two types is somewhat different, as such a name such as NInt and NUInt may be a better alternative. Whatever we choose, it should likely become the standard for nint moving forward.

The same request could be made for System.Runtime.Intrinsics, but the API bloat for this would be much larger and would need further consideration. It might be worthwhile to allow nint/nuint as valid T without exposing the additional overloads initially as that would at least unblock users from providing such APIs themselves.

The text was updated successfully, but these errors were encountered:

ghost · 2020-05-09T05:34:27Z

Tagging subscribers to this area: @tannergooding
Notify danmosemsft if you want to be subscribed.

tannergooding · 2020-05-09T05:35:01Z

CC. @pgovind, @CarolEidt, @echesakovMSFT

tannergooding · 2020-05-09T05:38:35Z

We have an existing use case in the runtime in Utf16Utility.Validation.cs where we are using the following to workaround not having this:

#if TARGET_64BIT
using nuint_t = System.UInt64;
#else
using nuint_t = System.UInt32;
#endif

RamType0 · 2020-05-16T03:00:58Z

Could you also support any reference types?
This could be so much powerful.(e.g. Collection look up)
Ofcource, we could not cast other type vectors to reference type vector.

tannergooding · 2020-05-16T05:05:20Z

Could you also support any reference types?

As I understand it, that would be a non-trivial ask and likely not for a lot of benefit. Many of the operations exposed aren't valid or don't make sense for reference types and the JIT doesn't support tracking GC types in the SIMD registers.
There are likely other considerations I'm not aware of as well and which would best be answered by someone like @jkotas or @Maoni0.

GrabYourPitchforks · 2020-05-16T20:54:15Z

I wonder if we could optimistically handle fast collection lookup anyway, even without support for reference types in SIMD. Consider that you have a T[] (where T is a reference type), and you want to quickly look up the index of any given T within the array. Assume referential equality checks, not deep equality checks.

The algorithm would then be as follows:

Pin the T instance that represents your search target, then project this T to a void* then to nuint.
Project the underlying T[] as a Span<nuint>.
Perform a vectorized search within the Span<nuint> looking for the "target" nuint.

Since the target is pinned, it cannot be moved by the GC, so you don't have to worry about having the GC track individual elements within the SIMD registers. It's possible that the GC might move other elements of the array while you're inspecting it, but that's ok since neither their original addresses nor their modified addresses will match the pinned address of the target you're seeking. This means you're guaranteed zero false positives and zero false negatives.

Again, this would only work for referential equality checks. Since I'm not sure how common referential vs. deep equality checks are I'm not sure how much benefit this would offer in practice.

RamType0 · 2020-05-17T03:08:08Z

BTW,maybe we also want support for enum.

RamType0 · 2020-05-17T03:23:08Z

Pin the T instance that represents your search target, then project this T to a void* then to nuint.

Maybe we need some compiler specific support or JIT intrinsic for done this effectively.
Such as

IntPtr e0,e1,e2,e3;//fixed-pointers
Unsafe.WriteUnaligned(ref Unsafe.As<IntPtr,byte>(ref e0),Unsafe.ReadUnaligned<Vector<nint>>(ref Unsafe.As<T,byte>(ref array[0])));
var vector = Unsafe.ReadUnaligned<Vector<nint>>(ref e0);

RamType0 · 2020-05-17T03:52:44Z

C# going to have record type feature.
So compiler can provide detailed,and structured information about it's equality.
If Vector<object> is supported (or work around with nint works), we could compare deep-equality by SIMD.

This is a pseudo code.

[EquatabilityContract]
[CompilerGenerated]
class Record
{
   [EquatabilityMember]
   long id;
}
Record[] records;
Record Find(long id)
{
   for(var i = 0;i + Vector256<object>.Count <records.Length;i+=Vector256<object>.Count)
   {
      var vector = Unsafe.ReadUnaligned<Vector256<object>>(ref records[i]);
      var ids = Avx2.GatherVector256<long>(JitHelper.OffsetOf<Record>(tokenof(id)),vector);//"tokenof" means member definition reference in IL.Compiler could generate it
      var compare = Avx2.CompareEqual(vector,Vector256.Create(id));
      var mm = Avx2.MoveMask(compare.AsBytes());
      if(mm!=0)
      {
         return Unsafe.As<Record>(vector.GetElement(mm/IntPtr.Size));
      }
   }
}

terrajobst · 2020-08-06T18:49:03Z

Video

We should use NInt and NUInt as the the non-keyword types names for APIs that have to refer to nint and nuint, as opposed to IntPtr and UIntPtr,

namespace System.Numerics
{
    public partial struct Vector<T>
    {
        public static explicit operator Vector<nint>(Vector<T> value);
        public static explicit operator Vector<nuint>(Vector<T> value);
    }

    public static partial class Vector
    {
        public static Vector<nint> AsVectorNInt<T>(Vector<T> value);
        public static Vector<nuint> AsVectorNUInt<T>(Vector<T> value);

        public static Vector<nint> Equals(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> Equals(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> GreaterThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> GreaterThanOrEqual(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThan(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThan(Vector<nuint> left, Vector<nuint> right);

        public static Vector<nint> LessThanOrEqual(Vector<nint> left, Vector<nint> right);
        public static Vector<nuint> LessThanOrEqual(Vector<nuint> left, Vector<nuint> right);
    }
}

tannergooding · 2020-09-08T20:41:43Z

@jkotas, @CarolEidt: How important is maintaining perf of Vector<T> for platforms without acceleration (such as ARM32)?

Today, much of the code is generated via T4 templates and internally uses the Register struct, which is explicit layout with overlapping data.

However, given the existence of System.Runtime.CompilerServices.Unsafe we could remove the T4 templates and rewrite the software fallback logic to be generic loops instead which would greatly simplify the code and make it easier to insert the nint and nuint support, which doesn't work as nicely with the T4 setup.

The T4 templates also rarely work with the current S.P.Corelib setup and so the benefit of it is basically non-existent today

This would also remove duplication between the IsHardwareAccelerated and fallback paths, where the former uses for loops and the latter is manually unrolled.

The IsHardwareAccelerated path is relying on the basic loop unrolling support that exists for Vector<T>.Count, which we could enable for ARM32 if perf is a concern.

jkotas · 2020-09-08T21:26:15Z

Do you have any numbers for how much regression we would potentially see on ARM32?

It would impact Mono target platforms too. @marek-safar How much do we care about non-accelerated Vector<T> performance on Mono?

tannergooding · 2020-09-08T21:43:30Z

Do you have any numbers for how much regression we would potentially see on ARM32?

I don't have any currently but I can try and get some.

I imagine some parts would be improved, especially in user code, due to less data being zeroed (its currently a 16-byte, 66 field explicit layout struct). But some parts, namely the methods in S.P.Corelib, would likely regress due to running a for loop, rather than a manually unrolled loop (although several functions aren't unrolled today).

marek-safar · 2020-09-09T06:57:05Z

How much do we care about non-accelerated Vector performance on Mono?

@jkotas we don't care about performance for that config

tannergooding · 2020-09-09T23:05:09Z

Do you have any numbers for how much regression we would potentially see on ARM32?

@jkotas, it looks like the worst case is about a 4x* perf regression (for byte which has 16 elements). This only looks to be the case on ARM32 where the loop unrolling support around Vector<T>.Count doesn't exist.

Some, like LessThanOrEqualAll is actually closer to 6x, but this is because it does LessThan || Equals, rather than doing an efficient "single pass".

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores

Current:

Method	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
CountBenchmark	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	-	-	-	-
OneBenchmark	11.6882 ns	0.0298 ns	0.0264 ns	11.6883 ns	11.6550 ns	11.7447 ns	-	-	-	-
ZeroBenchmark	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	0.0000 ns	-	-	-	-
EqualsBenchmark	3.7958 ns	0.0291 ns	0.0258 ns	3.8016 ns	3.7472 ns	3.8370 ns	-	-	-	-
GetHashCodeBenchmark	99.3895 ns	1.4870 ns	1.2417 ns	99.2845 ns	97.4959 ns	101.7576 ns	-	-	-	-
AddOperatorBenchmark	14.7578 ns	0.0843 ns	0.0748 ns	14.7397 ns	14.6672 ns	14.9465 ns	-	-	-	-
BitwiseAndOperatorBenchmark	12.2149 ns	0.0902 ns	0.0800 ns	12.2393 ns	12.0871 ns	12.3594 ns	-	-	-	-
BitwiseOrOperatorBenchmark	9.1529 ns	0.0340 ns	0.0302 ns	9.1471 ns	9.1069 ns	9.2057 ns	-	-	-	-
DivisionOperatorBenchmark	79.2174 ns	0.0941 ns	0.0786 ns	79.1944 ns	79.1173 ns	79.3931 ns	-	-	-	-
EqualityOperatorBenchmark	0.7216 ns	0.0280 ns	0.0248 ns	0.7208 ns	0.6867 ns	0.7715 ns	-	-	-	-
ExclusiveOrOperatorBenchmark	9.1752 ns	0.0468 ns	0.0438 ns	9.1527 ns	9.1113 ns	9.2425 ns	-	-	-	-
InequalityOperatorBenchmark	3.9465 ns	0.0683 ns	0.0605 ns	3.9228 ns	3.8851 ns	4.0766 ns	-	-	-	-
MultiplyOperatorBenchmark	15.1731 ns	0.0715 ns	0.0633 ns	15.1714 ns	15.0914 ns	15.3134 ns	-	-	-	-
OnesComplementOperatorBenchmark	21.1975 ns	0.0804 ns	0.0671 ns	21.2021 ns	21.0049 ns	21.2608 ns	-	-	-	-
SubtractionOperatorBenchmark	15.2344 ns	0.0691 ns	0.0613 ns	15.2315 ns	15.1622 ns	15.3523 ns	-	-	-	-
UnaryNegateOperatorBenchmark	17.3482 ns	0.0274 ns	0.0243 ns	17.3372 ns	17.3272 ns	17.4067 ns	-	-	-	-
AbsBenchmark	4.8659 ns	0.0293 ns	0.0259 ns	4.8537 ns	4.8366 ns	4.9118 ns	-	-	-	-
AddBenchmark	18.1764 ns	0.0569 ns	0.0504 ns	18.1810 ns	18.1020 ns	18.2635 ns	-	-	-	-
AndNotBenchmark	29.3541 ns	0.1666 ns	0.1558 ns	29.3400 ns	29.1411 ns	29.6693 ns	-	-	-	-
BitwiseAndBenchmark	10.0123 ns	0.0400 ns	0.0355 ns	10.0063 ns	9.9612 ns	10.0709 ns	-	-	-	-
BitwiseOrBenchmark	10.0021 ns	0.1060 ns	0.0991 ns	9.9950 ns	9.8593 ns	10.1830 ns	-	-	-	-
ConditionalSelectBenchmark	49.0080 ns	0.3078 ns	0.2879 ns	49.0135 ns	48.4501 ns	49.5042 ns	-	-	-	-
DivideBenchmark	78.8175 ns	0.2677 ns	0.2373 ns	78.7779 ns	78.4765 ns	79.2487 ns	-	-	-	-
DotBenchmark	15.2722 ns	0.0261 ns	0.0231 ns	15.2709 ns	15.2363 ns	15.3179 ns	-	-	-	-
EqualsStaticBenchmark	22.6560 ns	0.1268 ns	0.1186 ns	22.6159 ns	22.5144 ns	22.9051 ns	-	-	-	-
EqualsAllBenchmark	4.1479 ns	0.0527 ns	0.0467 ns	4.1447 ns	4.0745 ns	4.2333 ns	-	-	-	-
EqualsAnyBenchmark	29.4765 ns	0.0926 ns	0.0866 ns	29.4362 ns	29.3856 ns	29.6265 ns	-	-	-	-
GreaterThanBenchmark	21.7368 ns	0.0631 ns	0.0559 ns	21.7294 ns	21.6671 ns	21.8384 ns	-	-	-	-
GreaterThanAllBenchmark	32.9013 ns	0.0527 ns	0.0493 ns	32.9126 ns	32.8253 ns	32.9649 ns	-	-	-	-
GreaterThanAnyBenchmark	28.0047 ns	0.0764 ns	0.0714 ns	27.9833 ns	27.9141 ns	28.1723 ns	-	-	-	-
GreaterThanOrEqualBenchmark	52.3921 ns	0.3169 ns	0.2646 ns	52.2974 ns	52.1671 ns	53.0578 ns	-	-	-	-
GreaterThanOrEqualAllBenchmark	60.9816 ns	0.1275 ns	0.1065 ns	60.9825 ns	60.7853 ns	61.2152 ns	-	-	-	-
GreaterThanOrEqualAnyBenchmark	58.1081 ns	0.1660 ns	0.1386 ns	58.0799 ns	57.8858 ns	58.3585 ns	-	-	-	-
LessThanBenchmark	20.2578 ns	0.1129 ns	0.1000 ns	20.2367 ns	20.0859 ns	20.4389 ns	-	-	-	-
LessThanAllBenchmark	33.3310 ns	0.1187 ns	0.0991 ns	33.3558 ns	33.1460 ns	33.5284 ns	-	-	-	-
LessThanAnyBenchmark	25.5108 ns	0.0980 ns	0.0869 ns	25.5065 ns	25.3958 ns	25.6639 ns	-	-	-	-
LessThanOrEqualBenchmark	49.8892 ns	0.1327 ns	0.1176 ns	49.8391 ns	49.7525 ns	50.1483 ns	-	-	-	-
LessThanOrEqualAllBenchmark	60.0621 ns	0.1442 ns	0.1349 ns	60.0328 ns	59.8471 ns	60.2880 ns	-	-	-	-
LessThanOrEqualAnyBenchmark	55.2470 ns	0.0496 ns	0.0388 ns	55.2529 ns	55.1691 ns	55.3128 ns	-	-	-	-
MaxBenchmark	21.8661 ns	0.0363 ns	0.0322 ns	21.8606 ns	21.8211 ns	21.9399 ns	-	-	-	-
MinBenchmark	21.0789 ns	0.0779 ns	0.0691 ns	21.0662 ns	20.9944 ns	21.2350 ns	-	-	-	-
MultiplyBenchmark	18.0297 ns	0.0527 ns	0.0467 ns	18.0289 ns	17.9634 ns	18.1398 ns	-	-	-	-
NegateBenchmark	18.5310 ns	0.0626 ns	0.0555 ns	18.5090 ns	18.4585 ns	18.6348 ns	-	-	-	-
OnesComplementBenchmark	20.7130 ns	0.1875 ns	0.1754 ns	20.7374 ns	20.3737 ns	20.9449 ns	-	-	-	-
SquareRootBenchmark	34.9111 ns	0.0887 ns	0.0786 ns	34.8872 ns	34.8279 ns	35.0818 ns	-	-	-	-
SubtractBenchmark	18.0638 ns	0.0379 ns	0.0336 ns	18.0603 ns	17.9950 ns	18.1135 ns	-	-	-	-
XorBenchmark	10.2635 ns	0.1394 ns	0.1304 ns	10.2707 ns	10.0667 ns	10.5332 ns	-	-	-	-

Generic Unsafe implementation:
// * Summary *

BenchmarkDotNet=v0.12.1.1405-nightly, OS=Windows 10.0.19041.508 (2004/May2020Update/20H1)
Microsoft SQ1 3.0 GHz, 1 CPU, 8 logical and 8 physical cores
.NET Core SDK=6.0.100-alpha.1.20459.9
[Host] : .NET Core 6.0.0 (CoreCLR 6.0.20.45424, CoreFX 6.0.20.45424), Arm RyuJIT
Job-XVEAND : .NET Core 5.0 (CoreCLR 42.42.42.42424, CoreFX 42.42.42.42424), Arm RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000 Arguments=/p:DebugType=portable Toolchain=CoreRun
IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15
WarmupCount=1

Method	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
CountBenchmark	0.6085 ns	0.0292 ns	0.0259 ns	0.6163 ns	0.5516 ns	0.6352 ns	-	-	-	-
OneBenchmark	15.0520 ns	0.0734 ns	0.0687 ns	15.0311 ns	14.9681 ns	15.1906 ns	-	-	-	-
ZeroBenchmark	0.7316 ns	0.1004 ns	0.1156 ns	0.7163 ns	0.5323 ns	0.9691 ns	-	-	-	-
EqualsBenchmark	4.0958 ns	0.0461 ns	0.0408 ns	4.0859 ns	4.0474 ns	4.1932 ns	-	-	-	-
GetHashCodeBenchmark	308.9952 ns	2.5950 ns	2.3004 ns	308.3442 ns	306.2050 ns	314.2366 ns	-	-	-	-
AddOperatorBenchmark	61.4049 ns	0.0881 ns	0.0781 ns	61.3829 ns	61.2802 ns	61.5378 ns	-	-	-	-
BitwiseAndOperatorBenchmark	8.7401 ns	0.0907 ns	0.0804 ns	8.7529 ns	8.4814 ns	8.8201 ns	-	-	-	-
BitwiseOrOperatorBenchmark	5.8710 ns	0.0195 ns	0.0163 ns	5.8664 ns	5.8459 ns	5.8993 ns	-	-	-	-
DivisionOperatorBenchmark	243.7459 ns	6.7311 ns	7.7516 ns	244.9533 ns	227.9078 ns	257.7710 ns	-	-	-	-
EqualityOperatorBenchmark	0.9734 ns	0.0154 ns	0.0136 ns	0.9692 ns	0.9549 ns	1.0001 ns	-	-	-	-
ExclusiveOrOperatorBenchmark	5.9003 ns	0.0257 ns	0.0228 ns	5.8996 ns	5.8748 ns	5.9447 ns	-	-	-	-
InequalityOperatorBenchmark	3.5203 ns	0.0322 ns	0.0285 ns	3.5061 ns	3.4928 ns	3.5887 ns	-	-	-	-
MultiplyOperatorBenchmark	61.9186 ns	0.1388 ns	0.1159 ns	61.9120 ns	61.7196 ns	62.1257 ns	-	-	-	-
OnesComplementOperatorBenchmark	21.6788 ns	0.1246 ns	0.1105 ns	21.6766 ns	21.5047 ns	21.8764 ns	-	-	-	-
SubtractionOperatorBenchmark	60.1066 ns	0.2208 ns	0.1844 ns	60.1600 ns	59.8257 ns	60.3687 ns	-	-	-	-
UnaryNegateOperatorBenchmark	67.4316 ns	0.2463 ns	0.2183 ns	67.3973 ns	67.1599 ns	67.8942 ns	-	-	-	-
AbsBenchmark	10.3173 ns	0.1574 ns	0.1472 ns	10.2980 ns	10.0591 ns	10.5487 ns	-	-	-	-
AddBenchmark	65.3602 ns	0.1875 ns	0.1662 ns	65.3597 ns	65.0336 ns	65.6387 ns	-	-	-	-
AndNotBenchmark	22.8476 ns	0.0983 ns	0.0872 ns	22.8684 ns	22.6440 ns	22.9391 ns	-	-	-	-
BitwiseAndBenchmark	10.1515 ns	0.1625 ns	0.1520 ns	10.1560 ns	9.8568 ns	10.3946 ns	-	-	-	-
BitwiseOrBenchmark	8.7920 ns	0.0685 ns	0.0641 ns	8.7941 ns	8.6943 ns	8.9188 ns	-	-	-	-
ConditionalSelectBenchmark	35.1013 ns	0.0871 ns	0.0814 ns	35.0999 ns	34.9701 ns	35.2477 ns	-	-	-	-
DivideBenchmark	230.8056 ns	3.8200 ns	3.5733 ns	231.1309 ns	223.5282 ns	236.5047 ns	-	-	-	-
DotBenchmark	106.8848 ns	1.5138 ns	1.4160 ns	106.6555 ns	104.8663 ns	109.8214 ns	-	-	-	-
EqualsStaticBenchmark	38.7218 ns	0.1433 ns	0.1341 ns	38.6835 ns	38.5266 ns	38.9829 ns	-	-	-	-
EqualsAllBenchmark	3.5746 ns	0.0356 ns	0.0316 ns	3.5639 ns	3.5331 ns	3.6283 ns	-	-	-	-
EqualsAnyBenchmark	57.1994 ns	0.2407 ns	0.2134 ns	57.1643 ns	56.9201 ns	57.6293 ns	-	-	-	-
GreaterThanBenchmark	36.8960 ns	0.0887 ns	0.0786 ns	36.8852 ns	36.7171 ns	37.0227 ns	-	-	-	-
GreaterThanAllBenchmark	105.6515 ns	0.7164 ns	0.6351 ns	105.6380 ns	104.6500 ns	106.7682 ns	-	-	-	-
GreaterThanAnyBenchmark	94.5897 ns	1.4236 ns	1.3316 ns	94.5560 ns	92.3444 ns	97.3495 ns	-	-	-	-
GreaterThanOrEqualBenchmark	144.2161 ns	0.3195 ns	0.2832 ns	144.1130 ns	143.8592 ns	144.8957 ns	-	-	-	-
GreaterThanOrEqualAllBenchmark	225.7534 ns	0.7733 ns	0.6457 ns	225.8675 ns	224.5361 ns	226.7384 ns	-	-	-	-
GreaterThanOrEqualAnyBenchmark	214.1536 ns	1.0426 ns	0.9752 ns	213.6999 ns	213.0955 ns	216.1802 ns	-	-	-	-
LessThanBenchmark	34.9333 ns	0.1168 ns	0.1036 ns	34.9611 ns	34.7754 ns	35.1236 ns	-	-	-	-
LessThanAllBenchmark	162.8080 ns	1.7153 ns	1.6045 ns	162.2848 ns	160.3714 ns	165.8838 ns	-	-	-	-
LessThanAnyBenchmark	44.2883 ns	0.4552 ns	0.4258 ns	44.3683 ns	43.1221 ns	44.9381 ns	-	-	-	-
LessThanOrEqualBenchmark	183.8210 ns	0.4592 ns	0.4071 ns	183.8070 ns	183.3386 ns	184.5908 ns	-	-	-	-
LessThanOrEqualAllBenchmark	401.9115 ns	8.9488 ns	10.3055 ns	402.8547 ns	384.7896 ns	423.1362 ns	-	-	-	-
LessThanOrEqualAnyBenchmark	265.0211 ns	3.5067 ns	3.2801 ns	264.9297 ns	260.6478 ns	270.4116 ns	-	-	-	-
MaxBenchmark	40.3965 ns	0.1073 ns	0.1004 ns	40.4098 ns	40.1970 ns	40.5682 ns	-	-	-	-
MinBenchmark	39.0784 ns	0.1081 ns	0.0959 ns	39.0810 ns	38.9268 ns	39.2049 ns	-	-	-	-
MultiplyBenchmark	63.0806 ns	0.2723 ns	0.2547 ns	63.0654 ns	62.6993 ns	63.6056 ns	-	-	-	-
NegateBenchmark	63.1924 ns	0.1658 ns	0.1550 ns	63.2034 ns	62.9797 ns	63.5404 ns	-	-	-	-
OnesComplementBenchmark	22.2950 ns	0.1434 ns	0.1271 ns	22.3171 ns	21.9820 ns	22.4744 ns	-	-	-	-
SquareRootBenchmark	129.6186 ns	0.5511 ns	0.4885 ns	129.5628 ns	128.9000 ns	130.5883 ns	-	-	-	-
SubtractBenchmark	66.1447 ns	0.3725 ns	0.3484 ns	66.1251 ns	65.3131 ns	66.5934 ns	-	-	-	-
XorBenchmark	9.4515 ns	0.1190 ns	0.1113 ns	9.4744 ns	9.2758 ns	9.6036 ns	-	-	-	-

jkotas · 2020-09-09T23:20:22Z

Thanks for collecting the data. I am supportive of your proposal to simplify the template.

CarolEidt · 2020-09-11T23:42:14Z

I think that removing the T4 templates is the right thing to do, but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

tannergooding · 2020-09-11T23:46:58Z

but it also seems that it would be worth what might be a smallish investment to reduce the perf impact for arm32.

I'll take a look and see if adding the loop unrolling support is trivial. From what I recall on doing it for Vector128<T>.Count, it might just be treating Count as intrinsic and ensuring the right flag is set.

CarolEidt · 2020-09-11T23:49:17Z

@tannergooding - that would be good, but perhaps that could also be done as a "cleanup" PR that includes the work to figure out why the HFA classification requires the Register type.

tannergooding · 2020-09-14T21:10:56Z

So we have the SIMDHandlesCache which is set based on matching the name of the generic type: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330.

However, we then map that later based on just the simdBaseType:

However, TYP_I_MPL and TYP_U_IMPL are themselves merely aliases for TYP_LONG/TYP_INT and TYP_ULONG/TYP_UINT, respectively; so the caching logic breaks down: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/vartype.h#L35-L50

Is there some way to get the handle for Vector<ulong>/Vector<uint> when we encounter a Vector<nuint> so the cache won't be invalid during later lookups or is there someway to track that a given TYP_ULONG is actually a TYP_U_IMPL (existing support that is)?

CC. @CarolEidt, @dotnet/jit-contrib

tannergooding · 2020-10-05T17:30:54Z

@jkotas, is there currently a JIT/EE method that can be used to resolve a handle? This would end up being "pay for play" as it would only be resolved once and only if a Vector<nint> or Vector<nuint> was the used handle.

I was also looking at finishing connecting the m_layout field as an alternative, but there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle.

jkotas · 2020-10-05T17:54:24Z

Does getTypeInstantiationArgument method on JIT/EE help?

CarolEidt · 2020-10-05T18:07:32Z

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.

there are a number of places where we may synthesize nodes and ultimately create something which doesn't yet have a resolved/cached handle

That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that. I confess I haven't investigated this in detail, but it doesn't seem like there's any other reasonable way to get class info if you don't have a handle (getTypeInstantiationArgument requires a handle).

tannergooding · 2020-10-05T18:28:26Z

Does getTypeInstantiationArgument method on JIT/EE help?

I don't think so. The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330
So what we have is a Vector`1[System.IntPtr], however we only track the simdBaseType in the created node, which will be TYP_I_IMPL, which is just an alias for TYP_LONG

This means that when we later try and get the handle back, such as in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.cpp#L17354-L17356, we will try to get a handle for Vector`1[System.Int64], which doesn't exist.

I was hoping there was some way to, for example, from a class handle for Vector`1[System.IntPtr], resolve the class handle for Vector`1[System.Int64]. This would largely solve some of the handle lookup problems we have both for this case and cases where we are returning a type that hasn't been resolved (say for example, we internally create a node that temporarily treats a Vector`1[System.Single] as a Vector`1[System.Int32] -- we do this or similar in a few places).

I'm not really a fan of the SIMDHandlesCache, and I believe we'd be better off extending the ClassLayout to include the SIMD "base type" (instantiation type). Then we could replace the search through the SIMDHandlesCache with a lookup in the ClassLayoutTable.
That seems like an issue we need to be able to handle in any event, and I wonder if a synthetic ClassLayout would be the way to address that.

I think this, in general, probably needs a bit of thought.

We also have a couple issues around rewriting intrinsics into user calls in lowering (such as to better handle operands that later become constants) in which case I think we want to carry a CORINFO_METHOD_HANDLE instead (or in addition to). Like GenTreeIntrinsic does: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/gentree.h#L4743-L4748 and with the rewriting: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/rationalize.cpp#L265-L284)

I believe you can get the ClassLayout for a given return type or parameter from the method handle, but I think we'd also have the issue where we may not have a method handle (such as for an intrinsic node created to handle a helper method, for example).

jkotas · 2020-10-05T19:15:16Z

The issue is we are building the SIMDHandleCache by matching the name: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L325-L330

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

We can add JIT/EE interface methods that let you create new generic instantiations if it helps JIT to represent things. JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations. Otherwise, it would cause problems for AOT.

tannergooding · 2020-10-05T19:58:30Z

BTW: This name matching code is sub-optimal. It is using slow method for name formatting that it is meant to be used for debug-only tracing. The TODO at https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L257 talks about it.

This is what getTypeInstantiationArgument is for, correct? It looks like we are actually using that for the Vector64/128/256<T> path: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/simd.cpp#L578...

Is it prohibitively expensive if we used it exclusively or would we still need some caching layer to help?

JIT would have to guarantee that these handles are never embedded into the code (directly or indirectly), e.g. the JITed code cannot call these instantiations

I think that's reasonable and don't believe we are using the handles for anything like that today.

What I'd like to do, ideally, is come to some middle ground between where we are today with the SIMDHandleCache and where we'd like to be with the ClassLayout info. I believe the latter is quite a bit more work, while the former would at least unblock this issue and related scenarios around accelerating Quaternion/Plane (which are functionally a Vector4 for most operations) and implementing the generic Vector2/3/4<T> types.

It sounds like we might be able to do something like:

Fixup the Vector<T> cache to use getTypeInstantiationArgument, which resolves the name lookup but keeps the mismatch
Track the CORINFO_TYPE_* rather than the TYP_ for the simdBaseType (it is trivial to go from the former to the latter)

This would allow us to track the proper base type and utilize Vector<T> for nint and nuint. We could then continue the discussions around how to handle ClassLayout and and if we can have some "synthetic" class layout like @CarolEidt suggested.

tannergooding · 2020-11-05T17:19:15Z

I got most of the changes done (it rounds out to +812, -722 lines, approx), but ran into a bit of "snag".

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

This isn't hard to handle, but I was wondering if someone could explain the reason for the difference here? At first glance it looks like a bug, but I'm guessing there is some reason for it.
I don't see any surrounding comments, nor comments in https://github.com/dotnet/runtime/blob/master/src/coreclr/src/inc/corinfo.h or https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/typelist.h

echesakov · 2020-11-05T17:33:49Z

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

Isn't this related to the fact that CLI models only 3 integral types on the computation stack - int32, native int and int64. Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

tannergooding · 2020-11-05T17:40:24Z

Perhaps, you can distinguish between signed and unsigned variants by checking flag GTF_UNSIGNED?

The base type is distinct from the node type and while they often match up, there are many cases where they don't.

Isn't this related to the fact that CLI models only 3 integral types on the computation stack

That was my initial thought, but byte, ubyte, short, and ushort not mapping to int ends up not lining up with that and much of the JIT logic uses genActualType() to convert a var_types to the stack type, where that is relevant.

CarolEidt · 2020-11-06T00:09:42Z

In particular JITtype2varType (https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152) doesn't preserve the sign of certain types: https://github.com/dotnet/runtime/blob/master/src/coreclr/src/jit/ee_il_dll.hpp#L152
For example, while CORINFO_TYPE_* for BYTE, UBYTE, SHORT, and USHORT all translate over to the corresponding TYP_*. INT, UINT, LONG, ULONG, NATIVEINT, and NATIVEUINT all go to the signed variant (INT, LONG, I_IMPL).

I'm not sure why this is the case. It does seem quite inconsistent, but I would bet that there's JIT code somewhere that depends on it.

Dotnet-GitSync-Bot added area-System.Numerics untriaged New issue has not been triaged by the area owner labels May 9, 2020

tannergooding added api-suggestion Early API idea and discussion, it is NOT ready for implementation and removed untriaged New issue has not been triaged by the area owner labels May 9, 2020

tannergooding mentioned this issue May 9, 2020

Use C# compiler provided nint/nuint #36159

Merged

jkotas mentioned this issue May 9, 2020

Take advantage of nint and nuint in Utf8String code #33575

Closed

tannergooding mentioned this issue May 19, 2020

Revert unintentional alignment change for Vector256<T> #36673

Merged

tannergooding added api-ready-for-review and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Jun 12, 2020

tannergooding added this to the Future milestone Jun 23, 2020

carlossanlop added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-ready-for-review labels Jul 6, 2020

terrajobst added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation labels Aug 6, 2020

tannergooding mentioned this issue Sep 10, 2020

Removing the T4 templates used by Vector<T> #42050

Merged

jkotas mentioned this issue Nov 1, 2020

Fix remaining using nuing = aliases #44117

Closed

tannergooding added the Cost:M Work that requires one engineer up to 2 weeks label Jan 15, 2021

tannergooding mentioned this issue Apr 7, 2021

Updating Vector<T> to support nint and nuint #50832

Merged

ghost added the in-pr There is an active PR which will close this issue when it is merged label Apr 7, 2021

tannergooding closed this as completed in #50832 Apr 20, 2021

ghost removed the in-pr There is an active PR which will close this issue when it is merged label Apr 20, 2021

tannergooding mentioned this issue Apr 29, 2021

Add support System.Numerics.Vectors types with double precision #24168

Open

ghost locked as resolved and limited conversation to collaborators May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider expanding `Vector<T>` to support `nint` and `nuint` #36160

Consider expanding `Vector<T>` to support `nint` and `nuint` #36160

tannergooding commented May 9, 2020

ghost commented May 9, 2020

tannergooding commented May 9, 2020

tannergooding commented May 9, 2020

RamType0 commented May 16, 2020

tannergooding commented May 16, 2020

GrabYourPitchforks commented May 16, 2020 •

edited

Loading

RamType0 commented May 17, 2020

RamType0 commented May 17, 2020

RamType0 commented May 17, 2020

terrajobst commented Aug 6, 2020 •

edited by dotnet-api-review bot

Loading

tannergooding commented Sep 8, 2020 •

edited

Loading

jkotas commented Sep 8, 2020

tannergooding commented Sep 8, 2020

marek-safar commented Sep 9, 2020

tannergooding commented Sep 9, 2020

jkotas commented Sep 9, 2020

CarolEidt commented Sep 11, 2020

tannergooding commented Sep 11, 2020

CarolEidt commented Sep 11, 2020

tannergooding commented Sep 14, 2020 •

edited

Loading

tannergooding commented Oct 5, 2020

jkotas commented Oct 5, 2020

CarolEidt commented Oct 5, 2020

tannergooding commented Oct 5, 2020 •

edited

Loading

jkotas commented Oct 5, 2020 •

edited

Loading

tannergooding commented Oct 5, 2020

tannergooding commented Nov 5, 2020 •

edited

Loading

echesakov commented Nov 5, 2020

tannergooding commented Nov 5, 2020

CarolEidt commented Nov 6, 2020

Consider expanding Vector<T> to support nint and nuint #36160

Consider expanding Vector<T> to support nint and nuint #36160

Comments

tannergooding commented May 9, 2020

Rationale

Proposal

Other Considerations

ghost commented May 9, 2020

tannergooding commented May 9, 2020

tannergooding commented May 9, 2020

RamType0 commented May 16, 2020

tannergooding commented May 16, 2020

GrabYourPitchforks commented May 16, 2020 • edited Loading

RamType0 commented May 17, 2020

RamType0 commented May 17, 2020

RamType0 commented May 17, 2020

terrajobst commented Aug 6, 2020 • edited by dotnet-api-review bot Loading

tannergooding commented Sep 8, 2020 • edited Loading

jkotas commented Sep 8, 2020

tannergooding commented Sep 8, 2020

marek-safar commented Sep 9, 2020

tannergooding commented Sep 9, 2020

jkotas commented Sep 9, 2020

CarolEidt commented Sep 11, 2020

tannergooding commented Sep 11, 2020

CarolEidt commented Sep 11, 2020

tannergooding commented Sep 14, 2020 • edited Loading

tannergooding commented Oct 5, 2020

jkotas commented Oct 5, 2020

CarolEidt commented Oct 5, 2020

tannergooding commented Oct 5, 2020 • edited Loading

jkotas commented Oct 5, 2020 • edited Loading

tannergooding commented Oct 5, 2020

tannergooding commented Nov 5, 2020 • edited Loading

echesakov commented Nov 5, 2020

tannergooding commented Nov 5, 2020

CarolEidt commented Nov 6, 2020

Consider expanding `Vector<T>` to support `nint` and `nuint` #36160

Consider expanding `Vector<T>` to support `nint` and `nuint` #36160

GrabYourPitchforks commented May 16, 2020 •

edited

Loading

terrajobst commented Aug 6, 2020 •

edited by dotnet-api-review bot

Loading

tannergooding commented Sep 8, 2020 •

edited

Loading

tannergooding commented Sep 14, 2020 •

edited

Loading

tannergooding commented Oct 5, 2020 •

edited

Loading

jkotas commented Oct 5, 2020 •

edited

Loading

tannergooding commented Nov 5, 2020 •

edited

Loading