Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for SSE4 intrinsics by RyuJIT #2209

Closed
redknightlois opened this issue Jun 30, 2015 · 78 comments

Comments

@redknightlois
Copy link

commented Jun 30, 2015

Support for many of the interesting instructions like popcnt (technically SSE4a) could be an interesting addition and prove to be useful to avoid using unmanaged code in certain performance sensitive applications.

Many (technically all) of the operations can be emulated in CPU when not available with specific optimizations for the target platform or even have the ability with specially crafted if-then-else optimizations. That would allow to even switch to an entirely different algorithm without any runtime impact (if properly done at the jitting phase).

@mellinoe

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2015

@mellinoe

This comment has been minimized.

Copy link
Contributor

commented Jun 30, 2015

Do you have any specific suggestions or use cases in mind here? If you're curious/interested in general JIT optimizations, it may be more relevant to discuss over at https://github.com/dotnet/coreclr (Runtime repo). But if there are more specific use cases that could be exposed through some sort of API or client library, that would be interesting to discuss here (and there as well, probably).

@CarolEidt

This comment has been minimized.

Copy link
Member

commented Jul 1, 2015

It would be great to hear how & where this might be used. It would not be difficult to add an intrinsic, but we would probably want to avoid adding yet another configuration to support (in order to avoid exploding the test matrix) - so perhaps it would be something that could be enabled with AVX2.

For the SIMD intrinsics, we have an IsHardwareAccelerated property on the Vector class that allows the developer to select a different path. Perhaps something similar could be done here, as you seem to suggest.

That said, this is the first request that I've seen, so this is probably not something that would be high on our list.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jul 1, 2015

This request actually came from one particular place where I could have seen insane performance differences. A popcnt enabled select and rank implementation can have massive improvements on very low level database indexing tech. For example this was the actual algorithm I was looking into when I opened the issue: http://link.springer.com/chapter/10.1007%2F978-3-642-38527-8_15

But that is certainly not the only place where hardware intrinsics can make a huge difference. Not long ago I required a very fast non-cryptographic algorithm and ended up building xxHash just because I could achieve "decent" performance without SSE bit packing instructions. If I remember correctly, "decent" was about 70% performance of the memory bandwidth on my i5 (processing 2.5 Gb/sec in hashes) for the 64bits variant. That can certainly be improved with SSE operations.

My biggest gripe with IsHardwareAccelerated is that it is not fine grained enough. I wouldnt mind to have specific "libraries" with Microsoft approved JIT extensions if that helps alliviate the test matrix issues.

About specific use cases, some can be found in Roslyn. We, in the managed world, know for fact we dont have access to low-level primitives, so we end up building stuff like this: http://odetocode.com/blogs/scott/archive/2015/02/19/roslyn-code-gems-counting-bits.aspx ... that is popcnt, the operation that motivated opening the discussion :)

Another example, @stephentoub has opened this issue not too long ago (#2025) offloading the crc32 operation to a hardware intrinsic has huge impact on commonly used framework functionality like DefrateStream. A 1.8x speedup on a general use routine like that is not to be taken lightly.

Why would I like to stick with managed code? Because the jump to unmanaged is very costly. Not long ago I was able to gain 30% just replacing the native memcmp (all safeguards off) with unsafe managed code. Mainly, because the jump to unmanaged code for a tight routine that could be called in the billions in just 2 minutes makes a huge different. I wrote a whole series about memory comparisons up to the point of finding the best unmanaged solution (http://ayende.com/blog/169825/excerpts-from-the-ravendb-performance-team-report-optimizing-memory-compare-copy-costs). After that I could get 30% on top of that just because of how the JIT was able to optimize the call-site when going full managed (even at the expense of losing 0.6% in the general case to unmanaged code). The managed code in question: https://github.com/Corvalius/ravendb/blob/master/Raven.Sparrow/Sparrow/Memory.cs

I believe that supports the use cases part. Why there is probably not many requests? I guess because asking for SIMD could be read as access to special purpose operations. Math intrinsics are just a bunch of those (very important and very welcomed) but there are other types like bit packing and manipulation instructions that are very important in other domains, but as of now not many are looking to implement high-performance code in .Net; but with the introduction of SIMD and open sourcing of the CLR which implies support for other platforms will certainly change that.

Most of the optimization issues are related to the JIT emitting better code when it really matters:
https://github.com/dotnet/coreclr/labels/optimization

Interest for performance is out there in the requests, and many of the issues are rooted in sub-par support for dealing with unsafe code or access to exploit the hardware:
dotnet/roslyn#1798
dotnet/roslyn#120
dotnet/coreclr#916
dotnet/coreclr#1015
#1168
dotnet/roslyn#166

And those are the ones I am tracking, I am pretty sure with some work we can dig others.

In my dream world I would be able to write memcpy|memcmp|hashes|etc routines in unsafe (but portable) managed code when I need them and compete with the fastest routines available in the C world; while continue writing safe code with the flexibility and productivity I already have. I would also be able to compile specially crafted MSIL to OpenCL/Cuda too, but that is another topic :P

EDIT: @CarolEidt I just noticed you said: "so perhaps it would be something that could be enabled with AVX2". If you plan to implement the whole AVX2 instruction set, I will be VERY HAPPY!!! 😃

EDIT2: More issues.

@CarolEidt

This comment has been minimized.

Copy link
Member

commented Jul 1, 2015

@redknightlois - thanks! It's really helpful to have such a good articulation of the need. Just to be clear, I don't think there's any chance that I/we will implement the whole AVX2 instruction set, but just that enabling something like popcnt only for the AVX2 target (presuming, I think correctly, though I haven't verified, that AVX2 hardware would always support SSE4a) would allow us to support it without adding another target to test.
Regarding the granularity of IsHardwareAccelerated - I agree that it is too coarse. What do you think of something like a HardwarePopCount that took a reference to an int for the return value, and returned a bool indicating whether it was successful. So you could write code like:
if (HardwarePopCount(long source, ref int count))
{
// code that depends on popcnt
}
else
{
// alternate implementation
}
I don't think it's ideal, but it's certainly finer granularity. The non-accelerated version (i.e. the one that lives in the IL) would always simply return false. One could then also provide a PopCount that looked like the above, but had a managed implementation in the else clause. But providing the HardwarePopCount would allow the developer to choose a completely different algorithm (not counting bits) if popcnt wasn't accelerated.

Thoughts?

@mburbea

This comment has been minimized.

Copy link

commented Jul 1, 2015

I think it would be useful to offer a series of constants that acted as a means of feature detection. That way I could just write.

if( Feature.SupportHardwarePopCount)
{
        // code uses popCount goes here
}
else
{
    // code that can't counts bit.
}

RyuJit can optimize away the never visited branch like always based on the value of the constant.

The current implementation is too rigid, and some algorithms with the lack of intrinsics become difficult or impossible to beat a non-simd implementation, or require writing ugly code.

There is unfortunately little in the way of documentation for writing high-performance code that plays well with the JIT. Pointer tricks that work great in C/C++ do not always get optimized as you would expect. And you're pretty much forced to go and spend lots of time doing trial and error to see if the IL emitted gets turned into quality machine code.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jul 1, 2015

@mburbea I was actually thinking along those lines (even if today the response for those is always false and the code is library call):

Hardware.SSE4.IsAccelerated
Hardware.AVX2.IsAccelerated

If then we have special cases like:

Hardware.SSE4.IsPopCountAccelerated
Hardware.SSE4.IsCrc32Accelerated
Hardware.SSE4.IsBitPackingAccelerated

for groups of funcionality it will give far greater flexibility without losing generality.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jul 17, 2015

@CarolEidt Other intrinsecs that are very important for succinct and compact data structures (along with compression algorithms and indexing algorithms) while not SSE are:

Count the number of leading zeroes in variable (byte, int, long). In GCC: __builtin_clz();
Count the number of trainling zeroes in variable (byte, int, long). In GCC: __builtin_ctz();
Most significative 1 Bit. In VC++ https://msdn.microsoft.com/en-us/library/fbxyd7zd.aspx
Least Significative 1 Bit. In VC++ https://msdn.microsoft.com/en-us/library/wfd9z0bb.aspx
Byte Swaps. In VC++ https://msdn.microsoft.com/en-us/library/a3140177.aspx

Without those you have to go and implement something like this (instead of a single CPU operation):

int LeadingZeros(int x)
{
        x |= (x >> 1);
        x |= (x >> 2);
        x |= (x >> 4);
        x |= (x >> 8);
        x |= (x >> 16);
        return(sizeof(int)*8 -Ones(x));
}

int Ones(int x)
{
        x -= ((x >> 1) & 0x55555555);
        x = (((x >> 2) & 0x33333333) + (x & 0x33333333));
        x = (((x >> 4) + x) & 0x0f0f0f0f);
        x += (x >> 8);
        x += (x >> 16);
        return(x & 0x0000003f);
} 

for every word size.

Given these types of operations are typically used in very hot-paths the difference of having an intrinsic is INSANE!!! :) ... There are plenty framework places where such things are done by hand, specially the byte swapping. Having that available would be a huge win in many situations, they shouldnt either complicate the test matrix.

Good thing is that on platforms that are not available a forced inline library call can be used. Either the platform supports it, or it doesnt... reverting to a library call is just fine.

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Aug 20, 2015

I'd be very interested in Hamming weight/popcnt and bitscan

On Intel the came in with Nehalem (Q4 2008); and have been in the chips since then Westmere, Sandy Bridge, Ivy Bridge, Haswell, Broadwell and now Skylake; AMD since Barcelona (Q4? 2007); ARM in NEON Cortex A8/9 (2007?). So the fallback would probably be the road less taken.

Probably could have better names than the intrinsics though :)

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Sep 11, 2015

@CarolEidt an example of where popcnt would be helpful in the aspnet code: https://github.com/aspnet/KestrelHttpServer/blob/dev/src/Microsoft.AspNet.Server.Kestrel/Http/FrameHeaders.Generated.cs#L66

Could replace with single instruction

@redknightlois

This comment has been minimized.

Copy link
Author

commented Sep 11, 2015

@benaadams I just hope that implementation is not in a hot-path, it is 15x slower than the naive (shift, add, and) implementation, and almost 30x of the optimized one using 12 arithmetic operations and one multiply. o.O

BenchmarkDotNet=v0.7.7.0
OS=Microsoft Windows NT 6.2.9200.0
Processor=Intel(R) Core(TM) i5-2500K CPU @ 3.30GHz, ProcessorCount=4
HostCLR=MS.NET 4.0.30319.42000, Arch=64-bit [RyuJIT]
Type=Algo_BitCount Mode=Throughput Platform=HostPlatform Jit=HostJit .NET=HostFramework

Method AvrTime StdDev op/s
PopCount1 6.7466 us 0.9003 us 148,221.90
PopCount2 4.3872 us 0.0174 us 227,933.86
PopCount3 3.8315 us 0.0394 us 260,996.33
PopCountParallel2 3.0998 us 0.0256 us 322,604.58
Asp.Net 99.8271 us 0.7559 us 10,017.32
@dadhi

This comment has been minimized.

Copy link

commented Nov 27, 2015

Another required use of popcount are persistent data structures like ideal hash tries HAMT or CHAMP.

This is a foundation for very efficient immutable data structures, that could provide an alternative to current AVL tree based collections in BCL.

The Clojure collections for instance are based on HAMT.

Using Hamming Weight instead of native popcount drastically degrades performance of such structures.

So ±100 @redknightlois

@ghost

This comment has been minimized.

Copy link

commented Dec 16, 2015

👍 would be nice to have both variants in runtime:

Without codegen, we can do something like this in native:

#include <nmmintrin.h>
static inline bool HasPopcntIntrincis()
{
    static bool is_capable(false), capability_tested(false);

    if (capability_tested)
        return is_capable;

    capability_tested = true

    // see more example at https://msdn.microsoft.com/en-us/library/hskdteyh.aspx
    int CPUInfo[4] = {-1};
    __cpuid(CPUInfo, 0);
    is_capable = (CPUInfo[2] >> 23) & 1;
    return is_capable;
}

static inline int BitCountWithoutPOPCNT(uint64_t x)
{
    x -= ((x >> 1) & 0x5555555555555555ULL);
    x = (((x >> 2) & 0x3333333333333333ULL) + (x & 0x3333333333333333ULL));
    x = (((x >> 4) + x) & 0x0F0F0F0F0F0F0F0FULL);
    x *= 0x0101010101010101ULL;
    return static_cast<int>(x >> 56);
}

static inline int GetBitCount(uint64_t x)
{
    if(HasPopcntIntrincis()) // runtime check
        return _mm_popcnt_u64(x);

    return BitCountWithoutPOPCNT(x);
}

Then expose GetBitCount to managed surface area.

Alternatively, RyuJIT codegen can be equipped with AVX2 instruction set with fallback code to do the same thing bit more efficiently.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Feb 5, 2016

Yet another place where JAVA is beating .Net in indexing technology because we don't have popcnt support. It is actually specifically called of as the reason of the performance improvement.

Better bitmap performance with Roaring bitmaps.
http://arxiv.org/pdf/1402.6407.pdf

BTW. I cannot implement this method because I don't have the supporting HW operations.

@jonathanmarston

This comment has been minimized.

Copy link

commented Feb 18, 2016

I'd be very interested in seeing support for popcnt. I have a project that I'm working on that heavily uses bitmaps and would benefit greatly. Right now I'm looking at needing to break down and write it in C++ instead of C#...

@CarolEidt

This comment has been minimized.

Copy link
Member

commented Feb 18, 2016

I don't think that any of these requests would be difficult to implement as intrinsics. The main issue is to define the appropriate API. The "path of least resistance" would probably be to put them in System.Numerics.Vectors.dll, but I'm not sure that's the best place from a design perspective. However, to get the conversation started (and admitting up front that API design is not my field), here is a preliminary proposal for 4 method that might be added to System.Numerics.Vector (the static Vector class):

public static int BitCount(long bits);
public static bool BitCountAccelerated();

public static int FirstSetBit(long bits);
public static bool FirstSetBitAccelerated();

This fixes the length of the "bit vector" at long, but has the attraction of simplicity.

I would not be in favor of a global "Feature" class that subsumed the responsibility for all "is feature XX accelerated", because I think it is better to associate them with the class that exposes the feature. I'm not invested in the "Accelerated" suffix, but I think it would be good to have a standard naming convention for these. One issue would be what "Accelerated" means - what if there is a JIT-generated code sequence that takes multiple instructions, but is otherwise more efficient than one could do in C#/F#/IL?

@redknightlois

This comment has been minimized.

Copy link
Author

commented Feb 18, 2016

@CarolEidt I agree with you, "Accelerated" should mean "Better than, even if we do this writing the IL directly".

I can try to build a few examples of how I envision such an API to work (as I have already the stock implementation for a few of the most important routines). But, I have a few questions:

  • Is the idea to provide "stock / even if not hw accelerated" implementations to avoid every single project to repeat itself? Ex, Roslyn, CoreFX, Kestrel, all have/had their own implemention for PopCount/BitCount.
  • Should we focus on "feature-set" or "behavior"? Feature-set: Is it SSE2, AVX, etc or Behavior: "Logical Shift, etc"
  • Should we priorize some to implement an small subset first but have room for improvement API wise?
  • Should we look into leveraging Vector itself (having 256/512 bits implementations of popcount and create an API that is restricted to 64bits does not look like a good choice to me).
@GSPP

This comment has been minimized.

Copy link

commented Feb 19, 2016

Maybe, the JIT can accelerate based on a well-known IL sequence instead of based on a method name. For example, the sequence

c = (v & 0x55555555) + ((v >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
c = (c & 0x0000FFFF) + ((c >> 16)& 0x0000FFFF);

could be converted to bitcount everywhere, no matter where it is defined. There should be documentation specifying the exact patterns being accelerated. That way there is no need to define an intrinsic method in the framework assemblies at all. Each project that wants to make use of these instructions can just copy and paste this implementation and achieve accelerated performance. This is a zero surface area approach.

I believe GCC and LLVM recognize these "magic" implementations and replace them with intrinsics. This is to create a portable way to implement a fast bitcount.

For each instruction to be exposed that way, the most common 1-3 patterns should be supported. That way user code can pick the fastest unaccelerated pattern for their case and still get it accelerated where possible.

For testing feature availability there could be a method JitCapabilities.IsFeaturePresent(string). User code can pull the result of that into static readonly bool variables. The JIT is currently already capable of inlining the value of such variables and eliminating dead code. User code could be:

static readonly bool isBitcountAccelerated = JitCapabilities.IsFeaturePresent("IsBitcountAccelerated");

if (isBitcountAccelerated) {
c = (v & 0x55555555) + ((v >> 1) & 0x55555555);
c = (c & 0x33333333) + ((c >> 2) & 0x33333333);
c = (c & 0x0F0F0F0F) + ((c >> 4) & 0x0F0F0F0F);
c = (c & 0x00FF00FF) + ((c >> 8) & 0x00FF00FF);
c = (c & 0x0000FFFF) + ((c >> 16)& 0x0000FFFF);
} else {
 //some other approach
}

After optimizations this should collapse to c = bitcount(c). The if goes away.

Whatever design is chosen, it should be suitable to expose many intrinsics. There are many useful x86 instructions to be exposed. Parallel extract comes to mind. It is very versatile.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Feb 19, 2016

@GSPP the problem there is that you have to write an specialized morpher for such complex chains of calls (and all their variations) which cost resources in runtime, giving less time to the JIT to do the rest of the work. In AoT compilers you wont care, but in JIT compilers you have to be very careful about the time it takes to handle that. The use of a library call has the advantage that now everybody will be able to use it from the same place, whether it is accelerated by HW/JIT or not. And only those that really require the performance will need to do JitCapabilities.Features.IsBitCountSupported kind of calls to use alternative codepaths.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jun 22, 2017

@tannergooding yes, from all I have witness there is agreement among the ones needing those intrinsics is that having a simple straight-to-the-metal approach with a very big you can shoot yourself in the foot warning label across the namespace would be the one that will provide flexibility to build upon. So in essense the API issues boils down to the actual static class name and method name than design abstraction per se.

@damageboy

This comment has been minimized.

Copy link

commented Jul 21, 2017

I want to add to what @redknightlois wrote and say that I can't think of a single PL / environment where at the very least, when intrinsics are supported at all, they are at least supported with the straight-to-opcode approach.

MS needs not go anyfurther than revisit its own C++ compiler to witness that.

I'm all for a more generalized (a-la System.Numerics) approach for an XP experience where that make sense. But that cannot come instead of having the straight-to-opcode versions provided....

There are multiple reasons for straight-to-opcode approach:

  • Programmers already wanting to do intrinsics would probably also want to manually control unrolling and inteleaving different intrinsics to accomplish higher IPC, having simple, conventional naming allows them to actually understand more intuitively what they are about to ask of the CPU to do, and be able to meaningfully consult resources like agner.org for reference...
  • .NET, via Nuget 3.x, already supports providing different implementation of managed code for different OS/arch, thus allowing library writers that actually do care, to provide different implementations for arm/x64 etc. via these requested straight-to-opcode intrinsics
  • Users are very likely to use some base-line implementation already written in C/C++ with intrinsics as starting point for whatever they do. While I do understand the immediate urge to throw up upon seeing something like System.Unsafe.Intrinsics.x64._pdep_u64() or worse-yet: System.Unsafe.Intrinsics.x64._mm256_slli_epi64() I think it is actually the right and possibly only sane way to present these to a would be user
@tannergooding

This comment has been minimized.

Copy link
Member

commented Jul 21, 2017

While I do understand the immediate urge to throw up upon seeing something like System.Unsafe.Intrinsics.x64._pdep_u64() or worse-yet: System.Unsafe.Intrinsics.x64._mm256_slli_epi64()

I really hope that if such a feature is implemented, we choose better names:

  • _pdep_u64 -> DepositContiguousLowBits or DepositContiguousBits
  • _mm256_slli_epi64 -> ShiftPackedInt64 or ShiftPacked or even just Shift

No reason why we have to make it hard to read 😉

@damageboy

This comment has been minimized.

Copy link

commented Jul 21, 2017

@tannergooding I understand where you are coming for, and am definitely all for having readable/meaningful names...

However, people, in this specific case, are not going to use these sorts of intrinsics with a clean slate, or at least many of them will have "prior convictions" and baggage coming from C/C++....

So while having nice meaningful names is something I would definitely like, I do strongly feel that the "ugly" names should be supported, for code portability purposes if nothing else.

C# designers has the good instinct of not breaking with C/C++ where it wasn't required previously, and this allows for easier porting of existing code when needed...

I feel the same here..., and also feel that if anything, the GCC names and coverage of intrinsics is a better starting point than MSVC....

For example, if I have the following working piece of code:

static const int32_t CHUNKMASK_SHIFT = 6;

int32_t GetKeyForIndexIntrinsicsUnrolled(int64_t index, uint64_t *bits)
{
  index++;
  

  auto p = (uint64_t *) bits;

  for (; index >= 256; p += 4)
    index -= __popcntq(p[0]) + __popcntq(p[1]) + __popcntq(p[2]) + __popcntq(p[3]);

  // As long as we are still looking for more than 64 bits
  auto prevIndex = index;
  while (index > 0) {
    prevIndex = index;
    index -= __popcntq(*(p++));
  }

  auto pos = __bsfq(_pdep_u64(1ULL << (prevIndex - 1), *(p - 1)));
  return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}

The last thing I care about, is finding out the exact correct name that the CLR guys thought the __bsfq or anything else here should get.

I just want the code to work... And given that this is a very niche API, I don't see a good reason to make it pretty over functional for the target audience...

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2017

The C++ intrinsics don't match the asm opcodes in name anyway.

Would it not be better to match the asm descriptions and merge opcodes with overloading? Casting to a defined clr type if needed can be done via the Unsafe class As you can do for Vector2, Vector4 and Vector<T> (Vector3 is an oddity instruction-wise, though a useful one).

While at the same time, not shying away from the use of vowels, but staying away from underscore exuberance?

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2017

@damageboy C# version could look something like this

using System.Numerics;

const int CHUNKMASK_SHIFT = 6;

unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
    index++;

    var p = bits;

    for (; index >= 256; p += 4)
    {
        index -= Bits.Count(p[0]) + Bits.Count(p[1]) + Bits.Count(p[2]) + Bits.Count(p[3]);
    }

    // As long as we are still looking for more than 64 bits
    var prevIndex = index;
    while (index > 0)
    {
        prevIndex = index;
        index -= Bits.Count(*(p++));
    }
    // or Bits.ScanForward(...)
    var pos = Bits.First(Bits.Scatter(1UL << (prevIndex - 1), *(p - 1)));
    return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}

or with using static

using static System.Numerics.Bits;

unsafe int GetKeyForIndexIntrinsicsUnrolled(long index, ulong* bits)
{
    index++;

    var p = bits;

    for (; index >= 256; p += 4)
    {
        index -= Count(p[0]) + Count(p[1]) + Count(p[2]) + Count(p[3]);
    }

    // As long as we are still looking for more than 64 bits
    var prevIndex = index;
    while (index > 0)
    {
        prevIndex = index;
        index -= Count(*(p++));
    }
    // or ScanForward(...)
    var pos = First(Scatter(1UL << (prevIndex - 1), *(p - 1)));
    return ((p - 1 - bits) << CHUNKMASK_SHIFT) + pos;
}
@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2017

@dsyme @redknightlois @jonathanmarston @mellinoe @damageboy @CarolEidt @russellhadley @mgravell @terrajobst

API starter for comment/feedback (example use in comment above)

namespace System.Numerics
{
    public static class Bits
    {
        // POPCNT on Intel
        public static byte Count(byte value);
        public static ushort Count(ushort value);
        public static uint Count(uint value);
        public static ulong Count(ulong value);

        // +/- shift values to rotate left and right
        public static byte Rotate(byte value, sbyte shift);
        public static short Rotate(short value, sbyte shift);
        public static int Rotate(int value, sbyte shift);
        public static long Rotate(long value, sbyte shift);

        // BSF on Intel
        public static int First(int value);
        public static int First(long value);

        // BSR on Intel
        public static int Last(int value);
        public static int Last(long value);

        // PEXT on Intel 
        public static uint Gather(uint value, uint bitMask);
        public static ulong Gather(ulong value, ulong bitMask);

        // PDEP on Intel
        public static uint Scatter(uint value, uint bitMask);
        public static ulong Scatter(ulong value, ulong bitMask);

        public static byte Crc(byte crc, byte value);
        public static short Crc(short crc, short value);
        public static int Crc(int crc, int value);
        public static long Crc(long crc, long value);

        // Byteswap
        public static short SwitchEndianness(short value);
        public static int SwitchEndianness(int value);
        public static long SwitchEndianness(long value);

        // LZCNT on Intel
        public static int LeadingZeros(int bitMask);
        public static int LeadingZeros(long bitMask);

        // TZCNT on Intel
        public static int TrailingZeros(int bitMask);
        public static int TrailingZeros(long bitMask);
    }
}

None are too exotic, so probably could have software fallbacks - not sure about detection of HW support though.

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 22, 2017

Perhaps to address @damageboy's concerns also have a Intrinsics Interop

namespace System.Numerics.Intrinsics
{
    public static class Interop
    {
        uint _BitScanForward(uint value) => Bits.First(value);
        ulong _BitScanForward64(ulong value) => Bits.First(value);

        uint __bsfd(uint value) => Bits.First(value);
        ulong __bsfdq(ulong value) => Bits.First(value);

        uint _pdep_u32(uint source,  uint mask) => Bits.Scatter(source, mask);
        ulong _pdep_u64(ulong source, uint mask) => Bits.Scatter(source, mask);

        int __popcnt16(ushort value) => Bits.Count(value);
        int __popcnt(uint value) => Bits.Count(value);
        int __popcnt64(uint value) => Bits.Count(value);

        int __popcntd(uint __X) => Bits.Count(__X);
        int __popcntq(ulong __X) => Bits.Count(__X);

        // ...
    }
}

Then you just need to a the header using static System.Numerics.Intrinsics.Interop; and all the C-style functions are available? So this would then be vaild:

var pos = __bsfq(_pdep_u64(1UL << (prevIndex - 1), *(p - 1)));

Or if you were MSVC rather than gcc

var pos = _BitScanForward64(_pdep_u64(1UL << (prevIndex - 1), *(p - 1)));
@damageboy

This comment has been minimized.

Copy link

commented Jul 23, 2017

@benaadams Having those two versions is basically what I meant...

I like meaningful names just like any sane person, but when porting or trying to implement some paper you may be reading it just makes some sense to have the interop version around.

Few comment though

  • Wouldn't it make more sense the have the interop version in some x64/x86 namespace, along side ARM ones?
  • Can/Should he canonical Bits stay uniform across archs?
  • There is some sort of need for a more higher level CPUID wrapper that would be able to report the various capabilities to the user, as in bits that inform the user when popcnt and friends are not supported...
@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 23, 2017

Wouldn't it make more sense the have the interop version in some x64/x86 namespace, along side ARM ones?

Yes, throw for intrinsics of wrong platform, also some are x-plat so something like?

namespace System.Numerics.Intrinsics
{
    [Flags]
    public enum CpuPlatform
    {
        x86   = 1 << 0,
        x64   = 1 << 1 | x86,

        ARM   = 1 << 8,
        ARM64 = 1 << 9 | ARM
    }

    public static class Interop
    {
        uint _BitScanForward(uint value) => Bits.First(value);
        ulong _BitScanForward64(ulong value) => Bits.First(value);

        // ...
    }
}

namespace System.Numerics.Intrinsics.x64
{
    public static class Interop
    {
        private static void ThrowPlatformNotSupportedException()
            => throw new PlatformNotSupportedException();

        private static void CheckPlatform()
        {
            if (!Environment.Is64BitProcess 
                || Environment.CpuPlatform & CpuPlatform.x64 != CpuPlatform.x64)
                ThrowPlatformNotSupportedException();
        }

        public byte _mm_crc32_u8(byte crc, byte value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public ushort _mm_crc32_u16(ushort crc, ushort value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public uint _mm_crc32_u32(uint crc, uint value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        public ulong _mm_crc32_u64(ulong crc, ulong value)
        {
            CheckPlatform();
            Bits.Crc(crc, value);
        }

        // ...
    }

    namespace System.Numerics.Intrinsics.x86
    {
        public static class Interop
        {
            private static void CheckPlatform()
            {
                if (Environment.CpuPlatform & CpuPlatform.x86 != CpuPlatform.x86)
                    ThrowPlatformNotSupportedException();
            }
        }
    }

    namespace System.Numerics.Intrinsics.ARM
    {
        public static class Interop
        {
            private static void CheckPlatform()
            {
                if (Environment.CpuPlatform & CpuPlatform.ARM != CpuPlatform.ARM)
                    ThrowPlatformNotSupportedException();
            }
        }
    }
}

Can/Should he canonical Bits stay uniform across archs?

Yes. They are fairly universal functions and the software fallback is well known; so I'd think they sit well as platform independent "intrinsics".

Note this is different than interop intrinsics (as above) and platform/cpu specific intrinsics that either aren't common or have a complex software fallback (e.g. encryption opcodes) - but I think that's a different discussion.

There is some sort of need for a more higher level CPUID wrapper that would be able to report the various capabilities to the user, as in bits that inform the user when popcnt and friends are not supported...

For platform independent intrinsics should be a "is hardware accelerated" check; that is branch eliminated at Jit time. Something equivalent to a readonly static; that has prechecked CPUID rather than doing the expensive check always.

For platform specific intrinsics (always same cpu opcode; though with type overloading); same mechanism but "is hardware supported"; with a branch eliminated PNS exception path (as above)

Seem sensible?

Not sure on AoT

@damageboy

This comment has been minimized.

Copy link

commented Jul 23, 2017

Seems pretty sensible to me so far, yes.

Not sure on AoT

Well, there is something like the intel way of doing things in ICC where they can generate functions for several archs and then they basically do a synamic dispatch to the appropriate function.

For anything that take a considerable amount of cycles that sort of approach is both inclusive as far as compiling once, running "everywhere"...

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 23, 2017

For detection I was hoping there was some way to directly tie to the method/method group itself with an extension like:

namespace System.Numerics.Intrinsics
{
    public static class IntrinsicExstensions
    {
        public static bool IsHardwareAccelerated(this MethodInfo intrinsicFunction);
        public static bool IsHardwareAccelerated(this MethodGroup intrinsicFunction);
    }
}

To do

int bits;
if (Bits.Count.IsHardwareAccelerated())
{
    bits = Bits.Count(value);
}
else
{
    // ...
}

But it doesn't seem that's valid C# 😞

@damageboy

This comment has been minimized.

Copy link

commented Jul 23, 2017

On the other hand

Bits.IsHardwareAccelerated(Bits.Count)

Is really not that bad

@damageboy

This comment has been minimized.

Copy link

commented Jul 23, 2017

One thing I'm not really clear about in this discussion is are we talking about numeric intrinsics per-se here, or general intrinsics?

All of the examples so far are fine for System.Numerics, but if we start going into prefetching and clearing cache instructions, then this becomes something completely different... at least from the title...

Maybe the whole thing needs to become slightly wider in scope and move into some future sounding System.Runtime.Unsafe of some sort...?

@tannergooding

This comment has been minimized.

Copy link
Member

commented Jul 23, 2017

@damageboy, there have been a few proposal on the subject of general intrinsics (dotnet/coreclr#6906 (comment)).

@benaadams

This comment has been minimized.

Copy link
Collaborator

commented Jul 23, 2017

All of the examples so far are fine for System.Numerics, but if we start going into prefetching and clearing cache instructions, then this becomes something completely different... at least from the title...

Maybe the whole thing needs to become slightly wider in scope and move into some future sounding System.Runtime.Unsafe of some sort...?

Prefetching and clearing cache can be inadvisable, but its not strictly unsafe..? i.e. its only performance that can go wrong, not a failure in operation.

e.g. a prefetch byref would be safe; while a prefetch by pointer would be unsafe, but both are valid

@damageboy

This comment has been minimized.

Copy link

commented Jul 24, 2017

@benaadams Right, bad naming...

@tannergooding Haven't seen that one before

There seem to be a few of these slogging around...

@danmosemsft

This comment has been minimized.

Copy link
Member

commented Jul 24, 2017

@ericstj fyi.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jul 24, 2017

@benaadams That C# example would work right off the bat with software fallbacks I had to build because no intrinsics (even though perf sucks :D).
@damageboy While I agree with you about most of the post, I am not in the camp of using the C++ intrinsic names which are cryptic and do not give an idea... Having the Interop.XXX version that @benaadams has also the interesting extra that we can build Roslyn "suggestions" to replace them with the appropriate API if you run into them (porting code would be a bliss that way).

@damageboy

This comment has been minimized.

Copy link

commented Jul 29, 2017

@redknightlois I really like attacking this through roslyn suggestions. I'm the last person to willingly push forward the cryptic naming, but it really helps with getting stuff off the ground...

@jnm2

This comment has been minimized.

Copy link
Collaborator

commented Jul 29, 2017

You could make Roslyn fixes that work without the interop class too.

@redknightlois

This comment has been minimized.

Copy link
Author

commented Jul 31, 2017

@jnm2 the downside of that is while you are coding an algorithm that has been published, you would use the interop naming just to be able to follow the algorithm properly. Later on you move to the better notation.

@tannergooding

This comment has been minimized.

Copy link
Member

commented Jul 31, 2017

@fiigii

This comment has been minimized.

Copy link
Contributor

commented Aug 3, 2017

Intel hardware intrinsic API proposal has been opened at #22940

@jkotas jkotas closed this Nov 18, 2017

@karelz karelz modified the milestones: Future, 2.1.0 Nov 18, 2017

@dsyme dsyme referenced this issue Nov 23, 2017
5 of 5 tasks complete
@grant-d grant-d referenced this issue Jan 25, 2019
3 of 5 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.