New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: Add Intel hardware intrinsic functions and namespace #22940

Open
fiigii opened this Issue Aug 3, 2017 · 177 comments

Comments

@fiigii
Contributor

fiigii commented Aug 3, 2017

This proposal adds intrinsics that allow programmers to use managed code (C#) to leverage Intel® SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2, FMA, LZCNT, POPCNT, BMI1/2, PCLMULQDQ, and AES instructions.

Rationale and Proposed API

Vector Types

Currently, .NET provides System.Numerics.Vector<T> and related intrinsic functions as a cross-platform SIMD interface that automatically matches proper hardware support at JIT-compile time (e.g. Vector<T> is size of 128-bit on SSE2 machines or 256-bit on AVX2 machines). However, there is no way to simultaneously use different size Vector<T>, which limits the flexibility of SIMD intrinsics. For example, on AVX2 machines, XMM registers are not accessible from Vector<T>, but certain instructions have to work on XMM registers (i.e. SSE4.2). Consequently, this proposal introduces Vector128<T> and Vector256<T> in a new namespace System.Runtime.Intrinsics

namespace System.Runtime.Intrinsics
{
    // 128 bit types
    [StructLayout(LayoutKind.Sequential, Size = 16)]
    public struct Vector128<T> where T : struct {}

    // 256 bit types
    [StructLayout(LayoutKind.Sequential, Size = 32)]
    public struct Vector256<T> where T : struct {}
}

This namespace is platform agnostic, and other hardware could provide intrinsics that operate over them. For instance, Vector128<T> could be implemented as an abstraction of XMM registers on SSE capable processor or as an abstraction of Q registers on NEON capable processors. Meanwhile, other types may be added in the future to support newer SIMD architectures (i.e. adding 512-bit vector and mask vector types for AVX-512).

Intrinsic Functions

The current design of System.Numerics.Vector abstracts away the specifics of processor details. While this approach works well in many cases, developers may not be able to take full advantage of the underlying hardware. Intrinsic functions allow developers to access full capability of processors on which their programs run.

One of the design goals of intrinsics APIs is to provide one-to-one correspondence to Intel C/C++ intrinsics. That way, programmers already familiar with C/C++ intrinsics can easily leverage their existing skills. Another advantage of this approach is that we leverage the existing body of documentation and sample code written for C/C++ instrinsics.

Intrinsic functions that manipulate Vector128/256<T> will be placed in a platform-specific namespace System.Runtime.Intrinsics.X86. Intrinsic APIs will be separated to several static classes based-on the instruction sets they belong to.

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        public static bool IsSupported {get;}

        // __m256 _mm256_add_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> Add(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_add_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> Add(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        // __m256 _mm256_addsub_ps (__m256 a, __m256 b)
        [Intrinsic]
        public static Vector256<float> AddSubtract(Vector256<float> left, Vector256<float> right) { throw new NotImplementedException(); }
        // __m256d _mm256_addsub_pd (__m256d a, __m256d b)
        [Intrinsic]
        public static Vector256<double> AddSubtract(Vector256<double> left, Vector256<double> right) { throw new NotImplementedException(); }

        ......
    }
}

Some of intrinsics benefit from C# generic and get simpler APIs:

// Sse2.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse
    {
        public static bool IsSupported {get;}

        // __m128 _mm_castpd_ps (__m128d a)
        // __m128i _mm_castpd_si128 (__m128d a)
        // __m128d _mm_castps_pd (__m128 a)
        // __m128i _mm_castps_si128 (__m128 a)
        // __m128d _mm_castsi128_pd (__m128i a)
        // __m128 _mm_castsi128_ps (__m128i a)
        [Intrinsic]
        public static Vector128<U> StaticCast<T, U>(Vector128<T> value) where T : struct where U : struct { throw new NotImplementedException(); }
        
        ......
    }
}

Each instruction set class contains an IsSupported property which stands for whether the underlying hardware supports the instruction set. Programmers use these properties to ensure that their code can run on any hardware via platform-specific code path. For JIT compilation, the results of capability checking are JIT time constants, so dead code path for the current platform will be eliminated by JIT compiler (conditional constant propagation). For AOT compilation, compiler/runtime executes the CPUID checking to identify corresponding instruction sets. Additionally, the intrinsics do not provide software fallback and calling the intrinsics on machines that has no corresponding instruction sets will cause PlatformNotSupportedException at runtime. Consequently, we always recommend developers to provide software fallback to remain the program portable. Common pattern of platform-specific code path and software fallback looks like below.

if (Avx2.IsSupported)
{
    // The AVX/AVX2 optimizing implementation for Haswell or above CPUs  
}
else if (Sse41.IsSupported)
{
    // The SSE optimizing implementation for older CPUs  
}
......
else
{
    // Scalar or software-fallback implementation
}

The scope of this API proposal is not limited to SIMD (vector) intrinsics, but also includes scalar intrinsics that operate over scalar types (e.g. int, short, long, or float, etc.) from the instruction sets mentioned above. As an example, the following code segment shows Crc32 intrinsic functions from Sse42 class.

// Sse42.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Sse42
    {
        public static bool IsSupported {get;}

        // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
        [Intrinsic]
        public static uint Crc32(uint crc, byte data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v)
        [Intrinsic]
        public static uint Crc32(uint crc, ushort data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v)
        [Intrinsic]
        public static uint Crc32(uint crc, uint data) { throw new NotImplementedException(); }
        // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
        [Intrinsic]
        public static ulong Crc32(ulong crc, ulong data) { throw new NotImplementedException(); }

        ......
    }
}

Intended Audience

The intrinsics APIs bring the power and flexibility of accessing hardware instructions directly from C# programs. However, this power and flexibility means that developers have to be cognizant of how these APIs are used. In addition to ensuring that their program logic is correct, developers must also ensure that the use of underlying intrinsic APIs are valid in the context of their operations.

For example, developers who use certain hardware intrinsics should be aware of their data alignment requirements. Both aligned and unaligned memory load and store intrinsics are provided, and if aligned loads and stores are desired, developers must ensure that the data are aligned appropriately. The following code snippet shows the different flavors of load and store intrinsics proposed:

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......
        
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> Load(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_loadu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> Load(byte* address) { throw new NotImplementedException(); }
        ......
        [Intrinsic]
        public static Vector256<T> Load<T>(ref T vector) where T : struct { throw new NotImplementedException(); }

        
        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadAligned(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_load_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadAligned(byte* address) { throw new NotImplementedException(); }
        ......

        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<sbyte> LoadDqu(sbyte* address) { throw new NotImplementedException(); }
        // __m256i _mm256_lddqu_si256 (__m256i const * mem_addr)
        [Intrinsic]
        public static unsafe Vector256<byte> LoadDqu(byte* address) { throw new NotImplementedException(); }
        ......
        
        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_storeu_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void Store(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......
        public static void Store<T>(ref T vector, Vector256<T> source) where T : struct { throw new NotImplementedException(); }

        
        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_store_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAligned(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
        ......

	// void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(sbyte* address, Vector256<sbyte> source) { throw new NotImplementedException(); }
        // void _mm256_stream_si256 (__m256i * mem_addr, __m256i a)
        [Intrinsic]
        public static unsafe void StoreAlignedNonTemporal(byte* address, Vector256<byte> source) { throw new NotImplementedException(); }
    
        ......
	}
}

IMM Operands

Most of the intrinsics can be directly ported to C# from C/C++, but certain instructions that require immediate parameters (i.e. imm8) as operands deserve additional consideration, such as pshufd, vcmpps, etc. C/C++ compilers specially treat these intrinsics which throw compile-time errors when non-constant values are passed into immediate parameters. Therefore, CoreCLR also requires the immediate argument guard from C# compiler. We suggest an addition of a new "compiler feature" into Roslyn which places const constraint on function parameters. Roslyn could then ensure that these functions are invoked with "literal" values on the const formal parameters.

// Avx.cs
namespace System.Runtime.Intrinsics.X86
{
    public static class Avx
    {
        ......

        // __m256 _mm256_blend_ps (__m256 a, __m256 b, const int imm8)
        [Intrinsic]
        public static Vector256<float> Blend(Vector256<float> left, Vector256<float> right, const byte control) { throw new NotImplementedException(); }
        // __m256d _mm256_blend_pd (__m256d a, __m256d b, const int imm8)
        [Intrinsic]
        public static Vector256<double> Blend(Vector256<double> left, Vector256<double> right, const byte control) { throw new NotImplementedException(); }

        // __m128 _mm_cmp_ps (__m128 a, __m128 b, const int imm8)
        [Intrinsic]
        public static Vector128<float> Compare(Vector128<float> left, Vector128<float> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }
        
        // __m128d _mm_cmp_pd (__m128d a, __m128d b, const int imm8)
        [Intrinsic]
        public static Vector128<double> Compare(Vector128<double> left, Vector128<double> right, const FloatComparisonMode mode) { throw new NotImplementedException(); }

        ......
    }
}

// Enums.cs
namespace System.Runtime.Intrinsics.X86
{
    public enum FloatComparisonMode : byte
    {
        EqualOrderedNonSignaling,
        LessThanOrderedSignaling,
        LessThanOrEqualOrderedSignaling,
        UnorderedNonSignaling,
        NotEqualUnorderedNonSignaling,
        NotLessThanUnorderedSignaling,
        NotLessThanOrEqualUnorderedSignaling,
        OrderedNonSignaling,
        ......
    }

    ......
}

Semantics and Usage

The semantic is straightforward if users are already familiar with Intel C/C++ intrinsics. Existing SIMD programs and algorithms that are implemented in C/C++ can be directly ported to C#. Moreover, compared to System.Numerics.Vector<T>, these intrinsics leverage the whole power of Intel SIMD instructions and do not depend on other modules (e.g. Unsafe) in high-performance environments.

For example, SoA (structure of array) is a more efficient pattern than AoS (array of structure) in SIMD programming. However, it requires dense shuffle sequences to convert data source (usually stored in AoS format), which is not provided by Vector<T>. Using Vector256<T> with AVX shuffle instructions (including shuffle, insert, extract, etc.) can lead to higher throughput.

public struct Vector256Packet
{
    public Vector256<float> xs {get; private set;}
    public Vector256<float> ys {get; private set;}
    public Vector256<float> zs {get; private set;}

    // Convert AoS vectors to SoA packet
    public unsafe Vector256Packet(float* vectors)
    {
        var m03 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[0])); // load lower halves
        var m14 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[4]));
        var m25 = Avx.ExtendToVector256<float>(Sse2.Load(&vectors[8]));
        m03 = Avx.Insert(m03, &vectors[12], 1);  // load higher halves
        m14 = Avx.Insert(m14, &vectors[16], 1);
        m25 = Avx.Insert(m25, &vectors[20], 1);

        var xy = Avx.Shuffle(m14, m25, 2 << 6 | 1 << 4 | 3 << 2 | 2);
        var yz = Avx.Shuffle(m03, m14, 1 << 6 | 0 << 4 | 2 << 2 | 1);
        var _xs = Avx.Shuffle(m03, xy, 2 << 6 | 0 << 4 | 3 << 2 | 0);
        var _ys = Avx.Shuffle(yz, xy,  3 << 6 | 1 << 4 | 2 << 2 | 0);
        var _zs = Avx.Shuffle(yz, m25, 3 << 6 | 0 << 4 | 3 << 2 | 1);

        xs = _xs;
        ys = _ys;
        zs = _zs; 
    }
    ......
}

public static class Main
{
    static unsafe int Main(string[] args)
    {
        var data = new float[Length];
        fixed (float* dataPtr = data)
        {
            if (Avx2.IsSupported)
            {
                var vector = new Vector256Packet(dataPtr);
                ......
                // Using AVX/AVX2 intrinsics to compute eight 3D vectors.
            }
            else if (Sse41.IsSupported)
            {
                var vector = new Vector128Packet(dataPtr);
                ......
                // Using SSE intrinsics to compute four 3D vectors.
            }
            else
            {
                // scalar algorithm
            }
        }
    }
}

Furthermore, conditional code is enabled in vectorized programs. Conditional path is ubiquitous in scalar programs (if-else), but it requires specific SIMD instructions in vectorized programs, such as compare, blend, or andnot, etc.

public static class ColorPacketHelper
{
    public static IntRGBPacket ConvertToIntRGB(this Vector256Packet colors)
    {
        var one = Avx.Set1<float>(1.0f);
        var max = Avx.Set1<float>(255.0f);

        var rsMask = Avx.Compare(colors.xs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var gsMask = Avx.Compare(colors.ys, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);
        var bsMask = Avx.Compare(colors.zs, one, FloatComparisonMode.GreaterThanOrderedNonSignaling);

        var rs = Avx.BlendVariable(colors.xs, one, rsMask);
        var gs = Avx.BlendVariable(colors.ys, one, gsMask);
        var bs = Avx.BlendVariable(colors.zs, one, bsMask);

        var rsInt = Avx.ConvertToVector256Int(Avx.Multiply(rs, max));
        var gsInt = Avx.ConvertToVector256Int(Avx.Multiply(gs, max));
        var bsInt = Avx.ConvertToVector256Int(Avx.Multiply(bs, max));

        return new IntRGBPacket(rsInt, gsInt, bsInt);
    }
}

public struct IntRGBPacket
{
    public Vector256<int> Rs {get; private set;}
    public Vector256<int> Gs {get; private set;}
    public Vector256<int> Bs {get; private set;}

    public IntRGBPacket(Vector256<int> _rs, Vector256<int> _gs, Vector256<int>_bs)
    {
        Rs = _rs;
        Gs = _gs;
        Bs = _bs;
    }
}

As previously stated, traditional scalar algorithms can be accelerated as well. For example, CRC32 is natively supported on SSE4.2 CPUs.

public static class Verification
{
    public static bool VerifyCrc32(ulong acc, ulong data, ulong res)
    {
        if (Sse42.IsSupported)
        {
            return Sse42.Crc32(acc, data) == res;
        }
        else
        {
            return SoftwareCrc32(acc, data) == res;
            // The software implementation of Crc32 provided by developers or other libraries
        }
    }
}

Implementation Roadmap

Implementing all the intrinsics in JIT is a large-scale and long-term project, so the current plan is to initially implement a subset of them with unit tests, code quality test, and benchmarks.

The first step in the implementation would involve infrastructure related items. This step would involve wiring the basic components, including but not limited to internal data representations of Vector128<T> and Vector256<T>, intrinsics recognition, hardware support checking, and external support from Roslyn/CoreFX. Next steps would involve implementing subsets of intrinsics in classes representing different instruction sets.

Complete API Design

Add Intel hardware intrinsic APIs to CoreFX #23489
Add Intel hardware intrinsic API implementation to mscorlib #13576

Update

08/17/2017

  • Change namespace System.Runtime.CompilerServices.Intrinsics to System.Runtime.Intrinsics and System.Runtime.CompilerServices.Intrinsics.X86 to System.Runtime.Intrinsics.X86.
  • Change ISA class name to match CoreFX naming convention, e.g., using Avx instead of AVX.
  • Change certain pointer parameter names, e.g., using address instead of mem.
  • Define IsSupport as properties.
  • Add Span<T> overloads to the most common memory-access intrinsics (Load, Store, Broadcast), but leave other alignment-aware or performance-sensitive intrinsics with original pointer version.
  • Clarify that these intrinsics will not provide software fallback.
  • Clarify Sse2 class design and separate small calsses (e.g., Aes, Lzcnt, etc.) into individual source files (e.g., Aes.cs, Lzcnt.cs, etc.).
  • Change method name CompareVector* to Compare and get rid of Compare prefix from FloatComparisonMode.

08/22/2017

  • Replace Span<T> overloads by ref T overloads.

09/01/2017

  • Minor changes from API code review.
@fiigii

This comment has been minimized.

Contributor

fiigii commented Aug 3, 2017

@tannergooding

This comment has been minimized.

Member

tannergooding commented Aug 3, 2017

Overall I love this proposal. I do have a few questions/comments:

Each vector type exposes an IsSupported method to check if the current hardware supports

I think this can be a property, as it is in Vector<T>.

Does this take the type of T into account? For example, will IsSupported return true for Vector128<float> but false for Vector128<CustomStruct> (or is it expected to throw in this case)?

What about formats that may be supported on some processors, but not others? As an example, lets say there is instruction set X which only supports Vector128<float> and later comes instruction set Y which supports Vector128<double>. If the CPU currently only supports X would it return true for Vector128<float> and false for Vector128<double> with Vector128<double> only returning true when instruction set Y is supported?

In addition, this namespace would contain conversion functions between the existing SIMD type (Vector) and new Vector128 and Vector256 types.

My concern here is the target layering for each component. I would hope that System.Runtime.CompilerServices.Intrinsics are part of the lowest layer, and therefore consumable by all other APIs in CoreFX. While Vector<T>, on the other hand, is part of one of the higher layers and is therefore not consumable.

Would it be better here to either have the conversion operators on Vector<T> or to expect the user to perform an explicit load/store (as they will likely be expected to do with other custom types)?

SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)

I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.

Other.cs (includes LZCNT, POPCNT, BMI1, BMI2, PCLMULQDQ, and AES)

For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.

Some of intrinsics benefit from C# generic and get simpler APIs

I didn't see any examples of scalar floating-point APIs (_mm_rsqrt_ss). How would these fit in with the Vector based APIs (naming wise, etc)?

@redknightlois

This comment has been minimized.

redknightlois commented Aug 3, 2017

Looks good and in line with the suggestions I have made. The only thing that probably do not resonate with me (maybe because we deal with pointers on a regular basis on our codebase) is having to use Load(type*) instead of supporting also the ability to call the function with a void* as the semantics of the operation are very clear. Probably it is me, but with the exception of special operations like a non-temporal store (where you would need to use a Store/Load operation explicitely) not having support for arbitrary pointer types would only add bloat to the algorithm without any actual improvement in readability/understandability.

@tannergooding

This comment has been minimized.

Member

tannergooding commented Aug 3, 2017

Therefore, CoreCLR also requires the immediate argument guard from C# compiler.

Going to tag @jaredpar here explicitly. We should get a formal proposal up.

I think that we can do this without language support (@jaredpar, tell me if I'm crazy here) if the compiler can recognize something like System.Runtime.CompilerServices.IsLiteralAttribute and emits it as modreq isliteral.

Having a new recognized keyword (const) here is likely more complicated as it requires formal spec'ing in the language etc.

@mellinoe

This comment has been minimized.

Contributor

mellinoe commented Aug 4, 2017

Thanks for posting this @fiigii. I'm very eager to hear everyone's thoughts on the design.

IMM Operands

One thing that came up in a recent discussion is that some immediate operands have stricter constraints than just "must be constant". The examples given use a FloatComparisonMode enum, and functions accepting it apply a const modifier to the parameter. But there is no way to prevent someone from passing a non-enum value, still a constant, to a method accepting that parameter.

`AVX.CompareVector256(left, right, (FloatComparisonMode)255);

EDIT: This warning is emitted in a VC++ project if you use the above code.

Now, this may not be a problem for this particular example (I'm not familiar with its exact semantics), but it's something to keep in mind. There were also other, more esoteric examples given, like an immediate operand which must be a power of two, or which satisfies some other obscure relation to the other operands. These constraints will be much more difficult, most likely impossible, to enforce at the C# level. The "const" enforcement feels more reasonable and achievable, and seems to cover most instances of the problem.

SSE2.cs (the bottom-line of intrinsic support that contains all the intrinsics of SSE and SSE2)

I'll echo what @tannergooding said -- I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.

@sharwell

This comment has been minimized.

Member

sharwell commented Aug 4, 2017

💭 Most of my initial thoughts go to the use of pointers in a few places. Knowing what we know about ref structs and Span<T>, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.

In the following code, would the generic method actually be expanded to each of the forms allowed by the processor, or would it be defined in coed as a generic?

// __m128i _mm_add_epi8 (__m128i a,  __m128i b)
// __m128i _mm_add_epi16 (__m128i a,  __m128i b)
// __m128i _mm_add_epi32 (__m128i a,  __m128i b)
// __m128i _mm_add_epi64 (__m128i a,  __m128i b)
// __m128 _mm_add_ps (__m128 a,  __m128 b)
// __m128d _mm_add_pd (__m128d a,  __m128d b)
[Intrinsic]
public static Vector128<T> Add<T>(Vector128<T> left,  Vector128<T> right) where T : struct { throw new NotImplementedException(); }
@sharwell

This comment has been minimized.

Member

sharwell commented Aug 4, 2017

If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions? If we choose the former, would it make sense to rename IsSupported to IsHardwareAccelerated?

@tannergooding

This comment has been minimized.

Member

tannergooding commented Aug 4, 2017

Knowing what we know about ref structs and Span, what parts of the proposal can leverage new functionality to avoid unsafe code without compromising performance.

Personally, I am fine with the unsafe code. I don't believe this is meant to be a feature that app designers use and is instead meant to be something framework designers use to squeeze extra performance and also to simplify overhead on the JIT.

People using intrinsics are likely already doing a bunch of unsafe things already and this just makes it more explicit.

If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?

The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.

I am of the opinion that all of these methods should be declared as extern and should never have software fallbacks. Users would be expected to implement a software fallback themselves or have a PlatformNotSupportedException thrown by the JIT at runtime.

This will help ensures the consumer is being aware of the underlying platforms they are targeting and that they are writing code that is "suited" for the underlying hardware (running vectorized algorithms on hardware without vectorization support can cause performance degradation).

@benaadams

This comment has been minimized.

Collaborator

benaadams commented Aug 4, 2017

If the processor doesn't support something, do we fall back to simulated behavior or do we throw exceptions?

The official design doc (https://github.com/dotnet/designs/blob/master/accepted/platform-intrinsics.md) indicates that it is up in the air whether software fallbacks are allowed.

These are the raw CPU platform intrinsics e.g. X86.SSE so PNS is probably fine; and will help get them out quicker.

Assuming the detection is branch eliminated; it should be easy to build a library on top that then does software fallbacks, which can be iterated on (either coreclr/corefx or 3rd party)

@sharwell

This comment has been minimized.

Member

sharwell commented Aug 4, 2017

Personally, I am fine with the unsafe code.

I am not against unsafe code. However, given the choice between safe code and unsafe code that perform the same, I would choose the former.

I am of the opinion that all of these methods should be declared as extern and should never have software fallbacks.

The biggest advantage of this is the runtime can avoid shipping software fallback code that never needs to execute.

The biggest disadvantage of this is test environments for the various possibilities are not easy to come by. Fallbacks provide a functionality safety net in case something gets missed.

@tannergooding

This comment has been minimized.

Member

tannergooding commented Aug 4, 2017

The biggest disadvantage of this is test environments for the various possibilities are not easy to come by.

@sharwell, what possibilities are you envisioning?

The way these are currently structured, proposed, the user would code:

public static double Cos(double x)
{
    if (x86.FMA3.IsSupported)
    {
        // Do FMA3
    }
    else if (x86.SSE2.IsSupported)
    {
        // Do SSE2
    }
    else if (Arm.Neon.IsSupported)
    {
        // Do ARM
    }
    else
    {
        // Do software fallback
    }
}

Under this, the only way a user is faulted is if they write a bad algorithm or if they forget to provide any kind of software fallback (and an analyzer to detect this should be fairly trivial).

@redknightlois

This comment has been minimized.

redknightlois commented Aug 4, 2017

running vectorized algorithms on hardware without vectorization support can cause performance degradation.

I would rephrase @tannergooding thought into: "running vectorized algorithms on hardware without vectorization support will with utmost certainty cause performance degradation."

@fiigii

This comment has been minimized.

Contributor

fiigii commented Aug 4, 2017

For this specifically, how would you expect the user to check which instruction subsets are supported? AES and POPCNT are separate CPUID flags and not every x86 compatible CPU may always provide both.

@tannergooding We defined an individual class for each instruction set (except SSE and SSE2) but put certain small classes into the Other.cs file. I will update the proposal to clarify.

// Other.cs
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
    public static class LZCNT
    {
     ......
    }

    public static class POPCNT
    {
    ......
    }

    public static class BMI1
    {
     .....
    }

    public static class BMI2
    {
     ......
    }

    public static class PCLMULQDQ
    {
     ......
    }

    public static class AES 
    {
    ......
    }
}
@tannergooding

This comment has been minimized.

Member

tannergooding commented Aug 4, 2017

AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).

I don't think this needs to be true all the time. In some cases, the AOT can drop the check altogether, depending on the target operating system (Win8 and above require SSE and SSE2 support, for example).

In other cases, the AOT can/should drop the check from each method and should instead aggregate them into a single check at the highest entry point.

Ideally, the AOT would run CPUID once during startup and cache the results as globals (honestly, if the AOT didn't do this, I would log a bug). The IsSupported check then becomes essentially a lookup of the cached value (just like a property normally behaves). This behavior is what the CRT implementations do to ensure that things like cos(double) remain performant and that they can still run FMA3 code where supported.

@benaadams

This comment has been minimized.

Collaborator

benaadams commented Aug 4, 2017

For AOT compilation, however, the compiler generates CPUID checking code that would return different values each time it is called (on different hardware).

The implication would be from a usage perspective:

For Jit we could be quite granular on the checks as they are no-cost branch eliminated.

For AOT we'd need to be quite course on the checks and perform it at algorithm or library level, to offset the cost of CPUID; which may push it much higher than intended e.g. you wouldn't use a vectorized IndexOf; unless your strings were huge because CPUID would dominate.

Probably could still cache on AOT in startup, so it would set the property; it wouldn't branch eliminate, but would be fairly low cost?

@fiigii

This comment has been minimized.

Contributor

fiigii commented Aug 4, 2017

I understand that with SSE and SSE2 being required in RyuJIT this makes sense, but I would almost prefer an explicit SSE class to have a consistent separation. I would essentially expect a 1-1 mapping of class to CPUID flag.

I think it will be simpler to just have a distinct class for each instruction set. I'd like for it to be very obvious how and where new things should be added. If there's a "grab bag" sort of type, then it becomes a bit murkier and we have to make lots of unnecessary judgement calls.

@tannergooding @mellinoe The current design intent of class SSE2 is to make more intrinsic functions friendly to users. If we had two classes SSE and SSE2, certain intrinsics would loose the generic signature. For example, SIMD addition only supports float in SSE, and SSE2 complements other types.

public static class SSE
{
    // __m128 _mm_add_ps (__m128 a,  __m128 b)
    public static Vector128<float> Add(Vector128<float> left,  Vector128<float> right);
}

public static class SSE2
{
    // __m128i _mm_add_epi8 (__m128i a,  __m128i b)
    public static Vector128<byte> Add(Vector128<byte> left,  Vector128<byte> right);
    public static Vector128<sbyte> Add(Vector128<sbyte> left,  Vector128<sbyte> right);

    // __m128i _mm_add_epi16 (__m128i a,  __m128i b)
    public static Vector128<short> Add(Vector128<short> left,  Vector128<short> right);
    public static Vector128<ushort> Add(Vector128<ushort> left,  Vector128<ushort> right);
    
    // __m128i _mm_add_epi32 (__m128i a,  __m128i b)
    public static Vector128<int> Add(Vector128<int> left,  Vector128<int> right);
    public static Vector128<uint> Add(Vector128<uint> left,  Vector128<uint> right);

    // __m128i _mm_add_epi64 (__m128i a,  __m128i b)
    public static Vector128<long> Add(Vector128<long> left,  Vector128<long> right);
    public static Vector128<ulong> Add(Vector128<uint> left,  Vector128<ulong> right);
    
    // __m128d _mm_add_pd (__m128d a, __m128d b)
    public static Vector128<double> Add(Vector128<double> left,  Vector128<double> right);
}

Comparing to SSE2.Add<T>, the above design looks complex, and users have to remember SSE.Add(float, float) and SSE2.Add(int, int). Additionally, SSE2 is the bottom-line of RyuJIT code generation for x86/x86-64, seperating SSE from SSE2 has no advatage on functionality or convenience.

Although the current design (class SSE2 including SSE and SSE2 intrinsics) hurts API consistency, there is a trade-off between design consistency and user experience, which is worth discussing.

@benaadams

This comment has been minimized.

Collaborator

benaadams commented Aug 4, 2017

Rather than X86 maybe x86x64 as x86 is often used to donate 32-bit only?

@nietras

This comment has been minimized.

Collaborator

nietras commented Aug 4, 2017

Very excited we are finally seeing a proposal for this. My initial thoughts below.

AVX-512 is missing, probably since it is not that widespread yet, but I think it would be good to at least give this some thought and how to structure these because AVX-512 feature set is very fragmented. In this case I would assume we need to have a class for each set i.e. (see https://en.wikipedia.org/wiki/AVX-512):

public static class AVX512F {} // Foundation 
public static class AVX512CD {} // Conflict Detection
public static class AVX512ER {} // Exponential and Reciprocal
public static class AVX512PF {} // Prefetch Instructions
public static class AVX512BW {} // Byte and Word
public static class AVX512DQ {} // Doubleword and Quadword
public static class AVX512VL {} // Vector Length
public static class AVX512IFMA {} // Integer Fused Multiply Add (Future)
public static class AVX512VBMI {} // Vector Byte Manipulation Instructions (Future)
public static class AVX5124VNNIW {} // Vector Neural Network Instructions Word variable precision (Future)
public static class AVX5124FMAPS {} // Fused Multiply Accumulation Packed Single precision (Future)

and add a struct Vector512<T> type, of course. Note that the latter two AVX5124VNNIW and AVX5124FMAPS are hard to read due to number 4.

Some of these can have a huge impact for deep learning, sorting etc.

Regarding Load I have some concerns as well. As @redknightlois I think void* should be considered too, but more importantly also load from/store to ref. Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size. So something like (not sure where we could put this, and how naming should be done, if it can be moved to platform agnostic type.

[Intrinsic]
public static unsafe Vector256<sbyte> Load(sbyte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadSByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> Load(ref sbyte mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(byte* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<sbyte> LoadByte(void* mem) { throw new NotImplementedException(); }
[Intrinsic]
public static unsafe Vector256<byte> Load(ref byte mem) { throw new NotImplementedException(); }
// Etc.

The most important thing here is if ref can be supported as it would be essential for supporting generic algorithms. Naming should be revised no doubt, but just trying to make a point. If we want to support load from void* method name needs to include return type or method needs to be on type specific static class.

@4creators

This comment has been minimized.

Contributor

4creators commented Aug 4, 2017

It's great we are discussing a concrete proposal right now. 😄

  1. The above linked const keyword usage language proposal was created explicitly to provide support for some of SIMD instructions requiring immediate parameters. I think it will be straightforward to implement but since it may delay introduction of intrinsics there were strong arguments in favor of going with simple attribute implementation first and later expand C# syntax and API by including support for const method parameters.

  2. IMO we have to discuss in parallel forward looking designs which comprise two different areas:

  • System.Numerics API which can be partially implemented with support of discussed here x86 intrinsics
  • Intrinsics API which should comprise other architectures as well as this will have an impact on final shape of the intrinsics API

Intrinsics

Namespace and assembly

I would propose to move intrinsics to separate namespace located relatively high in hierarchy and each platform specific code into separate assembly.

System.Intrinsics general top level namespace for all intrinsics
System.Intrinsics.X86 x86 ISA extensions and separate assembly
System.Intrinsics.Arm ARM ISA extensions and separate assembly
System.Intrinsics.Power Power ISA extensions and separate assembly
System.Intrinsics.RiscVRiscV ISA extensions and separate assembly

Reason for the above division is large API area for every instruction set i.e. AVX-512 will be represented by more than 2 000 intrinsics in MsVC compiler. This same will be true for ARM SVE very soon (see below). Size of the assembly due to string content only won't be small.

Register sizes (currently XMM, YMM, ZMM - 128, 256, 512 bits in x86)

Current implementations support limited set of register sizes:

  • 128, 256, 512 bits in x86
  • 128 in ARM Neon and IBM Power 8 and Power 9 ISA

However, ARM recently published:

ARM SVE - Scalable Vector Extensions

see: The Scalable Vector Extension (SVE), for ARMv8-A published on 31 March 2017 with status Non-Confidential Beta.

This specification is quite important as it introduces new register sizes - altogether there are 16 register sizes which are multiples of 128 bits. Details are on page 21 of the specification (table is below).

armv8_sve_beta

  • Maximum vector length: 2048 bits

  • Required vector lengths: 128, 256, 512, 1024 bits

  • Permitted vector lengths: 384, 640, 768, 896, 1152, 1280, 1408, 1536, 1664, 1792, 1920

It would be necessary to design API which is capable to support in near future 16 different register sizes and several thousands (or tens of thousands) of opcodes/functions (counting with overloads). Predictions of not having 2048 bit SIMD instructions in couple of years seems to have been falsified to anyone's surprise by ARM this year. Looking at history (ARM published public beta of ARMv8 ISA on 04 September 2013 and first processor implementing it was available to users globally in October 2014 - Samsung Galaxy Note 4) I would expect that first silicon with SVE extensions will be available in 2018. I suppose this would be most probably very close in time to public availability of DotNet SIMD intrinsics.

I would like to propose:

Vectors

Implement basic Vectors supporting all register sizes in System.CoreLib.Private

namespace System.Numerics
{
    [StructLayour(LayoutKind.Explicit)]
    public unsafe struct Register128
    {
        [FieldOffset(0)]
        public fixed byte [16];
        .....
        // accessors for other types    
    }

    // ....

    [StructLayour(LayoutKind.Explicit)]
    public unsafe struct Register2048
    {
        [FieldOffset(0)]
        public fixed byte [256];
        .....
        // accessors for other types    
    }

    public struct Vector<T, R> where T, R: struct
    {
    }

    public struct Vector128<T>  :  Vector<T, Register128>
    {
    }

    // ....

    public struct Vector2048<T>  :  Vector<T, Register2048>
    {
    }
}

System.Numerics

All safe APIs would be exposed via Vector and VectorXXX structures and implemented with support of intrinsics.

System.Intrinsics

All vector APIs will use System.Numerics.VectorXXX.

public static Vector128<byte> MultiplyHigh<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);
public static Vector128<byte> MultiplyLow<Vector128<byte>>(Vector128<byte> value1, Vector128<byte> value2);

Intrinsics APIs will be placed in separate classes according to functionality detection patterns provided by processors. In case of x86 ISA this would be one to one correspondence between CPUID detection and supported functions. This would allow for easy to understand programming pattern where one would use functions from given group in way consistent with platform support.

Main reason for that kind of division is a requirement set by silicon manufacturers to use instructions only if they are detected in hardware. This allows for example to ship processor with support matrix comprising SSE3 but not SSSE3, or comprising PCLMULQDQ and SHA and not AESNI. This direct class - hardware support detection correspondence is the only safe way of having IsHardwareSupported detection and be compliant with Intel/AMD instruction usage restrictions. Otherwise kernel will have to catch for us #UD exception 😸

Mapping APIs to C/C++ intrinsics or to ISA opcodes

Intrinsics abstract usually in 1 to 1 way ISA opcodes however there are some intrinsics which map to several instructions. I would prefer to abstract opcodes (using nice names) and implement multi opcode intrinsics as functions on VectorXxx.

@4creators

This comment has been minimized.

Contributor

4creators commented Aug 4, 2017

@nietras

Given this, perhaps these should be relocated to the "generic"/platform-agnostic namespace and type, since assumably all platforms should support load/store for a supported vector size.

The best place would be System.Numerics.VetorXxx<T>

@jkotas

This comment has been minimized.

Member

jkotas commented Aug 4, 2017

all platforms should support load/store for a supported vector size

Is the platform agnostic Load/Store any different from the existing Unsafe.Read/Write?

@battlebottle

This comment has been minimized.

battlebottle commented Jan 2, 2018

This API looks really great. There is one thing I would like to see changed however. Could we support a software fallback?

I know I'm very late to this discussion, but I'd like to make the case for this.

I can see the momentum in this discussion is for not having any software fallback, but I don't see any in-depth discussion of the pros and cons of this. For pros, I see someone mention that any code that runs a would-be software fallback mode would be a performance bug, and it would be better to to crash in the scenario for easier debugging. That's certainly true, but I would argue that this is not a very .NET way of doing things. Many aspects of .NET have graceful performance degradation in place of throwing exceptions and I'm sure it's written somewhere that this is part of the .NET philosophy. Better for code to run slower than to crash outright, unless the programmer specifies this is what they want to happen. This is something I like about the the old Vector API.

I think part of the argument for no software fallback is partly based on the fact that the audience for this API is for pretty low level developers who are used to using SIMD extensions from C++ and assembler and whatnot, and having the code crash outright when the real instruction sets are not available is a more comfortable development environment for them. And while I believe this will be true for 98% of developers who use this API, I don't think we should forget the more typical .NET developer and assume they would never want to explore this stuff to see if it could benefit them. In general, I think it's a mistake to design an API like this and assume only a certain type of developer will want to use it. Especially something baked into .NET.

Here’s some of the pros I consider a software fallback would provide:

  • Better development experience: I'll accept the point that crashing when a used extension is not present has some advantages, but consider the benefits of a software fallback also. A software fallback provides a way of reliably exploring the use of all instruction set classes, including ones not supported on the developer’s machine. This may not excite many in this discussion, but it does provide a nice way of developers to test algorithms ensuring they are logically sound before deciding if they are worth testing on real hardware. Some debugging scenarios are easier. As an example, if a user of a library reports a bug when they run it on ARM devices because of a bug in code that uses NEON, a developer on that library has the possibly of fixing this from an x86 machine as they can reproduce and fix the the bug using the NEON software fallback. Of course it would be better if the developer had NEON hardware to debug with, but this is not always practical and the developer is empowered to improve their code much more easily in less time than they would otherwise. A software fallback would also provide much better potential for unit tests that can test code path for all instruction set classes no matter what the local dev machine supports.

  • More reliable code: Certainly any code that runs in software fallback mode where there exists usable extensions that would run faster, or a handwritten software algorithm can be considered a performance bug. However, in the real world it is inevitable that developers have limited time to write and debug code. It is inevitable that mistakes will be made and that developers will simply opt not to bother writing code that does not expect a certain set of extensions to be present. .NET excels in allowing developers to write code that works reliably quickly, and then optimize that code to run faster at their preference. Given a code base where the developer makes use of these extensions but does not have the time and resources to ensure that their code runs appropriately on any platform, then for the consumer of the library or application it is much more preferable that that code run much slower than completely crash. I believe this is something that will affect developers consuming libraries written with this API, and end-users who potentially face complete crashes because an application was written without being tested with instruction sets available on the user’s CPU.

In general, I think a software fallback would provide little if any disadvantage to developers who feel they would not benefit from it, but at the same time make the API much more accessible to regular .NET developers.

I don't expect this to change given how late I am to this, but I thought I'd at at least put my thoughts on the record here.

@skesgin

This comment has been minimized.

skesgin commented Jan 16, 2018

I agree that having software fallback capability would be nice. However, given that it is just a nice-to-have feature and can also be implemented by individual developers on a need-to-have basis, or as a third-party library, I think it should be placed towards the bottom of the to-do list. I would rather see that energy being directed towards having full AVX-512 support which is already available on server-grade CPUs for a while and on its way to consumer CPUs.

@oscarbg

This comment has been minimized.

oscarbg commented Apr 11, 2018

Ping on AVX512 news?

@4creators

This comment has been minimized.

Contributor

4creators commented Apr 12, 2018

We have still some ISAs to implement before already accepted APIs will be finished - some AVX2 intrinsics and whole of AES, BMI1, BMI2, FMA, PCMULQDQ. My expectation is that after this work is finished and implementation is stabilized we will start working on AVX512. However, in the meantime we still have a lot to do with Arm64 implementations.

@fiigii could probably provide more info on future plans.

@4creators

This comment has been minimized.

Contributor

4creators commented Apr 12, 2018

I agree that having software fallback capability would be nice.

This API looks really great. There is one thing I would like to see changed however. Could we support a software fallback?

The current thinking around implementation of Hardware intrinsics is that we provide low level intrinsics which allow assembly like programming plus several helper intrinsics which should make developer life easier.

Implementation which provides more abstraction and software fallback is partially available already in System.Numerics namespace with Vector<T>. The expectation is that Hardware intrinsics will allow to expand functionality of Vector<T> implementation by adding new functionality backed by software fallback. Vector<T> implementation should be viewed than as a higher level programming interface which could be used on all hardware platforms due to software fallback.

The above, however, is a personal view of community member.

@fiigii

This comment has been minimized.

Contributor

fiigii commented Apr 12, 2018

Ping on AVX512 news?

After finish these APIs (e.g., AVX2, FMA, etc.), I think we have to investigate more potential performance issues (e.g., calling conversion, data alignment) before we move to the next step because these issues may blow up with wider SIMD architectures. Meanwhile, I prefer to improve/refactor the JIT backend (emitter, codgen, etc.) implementation before extending it to AVX-512. Yes, we definitely need to extend this plan to AVX-512 in the future, but now it is better to focus on enhancing 128/256-bit intrinsics.

@Jorenkv

This comment has been minimized.

Jorenkv commented Apr 13, 2018

Personally I don’t see software fallback as worth spending developer effort on, as the consumer can easily implement software feedback themselves if they want to have it, and besides it works better at the algorithm level than having software fallback at the intrinsic level.

Actually implementing all the dozens of intrinsics that exist out there for all targeted platforms is not something the consumer can do themselves and so I personally would prefer to have higher priority.

Great stuff by the way, I’m very much looking forward to having all these intrinsics available.

@voinokin

This comment has been minimized.

voinokin commented May 22, 2018

Minor API enhancement suggestion from my side:

Add Count property to all vector VTs which would be similar to System.Numerics.Vector.Count, albeit would give the static value based solely on Vector64/128/256/etc<T>'s generic type argument.

The implementation could be something that looks like Unsafe.SizeOf<Vector128<T>>() / Unsafe.SizeOf<T>().

The reason for this proposal is - when the generic type argument is known aforehead (eg. concrete type like ushort, int, etc), then the vector dimension could be just hardcoded into source code. But this is not the case for the code that uses approach with generics - the dimension must be recalculated in source code often when required (again).

@colgreen

This comment has been minimized.

colgreen commented May 25, 2018

Q. Is there any chance this functionality ever be available in .NET Standard?

I maintain a code library/nuget that would benefit from using these hardware intrinsics, but it currently targets .NET Standard to provide good portability.

Ideally I'd like to continue to offer portability, but also improved performance if the runtime platform/environment provides these intrinsics. Right now it seems my choice is either speed or portability, but not both - is this likely to change in the future?

@jkotas

This comment has been minimized.

Member

jkotas commented May 25, 2018

@colgreen This was discussed in #24346. I recommend moving the discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment