Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design and process for adding platform dependent intrinsics to .NET #13

Closed
russellhadley opened this Issue Jul 31, 2017 · 6 comments

Comments

Projects
None yet
5 participants
@russellhadley
Copy link
Contributor

russellhadley commented Jul 31, 2017

.NET Platform Dependent Intrinsics

Introduction

The .NET ecosystem supports many platforms and hardware architectures, and through its runtime implementation, ensuring that any MSIL has reasonable performance on different platform/hardware combinations. This consistency is one of the key selling points of the .NET stack, but for some high end apps platform specific functionality needs to be available to achieve peak performance. Examples of this are particular hardware accelerated encoders like crc32 on Intel, or the particulars of the NEON SIMD instructions. At the low level these are not consistent across platforms so a general functionality is not possible, and a higher-level, more abstract implementation, while consistent, imposes a performance penalty unacceptable to implementors seeking maximum app/service throughput. To enable this last bit of performance improvement .NET defines platform dependent intrinsics. This allows platform/hardware providers to define low level intrinsics that map to particular hardware features that are not consistent - i.e. target dependent. This document outlines the process for proposing and implementing these intrinsics. This process is intended to be open and can be initiated and implemented by any contributor or partner.

Guidelines for Platform Dependent Intrinsics

  1. A platform intrinsic should expose a specific feature or semantic that is not in common with other platforms or hardware implementations.
    • If the functionality is common and performant - make it platform independent.
    • If instead the semantic is platform dependent, or there is a platform dependent high performance implementation, then there is a clear argument for making an assuming that the next point holds
  2. A platform intrinsic should be impactful. Ideally they solve a particular user problem. Stated another way, platform intrinsics add complexity to the runtime implementation and language stack so they should help a concrete user scenario. 3. A platform independent way of determining whether the current executing platform supports dependent functionality needs to be included. Users need to be able to easily check for hardware acceleration. 4. Executing platform dependent APIs on a non supporting platform may result in a System.PlatformNotSupportedException or invalid instruction fault. Fallback implementations maybe provided but are not required.

Example:

On the Intel platform there is a built in CRC32 implementation that the below example would expose for use in C#. At high-level intrinsics are methods of static classes that are marked with the [Intrinsic] attribute.

// SSE42.cs
namespace System.Runtime.CompilerServices.Intrinsics.X86
{
    public static class SSE42
    {
        public static bool IsSupported() { throw new NotImplementedException(); }

        // unsigned int _mm_crc32_u8 (unsigned int crc, unsigned char v)
        [Intrinsic]
        public static uint Crc32(uint crc, byte data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u16 (unsigned int crc, unsigned short v)
        [Intrinsic]
        public static uint Crc32(uint crc, ushort data) { throw new NotImplementedException(); }
        // unsigned int _mm_crc32_u32 (unsigned int crc, unsigned int v)
        [Intrinsic]
        public static uint Crc32(uint crc, uint data) { throw new NotImplementedException(); }
        // unsigned __int64 _mm_crc32_u64 (unsigned __int64 crc, unsigned __int64 v)
        [Intrinsic]
        public static ulong Crc32(ulong crc, ulong data) { throw new NotImplementedException(); }

        ......
    }
}

Note: This example hasn't been implemented yet - thus NotImplementedException rather than PlatformNotSupportedException. So details could change going forward.

Process

  1. Design the API in the System.Runtime.CompilerServices.Intrinsics namespace. Take care to try and reuse what you can from the current system and ensure that other platforms can implement their functionality as well.
  2. Open an issue in CoreFX for API review
  3. Open an issue in CoreCLR for implementation of intrinsic in Runtime/JIT link to CoreCLR issues
  4. After API review approval implement intrinsic in CoreCLR

Cross Platform vs Platform Dependent

.NET favors cross platform functionality that has full ecosystem support. Our current Vector<T> SIMD support for instance provides higher level access to hardware acceleration. In general it is preferable to use this kind of common functionality, but when dictated by performance, platform dependent implementations needs to be available. As a consequence of this there can be multiple ways to implement functionality so careful consideration of the trade-offs needs to be made.

FAQ:

Q: What type and namespace should be used?
A: System.Runtime.CompilerServices.Intrinsics is the top namespace, but below that we would expect a breakout based on architecture and platform. Care should be taken to avoid collisions - an example might be an architecture like ARM, where there are multiple licensees - with the best recommendation being getting feedback from the community early.

Q: What should the method/parameters be named
A: Simple and clear is the best bet but that doesn't always help. There are a lot of cases where there is prior art in C++ for these intrinsics so having something that parallels that implementation (maybe a bit more clearly named) can make the intrinsics easier to use. That being said developers should try and follow regular C# naming conventions and choose names that indicate the semantic usage.

Q: Is a software fallback implementation allowed? (Discussed above a bit)
A: Fallback is allowed but not required. For very low level implementations a fallback could even be misleading.

Q: How are immediate operands handled?
A: We're planning to add support for this through Roslyn but haven't settled on an implementation yet. Stay tuned.

Q: How are overloads handled?
A: Overloads are allowed but expected to be rare. Lots of intrinsics are likely to be determined by method name on their input type but as we see in the example above, the overload case can be helpful.

Q: What happens when platform dependent intrinsics are compiled ahead of time (AOT)?
A: There are two cases here. First, the target independent checks for platform capability are used. These checks then become runtime checks and the accelerated code as well as the user provided independent fallback are preserved in the output and selected between based on the runtime check result. Second, an unguarded platform intrinsic is used. If this is run on an platform that doesn't support it then an low level illegal instruction fault is generated.

Edited by @terrajobst to make it a bit easier on the eyes

@tannergooding

This comment has been minimized.

Copy link
Member

tannergooding commented Jul 31, 2017

I went and tagged a bunch of issues I've been in discussions on that would be very much simplified, made easier to implement, etc if this were implemented.

@shaggygi

This comment has been minimized.

Copy link

shaggygi commented Jul 31, 2017

Also linking this one as it seems somewhat relevant.

Real-Time specification for C#

@guilt

This comment has been minimized.

Copy link

guilt commented Aug 1, 2017

  1. One proposal would be to incorporate elements of liboil

Give pre-written loop code / plug variables in;

  1. One related proposal would be to give unrollable loops as code blocks, and allow folks to plug-in the x4, x8 and x16 loops implementations. That way, you'll be able to create basic functions as memcpys using the same template.

unroll.h

and

unroll.c - sample usage of unrolling blocks

@russellhadley

This comment has been minimized.

Copy link
Contributor Author

russellhadley commented Aug 1, 2017

Closed by #14

@terrajobst

This comment has been minimized.

Copy link
Member

terrajobst commented Aug 2, 2017

@guilt this proposal isn't trying to extend the language, it's about how processor specific features could be exposed as APIs.

@guilt

This comment has been minimized.

Copy link

guilt commented Aug 7, 2017

@terrajobst Got it. But in the context of the issue's description, I decided to add my suggestions.

I wrote what I did to address this: "At the low level these are not consistent across platforms so a general functionality is not possible, and a higher-level, more abstract implementation, while consistent, imposes a performance penalty unacceptable to implementors seeking maximum app/service throughput."

It is my opinion that the statement overlooks many known implementations - and abstract implementations are available. IMHO People who already know how to use intrinsics use them without anyone else requiring to write these shims for them. These are actually the most underutilized feature by most programmers. I'm glad .NET is taking the first step.

But it is possible to give a general consistent interface as long as your parallel adders/multipliers/movers as well as specialized interfaces use providers - which you are building; It would benefit common people.

That's it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.