Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: AVX-VNNI intrinsics #43780

Closed
hanblee opened this issue Oct 23, 2020 · 15 comments
Closed

API Proposal: AVX-VNNI intrinsics #43780

hanblee opened this issue Oct 23, 2020 · 15 comments
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Milestone

Comments

@hanblee
Copy link
Contributor

hanblee commented Oct 23, 2020

Background and Motivation

The upcoming Intel® Alder Lake and Sapphire Rapids processors will introduce AVX-VNNI instruction set architecture which provides VEX-encoded versions of the Vector Neural Network Instructions (reference: https://software.intel.com/content/dam/develop/external/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf). This proposal aims to expose AVX-VNNI instructions via intrinsics.

Proposed API

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxVnni : Avx2
    {
        internal AvxVnni() { }

        public static new bool IsSupported { [Intrinsic] get { return false; } }

        public new abstract class X64 : Avx2.X64
        {
            internal X64() { }

            public static new bool IsSupported { [Intrinsic] get { return false; } }
        }

        /// <summary>
        /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }
    }
}

/cc @tannergooding @CarolEidt

@hanblee hanblee added the api-suggestion Early API idea and discussion, it is NOT ready for implementation label Oct 23, 2020
@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added area-System.Runtime.Intrinsics untriaged New issue has not been triaged by the area owner labels Oct 23, 2020
@ghost
Copy link

ghost commented Oct 23, 2020

Tagging subscribers to this area: @tannergooding, @jeffhandley
See info in area-owners.md if you want to be subscribed.

@tannergooding tannergooding removed the untriaged New issue has not been triaged by the area owner label Oct 23, 2020
@tannergooding
Copy link
Member

Thanks for the proposal, these LGTM.

It might be beneficial to name the operands left, right, and addend to help disambiguate which each is used for.
It would also be good to identify it these operations are effectively "fused" and if so include that in the name. That is, is a * 2 - a == a or might it overflow and produce a different result?

@GrabYourPitchforks
Copy link
Member

@tannergooding What does the order of operations normally look like re: API review and implementation? Presumably we won't implement until we have test hardware, even if the API is approved?

@tannergooding
Copy link
Member

That is likely something that needs more discussion.

One could imagine implementing it as an "experimental API" even without hardware. But I don't believe we would ship without being able to validate things more end to end.

Of course Carol, Bruce, or others may have different opinions here 😄

@tannergooding
Copy link
Member

-- I'd also like to double check what we did here for ARM, as I feel we included Widening in the names for similar instructions.

CC. @echesakovMSFT

@echesakov
Copy link
Contributor

-- I'd also like to double check what we did here for ARM, as I feel we included Widening in the names for similar instructions.

CC. @echesakovMSFT

Yes, we did include Widening in the names of similar intrinsics on Arm64. For example,

/// <summary>
/// uint16x8_t vmlal_u8 (uint16x8_t a, uint8x8_t b, uint8x8_t c)
/// A32: VMLAL.U8 Qd, Dn, Dm
/// A64: UMLAL Vd.8H, Vn.8B, Vm.8B
/// </summary>
public static Vector128<ushort> MultiplyWideningLowerAndAdd(Vector128<ushort> addend, Vector64<byte> left, Vector64<byte> right) => MultiplyWideningLowerAndAdd(addend, left, right);

and

/// <summary>
/// int32_t vqdmlalh_lane_s16 (int32_t a, int16_t b, int16x4_t v, const int lane)
/// A64: SQDMLAL Sd, Hn, Vm.H[lane]
/// </summary>
public static Vector64<int> MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(Vector64<int> addend, Vector64<short> left, Vector64<short> right, byte rightIndex) => MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(addend, left, right, rightIndex);

Presumably, we can name the intrinsics the same way and, as an example, the one that corresponds to __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b) would become MultiplyWideningAndAdd.

Although on Arm64 "widening" always means doubling the size of the result while on Intel it would also include doubling and quadrupling (byte -> int)

@hanblee
Copy link
Contributor Author

hanblee commented Oct 23, 2020

Thanks for the feedback.

public static Vector64 MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(...)

It looks like Saturate is used as a shortcut for WithSignedSaturation as well. Would the preference be for Saturate then and use Widening? E.g.,
MultiplyWideningAndAddSaturate instead of MultiplyAddWithSignedSaturation?

@echesakov
Copy link
Contributor

Thanks for the feedback.

public static Vector64 MultiplyDoublingWideningScalarBySelectedScalarAndAddSaturate(...)

It looks like Saturate is used as a shortcut for WithSignedSaturation as well. Would the preference be for Saturate then and use Widening? E.g.,
MultiplyWideningAndAddSaturate instead of MultiplyAddWithSignedSaturation?

If I understand what this instructions does - multiplies two values (bytes or shorts) with widening the result to larger type (short or int) and sum them up; if the value exceeds int positive or negative boundaries, it will saturate the value then the name seems right. I don't think we need to specify SignedSaturation since it's clear from the resulting value type.

I am curious why in

 MultiplyAdd(Vector128<int> source, Vector128<byte> a, Vector128<sbyte> b)

a corresponds to unsigned bytes while b to signed bytes?

@hanblee
Copy link
Contributor Author

hanblee commented Oct 23, 2020

The operand types are from the programming reference for VPDPBUSD:

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.

and that is also how the C intrinsic is designed:

__m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
Description
Multiply groups of 4 adjacent pairs of unsigned 8-bit integers in a with corresponding signed 8-bit integers in b, producing 4 intermediate signed 16-bit results. Sum these 4 results with the corresponding 32-bit integer in src, and store the packed 32-bit results in dst.

@hanblee
Copy link
Contributor Author

hanblee commented Nov 19, 2020

I updated the original post based on suggestions.

@hanblee
Copy link
Contributor Author

hanblee commented Mar 22, 2021

One could imagine implementing it as an "experimental API" even without hardware. But I don't believe we would ship without being able to validate things more end to end.

@tannergooding Since the HW spec is out in the public, I think the APIs are ready to be reviewed. I agree that it may not make sense to ship them until the hardware becomes available. Is there an experimental repo/branch where we can make a PR to get the implementation reviewed prior to hardware availability?

@tannergooding
Copy link
Member

tannergooding commented Mar 22, 2021

We can discuss exposing them in System.Runtime.Intrinsics.Experimental which is where we've exposed other "preview intrinsics". We'll also need to discuss how these will be tested even once hardware does become available since they won't be on our current CI machines.

I'm fine with marking this api-ready-for-review if there are no other concerns.

CC. @echesakovMSFT

@tannergooding tannergooding added api-ready-for-review API is ready for review, it is NOT ready for implementation and removed api-suggestion Early API idea and discussion, it is NOT ready for implementation labels Apr 5, 2021
@tannergooding tannergooding added this to the 6.0.0 milestone Apr 13, 2021
@tannergooding tannergooding added the blocking Marks issues that we want to fast track in order to unblock other important work label Apr 23, 2021
@bartonjs bartonjs added api-approved API was approved in API review, it can be implemented and removed api-ready-for-review API is ready for review, it is NOT ready for implementation blocking Marks issues that we want to fast track in order to unblock other important work labels Apr 27, 2021
@bartonjs
Copy link
Member

bartonjs commented Apr 27, 2021

Video

Looks good as proposed. One thing that was observed is one set of methods is short/short, but the other is byte/sbyte. Double check the sign bits.

namespace System.Runtime.Intrinsics.X86
{
    public abstract class AvxVnni : Avx2
    {
        internal AvxVnni() { }

        public static new bool IsSupported { [Intrinsic] get { return false; } }

        public new abstract class X64 : Avx2.X64
        {
            internal X64() { }

            public static new bool IsSupported { [Intrinsic] get { return false; } }
        }

        /// <summary>
        /// __m128i _mm_dpbusd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssd_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSD xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAdd(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssd_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSD ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAdd(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpbusds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPBUSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<byte> left, Vector128<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m128i _mm_dpwssds_epi32 (__m128i src, __m128i a, __m128i b)
        /// VPDPWSSDS xmm, xmm, xmm
        /// </summary>
        public static Vector128<int> MultiplyWideningAndAddSaturate(Vector128<int> addend, Vector128<short> left, Vector128<short> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpbusds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPBUSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<byte> left, Vector256<sbyte> right) { throw new PlatformNotSupportedException(); }

        /// <summary>
        /// __m256i _mm256_dpwssds_epi32 (__m256i src, __m256i a, __m256i b)
        /// VPDPWSSDS ymm, ymm, ymm
        /// </summary>
        public static Vector256<int> MultiplyWideningAndAddSaturate(Vector256<int> addend, Vector256<short> left, Vector256<short> right) { throw new PlatformNotSupportedException(); }
    }
}

@hanblee
Copy link
Contributor Author

hanblee commented Apr 27, 2021

One thing that was observed is one set of methods is short/short, but the other is byte/sbyte. Double check the sign bits.

That's the way the instructions were designed.

The operand types are from the programming reference for VPDPBUSD:

Description
Multiplies the individual unsigned bytes of the first source operand by the corresponding signed bytes of the second
source operand, producing intermediate signed word results. The word results are then summed and accumulated
in the destination dword element size operand.

And for VPDPWSSD:

Description
Multiplies the individual signed words of the first source operand by the corresponding signed words of the second
source operand, producing intermediate signed, doubleword results. The adjacent doubleword results are then
summed and accumulated in the destination operand.

@tannergooding
Copy link
Member

This was implemented and merged.

@ghost ghost locked as resolved and limited conversation to collaborators Jul 18, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Runtime.Intrinsics
Projects
None yet
Development

No branches or pull requests

6 participants