Better Simd reduction #49

gfoidl · 2018-04-19T13:10:03Z

Fixes #43

#44 made some improvements on this, but the codegen wasn't perfect. Cf. #44 (comment)
#43 (comment) opened the door for better codegen, this is the implementation.

Biggest improvement to #44 is in ReduceMinMax:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled
Base	14.598 ns	0.0441 ns	0.0413 ns	1.00
New	1.644 ns	0.0216 ns	0.0202 ns	0.11

With wonderful dasm 😉

00007FFD1DDE7510  vzeroupper
00007FFD1DDE7513  vmovupd       ymm0,ymmword ptr [rcx]
00007FFD1DDE7518  vextractf128  xmm1,ymm0,1
00007FFD1DDE751E  vextractf128  xmm0,ymm0,0
00007FFD1DDE7524  vpermilpd     xmm2,xmm1,1
00007FFD1DDE752A  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7530  vminpd        xmm1,xmm1,xmm2
00007FFD1DDE7535  vminpd        xmm0,xmm0,xmm3
00007FFD1DDE753A  vminpd        xmm0,xmm0,xmm1
00007FFD1DDE753F  vmovsd        qword ptr [r8],xmm0
00007FFD1DDE7544  vmovupd       ymm0,ymmword ptr [rdx]
00007FFD1DDE7549  vextractf128  xmm1,ymm0,1
00007FFD1DDE754F  vextractf128  xmm0,ymm0,0
00007FFD1DDE7555  vpermilpd     xmm2,xmm1,1
00007FFD1DDE755B  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7561  vmaxpd        xmm1,xmm1,xmm2
00007FFD1DDE7566  vmaxpd        xmm0,xmm0,xmm3
00007FFD1DDE756B  vmaxpd        xmm0,xmm0,xmm1
00007FFD1DDE7570  vmovsd        qword ptr [r9],xmm0
00007FFD1DDE7575  vzeroupper
00007FFD1DDE7578  ret

gfoidl · 2018-04-19T13:33:40Z

Just for reference a portion of C++:

#include <iostream>
#include <immintrin.h>
//-----------------------------------------------------------------------------
float max_sse(float* a)
{
    __m128* f4    = reinterpret_cast<__m128*>(a);
    __m128 maxval = *f4;

    for (int i = 0; i < 3; ++i)
    {
        __m128 tmp = _mm_shuffle_ps(maxval, maxval, 0x93);
        maxval     = _mm_max_ps(maxval, tmp);
    }

    float res;
    _mm_store_ss(&res, maxval);
    return res;
}
//-----------------------------------------------------------------------------
double max_sse(double* a)
{
    __m256d* d4    = reinterpret_cast<__m256d*>(a);
    __m256d maxval = *d4;

    for (int i = 0; i < 3; ++i)
    {
        __m256d tmp = _mm256_permute4x64_pd(maxval, 0x39);
        maxval      = _mm256_max_pd(maxval, tmp);
    }

    double res;
    _mm256_store_pd(&res, maxval);
    return res;
}
//-----------------------------------------------------------------------------
#define MM_SHUFFLE(fp0,fp1,fp2,fp3) (((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)))
//-----------------------------------------------------------------------------
int main()
{
    int a = 0x93;
    int b = MM_SHUFFLE(2, 1, 0, 3);

    float arr[] = {1, 2, 3, 4};
    float max   = max_sse(arr);

    double darr[] = {1, 2, 3, 4};
    double dmax   = max_sse(darr);

    using namespace std;

    cout << max  << endl;
    cout << dmax << endl;
}

I haven't tested the double-version in C#, because in the reference-assembly _mm256_permute4x64_pd is missing (though it's available in CoreLib).
But I believe the implemented variant is faster, because it's just

extracting the __m128 out of the __m256d
reversing the vectors
min/max

instead of rotating and min/max.

gfoidl added 4 commits April 18, 2018 14:55

Multitargeting netstandard2.0, netcoreapp2.1

8bae1ef

Fixed build for netcoreapp2.1

a80df83

VectorHelper.ReduceSum

3b26dee

VectorHelper.ReduceMinMax

1a2a603

gfoidl added the performance label Apr 19, 2018

gfoidl added this to the v1.1.0 milestone Apr 19, 2018

gfoidl self-assigned this Apr 19, 2018

gfoidl merged commit cc76c50 into master Apr 19, 2018

gfoidl deleted the simd-reduction branch April 19, 2018 13:12

gfoidl mentioned this pull request Apr 23, 2018

Better SIMD algorithm for min/max? #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better Simd reduction #49

Better Simd reduction #49

gfoidl commented Apr 19, 2018

gfoidl commented Apr 19, 2018

Better Simd reduction #49

Better Simd reduction #49

Conversation

gfoidl commented Apr 19, 2018

gfoidl commented Apr 19, 2018