Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Simd reduction #49

Merged
merged 4 commits into from
Apr 19, 2018
Merged

Better Simd reduction #49

merged 4 commits into from
Apr 19, 2018

Conversation

gfoidl
Copy link
Owner

@gfoidl gfoidl commented Apr 19, 2018

Fixes #43

#44 made some improvements on this, but the codegen wasn't perfect. Cf. #44 (comment)
#43 (comment) opened the door for better codegen, this is the implementation.

Biggest improvement to #44 is in ReduceMinMax:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT

Method Mean Error StdDev Scaled
Base 14.598 ns 0.0441 ns 0.0413 ns 1.00
New 1.644 ns 0.0216 ns 0.0202 ns 0.11

With wonderful dasm 😉

00007FFD1DDE7510  vzeroupper
00007FFD1DDE7513  vmovupd       ymm0,ymmword ptr [rcx]
00007FFD1DDE7518  vextractf128  xmm1,ymm0,1
00007FFD1DDE751E  vextractf128  xmm0,ymm0,0
00007FFD1DDE7524  vpermilpd     xmm2,xmm1,1
00007FFD1DDE752A  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7530  vminpd        xmm1,xmm1,xmm2
00007FFD1DDE7535  vminpd        xmm0,xmm0,xmm3
00007FFD1DDE753A  vminpd        xmm0,xmm0,xmm1
00007FFD1DDE753F  vmovsd        qword ptr [r8],xmm0
00007FFD1DDE7544  vmovupd       ymm0,ymmword ptr [rdx]
00007FFD1DDE7549  vextractf128  xmm1,ymm0,1
00007FFD1DDE754F  vextractf128  xmm0,ymm0,0
00007FFD1DDE7555  vpermilpd     xmm2,xmm1,1
00007FFD1DDE755B  vpermilpd     xmm3,xmm0,1
00007FFD1DDE7561  vmaxpd        xmm1,xmm1,xmm2
00007FFD1DDE7566  vmaxpd        xmm0,xmm0,xmm3
00007FFD1DDE756B  vmaxpd        xmm0,xmm0,xmm1
00007FFD1DDE7570  vmovsd        qword ptr [r9],xmm0
00007FFD1DDE7575  vzeroupper
00007FFD1DDE7578  ret

@gfoidl gfoidl added this to the v1.1.0 milestone Apr 19, 2018
@gfoidl gfoidl self-assigned this Apr 19, 2018
@gfoidl gfoidl merged commit cc76c50 into master Apr 19, 2018
@gfoidl gfoidl deleted the simd-reduction branch April 19, 2018 13:12
@gfoidl
Copy link
Owner Author

gfoidl commented Apr 19, 2018

Just for reference a portion of C++:

#include <iostream>
#include <immintrin.h>
//-----------------------------------------------------------------------------
float max_sse(float* a)
{
    __m128* f4    = reinterpret_cast<__m128*>(a);
    __m128 maxval = *f4;

    for (int i = 0; i < 3; ++i)
    {
        __m128 tmp = _mm_shuffle_ps(maxval, maxval, 0x93);
        maxval     = _mm_max_ps(maxval, tmp);
    }

    float res;
    _mm_store_ss(&res, maxval);
    return res;
}
//-----------------------------------------------------------------------------
double max_sse(double* a)
{
    __m256d* d4    = reinterpret_cast<__m256d*>(a);
    __m256d maxval = *d4;

    for (int i = 0; i < 3; ++i)
    {
        __m256d tmp = _mm256_permute4x64_pd(maxval, 0x39);
        maxval      = _mm256_max_pd(maxval, tmp);
    }

    double res;
    _mm256_store_pd(&res, maxval);
    return res;
}
//-----------------------------------------------------------------------------
#define MM_SHUFFLE(fp0,fp1,fp2,fp3) (((fp3) << 6) | ((fp2) << 4) | ((fp1) << 2) | ((fp0)))
//-----------------------------------------------------------------------------
int main()
{
    int a = 0x93;
    int b = MM_SHUFFLE(2, 1, 0, 3);

    float arr[] = {1, 2, 3, 4};
    float max   = max_sse(arr);

    double darr[] = {1, 2, 3, 4};
    double dmax   = max_sse(darr);

    using namespace std;

    cout << max  << endl;
    cout << dmax << endl;
}

I haven't tested the double-version in C#, because in the reference-assembly _mm256_permute4x64_pd is missing (though it's available in CoreLib).
But I believe the implemented variant is faster, because it's just

  • extracting the __m128 out of the __m256d
  • reversing the vectors
  • min/max

instead of rotating and min/max.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant