Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better SIMD algorithm for min/max? #56

Closed
gfoidl opened this issue Apr 23, 2018 · 1 comment
Closed

Better SIMD algorithm for min/max? #56

gfoidl opened this issue Apr 23, 2018 · 1 comment
Assignees
Milestone

Comments

@gfoidl
Copy link
Owner

gfoidl commented Apr 23, 2018

Maybe MinMaxCore1 or MinMaxCore2 are better implementations than current (MinMaxCore0)?

public static double MinMaxCore0(Vector<double> vector)
{
    Vector256<double> vec256 = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector128<double> hi128  = Avx.ExtractVector128(vec256, 1);                 // lat 3        thr 1
    Vector128<double> lo128  = Avx.ExtractVector128(vec256, 0);                 // lat 3        thr 1

    Vector128<double> tmp1 = Avx.Permute(hi128, 0b_01);                         // lat 1        thr 1
    Vector128<double> tmp2 = Avx.Permute(lo128, 0b_01);                         // lat 1        thr 1

    hi128 = Sse2.Min(hi128, tmp1);                                              // lat 4        thr 0.5
    lo128 = Sse2.Min(lo128, tmp2);                                              // lat 4        thr 0.5
    lo128 = Sse2.Min(lo128, hi128);                                             // lat 4        thr 0.5

    return Sse2.ConvertToDouble(lo128);                                         // lat 1        thr 1
}                                                                               // =  21        =   6.5

public static double MinMaxCore1(Vector<double> vector)
{
    Vector256<double> vec256       = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector256<double> swapped      = Avx.Permute2x128(vec256, vec256, 0x03);    // lat 3        thr 1
    Vector256<double> min256       = Avx.Min(vec256, swapped);                  // lat 4        thr 0.5
    Vector128<double> lo128        = Avx.ExtractVector128(min256, 0);           // lat 3        thr 1
    Vector128<double> lo128Swapped = Avx.Permute(lo128, 0b_01);                 // lat 1        thr 1
    Vector128<double> min128       = Sse2.Min(lo128, lo128Swapped);             // lat 4        thr 0.5

    return Sse2.ConvertToDouble(min128);                                        // lat 1        thr 1
}                                                                               // =  16        =   5

public static unsafe double MinMaxCore2(Vector<double> vector)
{
    Vector256<double> vec256        = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
    Vector256<double> swapped       = Avx.Permute2x128(vec256, vec256, 0x03);   // lat 3        thr 1
    Vector256<double> min256        = Avx.Min(vec256, swapped);                 // lat 4        thr 0.5
    Vector256<double> min256Swapped = Avx.Permute(min256, 0b_0101);             // lat 1        thr 1
    min256                          = Avx.Min(min256, min256Swapped);           // lat 4        thr 0.5
    Vector128<double> min128        = Avx.ExtractVector128(min256, 0);          // lat 3        thr 1

    return Sse2.ConvertToDouble(min128);                                        // lat 1        thr 1
}                                                                               // =  16        thr 5

Cf. #49

@gfoidl gfoidl added this to the v1.1.0 milestone Apr 23, 2018
@gfoidl gfoidl self-assigned this Apr 23, 2018
@gfoidl
Copy link
Owner Author

gfoidl commented Apr 23, 2018

Based on the numbers of latency and throughput they should be faster. But don't get fooled by these numbers, because instruction level parallelism comes into account, as can be seen by benchmarks:

BenchmarkDotNet=v0.10.14, OS=Windows 10.0.16299.371 (1709/FallCreatorsUpdate/Redstone3)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical and 4 physical cores
Frequency=2742187 Hz, Resolution=364.6724 ns, Timer=TSC
.NET Core SDK=2.1.300-preview3-008618
  [Host]     : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.0-preview3-26411-06 (CoreCLR 4.6.26411.07, CoreFX 4.6.26411.06), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
A 2.176 ns 0.0831 ns 0.0816 ns 1.00 0.00
B 2.266 ns 0.1006 ns 0.0941 ns 1.04 0.06
C 2.260 ns 0.0630 ns 0.0589 ns 1.04 0.04

In MinMaxCore0 almost always two instructions can be done in parallel.
Whereas in MinMaxCore1 and MinMaxCore2 there's more or less a strict sequential dependency, thus disabling ISP.

; MinMaxCore0
00007FFD1ECD28E0  vzeroupper
00007FFD1ECD28E3  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECD28E8  vextractf128 xmm1,ymm0,1
00007FFD1ECD28EE  vpermilpd    xmm2,xmm1,1
00007FFD1ECD28F4  vextractf128 xmm0,ymm0,0
00007FFD1ECD28FA  vpermilpd    xmm3,xmm0,1
00007FFD1ECD2900  vminpd       xmm1,xmm1,xmm2
00007FFD1ECD2905  vminpd       xmm0,xmm0,xmm3
00007FFD1ECD290A  vminpd       xmm0,xmm0,xmm1
00007FFD1ECD290F  vzeroupper
00007FFD1ECD2912  ret

; MinMaxCore1
00007FFD1ECC2960  vzeroupper
00007FFD1ECC2963  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECC2968  vperm2f128   ymm1,ymm0,ymm0,3
00007FFD1ECC296E  vminpd       ymm0,ymm0,ymm1
00007FFD1ECC2973  vextractf128 xmm0,ymm0,0
00007FFD1ECC2979  vpermilpd    xmm1,xmm0,1
00007FFD1ECC297F  vminpd       xmm0,xmm0,xmm1
00007FFD1ECC2984  vzeroupper
00007FFD1ECC2987  ret

; MinMaxCore2
00007FFD1ECF43B0  vzeroupper
00007FFD1ECF43B3  vmovupd      ymm0,ymmword ptr [rcx]
00007FFD1ECF43B8  vperm2f128   ymm1,ymm0,ymm0,3
00007FFD1ECF43BE  vminpd       ymm0,ymm0,ymm1
00007FFD1ECF43C3  vpermilpd    ymm1,ymm0,5
00007FFD1ECF43C9  vminpd       ymm0,ymm0,ymm1
00007FFD1ECF43CE  vextractf128 xmm0,ymm0,0
00007FFD1ECF43D4  vzeroupper
00007FFD1ECF43D7  ret

Though the aforementioned dependency can be better seen in VS with the C# code.

@gfoidl gfoidl closed this as completed Apr 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant