Use Vector.Dot for simd-reduction (sum) #43

gfoidl · 2018-02-09T08:29:23Z

double sum = 0;

for (int i = 0; i < Vector<double>.Count; ++i)
    sum += vec[i];

is equivalent to

Vector.Dot(vec, Vector<double>.One);

while the latter is about twice as fast.

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT

Method	Mean	Error	StdDev	Scaled	ScaledSD
Loop_local	0.6544 ns	0.0212 ns	0.0198 ns	1.00	0.00
Loop_field	0.6873 ns	0.0309 ns	0.0289 ns	1.05	0.05
Dot	0.2862 ns	0.0258 ns	0.0229 ns	0.44	0.04

Benchmarkcode is here.

In the context of huge loops this reduction is done at the end, so the perf-win will be small, but it should still be changed, du have optimal code.

gfoidl · 2018-04-17T20:37:24Z

SIMD-Intrinsics are coming to netcoreapp2.1, so #44 (comment) can be written with C# like

private static double Reduce(Vector<double> vector)
{
    if (Avx.IsSupported && Sse2.IsSupported)
    {
        Vector256<double> a     = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
        Vector256<double> tmp   = Avx.HorizontalAdd(a, a);
        Vector128<double> hi128 = Avx.ExtractVector128(tmp, 1);
        Vector128<double> s     = Sse2.Add(Unsafe.As<Vector256<double>, Vector128<double>>(ref tmp), hi128);

        return Sse2.ConvertToDouble(s);
    }
    else
    {
        return Vector.Dot(vector, Vector<double>.One);
    }
}

This would need multi-targeting and

<ItemGroup>
    <PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="4.5.0-preview3-26417-03" />
    <PackageReference Include="System.Runtime.Intrinsics.Experimental" Version="4.5.0-preview2-26406-04" />
</ItemGroup>

Produces

vzeroupper
vmovupd         ymm0, ymmword ptr [rcx]
vhaddpd         ymm0, ymm0, ymm0
vextractf128    xmm1, ymm0, 1
vaddpd          xmm0, xmm1, xmm0
vzeroupper
ret

so exactely the same code as the C++ compiler generates.

gfoidl added the performance label Feb 9, 2018

gfoidl added this to the v1.1.0 milestone Feb 9, 2018

gfoidl self-assigned this Feb 9, 2018

gfoidl mentioned this issue Feb 9, 2018

Instead of a loop-reduction in simd, Vector.Dot is used #44

Merged

gfoidl closed this as completed in #44 Feb 9, 2018

gfoidl reopened this Apr 17, 2018

gfoidl mentioned this issue Apr 19, 2018

Better Simd reduction #49

Merged

gfoidl closed this as completed in #49 Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Vector.Dot for simd-reduction (sum) #43

Use Vector.Dot for simd-reduction (sum) #43

gfoidl commented Feb 9, 2018

gfoidl commented Apr 17, 2018

Use Vector.Dot for simd-reduction (sum) #43

Use Vector.Dot for simd-reduction (sum) #43

Comments

gfoidl commented Feb 9, 2018

gfoidl commented Apr 17, 2018