Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Vector.Dot for simd-reduction (sum) #43

Closed
gfoidl opened this issue Feb 9, 2018 · 1 comment
Closed

Use Vector.Dot for simd-reduction (sum) #43

gfoidl opened this issue Feb 9, 2018 · 1 comment
Assignees
Milestone

Comments

@gfoidl
Copy link
Owner

gfoidl commented Feb 9, 2018

double sum = 0;

for (int i = 0; i < Vector<double>.Count; ++i)
    sum += vec[i];

is equivalent to

Vector.Dot(vec, Vector<double>.One);

while the latter is about twice as fast.

BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125)
Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores
Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC
.NET Core SDK=2.1.4
  [Host]     : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
  DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT

Method Mean Error StdDev Scaled ScaledSD
Loop_local 0.6544 ns 0.0212 ns 0.0198 ns 1.00 0.00
Loop_field 0.6873 ns 0.0309 ns 0.0289 ns 1.05 0.05
Dot 0.2862 ns 0.0258 ns 0.0229 ns 0.44 0.04

Benchmarkcode is here.

In the context of huge loops this reduction is done at the end, so the perf-win will be small, but it should still be changed, du have optimal code.

@gfoidl
Copy link
Owner Author

gfoidl commented Apr 17, 2018

SIMD-Intrinsics are coming to netcoreapp2.1, so #44 (comment) can be written with C# like

private static double Reduce(Vector<double> vector)
{
    if (Avx.IsSupported && Sse2.IsSupported)
    {
        Vector256<double> a     = Unsafe.As<Vector<double>, Vector256<double>>(ref vector);
        Vector256<double> tmp   = Avx.HorizontalAdd(a, a);
        Vector128<double> hi128 = Avx.ExtractVector128(tmp, 1);
        Vector128<double> s     = Sse2.Add(Unsafe.As<Vector256<double>, Vector128<double>>(ref tmp), hi128);

        return Sse2.ConvertToDouble(s);
    }
    else
    {
        return Vector.Dot(vector, Vector<double>.One);
    }
}

This would need multi-targeting and

<ItemGroup>
    <PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="4.5.0-preview3-26417-03" />
    <PackageReference Include="System.Runtime.Intrinsics.Experimental" Version="4.5.0-preview2-26406-04" />
</ItemGroup>

Produces

vzeroupper
vmovupd         ymm0, ymmword ptr [rcx]
vhaddpd         ymm0, ymm0, ymm0
vextractf128    xmm1, ymm0, 1
vaddpd          xmm0, xmm1, xmm0
vzeroupper
ret

so exactely the same code as the C++ compiler generates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant