We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
double sum = 0; for (int i = 0; i < Vector<double>.Count; ++i) sum += vec[i];
is equivalent to
Vector.Dot(vec, Vector<double>.One);
while the latter is about twice as fast.
BenchmarkDotNet=v0.10.12, OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.125) Intel Core i7-7700HQ CPU 2.80GHz (Kaby Lake), 1 CPU, 8 logical cores and 4 physical cores Frequency=2742191 Hz, Resolution=364.6719 ns, Timer=TSC .NET Core SDK=2.1.4 [Host] : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT DefaultJob : .NET Core 2.0.5 (Framework 4.6.26020.03), 64bit RyuJIT
Benchmarkcode is here.
In the context of huge loops this reduction is done at the end, so the perf-win will be small, but it should still be changed, du have optimal code.
The text was updated successfully, but these errors were encountered:
SIMD-Intrinsics are coming to netcoreapp2.1, so #44 (comment) can be written with C# like
private static double Reduce(Vector<double> vector) { if (Avx.IsSupported && Sse2.IsSupported) { Vector256<double> a = Unsafe.As<Vector<double>, Vector256<double>>(ref vector); Vector256<double> tmp = Avx.HorizontalAdd(a, a); Vector128<double> hi128 = Avx.ExtractVector128(tmp, 1); Vector128<double> s = Sse2.Add(Unsafe.As<Vector256<double>, Vector128<double>>(ref tmp), hi128); return Sse2.ConvertToDouble(s); } else { return Vector.Dot(vector, Vector<double>.One); } }
This would need multi-targeting and
<ItemGroup> <PackageReference Include="System.Runtime.CompilerServices.Unsafe" Version="4.5.0-preview3-26417-03" /> <PackageReference Include="System.Runtime.Intrinsics.Experimental" Version="4.5.0-preview2-26406-04" /> </ItemGroup>
Produces
vzeroupper vmovupd ymm0, ymmword ptr [rcx] vhaddpd ymm0, ymm0, ymm0 vextractf128 xmm1, ymm0, 1 vaddpd xmm0, xmm1, xmm0 vzeroupper ret
so exactely the same code as the C++ compiler generates.
Sorry, something went wrong.
gfoidl
No branches or pull requests
is equivalent to
while the latter is about twice as fast.
Benchmarkcode is here.
In the context of huge loops this reduction is done at the end, so the perf-win will be small, but it should still be changed, du have optimal code.
The text was updated successfully, but these errors were encountered: