-
-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very good dotProduct #9887
Labels
Comments
bearophile_hugs commented on 2011-04-28T18:55:10ZSome code I have written for arrays of floats:
float dot(float[] a, float[] b) {
enum size_t UNROLL_MASK = 0b111;
assert(a.length == b.length, "dot(): the two array lengths differ.");
typeof(a[0]) tot = void;
auto a_ptr = a.ptr;
auto b_ptr = b.ptr;
assert((cast(size_t)a_ptr & cast(size_t)0b1111) == 0,
"dot(): the first array is not aligned to 16 bytes.");
assert((cast(size_t)b_ptr & cast(size_t)0b1111) == 0,
"dot(): the second array is not aligned to 16 bytes.");
size_t len = (a.length & (~UNROLL_MASK)) * a[0].sizeof;
if (len) {
asm {
mov EAX, a_ptr;
mov ECX, b_ptr;
mov EDI, len;
xorps XMM0, XMM0;
xorps XMM3, XMM3;
xor EDX, EDX;
align 8;
LOOP_START:
movaps XMM1, [EAX + EDX];
mulps XMM1, [ECX + EDX];
movaps XMM2, [EAX + EDX + 16];
mulps XMM2, [ECX + EDX + 16];
add EDX, 32;
addps XMM0, XMM1;
cmp EDI, EDX;
addps XMM3, XMM2;
jne LOOP_START;
addps XMM0, XMM3;
// XMM0[0] = XMM0[0] + XMM0[1] + XMM0[2] + XMM0[3]
movhlps XMM1, XMM0;
addps XMM0, XMM1;
pshufd XMM1, XMM0, 0b01_01_01_01;
addss XMM0, XMM1;
movss tot, XMM0;
}
} else
tot = 0.0;
size_t left = a.length & UNROLL_MASK;
for (size_t i = left; i > 0; i--)
tot += a[$ - i] * b[$ - i];
return tot;
}
And for arrays of doubles:
import std.c.stdio;
void main() {
double[] A = [0.7644108908809033, 0.96458177717869509,
0.51166055832523683, 0.098840360055908461,
0.40813780586233483, 0.32285857447088551,
0.59137950751990331, 0.59518178287473289];
double[] B = [0.28061374331187983, 0.85036446787626108,
0.52498124274748326, 0.84170745998075014,
0.55819169392258683, 0.62586351111477134,
0.43021720539707864, 0.652708603473523];
// >>> sum(a*b for a,b in zip(A, B))
// 2.4593435789602185
if (A.length % 4 != 0) throw new Error("");
double tot1 = void,
tot2 = void;
auto a_ptr = cast(double*)A.ptr;
auto b_ptr = cast(double*)B.ptr;
size_t len = A.length * double.sizeof;
asm {
mov EAX, a_ptr;
mov ECX, b_ptr;
mov EDI, len;
xorps XMM0, XMM0;
xorps XMM3, XMM3;
xor EDX, EDX;
align 8;
LOOP_START:
movapd XMM1, [EAX + EDX];
mulpd XMM1, [ECX + EDX];
movapd XMM2, [EAX + EDX + 16];
mulpd XMM2, [ECX + EDX + 16];
add EDX, 32;
addpd XMM0, XMM1;
cmp EDI, EDX;
addpd XMM3, XMM2;
jne LOOP_START;
addpd XMM0, XMM3; // XMM0 += XMM3
movhpd tot1, XMM0;
movlpd tot2, XMM0;
}
printf("%.15lf\n", tot1 + tot2);
}
See bug 5880 too. |
andrei (@andralex) commented on 2011-04-28T20:10:55ZNice work. Got some benchmarks? Also, I wonder what version flag I should use to guard the assembler implementation. Don? |
dsimcha commented on 2011-04-28T20:12:06ZLooks rather interesting. Dot products are sufficiently universally useful that, if this is significantly faster than the current std.numeric implementation, inclusion is definitely justified. However, you'd greatly increase the chances of inclusion if you created some simple benchmarks (nothing fancy, just the obvious ones) and unit tests and submitted this as a pull request on Github. |
clugdbug commented on 2011-04-28T21:36:38ZDid you test this on Intel, or AMD? Blas1 code is generally limited by memory access, and AMD has two load ports, so it has different bottlenecks and in these operations always does better than Intel.
See also a discussion at:
http://www.bealto.com/mp-dot_sse.html
(I've talked to Eric Bealto before, he's happy for Phobos to use any of his stuff if we see anything we want). It's a bit misleading, though, because above a certain length, you become dominated by cache effects, so I don't know if unrolling by 4 is actually worthwhile in practice.
I also have some optimized x87 dot product code (AMD 32 bit CPUs don't have SSE2, so they still need x87 for doubles). |
lt.infiltrator commented on 2014-03-19T17:56:16ZAndrei, since this is assigned to you: are you waiting on somebody/thing; or should this not be assigned to you? |
andrei (@andralex) commented on 2014-03-19T18:32:47ZHave at it! |
lt.infiltrator commented on 2014-03-19T18:43:12ZAndrei, I shall take that as a "this should not be assigned to me".
Bearophile, would you happen to have (or could put together) a PR for this? |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
bearophile_hugs reported this on 2010-06-25T15:17:41Z
Transfered from https://issues.dlang.org/show_bug.cgi?id=4393
CC List
Description
The text was updated successfully, but these errors were encountered: