Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instead of a loop-reduction in simd, Vector.Dot is used #44

Merged
merged 1 commit into from
Feb 9, 2018

Conversation

gfoidl
Copy link
Owner

@gfoidl gfoidl commented Feb 9, 2018

Fixes #43

dasm for the reduction:

G_M20916_IG01:
       C5F877               vzeroupper

G_M20916_IG02:
       C4E27D190524000000   vbroadcastsd ymm0, ymmword ptr[reloc @RWD00]        ; ymm0 = | 1 | 1 | 1 | 1 |
       C4E17D104F08         vmovupd  ymm1, ymmword ptr[rdi+8]                   ; ymm1 = | a | b | c | d |
       C4E17D59C1           vmulpd   ymm0, ymm1                                 ; ymm0 = | a | b | c | d |
       C4E17D7CC0           vhaddpd  ymm0, ymm0                                 ; ymm0 = | a + b | c + d | a + b | c + d |
       C4E37D19C201         vextractf128 ymm2, ymm0, 1                          ; ymm2 = | a + b | c + d | ----- | ----- |
       C4E17958C2           vaddpd   xmm0, xmm2                                 ; xmm0 = | a + b + c + d | ? |

G_M20916_IG03:
       C5F877               vzeroupper                                          ; xmm0 = | a + b + c + d | 0 |
       C3                   ret

@gfoidl gfoidl added this to the v1.1.0 milestone Feb 9, 2018
@gfoidl gfoidl self-assigned this Feb 9, 2018
@gfoidl gfoidl merged commit 0e607fb into master Feb 9, 2018
@gfoidl gfoidl deleted the dot-for-sum-reduction branch February 9, 2018 08:45
@gfoidl
Copy link
Owner Author

gfoidl commented Feb 9, 2018

Could be even better with just the horizontal add, but .net doesn't support this (now).

With C++ this could be writter as

double reduce_simd(double* arr, const int n)
{
    double sum = 0;

    __m256d* ptr  = reinterpret_cast<__m256d*>(arr);
    __m256d a {*ptr};
    __m256d tmp   = _mm256_hadd_pd(a, a);
    __m128d hi128 = _mm256_extractf128_pd(tmp, 1);
    __m128d lo128 = _mm256_extractf128_pd(tmp, 0);
    __m128d s     = _mm_add_pd(lo128, hi128);

    sum = _mm_cvtsd_f64(s);

    return sum;
}

yielding

vmovapd         ymm0, YMMWORD PTR [rdi]
vhaddpd         ymm0, ymm0, ymm0
vextractf128    xmm1, ymm0, 0x1
vaddpd          xmm0, xmm1, xmm0
vzeroupper
ret

thus saving the broadcast of 1 and the multiplication.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant