Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blasfeo_ddot: suggestion for improvement #91

Open
roversch opened this issue Jan 30, 2019 · 2 comments
Open

blasfeo_ddot: suggestion for improvement #91

roversch opened this issue Jan 30, 2019 · 2 comments

Comments

@roversch
Copy link
Contributor

In the 'reduce' step of blasfeo_ddot, a horizontal add _mm_hadd_pd is computed. Instead, one could replace

u_tmp = _mm_hadd_pd(u_tmp, u_tmp);

with

__m128d hi64 = _mm_unpackhi_pd(u_tmp, u_tmp);
u_tmp = _mm_add_sd(u_tmp, hi64);

effectively trading a packed double operation with a scalar one.

@giaf
Copy link
Owner

giaf commented Feb 2, 2019

Yes what you propose would indeed reduce the latency by 1 clock cycle: from 5 of hadd to 4=1(unpackhi)+3(add).
But at the end, the reduction code is not so important, the important part is the loop body.

And in general, level 1 BLAS routines are not so important in what we do and can gain much less from optimization, compared to level 2 and especially 3 routines, and therefore they received less attention.

What I would found the most important reason to implement your improvement would be to get rid of the dependency on SSE3 in case of targeting machines with capabilities up to SSE2. I don't know if this is the case for you. The choice to target SSE3 (i.e. the Core microarchitecture) was to have a reasonable trade-off between handiness and availability of ISAs, also on embedded devices, which usually lag a bit behind.

@giaf
Copy link
Owner

giaf commented Feb 2, 2019

Sure if you want to make the changes and make a PR, I would be happy to merge it. But otherwise I would leave it as it is for now, other stuff has higher priority from my side.

Thanks anyway for the suggestion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants