-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE2 and SSE3 for smearing and clover_leaf.c #37
Comments
I'm actually running into issues with the SSE macros for the change to C99 complex as well and am currently having a look at them. |
I am not sure it's not a compiler problem. However, the SSE macros were explicitly written for the Dirac-operator. And there are inter-marco dependecies which might lead to the observed error messages if the macros are used elsewhere. |
While we're discussing the SSE macros: I've noticed that with icc we disable the SSE2 and SSE3 defines. (I guess because icc automatically attempts to use as much sse as possible and we would conflict.) So if one forces their compilation when using -DSSE2, the code segfaults (although it compiles without errors). I see that the code compiled with ICC is faster despite this, but do we know that this is optimal? I would be interested in exploring this at some point. |
you are saying icc without -DSSE2 is faster than gcc with -DSSE2 ? What about gcc with -DSSE3? |
I wouldn't trust the ICC with SSE numbers though because the HMC produces loads of NaN's. I think there's a good reason the SSE flags are turned off in the configure.in when using icc! |
There might be an issue with the syntax there. We're using AT&T ordering, if I'm not mistaken, while icc might be expecting Intel ordering. That would exchange source and destination and explain the appearance of NaN's. In that case, it might be just a matter of setting compiler flags. Still, it is interesting that icc appears to do a better job than our manual code -- I suppose that might well be progress in compiler design. It would be good to see what happens for the XLC compiler, actually. |
I see, the code also segfaulted on the p4 machine when the icc code was (force-)compiled with -DSSE2. I reproduced the benchmark results using the hmc (sample-hmc4.input) and the values are consistent. Gcc without SSE is faster than gcc with SSE3, ICC with SSE2/3 doesn't get anywhere because of NaNs, while icc without SSE still fastest on average even though it doesn't use the hand-written code. |
actually, I don't really understand this, because I see a significant difference in between gcc with SSE3 and gcc without. With SSE3 is a factor 2 faster. So maybe I need to switch on some compiler flag? |
I think we should remove most of the comments to this issue because I cannot reproduce these results at all anymore. As you say, there's almost a factor 2 difference between gcc/see and gcc/nosse. Having said that, it would still be interesting to see how fast the code with SSE would become if compiled with ICC. |
Does anyone still see issue here? I don't... |
yes, I have undefined SSE2 and SSE3 at the beginning of the corresponding files. So, its not bug anymore. But we could optimise the code there by using SSE2 and SSE3. so, should we close this issue? |
was this with GCC? |
ah, sorry... yes, this was with the |
I added a missing ALIGN to the declaration of a temporary variable in one of the smearing routines a few days ago. That fixed a segfault for me when compiling with SSE enabled using gcc. Perhaps this was a related issue? |
compiling with --enable-sse3|2 gives errors
../../tmLQCD/smearing/stout_stout_smear.c:39: error: can't find a register in class ‘GENERAL_REGS’ while reloading ‘asm’
../../tmLQCD/smearing/stout_stout_smear.c:30: error: ‘asm’ operand has impossible constraints
../tmLQCD/clover_leaf.c:719: error: can't find a register in class ‘GENERAL_REGS’ while reloading ‘asm’
../tmLQCD/clover_leaf.c:611: error: ‘asm’ operand has impossible constraints
which is due to problems in the inline assembly implementation of the su3 etc. macros.
Need to either rework the routines or undef SSE macros in those files.
The text was updated successfully, but these errors were encountered: