Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gpurational merge #282

Closed
wants to merge 124 commits into from
Closed

Conversation

florian-burger
Copy link
Contributor

Removed #ifdef HAVE_GPU from monomials and using now only usegpu_flag.
usegpu_flag is set in read_input.l if HAVE_GPU is defined and a InitGPU block is present in the input file and some checks for non-implemented features are passed (e.g. only PARALLELT in case of MPI).
Code compiles without GPU related warning.

Florian Burger added 30 commits February 21, 2013 11:33
as a merge does not seem a good option here.
The functionality added by Falk will be repaired later on
added gpu support in gauge_monomial
changed dev_gauge_derivative to allow for a multiplicative constant in
momentum calculation
fought unnecessary warnings
fixed non-compile bug in observables.h with TEMPORALGAUGE
removed lengthy GPU related stuff for TEMPORALGAUGE to function in invert_eo
dev_Qtm_pm -> fused kernels to get more out on kepler, old version still available
dev_Qtm_pm_nd -> fused kernels, old version still available
double Versions seem to be working of both deg. and nd eo matrices
still issue with discrepancy of cpu/gpu(double) residues after solve (->
reconstruction issue?)
added first version of mm-solver (in mixed prcision)
degenerate part is working but ONLY WITHOUT TEMPORALGAUGE -> Why??
added if use_gpuflag in invert.c
Carsten's bugfix for not correctly set default of nd_precision flags and
values
mem leak in eo gpu part -> fixes issue 262
some cosmetics on misleading error message
added relativistic basis support for clover
added texture support for clover
…d matrix gpu kernels; added gpu mms support in ndratcor monomial
all improvements (TEMPORAL_GAUGE etc.) working
some more minor fixes
gauge_field still 64 bits
kappa's still 64 bits -> Matrix is slower than pure double version
Mixed solver faster than 64 bit cg
Florian Burger added 5 commits January 2, 2015 17:07
fixed #ifdef MPI in 32 bit related stuff - mixed solver workin again
added solve_degenerate in light clover monomials
added mixed cg iteration and epsilon parameters
brought solver types in new .h file
@florian-burger
Copy link
Contributor Author

cg+non-eo works now, also mixedcg+non-eo (both only without clover).
The functionality with bicgstab is the same as before - but now wrapped through solve_degenerate.

@kostrzewa
Copy link
Member

Thanks for the updates Florian and apologies for not getting back to you earlier. As I'm sure you can imagine I'm quite busy right now...

I wanted to ask you a question regarding the mixed solver implementation. Am I correct in remembering that you have been using this on Fermi and Cray XC30 for finite T simulations? If so, can you say anything about time to solution of the mixed solver on BG/Q and HLRN3 compared to the double-precision sovler? I have a strong feeling (which might be totally misplaced), that by using your mixed solver we could potentially halve our trajectory times, would you agree?

@florian-burger
Copy link
Contributor Author

Thanks for the updates Florian and apologies for not getting back to you earlier. As I'm sure you can
imagine I'm quite busy right now...

Yes, I can imagine

I wanted to ask you a question regarding the mixed solver implementation. Am I correct in >remembering that you have been using this on Fermi and Cray XC30 for finite T simulations?

Yes, on both.

time to solution of the mixed solver on BG/Q and HLRN3 compared to the double-precision sovler? >I have a strong feeling (which might be totally misplaced), that by using your mixed solver we could >potentially halve our trajectory times, would you agree?

It depends. On BG/Q the main advantage is that one can use 16^4 local lattices without too hard penalty (see the picture on the "Single Precision Matrix Performance" Wiki page). Here one only profits from better cache usage. Note that in the current version the overlapping has been removed again.

On Intel one also profits from the native 32bit instructions and the matrix performance can really be up to the factor 2 i.e. the theoretical factor. In practice I have seen a speedup from ~1200 sec to ~800 sec for one trajectory. But the speedup and how well the mixed solver performs depends on the situation. In general I think that worse conditioned matrices even work better.

But probably for halving the time some more work is needed. So far no specific optimizations for 32bit have been done. The speedup merely is the result of the algorithm and better memory/cache properties. So there should be much room for improvement.

@kostrzewa
Copy link
Member

I seem to be having some issues compiling this. First of all, I have some problems getting ALIGN32 and ALIGN_BASE32 to be defined. In particular, when some alignment is chosen explicitly, I think these should be set too (currently they're only set in auto mode and then only on BG/Q)

Secondly, when SSE2/3 is enabled, it seems that compiler cannot make sense of the instructions in the full- and half-spinor operators, failing with errors like this one:

../../operator/D_psi_32.c:259:3: note: in expansion of macro ‘_su3_inverse_multiply’
   _su3_inverse_multiply(chi, (*u), psi);
   ^
/home/bartek/code/tmLQCD.kost/build_openmp/../sse.h:955:33: error: memory input 2 is not directly addressable
                       "m" (cimag((u).c22)), \

I was able to compile the half-spinor version without SSE but unfotunately as you remark the clover parts are not quite ready yet, otherwise I would want to arrange for a test to be run by someone.

@urbach
Copy link
Contributor

urbach commented Jun 9, 2015

@kostrzewa the problem with the macros is due to the fact that in case of SSE there are special SSE versions of e.g. _su3_inverse_multiply, written for double precision variables.

Most of the time all the macros work for single and double precision data types, as long as there is no special version available, like in the case of SSE. This will require more work for a real 32bit SSE implementation. Such code is available, of course. Or we finally rely on the compiler capabilities!? A short term fix can be to un-define SSEX in the 32 Dirac matrices.

I have fixed the alignment thing, see my pull request to @florian-burger, however, they are now just defined, not necessarily to the correct values, though.

I have also fixed a missing definition of update_backward_gauge_32 for the case of --disable-halfspinor.

@florian-burger what is missing for the clover term?

@urbach
Copy link
Contributor

urbach commented Jun 9, 2015

indeed, with

 #ifdef SSE
 # undef SSE
 #endif
 #ifdef SSE2
 # undef SSE2
 #endif
 #ifdef SSE3
 # undef SSE3
 #endif

in D_psi_32.c, Hopping_Matrix32.c and Hopping_Matrix32_nocom.c just before #include "global.h" also the SSEX versions do compile.

@urbach
Copy link
Contributor

urbach commented Jun 9, 2015

If wanted I could include that into the pull request to @florian-burger

@urbach
Copy link
Contributor

urbach commented Jun 9, 2015

is there any way to also get the interleaved version (not in this pull request...) into my InterleavedNDTwistedClover branch, after maybe merging this here?

@florian-burger
Copy link
Contributor Author

As for the clover term the non-sse implementation would be straight forward. We only need the clover field in 32 bit and a functionality for initializing it given the double field. Next we would also need the two functions clover_gamma5 and clover_inv.

As for merging InterleavedNDTwistedClover should be possible. You can check the last commit in my interleaved_mixed_cg branch (which was merged here at some point), if something is missing. I have undone the interleaving xchanges in commit 2d713cc...

@urbach
Copy link
Contributor

urbach commented Nov 17, 2016

I'm closing this as we are not going to merge it any more.

@urbach urbach closed this Nov 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants