New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Gpurational merge #282
Gpurational merge #282
Conversation
as a merge does not seem a good option here. The functionality added by Falk will be repaired later on
added gpu support in gauge_monomial changed dev_gauge_derivative to allow for a multiplicative constant in momentum calculation
fought unnecessary warnings
Cleaned up a bit the ND solver
fixed non-compile bug in observables.h with TEMPORALGAUGE removed lengthy GPU related stuff for TEMPORALGAUGE to function in invert_eo
dev_Qtm_pm -> fused kernels to get more out on kepler, old version still available dev_Qtm_pm_nd -> fused kernels, old version still available double Versions seem to be working of both deg. and nd eo matrices still issue with discrepancy of cpu/gpu(double) residues after solve (-> reconstruction issue?) added first version of mm-solver (in mixed prcision)
segfaults due to double free
degenerate part is working but ONLY WITHOUT TEMPORALGAUGE -> Why??
added if use_gpuflag in invert.c
Carsten's bugfix for not correctly set default of nd_precision flags and values
mem leak in eo gpu part -> fixes issue 262 some cosmetics on misleading error message
added relativistic basis support for clover added texture support for clover
default is 4 gpus/node
…d matrix gpu kernels; added gpu mms support in ndratcor monomial
all improvements (TEMPORAL_GAUGE etc.) working some more minor fixes
We need a 32 bit matrix as we get nans!!
gauge_field still 64 bits kappa's still 64 bits -> Matrix is slower than pure double version
Mixed solver faster than 64 bit cg
fixed #ifdef MPI in 32 bit related stuff - mixed solver workin again
added solve_degenerate in light clover monomials added mixed cg iteration and epsilon parameters brought solver types in new .h file
cg+non-eo works now, also mixedcg+non-eo (both only without clover). |
and after inverting Q_minus_psi
Thanks for the updates Florian and apologies for not getting back to you earlier. As I'm sure you can imagine I'm quite busy right now... I wanted to ask you a question regarding the mixed solver implementation. Am I correct in remembering that you have been using this on Fermi and Cray XC30 for finite T simulations? If so, can you say anything about time to solution of the mixed solver on BG/Q and HLRN3 compared to the double-precision sovler? I have a strong feeling (which might be totally misplaced), that by using your mixed solver we could potentially halve our trajectory times, would you agree? |
Yes, I can imagine
Yes, on both.
It depends. On BG/Q the main advantage is that one can use 16^4 local lattices without too hard penalty (see the picture on the "Single Precision Matrix Performance" Wiki page). Here one only profits from better cache usage. Note that in the current version the overlapping has been removed again. On Intel one also profits from the native 32bit instructions and the matrix performance can really be up to the factor 2 i.e. the theoretical factor. In practice I have seen a speedup from ~1200 sec to ~800 sec for one trajectory. But the speedup and how well the mixed solver performs depends on the situation. In general I think that worse conditioned matrices even work better. But probably for halving the time some more work is needed. So far no specific optimizations for 32bit have been done. The speedup merely is the result of the algorithm and better memory/cache properties. So there should be much room for improvement. |
I seem to be having some issues compiling this. First of all, I have some problems getting ALIGN32 and ALIGN_BASE32 to be defined. In particular, when some alignment is chosen explicitly, I think these should be set too (currently they're only set in auto mode and then only on BG/Q) Secondly, when SSE2/3 is enabled, it seems that compiler cannot make sense of the instructions in the full- and half-spinor operators, failing with errors like this one:
I was able to compile the half-spinor version without SSE but unfotunately as you remark the clover parts are not quite ready yet, otherwise I would want to arrange for a test to be run by someone. |
…o-geometry in invert_eo.c and invert_doublet_eo.c when even_odd_flag is set
…e correct values yet
@kostrzewa the problem with the macros is due to the fact that in case of SSE there are special SSE versions of e.g. Most of the time all the macros work for single and double precision data types, as long as there is no special version available, like in the case of SSE. This will require more work for a real 32bit SSE implementation. Such code is available, of course. Or we finally rely on the compiler capabilities!? A short term fix can be to un-define SSEX in the 32 Dirac matrices. I have fixed the alignment thing, see my pull request to @florian-burger, however, they are now just defined, not necessarily to the correct values, though. I have also fixed a missing definition of @florian-burger what is missing for the clover term? |
indeed, with
in |
If wanted I could include that into the pull request to @florian-burger |
is there any way to also get the interleaved version (not in this pull request...) into my InterleavedNDTwistedClover branch, after maybe merging this here? |
As for the clover term the non-sse implementation would be straight forward. We only need the clover field in 32 bit and a functionality for initializing it given the double field. Next we would also need the two functions clover_gamma5 and clover_inv. As for merging InterleavedNDTwistedClover should be possible. You can check the last commit in my interleaved_mixed_cg branch (which was merged here at some point), if something is missing. I have undone the interleaving xchanges in commit 2d713cc... |
I'm closing this as we are not going to merge it any more. |
Removed #ifdef HAVE_GPU from monomials and using now only usegpu_flag.
usegpu_flag is set in read_input.l if HAVE_GPU is defined and a InitGPU block is present in the input file and some checks for non-implemented features are passed (e.g. only PARALLELT in case of MPI).
Code compiles without GPU related warning.