Gpurational merge #282

florian-burger · 2014-03-28T16:14:45Z

Removed #ifdef HAVE_GPU from monomials and using now only usegpu_flag.
usegpu_flag is set in read_input.l if HAVE_GPU is defined and a InitGPU block is present in the input file and some checks for non-implemented features are passed (e.g. only PARALLELT in case of MPI).
Code compiles without GPU related warning.

as a merge does not seem a good option here. The functionality added by Falk will be repaired later on

added gpu support in gauge_monomial changed dev_gauge_derivative to allow for a multiplicative constant in momentum calculation

fought unnecessary warnings

Cleaned up a bit the ND solver

fixed non-compile bug in observables.h with TEMPORALGAUGE removed lengthy GPU related stuff for TEMPORALGAUGE to function in invert_eo

dev_Qtm_pm -> fused kernels to get more out on kepler, old version still available dev_Qtm_pm_nd -> fused kernels, old version still available double Versions seem to be working of both deg. and nd eo matrices still issue with discrepancy of cpu/gpu(double) residues after solve (-> reconstruction issue?) added first version of mm-solver (in mixed prcision)

segfaults due to double free

degenerate part is working but ONLY WITHOUT TEMPORALGAUGE -> Why??

added if use_gpuflag in invert.c

Carsten's bugfix for not correctly set default of nd_precision flags and values

mem leak in eo gpu part -> fixes issue 262 some cosmetics on misleading error message

added relativistic basis support for clover added texture support for clover

default is 4 gpus/node

…d matrix gpu kernels; added gpu mms support in ndratcor monomial

all improvements (TEMPORAL_GAUGE etc.) working some more minor fixes

We need a 32 bit matrix as we get nans!!

gauge_field still 64 bits kappa's still 64 bits -> Matrix is slower than pure double version

Mixed solver faster than 64 bit cg

fixed #ifdef MPI in 32 bit related stuff - mixed solver workin again

added solve_degenerate in light clover monomials added mixed cg iteration and epsilon parameters brought solver types in new .h file

florian-burger · 2015-01-13T15:13:40Z

cg+non-eo works now, also mixedcg+non-eo (both only without clover).
The functionality with bicgstab is the same as before - but now wrapped through solve_degenerate.

and after inverting Q_minus_psi

kostrzewa · 2015-02-28T11:19:58Z

Thanks for the updates Florian and apologies for not getting back to you earlier. As I'm sure you can imagine I'm quite busy right now...

I wanted to ask you a question regarding the mixed solver implementation. Am I correct in remembering that you have been using this on Fermi and Cray XC30 for finite T simulations? If so, can you say anything about time to solution of the mixed solver on BG/Q and HLRN3 compared to the double-precision sovler? I have a strong feeling (which might be totally misplaced), that by using your mixed solver we could potentially halve our trajectory times, would you agree?

florian-burger · 2015-03-01T07:19:23Z

Thanks for the updates Florian and apologies for not getting back to you earlier. As I'm sure you can
imagine I'm quite busy right now...

Yes, I can imagine

I wanted to ask you a question regarding the mixed solver implementation. Am I correct in >remembering that you have been using this on Fermi and Cray XC30 for finite T simulations?

Yes, on both.

time to solution of the mixed solver on BG/Q and HLRN3 compared to the double-precision sovler? >I have a strong feeling (which might be totally misplaced), that by using your mixed solver we could >potentially halve our trajectory times, would you agree?

It depends. On BG/Q the main advantage is that one can use 16^4 local lattices without too hard penalty (see the picture on the "Single Precision Matrix Performance" Wiki page). Here one only profits from better cache usage. Note that in the current version the overlapping has been removed again.

On Intel one also profits from the native 32bit instructions and the matrix performance can really be up to the factor 2 i.e. the theoretical factor. In practice I have seen a speedup from ~1200 sec to ~800 sec for one trajectory. But the speedup and how well the mixed solver performs depends on the situation. In general I think that worse conditioned matrices even work better.

But probably for halving the time some more work is needed. So far no specific optimizations for 32bit have been done. The speedup merely is the result of the algorithm and better memory/cache properties. So there should be much room for improvement.

kostrzewa · 2015-03-03T22:19:32Z

I seem to be having some issues compiling this. First of all, I have some problems getting ALIGN32 and ALIGN_BASE32 to be defined. In particular, when some alignment is chosen explicitly, I think these should be set too (currently they're only set in auto mode and then only on BG/Q)

Secondly, when SSE2/3 is enabled, it seems that compiler cannot make sense of the instructions in the full- and half-spinor operators, failing with errors like this one:

../../operator/D_psi_32.c:259:3: note: in expansion of macro ‘_su3_inverse_multiply’
   _su3_inverse_multiply(chi, (*u), psi);
   ^
/home/bartek/code/tmLQCD.kost/build_openmp/../sse.h:955:33: error: memory input 2 is not directly addressable
                       "m" (cimag((u).c22)), \

I was able to compile the half-spinor version without SSE but unfotunately as you remark the clover parts are not quite ready yet, otherwise I would want to arrange for a test to be run by someone.

…o-geometry in invert_eo.c and invert_doublet_eo.c when even_odd_flag is set

…e correct values yet

…halfspinor

urbach · 2015-06-09T09:41:41Z

@kostrzewa the problem with the macros is due to the fact that in case of SSE there are special SSE versions of e.g. _su3_inverse_multiply, written for double precision variables.

Most of the time all the macros work for single and double precision data types, as long as there is no special version available, like in the case of SSE. This will require more work for a real 32bit SSE implementation. Such code is available, of course. Or we finally rely on the compiler capabilities!? A short term fix can be to un-define SSEX in the 32 Dirac matrices.

I have fixed the alignment thing, see my pull request to @florian-burger, however, they are now just defined, not necessarily to the correct values, though.

I have also fixed a missing definition of update_backward_gauge_32 for the case of --disable-halfspinor.

@florian-burger what is missing for the clover term?

urbach · 2015-06-09T09:48:17Z

indeed, with

 #ifdef SSE
 # undef SSE
 #endif
 #ifdef SSE2
 # undef SSE2
 #endif
 #ifdef SSE3
 # undef SSE3
 #endif

in D_psi_32.c, Hopping_Matrix32.c and Hopping_Matrix32_nocom.c just before #include "global.h" also the SSEX versions do compile.

urbach · 2015-06-09T09:49:17Z

If wanted I could include that into the pull request to @florian-burger

urbach · 2015-06-09T16:55:30Z

is there any way to also get the interleaved version (not in this pull request...) into my InterleavedNDTwistedClover branch, after maybe merging this here?

florian-burger · 2015-06-14T20:44:23Z

As for the clover term the non-sse implementation would be straight forward. We only need the clover field in 32 bit and a functionality for initializing it given the double field. Next we would also need the two functions clover_gamma5 and clover_inv.

As for merging InterleavedNDTwistedClover should be possible. You can check the last commit in my interleaved_mixed_cg branch (which was merged here at some point), if something is missing. I have undone the interleaving xchanges in commit 2d713cc...

urbach · 2016-11-17T08:27:06Z

I'm closing this as we are not going to merge it any more.

Florian Burger added 30 commits February 21, 2013 11:33

Here the latest GPU related files from my local code are taken over,

4ddef6c

as a merge does not seem a good option here. The functionality added by Falk will be repaired later on

fixed new allocation of solver fields in all outer solvers

573483a

added GPU support in det and detratio

56b8393

adapted function arguments to allow for hamiltonian field

5d2db32

added gpu support in gauge_monomial changed dev_gauge_derivative to allow for a multiplicative constant in momentum calculation

fixed all compile errors due to c99

0df793a

fought unnecessary warnings

both degenerate and ND solvers in EO running again.

5440583

Cleaned up a bit the ND solver

renamed ND Matrix to match name in cpu code

a1d7898

fixed non-compile bug in observables.h with TEMPORALGAUGE removed lengthy GPU related stuff for TEMPORALGAUGE to function in invert_eo

complex kappa now possible in HoppingMatrix

3884064

removed finalize temporal gauge routine in invert_doublet which caused

924b5e1

segfaults due to double free

made MPI version compile again

f7b84bb

degenerate part is working but ONLY WITHOUT TEMPORALGAUGE -> Why??

some small changes

9989d3e

added if use_gpuflag in invert.c

removed all REAL data type define related stuff and changed to float

75693b0

Carsten's bugfix for not correctly set default of nd_precision flags and values

removed subsequent mem alloc in init_temporalgauge_trafo which lead to

405259b

mem leak in eo gpu part -> fixes issue 262 some cosmetics on misleading error message

added first version of mixed clover tm inversion

e6a478a

Merge branch 'master' into gpurational

ed3454b

added mpi support for clover

f01b09c

added relativistic basis support for clover added texture support for clover

Number of gpus per node can now be specified in gpu input section

262e6d7

default is 4 gpus/node

fixed issue with gpu shift solver and added max EV normalization in n…

424d485

…d matrix gpu kernels; added gpu mms support in ndratcor monomial

improved non-EO version of degenerate matrix substantially,

2fa0747

all improvements (TEMPORAL_GAUGE etc.) working some more minor fixes

added some functionality to limit inner solver iterations in 1+1

1022e3c

first trials for mixed prec cg with ONLY 32bit opertions in inner solver

69c76e3

We need a 32 bit matrix as we get nans!!

first working float/double mixed solver

6c6255d

gauge_field still 64 bits kappa's still 64 bits -> Matrix is slower than pure double version

started adding 32 bit gauge field

1c4b9b1

finished adding 32 bit gauge field.

4e70534

Mixed solver faster than 64 bit cg

removed old solver version

3986564

bgq version added, not checked

4279626

added xlc hopping part

fd0220d

fixed forgotten #endif

4a2ff32

working BG/Q version

0aa2b4c

Florian Burger added 5 commits January 2, 2015 17:07

undoing overlapping in xchange

2d713cc

fixed #ifdef MPI in 32 bit related stuff - mixed solver workin again

added mixed_cg in monomial_solve

4db656f

added solve_degenerate in light clover monomials added mixed cg iteration and epsilon parameters brought solver types in new .h file

added D_psi_32

6d27d7f

added mixed cg ND mms solver and 32bit Matrices (w/o clover)

9f2979e

bicg in det and detratio now wrapped via solve_degenerate

6fd402a

fixed bug with non-eo+bicgstab related to g_mu sign swap before

920af80

and after inverting Q_minus_psi

Florian Burger and others added 11 commits March 17, 2015 21:05

solves mpi problem with infiniband

8cba40e

fixed double allocation of host exchange fields and a missing free

6c4d232

non_eo inversions, source generation and io fixed, dropped calls to e…

7193ad0

…o-geometry in invert_eo.c and invert_doublet_eo.c when even_odd_flag is set

resolved merging conflicts from merging the current tmLQCD master

a1d95ef

residual ifdef MPI replace by ifdef USE_MPI

9ac5e97

defined ALIGN32 and ALIGN_BASE32 in all cases, not neccessarily to th…

1a859f9

…e correct values yet

removed BG/L and BG/P related code

d224dcf

use exclusively the defined solver types and include SUMR in the list

ed8f07b

added missing version of update_backward_gauge_32 in case of disable-…

b9a5437

…halfspinor

corrected a printf format mistake

c83a164

removed warning due to a re-definition

7dac82b

Florian Burger added 2 commits June 20, 2015 10:12

32 bit clover working with mixedcg

d30ade8

missing Qm added

e375e2f

urbach closed this Nov 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gpurational merge #282

Gpurational merge #282

florian-burger commented Mar 28, 2014

florian-burger commented Jan 13, 2015

kostrzewa commented Feb 28, 2015

florian-burger commented Mar 1, 2015

kostrzewa commented Mar 3, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

florian-burger commented Jun 14, 2015

urbach commented Nov 17, 2016

Gpurational merge #282

Gpurational merge #282

Conversation

florian-burger commented Mar 28, 2014

florian-burger commented Jan 13, 2015

kostrzewa commented Feb 28, 2015

florian-burger commented Mar 1, 2015

kostrzewa commented Mar 3, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

urbach commented Jun 9, 2015

florian-burger commented Jun 14, 2015

urbach commented Nov 17, 2016