This pull-request implements a simple version of interleaved communication and computation when using SPI communication with the halfspinor version of the hopping matrix. DO NOT MERGE, JUST FOR COMPARISON PURPOSES!
Two new arrays, g_body and g_surface, are defined. These contain the even-odd pre-conditioned indices of the surface and body of the local lattice. These index arrays are then used to address the input spinor, auxiliary spinor and gauge field in the computation of the surface and body contributions in the hopping matrix. None of the index orderings is changed.
Roughly a 10% rise in performance is seen when compared to the hopping matrix without interleaving with just SPI. The implementation has been tested using the HMC.
By removing all prefetching from the hopping matrix on BG/Q a total increase in performance of 25% (compared to the old SPI version) has been observed.
perhaps I should leave these in, in case we figure out a way to make the prefetching better on BGQ, they are undefined above any way so they will be of no consequence
yes, leave them in for reminding us...!
So in general it should be possible to rewrite the index arrays g_eo2lexic, g_lexic2eo etc. That should basically work out of the box, the only reason why I didn't do this from the beginning are the following:
Otherwise it would be very easy to use the remapping I implemented in my NewHalfspinor version everywhere in the code...
That is excellent news Carsten, if it works out that will of course be faster and we should not pull this pull-request.
The next aim for SPI is the gauge exchange, currently this eats 7 percent of total time in the HMC and one could potentially shave off a bit by doing the gauge update interleaved with comms.
Indeed, undefining the prefetching improves performance also in the NewHalfspinor branch...
Prefetching does help if there are very few prefetch instructions, confirming my hunch that it cause cache contention because of the limited number of available streams. From the linear algebra routines where you have used prefetching, I see a sub percent level improvement in HMC performance when comparing with and without prefetching.
I also tried reinstating prefetching of the gauge fields but this does not lead to improvement!
I tried playing around a bit more with the perfect prefetcher and I've at last understood how it works. I set up 6 sets of prefetching patterns for each thread so that for the surface, body and final computation the list prefetcher has separate patterns for each thread, ieo part and loop. This does not improve performance, but it also doesn't hurt it much... indicating that whatever prefetching is done by the compiler, it is ineffective at reducing cache misses (ie we rarely prefetch correctly). Luckily we fit into L2, so our overall efficiency is still in the double digits...
My hunch is that with OpenMP, since threads are given different chunks of the loop at each call, list prefetching simply does not work. Manual prefetching could work if you manually specify a good prefetch to be done without saturating the 16 cache lines per core (ie maximum 4 per thread!).
I will try playing around a bit more with prefetch instructions but I think Abdou's assesment holds for BG/Q too and hardware prefetching is the important thing. However, when using OpenMP, all bets for hardware prefetching are off because it is not known in which order memory will be accessed.
This might also be the reason why P.B. has chosen to use a more fine-grained threading. He can associate the same loop chunk to the same thread every time, maybe losing a bit of time for synchronisation, but gaining complete predictabilty. Looking at the Bagel source-code, he also uses the perfect prefetcher.
I would like to look into using Bagel as our kernel. Although the documentation is disappointingly sparse, I think it is not that difficult to use BFM kernels from C.
I will also try a manual work-sharing for the loops in the hopping matrix together with my perfect prefetcher patterns. Since it is possible with OpenMP to have the same thread taking the same chunk of the loop at each call (just not with pragma omp for, this could increase performance...)
pragma omp for
I tried the manual chunk distribution and this does help, but not nearly as much as I would have expected. It gives an improvement of only about 3% to the nocomm result... In this situation I would expect the prefetching to be perfect, as each thread has an access pattern and the access pattern never changes. Maybe a call to the designers of the L1P is in order.. there just simply shouldn't be any cache misses...
I think in combination with halfspinor one could indeed change the even/odd geometry without too much of a problem. All other fields, well actually, the gauge field is not in even/odd order and therefore we don't run into problems in the communication.
uhh, now we get again a mess of #defines...
a lot of defines, as above
In the accumulation part (all the post macros) it would make sense to prefetch the phi field, as it is no longer fetched sequentially from memory.
Doing so in my NewHalfspinor branch, where this is also the case for accumulating the boundary, prefetching helps.
Well, one could define the inside of those loops as "_hop_pre_32_loop" etc.., then the code would consist essentially only out of defines, which is acceptable in my opionion, especially since all the loop bodies are the same. In fact, one could go one further and use the same loop body for both precisions with some replacement magic as done by Albert in the buffers.
I will try the prefetching in the week of the 29th or maybe in Frankfurt.
Further, it is clear that this should be generalized to other interleaving-capable communication schemes, which will add even more ifdefs...
I have worked a bit to generalise my geometry approach, and I think it is working now... See my new branch InterleavedNDTwistedClover. Tested on a 512 node partition on FERMI with 12^4 local volume gives me 767 Mflops per thread, 49129 Mflops per node. (I don't really understand why it is faster now than previously the NewHalfspinor one).
So far it seems to work in the HMC, but there are many things to test.
The idea is that I only changed the even/odd geometry used for the spinor fields. All the other geometry is left unchanged. This works only with halfspinor, of course, otherwise the exchange routines for the spinor fields would fail. It is not yet included in the most general way in the code, but if confirmed, we should maybe merge that one...!?
you mean LX-2 here...
I have also adapted the _IS_BODY define and the bodyV initialisation to work also with 0,1,2 and 3 dimensional parallelisation.
hmm, so scalar and 4-dim inverter do work in InterleavedNDTwistedClover, now I have to see whats going wrong in the HMC...
testing now on my laptop with MPI (so no SPI), but I get acceptance with DET, DETRATIO and NDCLOVER monomials.
I need some more input on for which input files and/or situation it did not work...!
For me it was on BG/Q with a number of input files. I used the "hmc1" parameters with a 24^3x48 volume for instance.
Also there is a bug now in
/bgsys/drivers/ppcfloor/comm/xl.ndebug/bin/mpixlc_r -DHAVE_CONFIG_H -I/homea/hch02/hch028/include/ -I. -I/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/ -I/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/../ -I/homea/hch02/hch028/lime-1.3.2/install_bgq//include/ -I/include/ -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/comm/xl.ndebug/include -qsmp=omp:noauto:schedule=static -O2 -qstrict=all -qtune=qp -qarch=qp -qmaxmem=-1 -c ../../solver/generate_dfl_subspace.c
"../../solver/generate_dfl_subspace.c", line 120.48: 1506-045 (S) Undeclared identifier repro.
make: *** [generate_dfl_subspace.o] Error 1
make: Leaving directory `/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/solver'
make: *** [solver] Error 2
the bug should be fixed now. There I forgot to push something...
so, might be SPI causing the problem? Did you try with MPI instead?
its not SPI causing the problem...
hmm, on a PC InterleavedNDTwistedClover runs without problems for sample-hmc1.input for 16 MPI processes and 1 or 2 threads. Any hints?
the small volume with 16 MPI processes works also on the BG/Q (64 threads), hmm, I don't understand...
okay, volume 4^3 x 8 gives wrong results on my PC and on the BG/Q, so somewhere the geometry is still not correct.
Strangely enough, the solver converges, though...
scalar version does work with 4^3 x 8 volume
maybe the solver is implicitly solving the wrong problem and some of the indices are mixed up in the end?
For the 4^4 volume and 16 MPI processes everything is exchanged (taking into account both exchange directions for each 2x2 plane), maybe that's a hint? I would try a 3^3x6 local volume to see whether it continues working.
In the definition of the geometry maybe there's a '&&' where there should be a '||', thus explaining why a local volume of 2^4 works but not one of 2^3x4?
could be solving the wrong problem, however the initial heatbath energy for detratio agrees between scalar and 4dim run to all digits, and this involves an inversion. And the scalar version gives acceptance... So, likely to be an MPI-only problem, and therefore the xchange routines, or as you say, one of the defines...
But I could also cross-check an inversion again against a code version with no interleaving.
Any help is welcome, I have to probably stop working on this again now, as I have to prepare my lecture... :(
I have to finish my MPI write-up and organize some things but I will take a look tomorrow whether I can find out more.
Can you confirm for me that the CG still blows up for
2D_MPI_hs with 4 processes, 2x2x4x4 local volume (TXYZ)
Other than that I've been seeing apparently better behaviour after your changes in terms of hot starts and acceptance. For a 2^4 local volume I still see a large deviation from the reference though. (I will follow up with a real update once the run is more or less complete)
interleave communication and computation with SPI on BG/Q
move bgq_msync into a separate single section at the end of the body …
extend SPI interleaving to sloppy precision version
forgot #endif iin sloppy precision part
safely allocate and deallocate memory for surface and body index arrays
Disable prefetching on BG/Q because it seems to hurt performance
Merge remote-tracking branch 'urbach/MoreTiming' into interleaving_hs
Merge branch 'master' of git://github.com/etmc/tmLQCD into interleavi…
Merge branch 'MoreTiming' of git://github.com/urbach/tmLQCD into inte…
fix regression bug in body calculation
regression bug in initialization of BODY and SURFACE without MPI
here we need to find some way to deal with it. Which version do we merge? And how do we keep the old halfspinor version? And what about Michaels code?
Still the same question, any opinions?
I think that Michael's code can only be merged with his help or a dedicated effort by all of us to understand, integrate and test it. In the long term this would be excellent.
In the short term I believe that we should merge you InterleavedNDTwistedClover branch.
With my last post in mind, I would propose that we actually have two hopping matrices in place. One with overlapping and the corresponding geometry and one without overlapping. Configurable through a global define we then have two "halfspinor_body.c" files. This also cleanly merges the SPI Hopping Matrix and allows continued use of the non-overlapping matrix with two loops on pure MPI machines.