interleaving computation and communication in halfspinor hopping matrix on BG/Q with SPI #157

Open
wants to merge 11 commits into
from

Projects

None yet

2 participants

@kostrzewa
Member

This pull-request implements a simple version of interleaved communication and computation when using SPI communication with the halfspinor version of the hopping matrix. DO NOT MERGE, JUST FOR COMPARISON PURPOSES!

Two new arrays, g_body and g_surface, are defined. These contain the even-odd pre-conditioned indices of the surface and body of the local lattice. These index arrays are then used to address the input spinor, auxiliary spinor and gauge field in the computation of the surface and body contributions in the hopping matrix. None of the index orderings is changed.

Roughly a 10% rise in performance is seen when compared to the hopping matrix without interleaving with just SPI. The implementation has been tested using the HMC.

By removing all prefetching from the hopping matrix on BG/Q a total increase in performance of 25% (compared to the old SPI version) has been observed.

@kostrzewa kostrzewa and 1 other commented on an outdated diff Oct 2, 2012
operator/halfspinor_hopping.h
#define _hop_t_p_pre32() \
_vec_load2(rs0, rs1, rs2, s->s0); \
_vec_load2(rs3, rs4, rs5, s->s1); \
_vec_load2(rs6, rs7, rs8, s->s2); \
_vec_load2(rs9, rs10, rs11, s->s3); \
- _prefetch_spinor(s+1); \
@kostrzewa
kostrzewa Oct 2, 2012 Member

perhaps I should leave these in, in case we figure out a way to make the prefetching better on BGQ, they are undefined above any way so they will be of no consequence

@urbach
urbach Oct 2, 2012 Contributor

yes, leave them in for reminding us...!

@urbach
Contributor
urbach commented Oct 2, 2012

So in general it should be possible to rewrite the index arrays g_eo2lexic, g_lexic2eo etc. That should basically work out of the box, the only reason why I didn't do this from the beginning are the following:

  • At some points in the code the coordinates t,x,y,z are computed, and I was not sure whether this is done consistently if I change the index arrays.
  • I'm not sure that it also works in the derivative functions 'deriv_Sb' etc...

Otherwise it would be very easy to use the remapping I implemented in my NewHalfspinor version everywhere in the code...

@kostrzewa
Member

That is excellent news Carsten, if it works out that will of course be faster and we should not pull this pull-request.

The next aim for SPI is the gauge exchange, currently this eats 7 percent of total time in the HMC and one could potentially shave off a bit by doing the gauge update interleaved with comms.

@urbach
Contributor
urbach commented Oct 4, 2012

Indeed, undefining the prefetching improves performance also in the NewHalfspinor branch...

@kostrzewa
Member

Prefetching does help if there are very few prefetch instructions, confirming my hunch that it cause cache contention because of the limited number of available streams. From the linear algebra routines where you have used prefetching, I see a sub percent level improvement in HMC performance when comparing with and without prefetching.

I also tried reinstating prefetching of the gauge fields but this does not lead to improvement!

@kostrzewa
Member

I tried playing around a bit more with the perfect prefetcher and I've at last understood how it works. I set up 6 sets of prefetching patterns for each thread so that for the surface, body and final computation the list prefetcher has separate patterns for each thread, ieo part and loop. This does not improve performance, but it also doesn't hurt it much... indicating that whatever prefetching is done by the compiler, it is ineffective at reducing cache misses (ie we rarely prefetch correctly). Luckily we fit into L2, so our overall efficiency is still in the double digits...

My hunch is that with OpenMP, since threads are given different chunks of the loop at each call, list prefetching simply does not work. Manual prefetching could work if you manually specify a good prefetch to be done without saturating the 16 cache lines per core (ie maximum 4 per thread!).

I will try playing around a bit more with prefetch instructions but I think Abdou's assesment holds for BG/Q too and hardware prefetching is the important thing. However, when using OpenMP, all bets for hardware prefetching are off because it is not known in which order memory will be accessed.

This might also be the reason why P.B. has chosen to use a more fine-grained threading. He can associate the same loop chunk to the same thread every time, maybe losing a bit of time for synchronisation, but gaining complete predictabilty. Looking at the Bagel source-code, he also uses the perfect prefetcher.

I would like to look into using Bagel as our kernel. Although the documentation is disappointingly sparse, I think it is not that difficult to use BFM kernels from C.

@kostrzewa
Member

I will also try a manual work-sharing for the loops in the hopping matrix together with my perfect prefetcher patterns. Since it is possible with OpenMP to have the same thread taking the same chunk of the loop at each call (just not with pragma omp for, this could increase performance...)

@kostrzewa
Member

I tried the manual chunk distribution and this does help, but not nearly as much as I would have expected. It gives an improvement of only about 3% to the nocomm result... In this situation I would expect the prefetching to be perfect, as each thread has an access pattern and the access pattern never changes. Maybe a call to the designers of the L1P is in order.. there just simply shouldn't be any cache misses...

@urbach
Contributor
urbach commented Oct 18, 2012

I think in combination with halfspinor one could indeed change the even/odd geometry without too much of a problem. All other fields, well actually, the gauge field is not in even/odd order and therefore we don't run into problems in the communication.

@urbach urbach commented on the diff Oct 18, 2012
operator/halfspinor_body.c
+
+ _hop_y_p_pre32();
+ U++;
+ ix++;
+
+ _hop_y_m_pre32();
+ ix++;
+
+ _hop_z_p_pre32();
+ U++;
+ ix++;
+
+ _hop_z_m_pre32();
+ }
+
+#else // SPI
@urbach
urbach Oct 18, 2012 Contributor

uhh, now we get again a mess of #defines...

@urbach urbach commented on the diff Oct 18, 2012
operator/halfspinor_body.c
+ U++;
+ ix++;
+
+ _hop_y_m_pre();
+ ix++;
+
+ _hop_z_p_pre();
+ U++;
+ ix++;
+
+ _hop_z_m_pre();
+ }
+
+#else // SPI
+
+ /* when not interleaving we compute body and surface together */
@urbach
urbach Oct 18, 2012 Contributor

a lot of defines, as above

@urbach
Contributor
urbach commented Oct 18, 2012

In the accumulation part (all the post macros) it would make sense to prefetch the phi field, as it is no longer fetched sequentially from memory.

Doing so in my NewHalfspinor branch, where this is also the case for accumulating the boundary, prefetching helps.

@kostrzewa
Member

Well, one could define the inside of those loops as "_hop_pre_32_loop" etc.., then the code would consist essentially only out of defines, which is acceptable in my opionion, especially since all the loop bodies are the same. In fact, one could go one further and use the same loop body for both precisions with some replacement magic as done by Albert in the buffers.

I will try the prefetching in the week of the 29th or maybe in Frankfurt.

@kostrzewa
Member

Further, it is clear that this should be generalized to other interleaving-capable communication schemes, which will add even more ifdefs...

@urbach
Contributor
urbach commented Oct 20, 2012

I have worked a bit to generalise my geometry approach, and I think it is working now... See my new branch InterleavedNDTwistedClover. Tested on a 512 node partition on FERMI with 12^4 local volume gives me 767 Mflops per thread, 49129 Mflops per node. (I don't really understand why it is faster now than previously the NewHalfspinor one).

So far it seems to work in the HMC, but there are many things to test.

The idea is that I only changed the even/odd geometry used for the spinor fields. All the other geometry is left unchanged. This works only with halfspinor, of course, otherwise the exchange routines for the spinor fields would fail. It is not yet included in the most general way in the code, but if confirmed, we should maybe merge that one...!?

@urbach urbach and 1 other commented on an outdated diff Nov 1, 2012
mpi_init.c
# elif defined PARALLELXT
RAND = 2*LZ*(LY*LX + T*LY);
EDGES = 4*LZ*LY;
+ BODY = LY*LZ*(T-2)*(X-2);
@urbach
urbach Nov 1, 2012 Contributor

you mean LX-2 here...

@urbach
Contributor
urbach commented Nov 1, 2012

I have also adapted the _IS_BODY define and the bodyV initialisation to work also with 0,1,2 and 3 dimensional parallelisation.

@urbach
Contributor
urbach commented Nov 24, 2012

hmm, so scalar and 4-dim inverter do work in InterleavedNDTwistedClover, now I have to see whats going wrong in the HMC...

@urbach
Contributor
urbach commented Nov 24, 2012

testing now on my laptop with MPI (so no SPI), but I get acceptance with DET, DETRATIO and NDCLOVER monomials.

I need some more input on for which input files and/or situation it did not work...!

@kostrzewa
Member

For me it was on BG/Q with a number of input files. I used the "hmc1" parameters with a 24^3x48 volume for instance.

@kostrzewa
Member

Also there is a bug now in

/bgsys/drivers/ppcfloor/comm/xl.ndebug/bin/mpixlc_r -DHAVE_CONFIG_H -I/homea/hch02/hch028/include/ -I. -I/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/ -I/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/../ -I/homea/hch02/hch028/lime-1.3.2/install_bgq//include/ -I/include/ -I/bgsys/drivers/ppcfloor/arch/include/ -I/bgsys/drivers/ppcfloor/comm/xl.ndebug/include -qsmp=omp:noauto:schedule=static -O2 -qstrict=all -qtune=qp -qarch=qp -qmaxmem=-1 -c ../../solver/generate_dfl_subspace.c
"../../solver/generate_dfl_subspace.c", line 120.48: 1506-045 (S) Undeclared identifier repro.
make[1]: *** [generate_dfl_subspace.o] Error 1
make[1]: Leaving directory `/homea/hch02/hch028/code/tmLQCD.kost/build_bgq_hybrid_hs/solver'
make: *** [solver] Error 2

@urbach
Contributor
urbach commented Nov 26, 2012

the bug should be fixed now. There I forgot to push something...

so, might be SPI causing the problem? Did you try with MPI instead?

@urbach
Contributor
urbach commented Nov 26, 2012

its not SPI causing the problem...

@urbach
Contributor
urbach commented Nov 26, 2012

hmm, on a PC InterleavedNDTwistedClover runs without problems for sample-hmc1.input for 16 MPI processes and 1 or 2 threads. Any hints?

@urbach
Contributor
urbach commented Nov 26, 2012

the small volume with 16 MPI processes works also on the BG/Q (64 threads), hmm, I don't understand...

@urbach
Contributor
urbach commented Nov 26, 2012

okay, volume 4^3 x 8 gives wrong results on my PC and on the BG/Q, so somewhere the geometry is still not correct.

Strangely enough, the solver converges, though...

@urbach
Contributor
urbach commented Nov 26, 2012

scalar version does work with 4^3 x 8 volume

@kostrzewa
Member

maybe the solver is implicitly solving the wrong problem and some of the indices are mixed up in the end?
For the 4^4 volume and 16 MPI processes everything is exchanged (taking into account both exchange directions for each 2x2 plane), maybe that's a hint? I would try a 3^3x6 local volume to see whether it continues working.
In the definition of the geometry maybe there's a '&&' where there should be a '||', thus explaining why a local volume of 2^4 works but not one of 2^3x4?

@urbach
Contributor
urbach commented Nov 26, 2012

could be solving the wrong problem, however the initial heatbath energy for detratio agrees between scalar and 4dim run to all digits, and this involves an inversion. And the scalar version gives acceptance... So, likely to be an MPI-only problem, and therefore the xchange routines, or as you say, one of the defines...

But I could also cross-check an inversion again against a code version with no interleaving.

Any help is welcome, I have to probably stop working on this again now, as I have to prepare my lecture... :(

@kostrzewa
Member

I have to finish my MPI write-up and organize some things but I will take a look tomorrow whether I can find out more.

@kostrzewa
Member

Can you confirm for me that the CG still blows up for

sample-hmc2.input
InterleavedNDTwistedClover
2D_MPI_hs with 4 processes, 2x2x4x4 local volume (TXYZ)
seed=444
reproducerandomnumbers=no

?

Other than that I've been seeing apparently better behaviour after your changes in terms of hot starts and acceptance. For a 2^4 local volume I still see a large deviation from the reference though. (I will follow up with a real update once the run is more or less complete)

@urbach
Contributor
urbach commented Jan 26, 2013

here we need to find some way to deal with it. Which version do we merge? And how do we keep the old halfspinor version? And what about Michaels code?

Still the same question, any opinions?

@kostrzewa
Member

I think that Michael's code can only be merged with his help or a dedicated effort by all of us to understand, integrate and test it. In the long term this would be excellent.

In the short term I believe that we should merge you InterleavedNDTwistedClover branch.

@kostrzewa
Member

With my last post in mind, I would propose that we actually have two hopping matrices in place. One with overlapping and the corresponding geometry and one without overlapping. Configurable through a global define we then have two "halfspinor_body.c" files. This also cleanly merges the SPI Hopping Matrix and allows continued use of the non-overlapping matrix with two loops on pure MPI machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment