Clover QPX #228

Closed
wants to merge 16 commits into
from

Conversation

Projects
None yet
2 participants
Contributor

urbach commented Feb 12, 2013

  • Interchanges the two innermost indices for sw_inv and sw, see issue #222
  • QPX implementation of multiplication with sw_inv, including a new data type
  • to do: sw multiplication
  • to do: fusing with hopping matrix

This not for merge yet, but I wanted to have some discussion. For the sw multiplication we either store full 6x6 matrices making the QPX implementation easy, or we make use of the hermiticity of the 6x6 colour matrix, but having difficulties with QPX then.

Moreover, it doesn't seem to help that much...

Owner

kostrzewa commented Feb 13, 2013

Just to pinpoint the bottlenecks I ran a quick scalasca run with a twisted mass clover doublet for 14 trajectories on BG/Q.

In percent of total time:

sw_all ~ 12%
clover_gamma5 ~ 10%
clover_inv ~ 9%

steps per timescale and monomials:

ts[0] = 2 , gauge
ts[1] = 8 , cloverdet

sw_term and sw_invert are only marginal with about 1% each.

My guess would be that sw_all is slow because of the atomic statements, but this does NOT seem to be confirmed by scalasca which reports almost no thread idling there. But maybe scalasca doesn't register waiting for atomic operations (quite likely). There is a large amount of thread idling the clover_gamma5 and clover_inv though (8% of total each), which could potentially be reduced through QPX. Generally thread idling occurs when memory bandwidth is the bottleneck (as in the hopping matrix, for instance), when there is a lot of OpenMP overhead or during the many single sections that we have for communication.

The latter two functions have a lot of su3 multiplications and I believe they could strongly benefit from optimization.

Owner

kostrzewa commented Feb 13, 2013

If you'd like I can send you the directory for analysis with square

Contributor

urbach commented Feb 13, 2013

clover_inv is optimised with QPX already, so you could try to see the difference... Or did you use this optimised version already?

Owner

kostrzewa commented Feb 13, 2013

No, this was with the old one. I'll take a look tomorrow if I can (got some administrative stuff to take care of)

Owner

kostrzewa commented Feb 14, 2013

I have a possible optimization of sw_all which does away with ALL of the atomics when run with OpenMP. It uses a lot of memory though... ( VOLUMEPLUSRAND_4_40*sizeof(su3adj) bytes ) and two loops (the second of which is quite trivial)

I will test it to see if it brings any benefit and set up a pull-request once it looks sensible.

Contributor

urbach commented Feb 14, 2013

@kostrzewa can you compare to my ImproveDerivSbND branch? Might be very similar...!?

Owner

kostrzewa commented Feb 14, 2013

@urbach I guess the idea is similar but your usage of _vector_tensor_vector_add_accum in the second loop of deriv_Sb_nd_tensor is not thread-safe as you are writing to iy = g_idn[ix] which can coincide for two threads (unless I'm missing something)

The following situation could happen, for instance (subscript refers to thread):
g_idn[ix_1][2] = g_idn[ix_24][3]

Keeping in mind that thread execution is totally asynchronous and there is no synchronization between loop iterations. Only when the full loop has been processed is memory synchronized between threads. Again, this is a probabilistic effect but it will happen.


What I did was to introduce a halo of 40 sites (all possible xpa (4), xma(4), xpamb(16), xmamb(16)) for each lattice point and direction of the derivative, thereby guaranteeing that threads never encounter writing conflicts. I can then write there without atomic statements and any writes to the current point [x] are guaranteed to be thread-safe and can therefore use the default add_assign macro and write directly to the derivative. For a limited number of writes to these halo sites I am able to use a write without accumulation, therefore gaining a little bit more performance.

The memory usage could be optimized slightly by taking into account that not all combinations C_{a,b} a={0,1,2,3}, b={0,1,2,3} can occur. I tried this but got so confused that I wanted to first test a running version before going further.

This is similar in principle to the half-spinor hopping matrix, except that in addition to direct neighbours it includes memory for certain 2-hop neighbours.

Owner

kostrzewa commented Feb 15, 2013

Well, that's dissapointing. My changes resulted only in a reduction from 12.5% to 10.5% of total time spent in sw_all. I guess there is still substantial improvement potential if su3_times_su3 and co. are BGQ optimized, thereby also reducing the time spent in the gauge derivative (~6% of total for plaquette only...) and in the gauge update (2% of total time)

Contributor

urbach commented Feb 15, 2013

I see, you are right, its not thread safe, I overlooked that. hmm...

The problem with optimising su3_times_su3 is that I think it will not work very well. The vector unit is not made for SU(3)... We should modify QCD to have a SU(4) colour gauge group!

Owner

kostrzewa commented Feb 15, 2013

I think we can squeeze a factor 2 out at least by using it like the old unit. I just wrote su3_times_su3 by using 18 registers for input, 9 for output and 3 temporaries. I reduced the operations to "row times column" and further to "complex multiply". I can't test right now because someone is running hundreds of small jobs in a row on JuQueen.

btw: The scalacsa test of CloverQPX is also in the queue.

Contributor

urbach commented Feb 15, 2013

there is a trick for testing... submit a job which starts an xterm. Then you can test half an hour...

yes, one can optimise su3_times_su3 definitely, but not as much as one would like to. Probably you do an vec_ld2 for the first matrix and then you can do some part very efficient...

Owner

kostrzewa commented Feb 15, 2013

The CloverQPX branch is unfortunately minimally slower in total than the InterleavedTwistedNDClover branch, clover_inv and sw_invert are minimally faster but it's not really worth mentioning... sw_term is a bit slower in exchange.

Contributor

urbach commented Feb 15, 2013

so, InterleavedTwistedNDClover + CloverQPX should be a little faster. But I was already fearing that we basically wait for data in theses routines... Can one see this with scalasca?

Owner

kostrzewa commented Feb 15, 2013

Yes, in principle idling threads are either overhead or waiting for data. Both clover_inv and clover_gamma5 have some level of thread-idling, but not nearly as much as the hopping matrix. Interestingly, sw_all has no thread-idling, indicating that any optimization will bring large rewards (I guess).

Owner

kostrzewa commented Feb 15, 2013

so, InterleavedTwistedNDClover + CloverQPX should be a little faster

I'm not sure, I would need to check without scalasca to be sure.

Owner

kostrzewa commented Feb 15, 2013

Wow, the QPX su3 REALLY helps. Reduced total time spent in sw_all to 7% of total. I will write the su3d_times_su3 and then we can deploy this in general.

Owner

kostrzewa commented Feb 15, 2013

I think it will also be useful to implement su3adj operations in QPX, especially since they lend themselves so nicely.

Contributor

urbach commented Feb 15, 2013

hmm, so maybe one should try also the clover term with QPX su3 operations. But maybe in sw_all the bandwidth to flop ratio is more favourable... Can I see your implementation somewhere already?

Owner

kostrzewa commented Feb 15, 2013

sure, it's in sw_all_OMP_optim, it's really naive though

and yes, there is no thread-idling in sw_all so memory is certainly not the problem

Contributor

urbach commented Feb 15, 2013

okay, you make even only use of half of the vector unit... So there is even more potential...

And indeed, bandwith/flop ratio is more favourable...!

Owner

kostrzewa commented Feb 15, 2013

Yes, I guess the columns of the right matrix could be fused "spinor style" and then maybe the existing macros could be used.

Oh, and ignore the fact that there is a ~250 MB memory leak (12^4 volume), I only wrote the sw_buffer thing for testing purposes and didn't bother freeing it (or the pointer arrays, for that matter). I don't think it's viable in practice because it uses so much memory and the gains are quite disappointing.

Contributor

urbach commented Feb 15, 2013

yes, indeed. But maybe right now its not needed to fuse... Trying to see whether the original version of clover_inv and clover_gamma5 is maybe faster with your su3 algebra would be interesting. Or maybe even better for f6bfd47 ?

Owner

kostrzewa commented Feb 15, 2013

I'll try both on Monday but then I'll have to get started with the Lugano talk.

Contributor

urbach commented Feb 15, 2013

you start with talks early... ;)

Owner

kostrzewa commented Feb 15, 2013

It will be my first plenary and I'm a little nervous, so I'd rather get it down well ahead of the time :)

Owner

kostrzewa commented Feb 16, 2013

I think the sw_buffer concept might be useful also for deriv_Sb and any other (future) derivative with "non-local impact". I think I might generalize it at some point as a derivative halo and see the impact on BGQ performance.

Contributor

urbach commented Feb 16, 2013

I think we should try to move sw_all and parts of deriv_Sb (my deriv_Sb_nd_trace in the ImproveDerivSbND branch) into update_momenta. They'd need to be called only once per timescale and -step, independent of the number and types of monomials on the timescale. The only question is the memory usage...

Owner

kostrzewa commented Feb 18, 2013

Trying to see whether the original version of clover_inv and clover_gamma5 is maybe faster with your su3 algebra would be interesting

There are only su3_times_vector multiplications which I still need to implement...

Besides that, I think clover_inv would be much faster with separate (const * const) input and output and a removal of the unnecessary assignment to phi1 and phi3. I guess this has been accomplished anyway with the fusing with the hopping matrix. How come you only replaced it once in one of the operators?

Owner

kostrzewa commented Feb 18, 2013

my deriv_Sb_nd_trace in the ImproveDerivSbND branch

In this the many for loops will make things slow for (n>1)-dim parallelizations when OpenMP is used. There is a barrier at the end of each for loop. However, and correct me if I'm wrong, all these for loops are independent. They can therefore be called with

#pragma omp for nowait

That should help perforamance quite a bit.

As for pulling the traces out, I agree that this could provide quite a boost in general. The only problem is that having the derivative in 3x3 complex representation prevents the usage of "atomic" completely (because an atomic operation needs to be reducible to a single operation on an elementary type, which is not possible for complex unless done manually) and will require either clever halos so that writes are thread-safe by definition (and subsequent accumulation from those halos) or the usage of critical sections which kill performance completely (and I mean by factors of 10 or worse!).

Contributor

urbach commented Feb 18, 2013

With the fusing I am still in progress. I'll replace it everywhere else, as well.

Some assignement cannot be avoided, unfortunately. phi1 and phi3 have been removed already, but you need some memcpy at some point.

Contributor

urbach commented Feb 18, 2013

However, and correct me if I'm wrong, all these for loops are independent. They can therefore be called with...

yes, I think that is correct. In principle one could also have different threads working on different of the loops...

Owner

kostrzewa commented Feb 18, 2013

However, and correct me if I'm wrong, all these for loops are independent. They can therefore be called with...

yes, I think that is correct. In principle one could also have different threads working on different of the loops...

That would happen by default with a nowait pragma, as a thread would finish its work-unit it would move on to the next loop. One might think about adjusting thread scheduling there to ensure that (a) slow thread(s) (say a thread is stuck in the first loop) is/are not assigned work in the subsequent loops, thereby causing the whole code to wait until they've completed their work-units from all loops.

Owner

kostrzewa commented Nov 22, 2013

I'd like to revive this effort and obviously focus on clover_inv and clover_gamma5. I was misled by my unrealistic example as to the importance of sw_all. If there are enough CG iterations, sw_all can be ignored in the outermost timescale. What I said before was not quite correct, at lower timescales, it contributes more (up to 10% of total at a given timescale)

Owner

kostrzewa commented Nov 25, 2013

In the latest version of CloverQPX there seems to be a segfault on BG/Q. (There's also a forgotten mu parameter for the combined swinv_times_HoppingMatrix in the C file). The segfault happens when swinv_times_HoppingMatrix in Qsw_pm_psi is commented out and Hopping_Matrix and clover_inv are called separately again.

@kostrzewa kostrzewa closed this Feb 10, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment