Skip to content

Timing the HMC #205

Closed
urbach opened this Issue Dec 31, 2012 · 30 comments

2 participants

@urbach
urbach commented Dec 31, 2012

Hmm, I'm timing the new D45 run I'm repeating because of the RNG bug. It is much worse than I thought. For the various monomials in a trajectory the derivatives take, normalized to the gauge one:

Gauge: 1
Det: 3
Deteat1: 5
Detrat2: 9
Poly: 161

in arbitrary time units. I just took the input file used previously, didn't optimise, but...

I think I know what I'm starting with on Wednesday... :(

@urbach
urbach commented Dec 31, 2012

These timings come from supermuc...

@urbach
urbach commented Jan 1, 2013

ah, Poly is called as often as det, only gauge is called more often.

@urbach
urbach commented Jan 2, 2013

I have worked a bit on this issue, and in my ImproveDerivSbND branch you can find them. Those help a lot on my local scalar PC version. They also avoid any omp atomic pragmas. However, on supermuc it doesn't improve at all, I really don't understand. Does anyone have any idea?

@kostrzewa
European Twisted Mass Collaboration member

My first test would be to check a pure MPI version of the code. Also, it would be very useful to compare to the performance on a numerically approximately equivalent partition of BG/P used in previous runs. If the trajectory takes 10 times as long then we have some performance regression in the poly derivative.

@urbach
urbach commented Jan 2, 2013

well, I'm a bit further... if one call of ndpoly_derivative 3.8 seconds, then it is divided as follows (see my ImproveDerivSbND branch):

  • 3.274360e-01 secs accumulated for all Q_tau1_sub_const_ndpsi calls in the first loop
  • 8.768349e-01 secs accumulated for all Q_tau1_sub_const_ndpsi calls in the second loop
  • 2.193796e+00 secs accumulated for all deriv_Sb_nd_tensor calls in the second loop
  • 4.388826e-01 secs accumulated for all H_eo_tm_ndpsi calls in the second loop

all but the second and third time do make sense when compared to the timings in the CG. The time spend in this tensor product thing is ridiculous. And the time for Q_tau1_sub_const_ndpsi calls in the second loop is a factor two to large.

When different local volumes are used these times seem to scale with the local volume.

Could be a cache problem, but then the times would not scale with the local volume. In deriv_Sb_nd_tensor there is not so much to do computational wise, and no omp atomic statements... I really don't get it!

@kostrzewa
European Twisted Mass Collaboration member

Try running it on a machine with as many cores as you can find with pure OpenMP and the same local volume with the Intel performance tools (the GUI works really well, the command line stuff is quite cumbersome), this will show clearly if it is a cache problem. (this is how I found the benefit of g_hi in the full hopping matrix).

@urbach
urbach commented Jan 2, 2013

yes, I'll try at some point... Now restarting D45 on FERMI doesn't work anymore, plaquette gives nan, while a hot start works. Reading a configuration generated on FERMI does work, reading one from another machine doesn't... I'll try recompiling lime...

However, as far as I can see the picture for the timing is that my changes do help quite a lot on BG/Q: ndpoly_derivative gets a factor 2 or more faster compared to the current master. Still, the tensor part seems unnaturally slow.

@urbach
urbach commented Jan 2, 2013

unfortunately it was not lime...

@urbach
urbach commented Jan 3, 2013

okay, here is the same timing for BG/Q: ndpoly_derivative takes about 4.4 secs in total (same number of physical cores as on supermuc, if I didn't miscount):

  • 0.41 secs for first loop
  • 0.43 secs for Q_tau1_sub_const_ndpsi calls in second loop
  • 3.2 secs for deriv_Sb_tensor calls in second loop
  • 0.35 secs for H_eo_tm_ndpsi in second loop

So, the same phenomenon for the tensor thing, while the increased time for Q_tau1_sub_const_ndpsi calls in second loop is not present here.

The times needed for a trajectory are very similar for both machines:

  • supermuc: 1300 secs on 128 nodes (21288 cores, 21288 threads), in the solver we reach 3.6 Tflops
  • fermi: 1900 secs on bg_size = 128 (12816 cores, 12864 threads), in the solver we reach 2.9 Tflops (no interleaving)

The eigenvalues are identical on both machines (to the precision we ask for). Next I also try (my) interleaving version.

Could someone please have a quick look to deriv_Sb_nd_tensor and tell me whether I do there something incredibly stupid? Am I right that we spend there too much time? Should we try a QPX version?

@kostrzewa
European Twisted Mass Collaboration member

I took a quick look yesterday evening and I can imagine that it has more OpenMP overhead than the naive version. but as you say this is more than offset by the fact that it is more efficient otherwise. One could try and count instructions and perhaps match that to our measured floating point performance in the solver?

@kostrzewa
European Twisted Mass Collaboration member

Running with scalasca could also shed light on any possible bottlenecks. The last time I ran with scalasca and the polynomial (A40.24 on one nodeboard) everything was as expected. Around 70% of the time was spent in ndpoly_acc, mostly in the hopping matrix. Only 7% of the time were spent computing the ndpoly derivative.

@kostrzewa
European Twisted Mass Collaboration member

Oh and from the absolute time measurements and the number of visits I see a factor of around 6+-0.5 between Q_tau1_sub_const_ndpsi and deriv_Sb, which roughly matches what you have.

@kostrzewa
European Twisted Mass Collaboration member

Hmm.. this was a run on the 17th of December, so it's no surprise it matches what you see, if indeed we have some regression.

@urbach
urbach commented Jan 3, 2013

But the 7% time spent in ndpoly derivative you report is quite low. For me its almost 100% of a trajectory!? (okay, my volume is bigger, but that doesn't make such a difference)

@kostrzewa
European Twisted Mass Collaboration member

That's interesting, let me dig out the parameters for this run... okay, I see what's going on now. This was a test run where I had explicitly used much heavier light quarks (using only the heaviest DET monomial in the input file), two timescales and only one integration step in each timescale (so.. effectively one timescale). I remember I did this because the job would always be terminated even before a trajectory was able to complete and I still wanted to see what was going on. I can do a more realistic test if you want, but even so I can see that in a realistic setting where the derivative would be computed around 30 times per trajectory [1]

1.0 = other + ndpoly_deriv + ndpoly_acc = 0.23 + 0.07 + 0.7 = (0.23 + 2.1 + 0.7)/3.03 ~ 0.07 + 0.7 + 0.23

so in a realistic run with a heavy light doublet, about 70% of the time would be spent in the poly derivative. This relative contribution should go down with a lighter doublet on account of the time spent in the lightest CG.

[1] does this make sense? should the heavy doublet be on the most frequent timescale?? seems a bit wasteful...

@urbach
urbach commented Jan 3, 2013

[1] does this make sense? should the heavy doublet be on the most frequent timescale?? seems a bit wasteful...

thats of course at the core of the problem. I'm currently trying to find out why the ndpoly is most of the time on the most frequent (fermion) time scale

@urbach
urbach commented Jan 3, 2013

update for timing: on the BG/Q I get on FERMI with interleaving

  • fermi: 2050 secs on bg_size = 128 (12816 cores, 12864 threads), in the solver we reach 4 Tflops (no interleaving)

the reason for this version to be slower, even though we reach 30% higher performance in the CG solver is that here the ndpoly_derivative part is even slower, 4.2 instead of 3.2 secs in deriv_Sb_nd_tensor. I think I know why: I had to rewrite the exchange routine used there because of the different geometry. Maybe at the end its the communication part causing the problem? Can you see in the scalasca measurement how much time is spend in xchange_2field?

@kostrzewa
European Twisted Mass Collaboration member

Maybe at the end its the communication part causing the problem? Can you see in the scalasca measurement how much time is spend in xchange_2field?

One fifth of the total time spent in deriv_Sb is spent in the xchange

@urbach
urbach commented Jan 3, 2013

hmm, thats a lot, but cannot explain what we see...

@kostrzewa
European Twisted Mass Collaboration member

All I can say is that ISend here takes 10 times as long as IRecv, IRecv+WaitAll ~ ISend. In the hopping matrix ISend ~ IRecv and WaitAll ~ 2*(ISend+IRecv)

@kostrzewa
European Twisted Mass Collaboration member

You've asked me this question in the past already and I remember measuring that our hopping matrix is much faster than it used to be, making deriv_Sb so dominant in polynomial runs. This happened roughly around the time when you introduced the new model of having the body in a separate file and all the steps as macros. How does the performance compare to past runs on numerically similar partitions of BG/P?

@urbach
urbach commented Jan 3, 2013

I don't know, firstly because I didn't do the runs myself, and secondly, because the timing was missing (only the CG times were measured, as far as I remember). However, if we had an std-output file from a run on BG/P -- which we should find somewhere -- we could reconstruct the time needed for the derivative of the polynomial by comparing to the total time needed for a trajectory.

@urbach
urbach commented Jan 3, 2013

All I can say is that ISend here takes 10 times as long as IRecv, IRecv+WaitAll ~ ISend. In the hopping matrix ISend ~ IRecv and WaitAll ~ 2*(ISend+IRecv)

This is in contrast to my measurements, where basically all the time is spent in Waitall...

@kostrzewa
European Twisted Mass Collaboration member

I don't know, firstly because I didn't do the runs myself, and secondly, because the timing was missing (only the CG times were measured, as far as I remember). However, if we had an std-output file from a run on BG/P -- which we should find somewhere -- we could reconstruct the time needed for the derivative of the polynomial by comparing to the total time needed for a trajectory.

Yes, but the total trajectory time itself is already a good indication, no? At least insofar as showing whether we have a regression or not. If the performance is comparable this dominance is just an effect of the improvement of the hopping matrix.

@kostrzewa
European Twisted Mass Collaboration member

This is in contrast to my measurements, where basically all the time is spent in Waitall...

Might be a scalasca artefact... not sure

@urbach
urbach commented Jan 3, 2013

well, for D45 with the same input file one trajectory took about 980 secs. with 4096 CPUs. All CG solves in such a trajectory sum up to 235 secs, not including deriv_Sb. Basically all the rest should be ndpoly, and most of that probably ndpoly_derivative ... It seems the ratio of rest/ndpoly has become worse on the new machines, but in principle the problem was there already before...

These numbers also tell us that on BG/Q we are a factor 4 faster than on BG/P on the same number of cores. When counting CPUs its a factor 16.

I'm surprised that no-one (including me) ever looked into these numbers.

@kostrzewa
European Twisted Mass Collaboration member

These numbers also tell us that on BG/Q we are a factor 4 faster than on BG/P on the same number of cores. When counting CPUs its a factor 16.

So we have essentially no speedup on a "per-thread" basis, which is in line with what Stefan Krieg was explaining during the Lattice Practices in late 2012. In fact, BG/P was faster "per-thread" (3.4 Gflop/s per thread vs. 3.2 on BG/Q)

It seems the ratio of rest/ndpoly has become worse on the new machines, but in principle the problem was there already before...

I really think this is due to the rest becoming faster, at least that's in line with my observations when you introduced the new hopping matrix. I only started looking at the polynomial very late in the game...

@urbach
urbach commented Jan 4, 2013

for the records: after tuning the input parameters the time for a trajectory is down to 435 secs on supermuc, compared to 1400 secs from the input file used previously. With only a few percent lower acceptance. Maybe its because of the RNG issue, but most probably input parameters were not properly tuned that time... This will also affect Luigis new run strongly.

@urbach
urbach commented Mar 24, 2013

so, should we close this now, seems we don't have too much discuss here right now...

@kostrzewa
European Twisted Mass Collaboration member

Yes, I think so, we can revive the issue at some point when time is not at such a premium.

@kostrzewa kostrzewa closed this Mar 24, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.