Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

remaining random numbers issues #182

Open
kostrzewa opened this Issue · 10 comments

2 participants

@kostrzewa
Owner

Just a collection of remaining issues to close the random numbers chapter:

  1. loc_seed = (seed + step*max_seed) % INT_MAX will result in loc_seed=seed for step=0 and step=g_nproc, but I don't think this can ever occur right? as g_nproc > max(step), also, for g_nproc > INT_MAX tihs obviously breaks down but I think for now it's safe to assume that we can live with that!
  2. seed^(nstore+1) will always modify the seed by flipping the lowest bit, this should be changed back, also, it should be made very clear to users that nstore needs to be readin when continuing ensembles!
  3. It should be made very clear to users that two different ensembles must be started with different seeds. The seed(s) used should be listed with the rest of the information on the wiki! It would also be useful to have a full listing of the chunks that were computed (say 100 trajectories per run or so)
  4. z2 noise always operates in repro=0 mode
  5. routines in P_M_eta.c always use repro=1
  6. for inversions with z2 noise it must be made clear to users that they need to set different seeds for different runs, otherwise they are introducing unwanted correlations, maybe the inverter should by default use a random seed?
  7. source_generation.c has a lot of hard-coded stuff which really needs to be reviewed
  8. the repro=1 solution for random_spinor_field(..) is only valid for V=VOLUME...
@kostrzewa
Owner

@urbach The repro=1 implementation is generally not valid for arbitrary number of random numbers. Because it is for testing purposes only, do you think it would be worthwhile if I rewrote it in such a way that the processes send the ranlux state around in a cartesian fashion whatever the number of random numbers to be generated? That way also the smaller functions can use repro mode. If I'm not mistaken this will be equivalent to the current implementation for N=VOLUME (or VOLUME/2 for the _eo function )

I was thinking of implementing:

(0,0,0,0) -> (0,0,0,1) -> (0,0,0,2) ... (NT,NX,NY,NZ)

This will probably be (quite?) a bit slower than what is used right now for repro=1 though.

@urbach
Owner

sorry, I think I didn't really understand what you wanted to do? What do you mean by

not valid for arbitrary number of random numbers

?

@urbach
Owner

One more thing that I also though about and wanted to hear (all) your opinions:

  • we could restart the random number generator at the beginning of each trajectory. This would make restarts reproducable. Is this a good idea?
  • how much slower is the current repro=1 implementation?
@kostrzewa
Owner

sorry, I think I didn't really understand what you wanted to do? What do you mean by

not valid for arbitrary number of random numbers
?

well, there are a few functions in start.c and in other places that produce random numbers but not in units of VOLUME or VOLUME/2. Here the repro=1 implementation that you used breaks down. Many functions don't have a repro=1 implementation at all.

Now, one could also try to split up the request into chunks N/g_nproc and then do the same game as is done now by generating all random numbers locally and using only the ones belonging to a given process by reusing the g_ngproc_[t,x,y,z] and setting the upper limit of the loops via g_nproc_[t,x,y,z]N/g_nproc and the local coordinates as is done now but one would have to replace [T,LX,LY,LZ] by g_nproc_[t,x,y,z]N/g_nproc. Of course, this hinges on N being divisible by g_nproc which is not a given...

This would of course add a boatload of complexity which would be avoided by my suggestion altogether.

In repro mode, whenever random numbers would be generated, the state would be sent from one process to the other (in a cartesian fashion) and finally back to process 0 to guarantee that there is no mixing of random number modes.

how much slower is the current repro=1 implementation?

I can't really say, in Zeuthen the performance is so variable because of the braindead process mapping that I can't measure the effect for any of the parallelizations. (I don't even want to tell you how many runs it took me to figure out how much walltime buffer I need to absorb the performance variations and predict correct job lengths...)

I think the best machine to figure this out on is BG/Q because it has such tight limits on predictable performance.

we could restart the random number generator at the beginning of each trajectory. This would make restarts reproducable. Is this a good idea?

The difference between restart and continue never became clear to me looking at what hmc_tm does. What's the difference?

@kostrzewa
Owner

A workaround based on the argument that we have more important things to finish right now would be to make the functions which haven't been adjusted yet fail and abort program execution in repro=1 mode.

@urbach
Owner

Which functions do you have in mind? I'm not convinced that its needed right now. But we should of course implemente this in the close future.

@kostrzewa
Owner
  • random_spinor_field has a V argument and could be called with V != VOLUME (it is in fact called with V=N2=VOLUMEPLUSRAND in index_jd.c), z2_spinor_field has the same issue
  • random_spinor and *_su3_vector are externally accessible and can cause the RNG to become confused if called in repro=1 mode, the same goes for the simpler gauss_vector function

I agree with you that none of this is urgent but it really has to be done and I wanted to write it all down.

  • for the case of Z2 noise, Elena has already been using this on SuperMUC (with the broken loc_seed initialization I presume... I don't know whether the code she compiled from is dated after September 3rd) and I promised I'd fix that bit soonish.
@urbach
Owner

Yes, fine, lets make the routines not externally accessible and random_spinor_field should be removed anyhow.

Concerning Elena, they should first fix invert!

@kostrzewa
Owner

random_spinor_field should be removed anyhow.

So I'm guessing index_jd.c is dead code then?

@urbach
Owner

currently not called anywhere...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.