Temperature in LB_GPU is too high #227

fweik · 2015-04-21T14:37:04Z

Fri 30 Jan 2015 05:08:59 PM CET, original submission:
(1) The particle temperature is too high; as determined by running a longer version of the lb_gpu.tcl test case, see figure.
(2) The lb_gpu.tcl (and by extension the lb.tcl) test case uses the instantaneous fluid temperature as its error criterion, this should be the averaged fluid temperature.
(3) The lb fluid temperature may be too high; could not confirm, different seeds gave different averages, see figure for an instance of a too high fluid temperature.
(4) The lb gpu is improperly seeded; it uses global seed plus thread index. If one uses the PID as global seed (as done in the test case) successive runs produce systems with nearly identical progression of the kinetic energy.
(5) In the extended run length version of the lb_gpu.tcl test case there is unexpected divergence of the kinetic energy (temperature) when runs are initialized with identical seeds. They evolve identically at first, but diverge suddenly somewhere between 16000 and 20000 steps (depending on the seed and possibly time step), see attached images. This could be an uninitialized memory issue.
(6) The deviation of the temperature with respect to the input value becomes smaller with smaller time step.

fweik · 2015-04-21T14:37:22Z

Fri 30 Jan 2015 05:38:21 PM CET, comment #1:
We just checked how the deviations depend on the particle friction. For a friction of 10 (instead of 2), the fluid temperature is unaffected, but the particle temperature deviation is 5 times higher (and closer to the higher fluid temperature). This suggests that the error happens in the fluid and the particles are just perturbed from their proper temperature by the coupling with the improperly thermalized fluid.

fweik · 2015-04-21T14:37:38Z

Fri 30 Jan 2015 05:50:55 PM CET, comment #2:
Regarding (4): Could you elaborate in what way global seed plus thread index is improper seeding? Provided we use a reasonably uncorrelated RNG, that would be correct and is actually a usual approach to many core seeding. Using for example the RNG to initialize the seeds for the nodes in turn would be plain wrong, because that introduces correlations.
If you see nearly identical progression of the kinetic energy, that rather indicates that the RNG seed isn't used.
Regarding (5): That's likely the one you want to hunt down. The program might initially start always the same because the initial memory contents is the same, but with time due to MPI/PCI timing jitter you will get different memory contents at places the code isn't supposed to read. And be advised that last time we had such a reading of uninitialized memory, valgrind was unable to detect it. printf debugging works though, but you need to be able to handle GB size of logs.

fweik · 2015-04-21T14:37:45Z

Fri 30 Jan 2015 05:53:21 PM CET, comment #3:
Just a caveat, if you increase the friction the integrator becomes less accurate which may also affect the particle temperature. It would be better to keep gamma*dt constant.
The distinct way to check would be to measure the distribution of the modes which will allow you to identify which one is off.

fweik · 2015-04-21T14:37:54Z

Tue 03 Feb 2015 01:46:45 PM CET, comment #4:
I played around a bit with the test case and discovered that the problem gets worst as one lowers the time step. For time steps of 10^-4 and lower the particle temperature seems to be 0.5 while the fluid temperature is more or less correct, not sure if this is helpful or not.

fweik · 2015-04-21T14:38:01Z

Tue 03 Feb 2015 02:51:47 PM CET, comment #5:
I discovered the cause of the discrepancy. I had ROTATION compiled in and the rotational degrees of freedom were only being equilibrated during an initial warm up. With a small time step these degrees of freedom did not have enough time to warm up and the resulting temperature was then half of what it should be. Apologies for the red herring.
Owen

fweik · 2015-04-21T14:38:09Z

Tue 03 Feb 2015 04:01:24 PM CET, comment #6:
(7) If there are no particles in the system, the global seed is not propagated to the GPU, which then uses uninitialized memory as seed (according to Dominic).
@axel: By using thread_seed=global_seed+thread_index, subsequent runs whose global_seed doesn't differ by number_of_threads or more produce correlated temperatures. The reason is that the simulation uses the same random numbers to create the noise, just at shifted positions in the grid. The shift doesn't matter so much, since the temperature uses the summed kinetic energies. The only difference between the two systems (considering temperature) comes from the first few threads, that actually use new random numbers, as well as the different spacial relations of the "noises". If this does indeed explain what we see, then the drifting apart of two runs with two very similar seeds (+6) takes a very long time, though.
Of course this is not a problem in principle, one just needs to use sufficiently different seeds, but seeding with the plain pid is common (as in the test case), that's why I thought it should be changed. Dominic uses PID^k with a large k from the TCL level. Something like that would already work.

fweik · 2015-04-21T14:38:17Z

Tue 03 Feb 2015 04:25:11 PM CET, comment #7:
I don't think "the shift doesn't matter". If the shift is not a multiple of the box length, then the neighbor topology of the nodes is different, and very quickly different parts of the random sequence interact, which should lead to additional decorrelation.
Otherwise, we would have a fundamental problem, since we use a 48-bit RNG. That takes just a couple of time steps to repeat, even for a classical MD with 10000 thousands of particles, and for LB it is much less.

fweik · 2015-04-21T16:20:15Z

Can this be closed?

dschwoerer · 2015-06-16T12:57:14Z

In pull request #338 the lb test case failed again.

rempferg · 2015-06-16T13:04:38Z

No. Several things need to be done:

Take into account the momentum from the initial kick and calculate the kinetic energy in the center of mass reference frame. That might fix the temperature offset.
The original script only compared to observables after the last integration step instead of their averages. Also the "expected" variances were wayy too high, which made the test case fail not that often anyways. I did fix the former and fudged some values for the latter. This should be done more rigorously, though.

rempferg · 2015-06-16T13:06:10Z

The unexplained sudden divergence of different runs with the same seed remains.
I don't think anyone implemented the GPU seeding interface.

Thesse should be moved into a separate issue.

Formatting rules alignments

fweik assigned fweik and rempferg and unassigned fweik Apr 21, 2015

fweik added the Bug label May 5, 2015

fweik closed this as completed Jun 17, 2016

jngrad pushed a commit to jngrad/espresso that referenced this issue Jan 22, 2020

Merge pull request espressomd#227 from psci2195/pr1842

e390bd8

Formatting rules alignments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Temperature in LB_GPU is too high #227

Temperature in LB_GPU is too high #227

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

dschwoerer commented Jun 16, 2015

rempferg commented Jun 16, 2015

rempferg commented Jun 16, 2015

Temperature in LB_GPU is too high #227

Temperature in LB_GPU is too high #227

Comments

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

fweik commented Apr 21, 2015

dschwoerer commented Jun 16, 2015

rempferg commented Jun 16, 2015

rempferg commented Jun 16, 2015