Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperature in LB_GPU is too high #227

Closed
fweik opened this issue Apr 21, 2015 · 11 comments
Closed

Temperature in LB_GPU is too high #227

fweik opened this issue Apr 21, 2015 · 11 comments
Assignees
Labels

Comments

@fweik
Copy link
Contributor

fweik commented Apr 21, 2015

Fri 30 Jan 2015 05:08:59 PM CET, original submission:
(1) The particle temperature is too high; as determined by running a longer version of the lb_gpu.tcl test case, see figure.
(2) The lb_gpu.tcl (and by extension the lb.tcl) test case uses the instantaneous fluid temperature as its error criterion, this should be the averaged fluid temperature.
(3) The lb fluid temperature may be too high; could not confirm, different seeds gave different averages, see figure for an instance of a too high fluid temperature.
(4) The lb gpu is improperly seeded; it uses global seed plus thread index. If one uses the PID as global seed (as done in the test case) successive runs produce systems with nearly identical progression of the kinetic energy.
(5) In the extended run length version of the lb_gpu.tcl test case there is unexpected divergence of the kinetic energy (temperature) when runs are initialized with identical seeds. They evolve identically at first, but diverge suddenly somewhere between 16000 and 20000 steps (depending on the seed and possibly time step), see attached images. This could be an uninitialized memory issue.
(6) The deviation of the temperature with respect to the input value becomes smaller with smaller time step.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Fri 30 Jan 2015 05:38:21 PM CET, comment #1:
We just checked how the deviations depend on the particle friction. For a friction of 10 (instead of 2), the fluid temperature is unaffected, but the particle temperature deviation is 5 times higher (and closer to the higher fluid temperature). This suggests that the error happens in the fluid and the particles are just perturbed from their proper temperature by the coupling with the improperly thermalized fluid.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Fri 30 Jan 2015 05:50:55 PM CET, comment #2:
Regarding (4): Could you elaborate in what way global seed plus thread index is improper seeding? Provided we use a reasonably uncorrelated RNG, that would be correct and is actually a usual approach to many core seeding. Using for example the RNG to initialize the seeds for the nodes in turn would be plain wrong, because that introduces correlations.
If you see nearly identical progression of the kinetic energy, that rather indicates that the RNG seed isn't used.
Regarding (5): That's likely the one you want to hunt down. The program might initially start always the same because the initial memory contents is the same, but with time due to MPI/PCI timing jitter you will get different memory contents at places the code isn't supposed to read. And be advised that last time we had such a reading of uninitialized memory, valgrind was unable to detect it. printf debugging works though, but you need to be able to handle GB size of logs.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Fri 30 Jan 2015 05:53:21 PM CET, comment #3:
Just a caveat, if you increase the friction the integrator becomes less accurate which may also affect the particle temperature. It would be better to keep gamma*dt constant.
The distinct way to check would be to measure the distribution of the modes which will allow you to identify which one is off.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Tue 03 Feb 2015 01:46:45 PM CET, comment #4:
I played around a bit with the test case and discovered that the problem gets worst as one lowers the time step. For time steps of 10^-4 and lower the particle temperature seems to be 0.5 while the fluid temperature is more or less correct, not sure if this is helpful or not.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Tue 03 Feb 2015 02:51:47 PM CET, comment #5:
I discovered the cause of the discrepancy. I had ROTATION compiled in and the rotational degrees of freedom were only being equilibrated during an initial warm up. With a small time step these degrees of freedom did not have enough time to warm up and the resulting temperature was then half of what it should be. Apologies for the red herring.
Owen

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Tue 03 Feb 2015 04:01:24 PM CET, comment #6:
(7) If there are no particles in the system, the global seed is not propagated to the GPU, which then uses uninitialized memory as seed (according to Dominic).
@axel: By using thread_seed=global_seed+thread_index, subsequent runs whose global_seed doesn't differ by number_of_threads or more produce correlated temperatures. The reason is that the simulation uses the same random numbers to create the noise, just at shifted positions in the grid. The shift doesn't matter so much, since the temperature uses the summed kinetic energies. The only difference between the two systems (considering temperature) comes from the first few threads, that actually use new random numbers, as well as the different spacial relations of the "noises". If this does indeed explain what we see, then the drifting apart of two runs with two very similar seeds (+6) takes a very long time, though.
Of course this is not a problem in principle, one just needs to use sufficiently different seeds, but seeding with the plain pid is common (as in the test case), that's why I thought it should be changed. Dominic uses PID^k with a large k from the TCL level. Something like that would already work.

@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Tue 03 Feb 2015 04:25:11 PM CET, comment #7:
I don't think "the shift doesn't matter". If the shift is not a multiple of the box length, then the neighbor topology of the nodes is different, and very quickly different parts of the random sequence interact, which should lead to additional decorrelation.
Otherwise, we would have a fundamental problem, since we use a 48-bit RNG. That takes just a couple of time steps to repeat, even for a classical MD with 10000 thousands of particles, and for LB it is much less.

@fweik fweik assigned fweik and rempferg and unassigned fweik Apr 21, 2015
@fweik
Copy link
Contributor Author

fweik commented Apr 21, 2015

Can this be closed?

@fweik fweik added the Bug label May 5, 2015
@dschwoerer
Copy link
Contributor

In pull request #338 the lb test case failed again.

@rempferg
Copy link
Member

No. Several things need to be done:

  1. Take into account the momentum from the initial kick and calculate the kinetic energy in the center of mass reference frame. That might fix the temperature offset.

  2. The original script only compared to observables after the last integration step instead of their averages. Also the "expected" variances were wayy too high, which made the test case fail not that often anyways. I did fix the former and fudged some values for the latter. This should be done more rigorously, though.

@rempferg
Copy link
Member

  1. The unexplained sudden divergence of different runs with the same seed remains.

  2. I don't think anyone implemented the GPU seeding interface.

Thesse should be moved into a separate issue.

@fweik fweik closed this as completed Jun 17, 2016
jngrad pushed a commit to jngrad/espresso that referenced this issue Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants