Not a number in L2 error #1

pkestene · 2019-03-05T23:00:02Z

Hi,

I have tested several GPU arch (sm_35, sm_50) with different cuda version (8.0 and 10.0), and keep having on output log:

--- FMM vs. direct ---------------
Rel. L2 Error (pot) : nan
Rel. L2 Error (acc) : nan

The bodyAcc arrays contains lot's of NaN.
Cuda-memcheck does not complain.

I don't know where this problem could originate.

Can anyone help to identify the problem, confirm / invalidate this behavior ?

rioyokota · 2019-03-06T08:50:09Z

Managing 15 students doesn't give me much time to code or debug anymore.
I can give you some hints though.
There must be something wrong with the initialization of the bodies.
Could you check the coordinates and charges of the initialized bodies?

pkestene · 2019-03-06T12:26:39Z

Ok, I'll try so see if I can find and report soon.
Thanks for sharing this impressive piece of code.

pkestene · 2019-03-08T12:54:18Z

Data are well initialized, no problem here.

I tried a smaller configuration (1024 bodies), running on K80 (sm_37).
As reported at the top cuda-memcheck does not complain, but since dynamic parallelism
is not supported by cuda-memcheck, I don't really know if there might still be memory problems.

Trying to analyze the problem turns to be quite hard.

without debug flags, there are NaN everywhere in bodyAcc (FMM) but not in bodyAcc2 (Direct)
with debug flags, I had to remove the launch_bound constraint for kernel buildOctant because ptxas an error (not enough registers).
In that case, weirdly, the very first execution, "seems" ok: L2 error for potential is rather low (5e-2 for 1024 particules), but L2 for acceleration is near 1. Then running multiple times the same executable / same config, provides results strongly wrong, even the direct computation (no NaN, but just wrong).
I have the feeling the GPU is then in a weird state; I need to reload the driver to reset / clean the GPU, before being able to observe again a small L2 error for potential.

Running larger configuration (still with debug flags) provides "mostly" good results, i.e. fmm and direct agree, except at some location, where fmm potential generate crazy large numbers.

I also don't know if I can "trust" cuda-memcheck : when running the executable (built with -g -G -O0), cuda-memcheck somtimes report an out of bound error in buildOctant kernel, sometime the execution never stops !

I'd like to identify if there is really a problem with buildOctant (and possibly related to recursive call inside kernel, ie cuda dynamic parallelism). Would you recommend a better configuration (number of particules, NCRIT, ....) which could be better to analyze this behavior ?

rioyokota · 2019-03-11T10:19:51Z

The code was working at some point. Perhaps you could checkout older revisions to see if they work?

rioyokota · 2019-03-11T10:21:21Z

Also, this is the original code.
https://github.com/treecode/Bonsai
Maybe this one works?
It's much more complicated than my simplified version though.

chaithyagr · 2019-08-21T08:35:24Z

I just did some changes to code..
It looks like somehow the drand48 is not returning a pseudo random number as expected and all numbers are zero.
I updated the code to initialize with a uniform random number between 0 and 1 , I no longer see NaN as L2 error:

--- FMM Profiling ----------------
Stack size : 16.5997 MB
Cell data : 7.99998 MB
Get bounds : 0.0005000 s
Grow tree : 0.0081382 s
Link tree : 0.0033362 s
Make groups : 0.0113270 s
Upward pass : 0.0012829 s
Traverse : 10.7919180 s (0.3498061 TFlops)
--- Total runtime ----------------
Total FMM : 10.8279550 s (0.3486419 TFlops)
Total Direct : 0.0099740 s (0.9419712 TFlops)
--- FMM vs. direct ---------------
Rel. L2 Error (pot) : 5.4503967e-03
Rel. L2 Error (acc) : 2.9218650e-02
--- Tree stats -------------------
Bodies : 524287
Cells : 4680
Tree depth : 4

Although for large sizes, I end up into CUDA launch timeouts as I have an active display.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not a number in L2 error #1

Not a number in L2 error #1

pkestene commented Mar 5, 2019

rioyokota commented Mar 6, 2019

pkestene commented Mar 6, 2019

pkestene commented Mar 8, 2019

rioyokota commented Mar 11, 2019

rioyokota commented Mar 11, 2019

chaithyagr commented Aug 21, 2019 •

edited

Loading

Not a number in L2 error #1

Not a number in L2 error #1

Comments

pkestene commented Mar 5, 2019

rioyokota commented Mar 6, 2019

pkestene commented Mar 6, 2019

pkestene commented Mar 8, 2019

rioyokota commented Mar 11, 2019

rioyokota commented Mar 11, 2019

chaithyagr commented Aug 21, 2019 • edited Loading

chaithyagr commented Aug 21, 2019 •

edited

Loading