Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not a number in L2 error #1

Open
pkestene opened this issue Mar 5, 2019 · 6 comments
Open

Not a number in L2 error #1

pkestene opened this issue Mar 5, 2019 · 6 comments

Comments

@pkestene
Copy link

pkestene commented Mar 5, 2019

Hi,

I have tested several GPU arch (sm_35, sm_50) with different cuda version (8.0 and 10.0), and keep having on output log:

--- FMM vs. direct ---------------
Rel. L2 Error (pot) : nan
Rel. L2 Error (acc) : nan

The bodyAcc arrays contains lot's of NaN.
Cuda-memcheck does not complain.

I don't know where this problem could originate.

Can anyone help to identify the problem, confirm / invalidate this behavior ?

@rioyokota
Copy link
Contributor

Managing 15 students doesn't give me much time to code or debug anymore.
I can give you some hints though.
There must be something wrong with the initialization of the bodies.
Could you check the coordinates and charges of the initialized bodies?

@pkestene
Copy link
Author

pkestene commented Mar 6, 2019

Ok, I'll try so see if I can find and report soon.
Thanks for sharing this impressive piece of code.

@pkestene
Copy link
Author

pkestene commented Mar 8, 2019

Data are well initialized, no problem here.

I tried a smaller configuration (1024 bodies), running on K80 (sm_37).
As reported at the top cuda-memcheck does not complain, but since dynamic parallelism
is not supported by cuda-memcheck, I don't really know if there might still be memory problems.

Trying to analyze the problem turns to be quite hard.

  • without debug flags, there are NaN everywhere in bodyAcc (FMM) but not in bodyAcc2 (Direct)
  • with debug flags, I had to remove the launch_bound constraint for kernel buildOctant because ptxas an error (not enough registers).
    In that case, weirdly, the very first execution, "seems" ok: L2 error for potential is rather low (5e-2 for 1024 particules), but L2 for acceleration is near 1. Then running multiple times the same executable / same config, provides results strongly wrong, even the direct computation (no NaN, but just wrong).
    I have the feeling the GPU is then in a weird state; I need to reload the driver to reset / clean the GPU, before being able to observe again a small L2 error for potential.

Running larger configuration (still with debug flags) provides "mostly" good results, i.e. fmm and direct agree, except at some location, where fmm potential generate crazy large numbers.

I also don't know if I can "trust" cuda-memcheck : when running the executable (built with -g -G -O0), cuda-memcheck somtimes report an out of bound error in buildOctant kernel, sometime the execution never stops !

I'd like to identify if there is really a problem with buildOctant (and possibly related to recursive call inside kernel, ie cuda dynamic parallelism). Would you recommend a better configuration (number of particules, NCRIT, ....) which could be better to analyze this behavior ?

@rioyokota
Copy link
Contributor

The code was working at some point. Perhaps you could checkout older revisions to see if they work?

@rioyokota
Copy link
Contributor

Also, this is the original code.
https://github.com/treecode/Bonsai
Maybe this one works?
It's much more complicated than my simplified version though.

@chaithyagr
Copy link

chaithyagr commented Aug 21, 2019

I just did some changes to code..
It looks like somehow the drand48 is not returning a pseudo random number as expected and all numbers are zero.
I updated the code to initialize with a uniform random number between 0 and 1 , I no longer see NaN as L2 error:

--- FMM Profiling ----------------
Stack size : 16.5997 MB
Cell data : 7.99998 MB
Get bounds : 0.0005000 s
Grow tree : 0.0081382 s
Link tree : 0.0033362 s
Make groups : 0.0113270 s
Upward pass : 0.0012829 s
Traverse : 10.7919180 s (0.3498061 TFlops)
--- Total runtime ----------------
Total FMM : 10.8279550 s (0.3486419 TFlops)
Total Direct : 0.0099740 s (0.9419712 TFlops)
--- FMM vs. direct ---------------
Rel. L2 Error (pot) : 5.4503967e-03
Rel. L2 Error (acc) : 2.9218650e-02

--- Tree stats -------------------
Bodies : 524287
Cells : 4680
Tree depth : 4

Although for large sizes, I end up into CUDA launch timeouts as I have an active display.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants