-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data binning segfaults when data is in device memory on non-unified memory systems #1122
Comments
Hey @BenWibking sorry for the delay. A reproducer would be great and I can do my best to try to help you figure out this issue! |
Thanks. I've put a reproducer here: parthenon-hpc-lab/athenapk#49. Let me know what you find. |
@nicolemarsaglia I've rebuilt Ascent + TPLs with debugging info and I get a more informative backtrace. The segmentation fault happens here:
|
Here's
And
|
I've uploaded the core files here: https://cloudstor.aarnet.edu.au/plus/s/hTgYZQWYDYTPZn9 |
Thanks for the info! I'm a tad sick so I'm taking the rest of the day (sorry!), but I can get back to this on Monday. |
This is a very strange bug that I cannot reproduce on either Frontier or Summit. Somehow it appears to only happen on A100s. |
Ok, I've traced the issue to the fact that the binning operation runs on the CPU and it attempts to dereference a device pointer, since our code sends the device-resident data to Ascent via zero-copy. This works on systems with unified memory, such as Summit and Frontier, but fails on systems without it. |
Thanks for confirming this behavior, we will work to resolve these limits for binning. |
Data binning works fine when running on CPU. However, when running this actions file
on A100 GPUs, I get a segmentation fault:
I can provide a full reproducer if needed.
The text was updated successfully, but these errors were encountered: