-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ChaNGa crashes/hangs with UCX machine layer (in SMP mode) on Frontera #2636
Comments
With just export UCX_ZCOPY_THRESH=-1, I saw a hang during the initial domain decomposition
With both export UCX_ZCOPY_THRESH=-1 and +ucx_rndv_thresh=2048, ChaNGa ran 2 big steps (3553 and 3554), but didn't complete in 30 mins and was hence killed by the scheduler. @trquinn was seeing another hang during the load balancing phase but mentioned that setting |
@nitbhat, can you please try UCX_IB_RX_MAX_BUFS=32768 without any other settings? |
Okay, I'll try that. |
@brminich: I haven't been able to test that setting yet. (Frontera was down for maintenance on Tuesday and now for some reason, I'm getting weird errors while launching the MPI job. I'm in conversation with TACC about it). I'll test it as soon as I can. |
@brminich: I tried different values for UCX_IB_RX_MAX_BUFS from 32k to 2k, and I got the same error. For the case when I set UCX_IB_RX_MAX_BUFS to 2048, I saw this warning: |
Following the suggestion in #2635 , I tried running ChaNGa with the master branch of ucx. With dwf1b running on 2 nodes/4 processes, I get the failure:
This run works for ucx releases 1.6.1, 1.7 and 1.8.0. Bisection says the failure starts to happen at ucx git hash 35f6d1189c410aa06a3c8f5fb18805527da91cf7 (although this fails with a seg fault earlier; the change to the registered memory issue happens at 896d76b8762bc5d54f8f74fbc805a25ed404d055)
|
@trquinn are there any errors in dmesg on the machine which failed to register memory? |
@trquinn: How can I get access to the dwf1b benchmark? Is it the same as dwf1.6144 as listed in https://github.com/N-BodyShop/changa/wiki/ChaNGa-Benchmarks? |
I tried the
@trquinn: Have you tried running ChaNGa based on ucx master to reproduce the occasional hangs that you saw during load balancing? |
I get past the registration error when I run with the nonsmp version. However, the run crashes after step 3553.875.
The assertion failure happens in the send completion callback and the failure indicates that one of the sends didn't complete successfully.
I'm guessing that the status object can be queried to understand more about why the send failed. |
Correct: that benchmark can be downloaded from google drive. |
Yes, and I got similar errors as you. |
@trquinn I was able to reproduce the memory registration error that you were seeing on 2 nodes/4 processes while running the dwf1b benchmark.
@yosefe: On seeing the dmesg output, I don't particularly see any errors related to memory registration. I'm attaching the dmesg output from both the nodes after the crash occurred. dmesg_output_1009912_c191-034.txt |
@nitbhat, are you running on Frontera? |
@brminich: Yes, I was running that on Frontera. Sure.
Let me know if you have any questions. (and if you are/aren't able to reproduce the crash). |
@nitbhat, thanks for the instructions. |
Yes, it crashes every time I run on Frontera. How many nodes did you run it on? 4 nodes? On trying with non-SMP, it seems like there is an issue with memory, since I see this error:
|
I was running on 2 nodes with 4 processes. |
Okay, I think you can run it on 4 nodes (with 28 cores each) to better suit the 2 Frontera nodes (with 56 cores each). Yes, in some runs, I saw similar errors 'Could not malloc - are we out of memory' from an SMP 2 node run as well. So, in the nonsmp case, it's always that error. In the smp run, I sometimes see that error and some other times, I see the error related to memory registration. However, running it on 2 nodes with the MPI layer (both smp and nonsmp) doesn't crash, but takes longer to complete. Interestingly, when I try increasing the number of nodes (to 4 and 8), I still see "out of memory" errors for UCX. (And MPI runs successfully for those cases as well). I'll try to determine the exact memory usage for UCX runs on 2/4/8 nodes. |
I checked on expected memory use: when running on a single SMP process, this benchmark uses 16.3GB. |
@nitbhat, maybe we can have joint debug session on Frontera? |
They just upgraded the ofed libraries (and the system installed UCX) on Frontera. We should see if that makes a difference first. |
@nitbhat, next week is ok |
Note that a similar issue is reported on in the UCX repository: |
I've done a little more investigation on frontera, using the master branches of ucx and charm (as of Aug. 18), and the dwf1b benchmark, running 8 processors on 4 nodes.
and the output at the time of crash typically looks like this:
So: ucx is registering a large number of memory segments. The actual amount of memory is large but (I think) not excessive: the total memory used by each process is about 32GB. (Again, 2 procs/node.) But I think the total number of memory segments seems very large: each node has of order 100,000 memory segments registered with the IB interface. I'm wondering if there is some fragmentation in the ucx memory pool. Trying to reduce the number of receive buffers with |
Any chance this will be fixed in 6.11? |
@trquinn Does the issue still occur with UCX 1.9.0? |
I just tried with UCX v1.9.0 and Charm v6.11.0-beta. The issue still occurs. |
@brminich: Do you have any insights as to what might be happening here? (Or the linked issue on the UCX repo openucx/ucx#5291) |
I doubt I have su privileges on Frontera, the current setting is sysctl: vm.max_map_count = 65530 |
@trquinn, is it possible to try UCX master and Charm from this branch?https://github.com/brminich/charm/tree/topic/ucx_van_using_am |
I get compile errors when building that branch of charm. I'm using gcc 9.1.0. Here are the first few:
|
@trquinn, sorry forgot to test SMP version. Can you please check now (updated the same branch)? It can be used with UCX master only |
I ran some tests with this on Frontera. The good news is that my standard benchmark runs very well. It used to fail at around four nodes, and now scales up to 24 nodes.
|
The various .yml files in the repository are examples of how we run our test suite as part of continuous integration. For example: UCX non-SMP: |
Has this been fixed? |
No. This problem has become more wide-spread since OpenMPI version 4 is now built on top of UCX. A new example is from SDSC Expanse, building charm with |
We are seeing the same thing with SpECTRE. We've so far only tested on Frontera. Specifically, we also get We're testing OpenMPI 4.1.2 on our own cluster using UCX 1.12.1. I'll report back if really long runs work fine. Frontera uses UCX 1.11.0, for what it's worth. |
Slightly off topic, but if you all are using the MPI layer while the issues with UCX are being diagnosed, it might be worth trying to use the preposted MPI receives option. You can enable it via compiling with the If you do try this out, we'd appreciate if you'd share some data on how it affected performance for you. |
I am testing a rebased version of the UCX AM branch from @brminich, and I was able to reproduce this crash on Bridges2. It occurs shortly after ChaNGa's startup with small node-count runs. I will start testing other configurations to see whether I can make any headway 👍
|
@jszaday, is it possible to try increasing map count with |
Unfortunately since this is running on an externally managed machine, it doesn't seem like I can change that option. Is there anything else I can modulate that doesn't require |
The 64 node run with the h148 cosmo dataset caused assertion failures at different places.
ChaNGa_6.10_debug_64nodes_h148_run.txt
is the run output.
The most common assertion failures was:
Others were:
@brminich mentioned that assertion failue
Assertion "!(((uintptr_t)(status_ptr)) >= ((uintptr_t)UCS_ERR_LAST))" failed in file machine.C line 570.
happens because of memory allocation failure.The text was updated successfully, but these errors were encountered: