Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera #2635

Closed
nitbhat opened this issue Dec 4, 2019 · 54 comments
Closed

Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera #2635

nitbhat opened this issue Dec 4, 2019 · 54 comments
Assignees
Labels
Bug UCX
Milestone

Comments

@nitbhat
Copy link

@nitbhat nitbhat commented Dec 4, 2019

This bug was reported by James Bordner/Mike Norman at the Charm++ BoF at SC 19.

Bug reproduction details:
Enzo-E/Cello version: https://github.com/jobordner/enzo-e.git (solver-dd branch)

Modules:
module load gcc ucx boost

Environment:

   export BOOST_HOME=$TACC_BOOST_DIR
   export CHARM_HOME=/home1/00369/tg456481/Charm/charm.6A0-ucx-gcc.default #(change if needed)
   export CELLO_ARCH=frontera_gcc
   export CELLO_PREC=single
   export HDF5_HOME=/home1/00369/tg456481

Input files:

      Copy /work/00369/tg456481/frontera/cello-data/Cosmo/OsNa05/OsNa05-N512/N512/*
   to run directory

Input file: see attached
enzoe-frontera-ucx.in.txt

Batch script:

#SBATCH -N 64               # Total # of nodes
#SBATCH -n 3584             # Total # of mpi tasks
...

Run command used: charmrun +p3584 ./enzo-p enzoe-frontera-ucx.in.txt

@nitbhat nitbhat added Bug UCX labels Dec 4, 2019
@nitbhat nitbhat added this to the 6.10.0 milestone Dec 4, 2019
@nitbhat nitbhat self-assigned this Dec 4, 2019
@nitbhat nitbhat changed the title Cello/Enzo-P hangs with the UCX machine layer (in nonSMP mode) on Frontera Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera Dec 4, 2019
@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 4, 2019

While attempting to reproduce this issue, I got this HDF5 related error while linking the final binary:

Install file: "build/Cello/libcello.a" as "lib/libcello.a"
/home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/charmc -language charm++ -o build/Enzo/enzo-p -O3 -Wall -g -ffast-math -funroll-loops -rdynamic -module CommonLBs build/Enzo/enzo-p.o build/Cello/main_enzo.o -Llib -L/home1/03808/nbhat4/software/hdf5-1.8.14/build/lib -L/opt/apps/gcc9_1/boost/1.69/lib -L/usr/lib64/lib -lenzo -lcharm -lsimulation -ldata -lproblem -lcompute -lcontrol -lmesh -lio -ldisk -lmemory -lparameters -lerror -lmonitor -lparallel -lperformance -ltest -lcello -lexternal -lhdf5 -lz -ldl -lpng -lgfortran -lboost_filesystem -lboost_system
/opt/apps/gcc/9.1.0/bin/ld: lib/libdisk.a(disk_FileHdf5.o): in function `FileHdf5::FileHdf5(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)':
/home1/03808/nbhat4/software/enzo-e/build/Cello/disk_FileHdf5.cpp:43: undefined reference to `H5P_CLS_DATASET_CREATE_ID_g'
collect2: error: ld returned 1 exit status
Fatal Error by charmc in directory /home1/03808/nbhat4/software/enzo-e
   Command g++ -rdynamic -L/usr/lib64/ -O3 -Wall -g -ffast-math -funroll-loops -rdynamic -Llib -L/home1/03808/nbhat4/software/hdf5-1.8.14/build/lib -L/opt/apps/gcc9_1/boost/1.69/lib -L/usr/lib64/lib build/Enzo/enzo-p.o build/Cello/main_enzo.o moduleinit123437.o -L/home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/../lib -lmoduleCommonLBs -lckmain -lck -lmemory-default -lthreads-default -lconv-machine -lconv-core -ltmgr -lconv-util -lconv-partition -lhwloc_embedded -lm -lmemory-default -lthreads-default -lldb-rand -lconv-ldb -lckqt -lucp -luct -lucs -lucm -ldl -lenzo -lcharm -lsimulation -ldata -lproblem -lcompute -lcontrol -lmesh -lio -ldisk -lmemory -lparameters -lerror -lmonitor -lparallel -lperformance -ltest -lcello -lexternal -lhdf5 -lz -ldl -lpng -lgfortran -lboost_filesystem -lboost_system -lmoduleCommonLBs -lmoduleNDMeshStreamer -lmodulecompletion -lz -lm /home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/../lib/conv-static.o -o build/Enzo/enzo-p returned error code 1
charmc exiting...
scons: *** [build/Enzo/enzo-p] Error 1

I got past it by using HDF5 version 1.8.13. (all other newer version gave me this ^ error)

Another setting that can simplify the build process is setting "use_grackle = 0" in the SConstruct file line 102

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 5, 2019

As previously stated by James over email, the program ran successfully when exectued on 1/2 nodes. However, the bug appeared while running on 4, 8 and 64 nodes.

The last output printed before the hang for a 4 node run is as follows:

0 00386.25 Performance refresh_store time-usec 0
0 00386.25 Performance refresh_child time-usec 0
0 00386.25 Performance refresh_exit time-usec 0
0 00386.25 Performance refresh_store_sync time-usec 0
0 00386.25 Performance refresh_child_sync time-usec 0
0 00386.25 Performance refresh_exit_sync time-usec 0
0 00386.25 Performance control time-usec 80702761
0 00386.25 Performance compute time-usec 14184066599
0 00386.25 Performance output time-usec 133512
0 00386.25 Performance stopping time-usec 4354327957
0 00386.25 Performance block time-usec 57974302110
0 00386.25 Performance exit time-usec 0
0 00386.25 Performance simulation max-proc-blocks 199
0 00386.25 Performance simulation max-proc-particles 618658
0 00386.25 Performance simulation balance-blocks 19.031216
0 00386.25 Performance simulation balance-eff-blocks 0.840116 (167/199)
0 00386.25 Performance simulation balance-particles 3.249693
0 00386.25 Performance simulation balance-eff-particles 0.968526 (599186/618658)

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 9, 2019

Since @brminich suggested export UCX_ZCOPY_THRESH=-1 for ChaNGa, I used that for a 64 node Cello/Enzo-P run and was able to successfully run without a hang. (Tried about 10 times and it ran successfully every time).

This was my run script:

#!/bin/bash
#SBATCH -J enzo
#SBATCH -p normal
#SBATCH -t 00:05:00
#SBATCH -N 64
#SBATCH -n 3584
#SBATCH --ntasks-per-node=56

cd /home1/03808/nbhat4/software/enzo-e

export UCX_ZCOPY_THRESH=-1
ibrun ./enzo-p-nonsmp ./enzoe-frontera-ucx.in

@brminich: Since the hang exists without UCX_ZCOPY_THRESH=-1 and doesn't hang with UCX_ZCOPY_THRESH=-1, I'm guessing the abundance of small eager messages being copied with the auto threshold are causing the hang? It'll be good if UCX complains about an error rather than just showing a hang.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 11, 2019

@brminich just confirmed that setting UCX_ZCOPY_THRESH=-1 actually means that the Zerocopy threshold will be set to infinite i.e all sends will be buffered and will complete immediately.

So, setting UCX_ZCOPY_THRESH=-1 should ideally take up more memory than the default UCX_ZCOPY_THRESH being 8k or 16k.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 11, 2019

I was able to get enzo-P to hang on 4 nodes and attach gdb to each process and get stack traces for it. I'm attaching the stack traces here.

stack_1node.txt
stack_2node.txt
stack_3node.txt
stack_4node.txt

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 13, 2019

This is the typical size/count of messages sent by one of the processes in the application for a 64 node (3584 process run) (other processes also seem to follow a similar profile):

Msg Size Count
96 630602
1184 216714
416 144234
4256 107880
160 34736
80 24294
3248 12119
4272 12119
6336 12119
9440 12119
12480 11943
944 8039
1200 8039
1728 8039
2528 8039
5312 7967
12464 5980
16560 5980
24768 5980
32960 5980
37088 5980
36992 2147
2144 2000
672 1284
192 424
320 372
112 336
208 301
128 295
1712 214
2224 214
3264 214
4544 214
4832 214
7360 176
960 144
1328 101
224 100
2944 72
656 62
816 62
1152 62
1664 62
3712 62
2576 24
3376 24
4992 24
7424 24
8192 24
14368 4
272 3
304 3
384 3
512 3
2048 3
20 1
452 1

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 17, 2019

@nitbhat, does setting +ucx_rndv_thresh=64 solve the issue?
If yes, we may consider setting it as a default for now

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 17, 2019

@brminich: No, it doesn't. I saw the hang when I passed +ucx_rndv_thresh=64.

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 17, 2019

@nitbhat, have you tried it with some other options or just +ucx_rndv_thresh=64?

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 18, 2019

@brminich: For now, I've just tried it with +ucx_rndv_thresh=64. I could try with some other values too.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 18, 2019

Also, I was working on a simpler test-case to replicate the ucx failures (with similar message patterns). I've added this test as /tests/charm++/bombard to the branch nitin/ucx_debugging.

This test has three modes: (if p is the number of processes/cores)
Mode 0 for p2p communication - where process 0 sends messages to process p-1

Mode 1 for all-to-one communication - where processes 0 to p-2 (in total p-1 processes) send messages to process p-1

Mode 2 for all-to-all communication - where processes 0 to p-1 send messages (broadcast) to all the processes (0 to p-1).

Because I saw about 600,000 messages of 96 bytes on each process, I tried my test with similar settings on 4 nodes (ibrun ./bombard 2 0 3000 - here 2 is the mode, 0 is the size of the buffer and 3000 is the number of iterations the sender should bombard the receiver) and saw that it crashes with the UCX layer. In that branch nitin/ucx_debugging, I'm also printing the message sizes and counts for each process.

A similar test with MPI layer takes really long, so I'm yet to determine if it passes or not. However, the crash in the UCX layer makes me wonder if it's worth pursuing this test and debugging. (because it should be simpler to debug than running Cello).

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 18, 2019

I just verified that although the MPI layer based ibrun ./bombard 2 0 3000 runs slowly, it doesn't crash. I think the UCX layer based ibrun ./bombard 2 0 3000 run completes much faster (probably because of buffered sending and causing immediate completion) and hence crashes. I'm yet to figure out the reason for this.

@trquinn
Copy link
Collaborator

@trquinn trquinn commented Dec 18, 2019

Nitin's observations are consistent with the behavior I see with ChaNGa: places in the simulation where UCX crashes are places that take a very long time with IMPI.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 18, 2019

@brminich: I'm guessing that UCX sends complete immediately (by buffering the messages) and MPI, on the other hand, might have some implicit way of throttling (or throttles the sends because it waits for matching receives). Is there a way for disabling immediate completion in UCX so that we can throttle the sends in some fashion?

@evan-charmworks
Copy link
Contributor

@evan-charmworks evan-charmworks commented Dec 18, 2019

@brminich In addition to trying on the HPC Advisory Council's Thor, it may also be worth trying on Rome and Helios since they have some different hardware that may better match Frontera.

Frontera

Intel Xeon Platinum 8280 ("Cascade Lake"), Clock rate: 2.7Ghz ("Base Frequency")
Mellanox InfiniBand, HDR-100
System Interconnect: Frontera compute nodes are interconnected with HDR-100 links to each node, and HDR (200Gb) links between leaf and core switches. The interconnect is configured in a fat tree topology with a small oversubscription factor of 11:9.

Thor

Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz
Mellanox ConnectX-6 HDR100 100Gb/s InfiniBand/VPI adapters
Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches
Mellanox Connect-IB® Dual FDR 56Gb/s InfiniBand adapters
Mellanox SwitchX SX6036 36-Port 56Gb/s FDR VPI InfiniBand switches

Helios

Dual Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Mellanox ConnectX-6 HDR/HDR100 200/100Gb/s InfiniBand/VPI adapters with Socket Direct
Mellanox HDR Quantum Switch QM7800 40-Port 200Gb/s HDR InfiniBand

Rome

Dual Socket AMD EPYC 7742 64-Core Processor @ 2.25GHz
Mellanox ConnectX-6 HDR 200Gb/s InfiniBand/Ethernet
Mellanox HDR Quantum Switch QM7800 40-Port 200Gb/s HDR InfiniBand

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

@nitbhat,
already tried your test on 2 different clusters. It works fine with 30000 iterations:
time mpirun -n 112 --report-bindings ./bombard 2 96 300000
real 1m4.365s
user 0m3.170s
sys 0m3.607s

But it crashes with 300000 iterations.
Looking into it

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

It crashes with both UCX machine layer and MPI machine layer (using HPCX).
The crash happens due to OOM:

[Thu Dec 19 12:42:10 2019] Out of memory: Kill process 16879 (bombard) score 74 or sacrifice child
[Thu Dec 19 12:42:10 2019] Killed process 16879 (bombard) total-vm:61742852kB, anon-rss:9719228kB, file-rss:0kB, shmem-rss:5300kB

Running with either UD or tcp transport did not help

With Intel MPI it hangs for several minutes and then crashes with:

 [Thu Dec 19 13:22:18 2019] bombard[21841]: segfault at 68 ip 00007fb1bc69cc49 sp 00007fffee32f8c0 error 6 in libmpi.so.12.0[7fb1bc2ee000+758000]

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

btw, while memory consumption is rather high (~60G per node) with +ucx_rndv_thresh=0 it helps to pass the following test:

time mpirun -n 112 -x UCX_TLS=sm,self,ud  ./bombard 2 96 100000 +ucx_rndv_thresh=0
real    0m3.817s
user    0m0.054s
sys     0m0.092s

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 19, 2019

Okay, I thought so (about OOM). Btw, I see that you're passing 96 in your argument list. Note that the size being passed there adds to the already created envelope in charm. So to get a total message of 96 bytes (and mimic Cello), you'd have to pass 0.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 19, 2019

@brminich: Also, is there a setting (an environment variable) which can disable immediate sends and complete sends only after recvs have been posted? If so, I would like to try that.

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

No, there is no such setting. But how would that help? Does Charm++ issue all sends in a nonblocking manner? Can we limit its injection rate somehow? Does Charm++ provides blocking send capabilities?
The main problem is that application (or bombard test) may use enormous amount of memory, more than the actual amount.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 19, 2019

Oh okay. I was thinking that it'll help because then we can narrow down that the bombarding of sends to the receiver is causing that problem (and having that control will help us throttle the sends).

Yes, charm issues all the sends in a non-blocking manner. Not that I'm aware of. We'd have to use a scheme where the sender buffers the messages until the receiver is ready (but I'm guessing the UCX layer already does that right now).

@brminich
Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

(but I'm guessing the UCX layer already does that right now).

unfortunately not for all messages (for instance not for small ones)

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Dec 19, 2019

You mean, it doesn't buffer the small messages and buffers the large message? That's strange.

Also, were you able to reproduce the Cello hang on Thor? (or other machines @hpcadv center)

When I tried running Cello on Thor, I couldn't get the run to get past the initial stage:

0 00001.34 Define LINKFLAGS           -O3 -Wall -g -ffast-math -funroll-loops  -rdynamic   -module CommonLBs
0 00001.34 Define BUILD HOST          login01.hpcadvisorycouncil.com
0 00001.34 Define BUILD DIR           /global/home/users/nitinb/enzo-e
0 00001.34 Define BUILD DATE (UTC)    2019-12-18
0 00001.34 Define BUILD TIME (UTC)    18:22:14
0 00001.34 Define CELLO_CHARM_PATH    /global/home/users/nitinb/charm/mpi-linux-x86_64-prod
0 00001.34 Define NEW_OUTPUT          no
0 00001.34 Define CHARM_VERSION 61000
0 00001.34 Define CHANGESET     937ed39ac5aa2cea70fd727c79711819a325ddcb
0 00001.34 Define CHARM_BUILD         mpi-linux-x86_64-prod
0 00001.34 Define CHARM_NEW_CHARM Yes
0 00001.34 Define CONFIG_NODE_SIZE    64
0 00001.34 CHARM CkNumPes()           256
0 00001.34 CHARM CkNumNodes()         256
0 00001.34  BEGIN ENZO-P
0 00001.34 Memory bytes 0 bytes_high 0

I'm not sure if this is a hang or a very slow run. (But this was seen for both MPI and UCX runs on 1,2,4,16,32 nodes).

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Feb 19, 2020

It looks like as the number of ranks per node increases, the chances of a hang increase. Here is the output of 224 ranks for different run configurations on Frontera.


7 nodes - 224 ranks - 32 ranks/node - 4/10 hangs
14 nodes - 224 ranks - 16 ranks/node - 1/10 hangs
28 nodes - 224 ranks - 8 ranks/node - 1/10 hangs
56 nodes - 224 ranks - 4 ranks/node - 0/10 hangs

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Feb 20, 2020

On Thor, when I try building and running enzo-P, I don't get past

0 00006.79 Define Simulation processors 1024
0 00006.79 Define CELLO_ARCH          frontera_gcc
0 00006.79 Define CELLO_PREC          single
0 00006.79 Define CC                  gcc
0 00006.79 Define CFLAGS              -O3 -Wall -g -ffast-math -funroll-loops
0 00006.79 Define CPPDEFINES          CONFIG_PRECISION_SINGLE SMALL_INTS {'CONFIG_NODE_SIZE': 64} {'CONFIG_NODE_SIZE_3': 192} NO_FREETYPE CONFIG_USE_PERFORMANCE CONFIG_NEW_CHARM CONFIG_HAVE_VERSION_CONTROL
0 00006.79 Define CPPPATH             #/include /global/home/users/nitinb/hdf5-1.10.5/build/include /global/home/users/nitinb/boost_1_69_0/build//include /usr/lib64/include
0 00006.79 Define CXX                 /global/home/users/nitinb/charm/ucx-linux-x86_64-prod/bin/charmc -language charm++
0 00006.79 Define CXXFLAGS            -O3 -Wall -g -ffast-math -funroll-loops     -balancer CommonLBs
0 00006.79 Define FORTRANFLAGS        -O3 -Wall -g -ffast-math -funroll-loops
0 00006.79 Define FORTRAN             gfortran
0 00006.79 Define FORTRANLIBS         gfortran
0 00006.79 Define FORTRANPATH         #/include
0 00006.79 Define LIBPATH             #/lib /global/home/users/nitinb/hdf5-1.10.5/build/lib /global/home/users/nitinb/boost_1_69_0/build//lib /usr/lib64/lib
0 00006.79 Define LINKFLAGS           -O3 -Wall -g -ffast-math -funroll-loops  -rdynamic   -module CommonLBs
0 00006.79 Define BUILD HOST          login02.hpcadvisorycouncil.com
0 00006.79 Define BUILD DIR           /global/home/users/nitinb/enzo-e
0 00006.79 Define BUILD DATE (UTC)    2020-02-19
0 00006.79 Define BUILD TIME (UTC)    21:55:06
0 00006.79 Define CELLO_CHARM_PATH    /global/home/users/nitinb/charm/ucx-linux-x86_64-prod
0 00006.79 Define NEW_OUTPUT          no
0 00006.79 Define CHARM_VERSION 61000
0 00006.79 Define CHANGESET     937ed39ac5aa2cea70fd727c79711819a325ddcb
0 00006.79 Define CHARM_BUILD         ucx-linux-x86_64-prod
0 00006.79 Define CHARM_NEW_CHARM Yes
0 00006.79 Define CONFIG_NODE_SIZE    64
0 00006.79 CHARM CkNumPes()           1024
0 00006.79 CHARM CkNumNodes()         1024
0 00006.79  BEGIN ENZO-P
0 00006.79 Memory bytes 0 bytes_high 0

I think it is a hang. @brminich: You mentioned that you were able to successfully run enzo-P about 45 times, was that on thor? What was your output of module list?

@brminich
Copy link
Collaborator

@brminich brminich commented Feb 21, 2020

@Nithin, no, it was on local jazz cluster.
did u run it on thor using instructions you gave me a while ago?

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Feb 21, 2020

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Mar 18, 2020

I recently ran another set of experiments on Frontera with 4 nodes and 64 nodes using the ucx machine layer and the mpi machine layer (with different backends). My findings suggest that although with UCX machine layer, the hangs occur more often, they are not limited to the UCX layer and show up with MPI layer as well (even when it doesn't use UCX as I saw when using the MPI layer built on MPICH without UCX on 64 nodes).

4 node runs with 224 ranks:

Machine Layer Backend Software Runs over UCX # Total Runs # Completed # Crashed # Hung Comments
               
ucx-nonsmp simple pmi Yes 10 5 0 5 Inconsistent hang during simulation
ucx-nonsmp pmix (through OMPI) Yes 10 6 0 4 Inconsistent hang during initialization
mpi-nonsmp Intel MPI Yes 10 10 0 0  
mpi-nonsmp MPICH (built with ucx) Yes 10 10 0 0  
mpi-nonsmp MPICH (built without ucx) No 10 0 0 10 Consistent hang that seems to be during startup
mpi-nonsmp Open MPI (built with ucx) Yes 10 10 0 0  
mpi-nonsmp Open MPI (built without ucx) Yes 10 10 0 0  

64 node runs with 3584 ranks

Machine Layer Backend Software Runs over UCX # Total Runs # Completed # Crashed # Hung Comments
               
ucx-nonsmp simple pmi Yes 10 0 0 10  
ucx-nonsmp pmix (through OMPI) Yes 10 6 1 3 1 Crash due to segmentation fault described in enzo-project/enzo-e#35
mpi-nonsmp Intel MPI Yes 10 10 0 0  
mpi-nonsmp MPICH (built with ucx) Yes 10 0 10 0 Crash due to segmentation fault described in enzo-project/enzo-e#35
mpi-nonsmp MPICH (built without ucx) No 10 0 0 10 Consistent hang that seems to be during startup
mpi-nonsmp Open MPI (built with ucx) Yes 10 10 0 0  
mpi-nonsmp Open MPI (built without ucx) Yes 10 0 8 2 Crash reports "ORTE has lost communication with a remote daemon", Hang site seems to be similar to the one seen with ucx machine layer

Excel with more info about results and runs: https://docs.google.com/spreadsheets/d/1IFomjSjWbnszt_iB6MtnnsGc0Sny0HzE1fyFFtz1D_s/edit?usp=sharing

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Mar 27, 2020

For a 4 node nonsmp run with 224 processes (56 processes/node), I was able to get the stack trace for each of the processes.
stack_trace_hang_enzo_699121.txt

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Mar 27, 2020

I've filtered out the unnecessary lines here.
stack_trace_hang_enzo_699121_reduced.txt

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Mar 27, 2020

Below is the count and the call listed from frame 0 of the backtrace obtained from the run.

     12 call_cblist_keep
      3 call_cblist_remove
     15 CcdCallBacks
      3 ccd_heap_update
      4 CcdRaiseCondition
      2 CdsFifo_Dequeue
      2 CkQ<void*>::deq
     25 clock_gettime
      7 CmiGetNonLocal
      9 CmiHandleImmediate
      1 CmiIdleLock_checkMessage
      5 CmiNotifyStillIdle
      6 CmiWallTimer
      1 CqsDequeue
      4 CsdNextMessage
      7 CsdScheduleForever
      1 CsdStillIdle
      1 fmax
      1 fmin
      9 gethrctime
      9 LrtsAdvanceCommunication
      4 LrtsStillIdle
      1 LrtsTryLock
      2 LrtsUnlock
      8 PCQueuePop
      1 remove_n_elems
     15 std::chrono::__duration_cast_impl<std::chrono::duration<double,
      1 std::chrono::duration_cast<std::chrono::duration<double,
      9 std::chrono::duration<double,
      6 std::chrono::duration<long,
     11 std::chrono::operator-<long,
      4 std::chrono::operator-<std::chrono::_V2::steady_clock,
      4 std::chrono::time_point<std::chrono::_V2::steady_clock,
      1 ucp_tag_probe_nb@plt
      2 ucp_worker_progress
      2 ucs_arbiter_dispatch
      1 ucs_async_check_miss
      2 ucs_mpmc_queue_is_empty
      2 uct_dc_mlx5_iface_progress
      1 uct_dc_mlx5_poll_tx
      3 uct_ib_mlx5_cqe_is_hw_owned
      6 uct_ib_mlx5_get_cqe
      7 uct_mm_iface_poll_fifo
      4 uct_mm_iface_progress

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Mar 27, 2020

This is an additional stack trace from another run that hung.
stack_trace_hang_enzo_699682_reduced.txt

In both the stack traces, I see 224 instances of CsdScheduleForever. So, I tried determining the function that is being called from CsdScheduleForever and I got the following splitup for the two jobs.


login1.frontera(1301)$ grep -ir "CsdScheduleForever" ./numbered_699121 -B 1 | grep -v "CsdScheduleForever" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
     62 CcdCallBacks
     55 CsdNextMessage
    100 CsdStillIdle
      7 main
login1.frontera(1300)$ grep -ir "CsdScheduleForever" ./numbered_699682 -B 1 | grep -v "CsdScheduleForever" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
     86 CcdCallBacks
     45 CsdNextMessage
     91 CsdStillIdle
      2 main
  1. main being printed is a false alarm. It's just the output from a previous interfering stack trace, which means that CsdScheduleForever is the last function shown in the stack trace.

  2. CsdStillIdle executes from SCHEDULE_IDLE macro when the scheduler is doing nothing.

  3. CsdNextMessage executes to dequeue and get the next message (from different queues). Digging in deeper into CsdNextMessage, there's no real evidence of there being an incoming message.


login1.frontera(1327)$ grep -ir "CsdNextMessage" ./numbered_699682 -B 1 | grep -v "CsdNextMessage" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
      3 CdsFifo_Dequeue
     33 CmiGetNonLocal
      2 CqsDequeue
      7 main
login1.frontera(1328)$ grep -ir "CsdNextMessage" ./numbered_699121 -B 1 | grep -v "CsdNextMessage" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
      4 CdsFifo_Dequeue
     46 CmiGetNonLocal
      1 CqsDequeue
      4 main
  1. CcdCallBacks is called periodically from the scheduler from CsdPeriodic.
login1.frontera(1329)$ string="CcdCallBacks" && grep -ir "$string" ./numbered_699682 -B 1 | grep -v "$string" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
      5 ccd_heap_update
     56 CmiWallTimer
      1 fmax@plt
      3 fmin
     21 main
login1.frontera(1330)$ string="CcdCallBacks" && grep -ir "$string" ./numbered_699121 -B 1 | grep -v "$string" | grep -i "^#" | awk -F " " '{ if($2 ~/^[0-9]/) print $4; else print $2 }' | sort | uniq -c
      3 ccd_heap_update
     42 CmiWallTimer
      1 fmax
      1 fmin
     15 main

From all of this, it definitely looks like all the 224 processes are just spinning in idle waiting for messages and not really acting on any messages.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Apr 16, 2020

I developed a message tracking infrastructure that adds every charm/converse leaving the sender PE to a hashmap (with a unique id) and then on delivery, (on the receiver), it sends an acknowledgment message back to the sender. On successful delivery, the hashmap entry is erased on the sender.

The first draft of the message tracking infrastructure is here (https://github.com/UIUC-PPL/charm/tree/nitin/trackMessages). It is enabled only with CMK_ERROR_CHECKING and a runtime option of +trackMsgs.

On running it with Enzo, I was able to get Enzo to hang and then using a script, I was able to print out the entries of the hashmap. I have attached the output of one such hung job. The hashmap entries are printed for each processor at the very end of the logs.

enzo-output-4nodes-ucx-debug-track2-731601-tracked.out.txt

In all the three runs (which hung and then I got the message tracking statistics), I saw that two PEs out of 224 show undelivered messages. In this log, it was PE 31 and 192. As seen in the log, PE 31 has 117 un-acked entries, all corresponding to charm entry point 161. PE 192 has 302 un-acked entries, corresponding to both charm entry point 179 and 161.

For the three runs that hung (and I tracked), the PEs and the number of entries varied, but all of them corresponded to charm entry point 161 and 179.

On printing out the entry method names and their charm entry point indices, I found that
161 corresponds to p_control_sync_count(int entry_point, int id, int count) and 179 corresponds to p_new_refresh_recv(MsgRefresh* impl_msg) for Enzo-P.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 4, 2020

One of my Enzo-P jobs that was built over UCX (pre-installed module ucx/1.6.1) caused Frontera's IB cards to fail and then caused the compute nodes to crash. I'll update here once I hear back more from TACC about the exact issue.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 5, 2020

From Tommy Minyard @ TACC:

this appears to be from job 775543, here's what we get in the log when your jobs was running:

[198307.229554] TACC: Running job 775543 for user nbhat4
[198542.853543] mlx5_core 0000:5e:00.0: device's health compromised - reached miss count
[198542.861625] mlx5_core 0000:5e:00.0: assert_var[0] 0x00000001
[198542.867495] mlx5_core 0000:5e:00.0: assert_var[1] 0x2097ce84
[198542.873367] mlx5_core 0000:5e:00.0: assert_var[2] 0x00000000
[198542.879232] mlx5_core 0000:5e:00.0: assert_var[3] 0x00000000
[198542.885107] mlx5_core 0000:5e:00.0: assert_var[4] 0x00000000
[198542.890975] mlx5_core 0000:5e:00.0: assert_exit_ptr 0x20a4996c
[198542.897021] mlx5_core 0000:5e:00.0: assert_callra 0x20a49a10
[198542.902895] mlx5_core 0000:5e:00.0: fw_ver 20.26.4012
[198542.908161] mlx5_core 0000:5e:00.0: hw_id 0x0000020f
[198542.913332] mlx5_core 0000:5e:00.0: irisc_index 0
[198542.918249] mlx5_core 0000:5e:00.0: synd 0x7: irisc not responding
[198542.924637] mlx5_core 0000:5e:00.0: ext_synd 0x40d0
[198542.929726] mlx5_core 0000:5e:00.0: raw fw_ver 0x141a0fac

that was from node c122-021, there were crashes on at least two other nodes with similar errors for other jobs, here is the one that crashed node c138-061:

[263011.149752] TACC: Running job 776827 for user nbhat4
[263212.049785] mlx5_core 0000:5e:00.0: device's health compromised - reached miss count
[263212.057864] mlx5_core 0000:5e:00.0: assert_var[0] 0x00000001
[263212.063730] mlx5_core 0000:5e:00.0: assert_var[1] 0x2097dc04
[263212.069592] mlx5_core 0000:5e:00.0: assert_var[2] 0x00000000
[263212.075455] mlx5_core 0000:5e:00.0: assert_var[3] 0x00000000
[263212.081316] mlx5_core 0000:5e:00.0: assert_var[4] 0x00000000
[263212.087180] mlx5_core 0000:5e:00.0: assert_exit_ptr 0x2080e34c
[263212.093220] mlx5_core 0000:5e:00.0: assert_callra 0x20ac8944
[263212.099088] mlx5_core 0000:5e:00.0: fw_ver 20.26.4012
[263212.104349] mlx5_core 0000:5e:00.0: hw_id 0x0000020f
[263212.109520] mlx5_core 0000:5e:00.0: irisc_index 2
[263212.114433] mlx5_core 0000:5e:00.0: synd 0x7: irisc not responding
[263212.120813] mlx5_core 0000:5e:00.0: ext_synd 0x40c0
[263212.125894] mlx5_core 0000:5e:00.0: raw fw_ver 0x141a0fac

unfortunately, that's about all we get from the system side, once the card is in this state, Lustre errors start and threads just hang and start dumping stack traces

[263393.108635] INFO: task enzo-p-nonsmp-u:336412 blocked for more than 120 seconds.
[263393.116352] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[263393.124508] enzo-p-nonsmp-u D ffff937d01f64100     0 336412      1 0x00000084
[263393.132017] Call Trace:
[263393.134689]  [<ffffffff939ea896>] ? handle_pte_fault+0x316/0xd10
[263393.140915]  [<ffffffff93b969cd>] ? list_del+0xd/0x30
[263393.146183]  [<ffffffff93f69f19>] schedule+0x29/0x70
[263393.151360]  [<ffffffff93f67a21>] schedule_timeout+0x221/0x2d0
[263393.157427]  [<ffffffffc04c5be7>] ? _mlx5_ib_post_send+0x3b7/0x13e0 [mlx5_ib]
[263393.164887]  [<ffffffff93f6a2cd>] wait_for_completion+0xfd/0x140
[263393.171099]  [<ffffffff938d7c40>] ? wake_up_state+0x20/0x20
[263393.176887]  [<ffffffffc04cfebb>] mlx5_ib_post_send_wait+0x9b/0x150 [mlx5_ib]
[263393.184350]  [<ffffffffc04cfbb0>] ? get_nchild+0x70/0x70 [mlx5_ib]
[263393.190739]  [<ffffffffc04d0be9>] unreg_umr+0x89/0xc0 [mlx5_ib]
[263393.196867]  [<ffffffffc04d1075>] dereg_mr+0x195/0x1e0 [mlx5_ib]
[263393.203087]  [<ffffffffc04d4de3>] mlx5_ib_dereg_mr+0x83/0x90 [mlx5_ib]
[263393.209823]  [<ffffffffc0469884>] ib_dereg_mr+0x34/0x50 [ib_core]
[263393.216126]  [<ffffffffc030a6b2>] uverbs_free_mr+0x12/0x20 [ib_uverbs]
[263393.222856]  [<ffffffffc0306a52>] destroy_hw_idr_uobject+0x22/0x60 [ib_uverbs]
[263393.230408]  [<ffffffffc0307184>] uverbs_destroy_uobject+0x34/0x190 [ib_uverbs]
[263393.238042]  [<ffffffffc03074e2>] uobj_destroy+0x52/0x70 [ib_uverbs]
[263393.244605]  [<ffffffffc0307841>] __uobj_get_destroy+0x31/0x60 [ib_uverbs]
[263393.251686]  [<ffffffffc0307881>] __uobj_perform_destroy+0x11/0x30 [ib_uverbs]
[263393.259236]  [<ffffffffc02fdadc>] ib_uverbs_dereg_mr+0x8c/0xc0 [ib_uverbs]
[263393.266316]  [<ffffffffc02f9190>] ib_uverbs_write+0x530/0x590 [ib_uverbs]
[263393.273313]  [<ffffffff938e143c>] ? update_curr+0x14c/0x1e0
[263393.279094]  [<ffffffff93afa597>] ? security_file_permission+0x27/0xa0
[263393.285832]  [<ffffffff93a42700>] vfs_write+0xc0/0x1f0
[263393.291175]  [<ffffffff93a4351f>] SyS_write+0x7f/0xf0
[263393.296441]  [<ffffffff93f76ddb>] system_call_fastpath+0x22/0x27

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 6, 2020

When I try running Enzo with the message tracking infrastructure to determine the undelivered messages, for some runs, I see a crash with the following stack trace. Any idea why this is being caused @brminich? Enzo here was built on top of charm++ master (includes your polling perf improvement patch) and uses ucx/1.8.0.

5381 [c137-064:228344:0:228344] Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))$
5382 ==== backtrace (tid: 228344) ====$
5383  0 0x0000000000050dfe ucs_debug_print_backtrace()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/ucs/debug/debug.c:625$
5384  1 0x0000000000036340 killpg()  ???:0$
5385  2 0x0000000000049fa0 ucs_arbiter_dispatch_nonempty()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/ucs/datastruct/arbiter.c:327$
5386  3 0x000000000003e9c5 ucs_arbiter_dispatch()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/ucs/datastruct/arbiter.h:356$
5387  4 0x000000000003e9c5 uct_dc_mlx5_iface_progress_pending()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/uct/ib/dc/dc_mlx5_ep.h:310$
5388  5 0x000000000003e9c5 uct_dc_mlx5_poll_tx()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/uct/ib/dc/dc_mlx5.c:225$
5389  6 0x000000000003e9c5 uct_dc_mlx5_iface_progress()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/uct/ib/dc/dc_mlx5.c:238$
5390  7 0x0000000000021602 ucs_callbackq_dispatch()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/ucs/datastruct/callbackq.h:211$
5391  8 0x0000000000021602 uct_worker_progress()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/uct/api/uct.h:2221$
5392  9 0x0000000000021602 ucp_worker_progress()  /work/00410/huang/frontera/ucx/ucx-1.8.0-rc1/src/ucp/core/ucp_worker.c:1951$
5393 10 0x000000000065aa8d LrtsAdvanceCommunication()  ???:0$
5394 11 0x000000000065abb7 CmiGetNonLocal()  ???:0$
5395 12 0x000000000065cc5d CsdNextMessage()  ???:0$
5396 13 0x000000000065cd20 CsdScheduleForever()  ???:0$
5397 14 0x000000000065cfd5 CsdScheduler()  ???:0$
5398 15 0x000000000065b33a ConverseInit()  ???:0$
5399 16 0x0000000000577857 charm_main()  ???:0$
5400 17 0x0000000000022495 __libc_start_main()  ???:0$
5401 18 0x000000000053ae5c _start()  ???:0$
5402 =================================$

@brminich
Copy link
Collaborator

@brminich brminich commented May 6, 2020

@nitbhat, checking it. Can be one of the issues we fixed in the arbiter recently.
Can you please try using UCX master instead?
Or, as a workaround, you could try to set this UCX env var: UCX_TLS=sm,self,ud
you can also use RC instead of UD, but it has scalability limitations, so it may not work with all-to-all kind of patterns with many processes

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 6, 2020

Okay, thanks @brminich.

I am trying to build ucx master (ed6b365dc9193dac8a4243f8c00b0ab8db368d89).

My configure line is as follows:

./configure --prefix=/scratch1/03808/nbhat4/ucx/build_master --enable-compiler-opt=3 --enable-optimizations --disable-logging --disable-debug --disable-assertions --disable-params-check --with-mcpu --with-march --with-rc --with-ud --with-dc --with-cm --with-mlx5-dv --with-ib-hw-tm --with-dm --with-knem=/opt/knem-1.1.3.90mlnx1 --enable-static=yes --enable-shared=yes

Configure ends with

checking for process_vm_readv... yes
checking whether KNEM_CMD_GET_INFO is declared... yes
configure: XPMEM - failed to open the requested location (guess), guessing ...
checking cray-ugni... no
checking that generated files are newer than configure... done
configure: creating ./config.status
config.status: creating src/ucm/cuda/Makefile
config.status: creating src/ucm/rocm/Makefile
config.status: creating src/ucm/Makefile
config.status: creating src/uct/cuda/gdr_copy/Makefile
config.status: creating src/uct/cuda/Makefile
config.status: creating src/uct/ib/cm/Makefile
config.status: creating src/uct/ib/rdmacm/Makefile
config.status: creating src/uct/ib/Makefile
config.status: creating src/uct/rocm/gdr/Makefile
config.status: creating src/uct/rocm/Makefile
config.status: creating src/uct/sm/cma/Makefile
config.status: creating src/uct/sm/knem/Makefile
config.status: creating src/uct/sm/mm/xpmem/Makefile
config.status: creating src/uct/sm/mm/Makefile
config.status: creating src/uct/sm/Makefile
config.status: creating src/uct/ugni/Makefile
config.status: creating src/uct/Makefile
config.status: creating src/tools/perf/lib/Makefile
config.status: creating src/tools/perf/cuda/Makefile
config.status: creating src/tools/perf/Makefile
config.status: creating test/gtest/ucm/test_dlopen/Makefile
config.status: creating test/gtest/ucs/test_module/Makefile
config.status: creating test/gtest/Makefile
config.status: creating Makefile
config.status: error: cannot find input file: `doc/doxygen/header.tex.in'

Then, on running make, I see:

login1.frontera(1045)$ make
cd . && /bin/sh ./config.status config.h
config.status: creating config.h
config.status: config.h is unchanged
make  all-recursive
make[1]: Entering directory `/scratch1/03808/nbhat4/ucx'
Making all in src/ucm
make[2]: Entering directory `/scratch1/03808/nbhat4/ucx/src/ucm'
Making all in .
make[3]: Entering directory `/scratch1/03808/nbhat4/ucx/src/ucm'
  CC       malloc/libucm_la-malloc_hook.lo
malloc/malloc_hook.c: In function ‘ucm_malloc_init_orig_funcs’:
malloc/malloc_hook.c:794:65: error: ‘malloc_usable_size’ undeclared (first use in this function); did you mean ‘dlmalloc_usable_size’?
  794 |         ucm_malloc_hook_state.usable_size = (size_t (*)(void *))malloc_usable_size;
      |                                                                 ^~~~~~~~~~~~~~~~~~
      |                                                                 dlmalloc_usable_size
malloc/malloc_hook.c:794:65: note: each undeclared identifier is reported only once for each function it appears in
malloc/malloc_hook.c: In function ‘ucm_malloc_install’:
malloc/malloc_hook.c:848:13: error: ‘__free_hook’ undeclared (first use in this function)
  848 |             __free_hook     = ucm_free;
      |             ^~~~~~~~~~~
malloc/malloc_hook.c:849:13: error: ‘__realloc_hook’ undeclared (first use in this function)
  849 |             __realloc_hook  = ucm_realloc;
      |             ^~~~~~~~~~~~~~
malloc/malloc_hook.c:850:13: error: ‘__malloc_hook’ undeclared (first use in this function)
  850 |             __malloc_hook   = ucm_malloc;
      |             ^~~~~~~~~~~~~
malloc/malloc_hook.c:851:13: error: ‘__memalign_hook’ undeclared (first use in this function)
  851 |             __memalign_hook = ucm_memalign;
      |             ^~~~~~~~~~~~~~~
make[3]: *** [malloc/libucm_la-malloc_hook.lo] Error 1
make[3]: Leaving directory `/scratch1/03808/nbhat4/ucx/src/ucm'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory `/scratch1/03808/nbhat4/ucx/src/ucm'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/scratch1/03808/nbhat4/ucx'
make: *** [all] Error 2

@hoopoepg
Copy link

@hoopoepg hoopoepg commented May 7, 2020

hi @nitbhat
could you run autogen.sh & re-launch configure script?
in case if it is failed - could you attach config.log file?

thank you

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 7, 2020

Thanks @hoopoepg! I was able to build it successfully. I'll post here about how my runs go.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 11, 2020

I see that with the ucx master (HEAD at b72cd117d474a3963640db72a1ac4de4e7442c81), the enzo hang occurs rarely. The hang percentage has dropped significantly. ( Was there any specific change that affected small messages (that could've caused this)? That might help us identify the bug better. @brminich

@brminich
Copy link
Collaborator

@brminich brminich commented May 11, 2020

@nitbhat, did you compare it with 1.6? We added quite many features/bug fixes since then. But message rate metric is not supposed to be affected anyway.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented May 11, 2020

I compared with both ucx/1.6.1 and ucx/1.8.0 on Frontera. With both of those the hang rate is much higher. For example, for 4 nodes, with ucx/1.6.1 and ucx/1.8.0, it hangs 5/10 times. Similarily, for 64 nodes, it hangs 8/10 times.

But with the new ucx master, I haven't seen it hang for 4 nodes so far. (0/20 runs)
But for 64 nodes, it hangs rarely. (1/20 runs). So, the problem occurs much less frequently but isn't entirely gone.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Jun 9, 2020

I tried this experiment again where I updated ucx to master (openucx/ucx@b72cd11) and I see that Enzo doesn't hang on 4 nodes or 64 nodes. I tried about 20 jobs each starting from 4 up to 64 nodes, going 2x every time and all the submitted jobs with charm based on ucx master ran successfully. This makes me believe that the hang I saw earlier with the ucx master was either a very rare hang or an error on my part, where I killed the job thinking it was a hang).

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Jun 9, 2020

^ A similar result was seen for NAMD as mentioned here.

I am yet to git bisect the ucx repo to determine the commit(s) that cause(s) the hang to not occur anymore.

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Jun 12, 2020

On bisecting, I found that this commit (openucx/ucx@7147812) seems to be the one that is fixing the hang. And the commit message mentions the fix to a ‘deadlock’. Do you think this is applicable to what was seen on Frontera? If so, can the bug be characterized as the deadlock as described in the commit message? @brminich

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Jun 29, 2020

On testing Enzo with the latest UCX release candidate v1.8.1-rc1, I see that Enzo continues to hang (as it did previously with v1.6.1 and v1.8.0). However, on testing with the current master, Enzo doesn't hang. Similar behavior is seen with NAMD as well i.e. NAMD hangs with ucx 1.8.1-rc1, but not with current master. (as mentioned in #2716)

@nitbhat
Copy link
Author

@nitbhat nitbhat commented Sep 10, 2020

I verified that this issue is solved when using UCX v1.9.0-rc1 (https://github.com/openucx/ucx/releases/tag/v1.9.0-rc1). The fix should be available in the upcoming UCX release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug UCX
Projects
None yet
Development

No branches or pull requests

6 participants