Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera #2635

Open
nitbhat opened this issue Dec 4, 2019 · 33 comments
Open

Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera #2635

nitbhat opened this issue Dec 4, 2019 · 33 comments
Assignees
Labels
Milestone

Comments

@nitbhat
Copy link
Contributor

@nitbhat nitbhat commented Dec 4, 2019

This bug was reported by James Bordner/Mike Norman at the Charm++ BoF at SC 19.

Bug reproduction details:
Enzo-E/Cello version: https://github.com/jobordner/enzo-e.git (solver-dd branch)

Modules:
module load gcc ucx boost

Environment:

   export BOOST_HOME=$TACC_BOOST_DIR
   export CHARM_HOME=/home1/00369/tg456481/Charm/charm.6A0-ucx-gcc.default #(change if needed)
   export CELLO_ARCH=frontera_gcc
   export CELLO_PREC=single
   export HDF5_HOME=/home1/00369/tg456481

Input files:

      Copy /work/00369/tg456481/frontera/cello-data/Cosmo/OsNa05/OsNa05-N512/N512/*
   to run directory

Input file: see attached
enzoe-frontera-ucx.in.txt

Batch script:

#SBATCH -N 64               # Total # of nodes
#SBATCH -n 3584             # Total # of mpi tasks
...

Run command used: charmrun +p3584 ./enzo-p enzoe-frontera-ucx.in.txt

@nitbhat nitbhat added Bug UCX labels Dec 4, 2019
@nitbhat nitbhat added this to the 6.10.0 milestone Dec 4, 2019
@nitbhat nitbhat self-assigned this Dec 4, 2019
@nitbhat nitbhat changed the title Cello/Enzo-P hangs with the UCX machine layer (in nonSMP mode) on Frontera Cello/Enzo-P hangs with UCX machine layer (in nonSMP mode) on Frontera Dec 4, 2019
@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 4, 2019

While attempting to reproduce this issue, I got this HDF5 related error while linking the final binary:

Install file: "build/Cello/libcello.a" as "lib/libcello.a"
/home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/charmc -language charm++ -o build/Enzo/enzo-p -O3 -Wall -g -ffast-math -funroll-loops -rdynamic -module CommonLBs build/Enzo/enzo-p.o build/Cello/main_enzo.o -Llib -L/home1/03808/nbhat4/software/hdf5-1.8.14/build/lib -L/opt/apps/gcc9_1/boost/1.69/lib -L/usr/lib64/lib -lenzo -lcharm -lsimulation -ldata -lproblem -lcompute -lcontrol -lmesh -lio -ldisk -lmemory -lparameters -lerror -lmonitor -lparallel -lperformance -ltest -lcello -lexternal -lhdf5 -lz -ldl -lpng -lgfortran -lboost_filesystem -lboost_system
/opt/apps/gcc/9.1.0/bin/ld: lib/libdisk.a(disk_FileHdf5.o): in function `FileHdf5::FileHdf5(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)':
/home1/03808/nbhat4/software/enzo-e/build/Cello/disk_FileHdf5.cpp:43: undefined reference to `H5P_CLS_DATASET_CREATE_ID_g'
collect2: error: ld returned 1 exit status
Fatal Error by charmc in directory /home1/03808/nbhat4/software/enzo-e
   Command g++ -rdynamic -L/usr/lib64/ -O3 -Wall -g -ffast-math -funroll-loops -rdynamic -Llib -L/home1/03808/nbhat4/software/hdf5-1.8.14/build/lib -L/opt/apps/gcc9_1/boost/1.69/lib -L/usr/lib64/lib build/Enzo/enzo-p.o build/Cello/main_enzo.o moduleinit123437.o -L/home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/../lib -lmoduleCommonLBs -lckmain -lck -lmemory-default -lthreads-default -lconv-machine -lconv-core -ltmgr -lconv-util -lconv-partition -lhwloc_embedded -lm -lmemory-default -lthreads-default -lldb-rand -lconv-ldb -lckqt -lucp -luct -lucs -lucm -ldl -lenzo -lcharm -lsimulation -ldata -lproblem -lcompute -lcontrol -lmesh -lio -ldisk -lmemory -lparameters -lerror -lmonitor -lparallel -lperformance -ltest -lcello -lexternal -lhdf5 -lz -ldl -lpng -lgfortran -lboost_filesystem -lboost_system -lmoduleCommonLBs -lmoduleNDMeshStreamer -lmodulecompletion -lz -lm /home1/03808/nbhat4/software/charm/ucx-linux-x86_64-prod/bin/../lib/conv-static.o -o build/Enzo/enzo-p returned error code 1
charmc exiting...
scons: *** [build/Enzo/enzo-p] Error 1

I got past it by using HDF5 version 1.8.13. (all other newer version gave me this ^ error)

Another setting that can simplify the build process is setting "use_grackle = 0" in the SConstruct file line 102

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 5, 2019

As previously stated by James over email, the program ran successfully when exectued on 1/2 nodes. However, the bug appeared while running on 4, 8 and 64 nodes.

The last output printed before the hang for a 4 node run is as follows:

0 00386.25 Performance refresh_store time-usec 0
0 00386.25 Performance refresh_child time-usec 0
0 00386.25 Performance refresh_exit time-usec 0
0 00386.25 Performance refresh_store_sync time-usec 0
0 00386.25 Performance refresh_child_sync time-usec 0
0 00386.25 Performance refresh_exit_sync time-usec 0
0 00386.25 Performance control time-usec 80702761
0 00386.25 Performance compute time-usec 14184066599
0 00386.25 Performance output time-usec 133512
0 00386.25 Performance stopping time-usec 4354327957
0 00386.25 Performance block time-usec 57974302110
0 00386.25 Performance exit time-usec 0
0 00386.25 Performance simulation max-proc-blocks 199
0 00386.25 Performance simulation max-proc-particles 618658
0 00386.25 Performance simulation balance-blocks 19.031216
0 00386.25 Performance simulation balance-eff-blocks 0.840116 (167/199)
0 00386.25 Performance simulation balance-particles 3.249693
0 00386.25 Performance simulation balance-eff-particles 0.968526 (599186/618658)
@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 9, 2019

Since @brminich suggested export UCX_ZCOPY_THRESH=-1 for ChaNGa, I used that for a 64 node Cello/Enzo-P run and was able to successfully run without a hang. (Tried about 10 times and it ran successfully every time).

This was my run script:

#!/bin/bash
#SBATCH -J enzo
#SBATCH -p normal
#SBATCH -t 00:05:00
#SBATCH -N 64
#SBATCH -n 3584
#SBATCH --ntasks-per-node=56

cd /home1/03808/nbhat4/software/enzo-e

export UCX_ZCOPY_THRESH=-1
ibrun ./enzo-p-nonsmp ./enzoe-frontera-ucx.in

@brminich: Since the hang exists without UCX_ZCOPY_THRESH=-1 and doesn't hang with UCX_ZCOPY_THRESH=-1, I'm guessing the abundance of small eager messages being copied with the auto threshold are causing the hang? It'll be good if UCX complains about an error rather than just showing a hang.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 11, 2019

@brminich just confirmed that setting UCX_ZCOPY_THRESH=-1 actually means that the Zerocopy threshold will be set to infinite i.e all sends will be buffered and will complete immediately.

So, setting UCX_ZCOPY_THRESH=-1 should ideally take up more memory than the default UCX_ZCOPY_THRESH being 8k or 16k.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 11, 2019

I was able to get enzo-P to hang on 4 nodes and attach gdb to each process and get stack traces for it. I'm attaching the stack traces here.

stack_1node.txt
stack_2node.txt
stack_3node.txt
stack_4node.txt

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 13, 2019

This is the typical size/count of messages sent by one of the processes in the application for a 64 node (3584 process run) (other processes also seem to follow a similar profile):

Msg Size Count
96 630602
1184 216714
416 144234
4256 107880
160 34736
80 24294
3248 12119
4272 12119
6336 12119
9440 12119
12480 11943
944 8039
1200 8039
1728 8039
2528 8039
5312 7967
12464 5980
16560 5980
24768 5980
32960 5980
37088 5980
36992 2147
2144 2000
672 1284
192 424
320 372
112 336
208 301
128 295
1712 214
2224 214
3264 214
4544 214
4832 214
7360 176
960 144
1328 101
224 100
2944 72
656 62
816 62
1152 62
1664 62
3712 62
2576 24
3376 24
4992 24
7424 24
8192 24
14368 4
272 3
304 3
384 3
512 3
2048 3
20 1
452 1
@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 17, 2019

@nitbhat, does setting +ucx_rndv_thresh=64 solve the issue?
If yes, we may consider setting it as a default for now

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 17, 2019

@brminich: No, it doesn't. I saw the hang when I passed +ucx_rndv_thresh=64.

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 17, 2019

@nitbhat, have you tried it with some other options or just +ucx_rndv_thresh=64?

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 18, 2019

@brminich: For now, I've just tried it with +ucx_rndv_thresh=64. I could try with some other values too.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 18, 2019

Also, I was working on a simpler test-case to replicate the ucx failures (with similar message patterns). I've added this test as /tests/charm++/bombard to the branch nitin/ucx_debugging.

This test has three modes: (if p is the number of processes/cores)
Mode 0 for p2p communication - where process 0 sends messages to process p-1

Mode 1 for all-to-one communication - where processes 0 to p-2 (in total p-1 processes) send messages to process p-1

Mode 2 for all-to-all communication - where processes 0 to p-1 send messages (broadcast) to all the processes (0 to p-1).

Because I saw about 600,000 messages of 96 bytes on each process, I tried my test with similar settings on 4 nodes (ibrun ./bombard 2 0 3000 - here 2 is the mode, 0 is the size of the buffer and 3000 is the number of iterations the sender should bombard the receiver) and saw that it crashes with the UCX layer. In that branch nitin/ucx_debugging, I'm also printing the message sizes and counts for each process.

A similar test with MPI layer takes really long, so I'm yet to determine if it passes or not. However, the crash in the UCX layer makes me wonder if it's worth pursuing this test and debugging. (because it should be simpler to debug than running Cello).

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 18, 2019

I just verified that although the MPI layer based ibrun ./bombard 2 0 3000 runs slowly, it doesn't crash. I think the UCX layer based ibrun ./bombard 2 0 3000 run completes much faster (probably because of buffered sending and causing immediate completion) and hence crashes. I'm yet to figure out the reason for this.

@trquinn

This comment has been minimized.

Copy link
Collaborator

@trquinn trquinn commented Dec 18, 2019

Nitin's observations are consistent with the behavior I see with ChaNGa: places in the simulation where UCX crashes are places that take a very long time with IMPI.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 18, 2019

@brminich: I'm guessing that UCX sends complete immediately (by buffering the messages) and MPI, on the other hand, might have some implicit way of throttling (or throttles the sends because it waits for matching receives). Is there a way for disabling immediate completion in UCX so that we can throttle the sends in some fashion?

@evan-charmworks

This comment has been minimized.

Copy link
Contributor

@evan-charmworks evan-charmworks commented Dec 18, 2019

@brminich In addition to trying on the HPC Advisory Council's Thor, it may also be worth trying on Rome and Helios since they have some different hardware that may better match Frontera.

Frontera

Intel Xeon Platinum 8280 ("Cascade Lake"), Clock rate: 2.7Ghz ("Base Frequency")
Mellanox InfiniBand, HDR-100
System Interconnect: Frontera compute nodes are interconnected with HDR-100 links to each node, and HDR (200Gb) links between leaf and core switches. The interconnect is configured in a fat tree topology with a small oversubscription factor of 11:9.

Thor

Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz
Mellanox ConnectX-6 HDR100 100Gb/s InfiniBand/VPI adapters
Mellanox Switch-IB 2 SB7800 36-Port 100Gb/s EDR InfiniBand switches
Mellanox Connect-IB® Dual FDR 56Gb/s InfiniBand adapters
Mellanox SwitchX SX6036 36-Port 56Gb/s FDR VPI InfiniBand switches

Helios

Dual Socket Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
Mellanox ConnectX-6 HDR/HDR100 200/100Gb/s InfiniBand/VPI adapters with Socket Direct
Mellanox HDR Quantum Switch QM7800 40-Port 200Gb/s HDR InfiniBand

Rome

Dual Socket AMD EPYC 7742 64-Core Processor @ 2.25GHz
Mellanox ConnectX-6 HDR 200Gb/s InfiniBand/Ethernet
Mellanox HDR Quantum Switch QM7800 40-Port 200Gb/s HDR InfiniBand

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

@nitbhat,
already tried your test on 2 different clusters. It works fine with 30000 iterations:
time mpirun -n 112 --report-bindings ./bombard 2 96 300000
real 1m4.365s
user 0m3.170s
sys 0m3.607s

But it crashes with 300000 iterations.
Looking into it

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

It crashes with both UCX machine layer and MPI machine layer (using HPCX).
The crash happens due to OOM:

[Thu Dec 19 12:42:10 2019] Out of memory: Kill process 16879 (bombard) score 74 or sacrifice child
[Thu Dec 19 12:42:10 2019] Killed process 16879 (bombard) total-vm:61742852kB, anon-rss:9719228kB, file-rss:0kB, shmem-rss:5300kB

Running with either UD or tcp transport did not help

With Intel MPI it hangs for several minutes and then crashes with:

 [Thu Dec 19 13:22:18 2019] bombard[21841]: segfault at 68 ip 00007fb1bc69cc49 sp 00007fffee32f8c0 error 6 in libmpi.so.12.0[7fb1bc2ee000+758000]
@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

btw, while memory consumption is rather high (~60G per node) with +ucx_rndv_thresh=0 it helps to pass the following test:

time mpirun -n 112 -x UCX_TLS=sm,self,ud  ./bombard 2 96 100000 +ucx_rndv_thresh=0
real    0m3.817s
user    0m0.054s
sys     0m0.092s
@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 19, 2019

Okay, I thought so (about OOM). Btw, I see that you're passing 96 in your argument list. Note that the size being passed there adds to the already created envelope in charm. So to get a total message of 96 bytes (and mimic Cello), you'd have to pass 0.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 19, 2019

@brminich: Also, is there a setting (an environment variable) which can disable immediate sends and complete sends only after recvs have been posted? If so, I would like to try that.

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

No, there is no such setting. But how would that help? Does Charm++ issue all sends in a nonblocking manner? Can we limit its injection rate somehow? Does Charm++ provides blocking send capabilities?
The main problem is that application (or bombard test) may use enormous amount of memory, more than the actual amount.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 19, 2019

Oh okay. I was thinking that it'll help because then we can narrow down that the bombarding of sends to the receiver is causing that problem (and having that control will help us throttle the sends).

Yes, charm issues all the sends in a non-blocking manner. Not that I'm aware of. We'd have to use a scheme where the sender buffers the messages until the receiver is ready (but I'm guessing the UCX layer already does that right now).

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

(but I'm guessing the UCX layer already does that right now).

unfortunately not for all messages (for instance not for small ones)

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 19, 2019

You mean, it doesn't buffer the small messages and buffers the large message? That's strange.

Also, were you able to reproduce the Cello hang on Thor? (or other machines @hpcadv center)

When I tried running Cello on Thor, I couldn't get the run to get past the initial stage:

0 00001.34 Define LINKFLAGS           -O3 -Wall -g -ffast-math -funroll-loops  -rdynamic   -module CommonLBs
0 00001.34 Define BUILD HOST          login01.hpcadvisorycouncil.com
0 00001.34 Define BUILD DIR           /global/home/users/nitinb/enzo-e
0 00001.34 Define BUILD DATE (UTC)    2019-12-18
0 00001.34 Define BUILD TIME (UTC)    18:22:14
0 00001.34 Define CELLO_CHARM_PATH    /global/home/users/nitinb/charm/mpi-linux-x86_64-prod
0 00001.34 Define NEW_OUTPUT          no
0 00001.34 Define CHARM_VERSION 61000
0 00001.34 Define CHANGESET     937ed39ac5aa2cea70fd727c79711819a325ddcb
0 00001.34 Define CHARM_BUILD         mpi-linux-x86_64-prod
0 00001.34 Define CHARM_NEW_CHARM Yes
0 00001.34 Define CONFIG_NODE_SIZE    64
0 00001.34 CHARM CkNumPes()           256
0 00001.34 CHARM CkNumNodes()         256
0 00001.34  BEGIN ENZO-P
0 00001.34 Memory bytes 0 bytes_high 0

I'm not sure if this is a hang or a very slow run. (But this was seen for both MPI and UCX runs on 1,2,4,16,32 nodes).

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 19, 2019

did not manage to do that on local cluster. Will try to use thor
BTW, can you please check memory usage with free during cello run?

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Dec 23, 2019

Running enzo 45 times in a row on 4-nodes cluster did not reveal the problem. Memory consumption is about 75GB per node (with 28 processes).
Can I use the same input files for other runs with more nodes?
@nitbhat, did you have a chance to check memory consumption during hang?

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Dec 23, 2019

Oh okay. I see. Yes, the issue shows up more consistently on Frontera with 64 nodes (3584 processes).

Yes, you can use the same input files for larger runs too.

I have been working on another issue, so didn't get a chance to check that so far.

@nitbhat nitbhat assigned nitbhat and brminich and unassigned nitbhat Jan 2, 2020
@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Jan 7, 2020

Running enzo 45 times in a row on 4-nodes cluster did not reveal the problem. Memory consumption is about 75GB per node (with 28 processes).
Can I use the same input files for other runs with more nodes?
@nitbhat, did you have a chance to check memory consumption during hang?

@brminich: The following is the output from free on the 4 compute nodes where the hang is seen. This is for 224 process run on 4 nodes (56 processes on each node).

c104-091.frontera(1013)$ free -m
              total        used        free      shared  buff/cache   available
Mem:         191328       54321      132280         133        4726      133811
Swap:             0           0           0


c104-092.frontera(1002)$ free -m
              total        used        free      shared  buff/cache   available
Mem:         191328       53087      132368         189        5871      134707
Swap:             0           0           0


c104-093.frontera(1000)$ free -m
              total        used        free      shared  buff/cache   available
Mem:         191328       52244      133281         181        5801      135597
Swap:             0           0           0


c104-094.frontera(1001)$ free -m
              total        used        free      shared  buff/cache   available
Mem:         191328       39034      147392         185        4900      148944
Swap:             0           0           0

Looking at this, it looks like there is enough memory available on each node.

@trquinn

This comment has been minimized.

Copy link
Collaborator

@trquinn trquinn commented Jan 7, 2020

The other thing to check (from my experience with the verbs layer) is the number of pinnable memory segments available.

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Feb 19, 2020

It looks like as the number of ranks per node increases, the chances of a hang increase. Here is the output of 224 ranks for different run configurations on Frontera.


7 nodes - 224 ranks - 32 ranks/node - 4/10 hangs
14 nodes - 224 ranks - 16 ranks/node - 1/10 hangs
28 nodes - 224 ranks - 8 ranks/node - 1/10 hangs
56 nodes - 224 ranks - 4 ranks/node - 0/10 hangs
@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Feb 20, 2020

On Thor, when I try building and running enzo-P, I don't get past

0 00006.79 Define Simulation processors 1024
0 00006.79 Define CELLO_ARCH          frontera_gcc
0 00006.79 Define CELLO_PREC          single
0 00006.79 Define CC                  gcc
0 00006.79 Define CFLAGS              -O3 -Wall -g -ffast-math -funroll-loops
0 00006.79 Define CPPDEFINES          CONFIG_PRECISION_SINGLE SMALL_INTS {'CONFIG_NODE_SIZE': 64} {'CONFIG_NODE_SIZE_3': 192} NO_FREETYPE CONFIG_USE_PERFORMANCE CONFIG_NEW_CHARM CONFIG_HAVE_VERSION_CONTROL
0 00006.79 Define CPPPATH             #/include /global/home/users/nitinb/hdf5-1.10.5/build/include /global/home/users/nitinb/boost_1_69_0/build//include /usr/lib64/include
0 00006.79 Define CXX                 /global/home/users/nitinb/charm/ucx-linux-x86_64-prod/bin/charmc -language charm++
0 00006.79 Define CXXFLAGS            -O3 -Wall -g -ffast-math -funroll-loops     -balancer CommonLBs
0 00006.79 Define FORTRANFLAGS        -O3 -Wall -g -ffast-math -funroll-loops
0 00006.79 Define FORTRAN             gfortran
0 00006.79 Define FORTRANLIBS         gfortran
0 00006.79 Define FORTRANPATH         #/include
0 00006.79 Define LIBPATH             #/lib /global/home/users/nitinb/hdf5-1.10.5/build/lib /global/home/users/nitinb/boost_1_69_0/build//lib /usr/lib64/lib
0 00006.79 Define LINKFLAGS           -O3 -Wall -g -ffast-math -funroll-loops  -rdynamic   -module CommonLBs
0 00006.79 Define BUILD HOST          login02.hpcadvisorycouncil.com
0 00006.79 Define BUILD DIR           /global/home/users/nitinb/enzo-e
0 00006.79 Define BUILD DATE (UTC)    2020-02-19
0 00006.79 Define BUILD TIME (UTC)    21:55:06
0 00006.79 Define CELLO_CHARM_PATH    /global/home/users/nitinb/charm/ucx-linux-x86_64-prod
0 00006.79 Define NEW_OUTPUT          no
0 00006.79 Define CHARM_VERSION 61000
0 00006.79 Define CHANGESET     937ed39ac5aa2cea70fd727c79711819a325ddcb
0 00006.79 Define CHARM_BUILD         ucx-linux-x86_64-prod
0 00006.79 Define CHARM_NEW_CHARM Yes
0 00006.79 Define CONFIG_NODE_SIZE    64
0 00006.79 CHARM CkNumPes()           1024
0 00006.79 CHARM CkNumNodes()         1024
0 00006.79  BEGIN ENZO-P
0 00006.79 Memory bytes 0 bytes_high 0

I think it is a hang. @brminich: You mentioned that you were able to successfully run enzo-P about 45 times, was that on thor? What was your output of module list?

@brminich

This comment has been minimized.

Copy link
Collaborator

@brminich brminich commented Feb 21, 2020

@Nithin, no, it was on local jazz cluster.
did u run it on thor using instructions you gave me a while ago?

@nitbhat

This comment has been minimized.

Copy link
Contributor Author

@nitbhat nitbhat commented Feb 21, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.